Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable paging for betydb functions #94

Merged
merged 33 commits into from
Mar 12, 2018
Merged

Enable paging for betydb functions #94

merged 33 commits into from
Mar 12, 2018

Conversation

dlebauer
Copy link
Collaborator

@dlebauer dlebauer commented Apr 14, 2017

This PR enables paging for betydb_* functions in order to avoid server timeout.

Description

The BETYdb API is fairly slow (PecanProject/bety#419), and with very large requests (> 500k records) functions can fail due to server timeouts. Although this is not an issue for the default database (betydb.org) it is becoming an issue for other instances of BETYdb.

These changes use a for loop to divide large requests into smaller requests (currently in chunks of 5000 rows) using the limit and offset parameters to iterate through.

Related Issue

Example

I've added new tests just to make sure that we are getting the right number of requested records! More examples can be seen in the tests file.

Here is an example I've used to test the effect of limit on timings:

options(
  betydb_url = "https://www.betydb.org/",
  betydb_api_version = "beta")

limits <- c(1, 10, 50, 100, 500, 1000, 5000, 10000)
times <- list()
for(i in seq_along(limits)){
  times[[i]] <- data.frame(limit = limits[i], time = as.vector(system.time(betydb_search("SLA", limit = limits[i]))['elapsed']))
} 

times <- dplyr::bind_rows(times)

library(ggplot2)
ggplot(data = times)+
  geom_point(aes(limit, time)) + 
  scale_x_log10('limit (# records)') + scale_y_log10('time (s)') + 
  theme_bw()

image

@dlebauer
Copy link
Collaborator Author

@infotroph please review

@codecov-io
Copy link

codecov-io commented Apr 14, 2017

Codecov Report

Merging #94 into master will increase coverage by 8.63%.
The diff coverage is 89.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #94      +/-   ##
==========================================
+ Coverage   15.44%   24.08%   +8.63%     
==========================================
  Files          17       17              
  Lines         602      681      +79     
==========================================
+ Hits           93      164      +71     
- Misses        509      517       +8
Impacted Files Coverage Δ
R/betydb.R 93.49% <89.02%> (-3.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63e9916...3cb72fd. Read the comment docs.

@dlebauer
Copy link
Collaborator Author

@sckott I would appreciate any advice on writing better tests ...

I expected that the tests I wrote would cover most of the code that I wrote and was pretty surprised to these results https://codecov.io/gh/ropensci/traits/pull/94/src/R/betydb.R that seem to indicate most of the new lines in betydb.R are not covered.

@sckott
Copy link
Contributor

sckott commented Apr 17, 2017

you don't have any requests with limit > 200 correct? seems like that block of lines 179-224 only invoked if limit > 200

@dlebauer
Copy link
Collaborator Author

aha. Thanks ... I was trying to minimize the testing time; will revise accordingly

@sckott
Copy link
Contributor

sckott commented Apr 17, 2017

def. do minimize testing time, just having one test that hits that bit of the code is good i think

change warning to message
tests: fixed errors testing API responses
Copy link
Contributor

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for incomplete comments -- I'll need to come back to this, but submitting now to save my comments

R/betydb.R Outdated
lst <- jsonlite::fromJSON(txt, simplifyVector = TRUE, flatten = TRUE)

api_version <- ifelse(grepl('/beta/api', url), 'beta', 'v0')
if(!exists('per_call_limit')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to test whether ... contains a parameter named per_call_limit? I don't think this will do that -- parameters in ... aren't assigned into the function environment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the whole reason I added the per_call_limit as a parameter rather than a fixed number is so that it can be set to a smaller number in the testing. So in the testing I make per_call_limit <<- 20 a global variable ...

5000 is a reasonable limit when the package is being used and 20 is sufficient for testing

It is kind of hacky, but is the difference b/w tests that take seconds vs minutes (e.g. so that I can test with iterations of [0,1,n] and remainders [0,n].

if(args$limit <= per_call_limit){
txt <- betydb_http(url, args, key, user, pwd, ...)
lst <- jsonlite::fromJSON(txt, simplifyVector = TRUE, flatten = TRUE)
} else if (args$limit > per_call_limit){ # divide large requests (aka page)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be could be an unconditional else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach makes it more clear and explicit the case under which we begin paging. the final else on line 264 below captures if args$limit captures cases where !args$limit <= per_call_limit | args$limit > per_call limit. (e.g. if args$limit can't be coerced to a numeric value)

expect_is(get.out, "response")
expect_true(grepl("OK", get.out[["headers"]]$status))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or expect_match(httr::headers(get.out)$status, "OK" ) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

betydb_api_version = "beta",
betydb_key = "eI6TMmBl3IAb7v4ToWYzR0nZYY07shLiCikvT6Lv",
warn=-1 ## suppress warnings that we did not get all data
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that option changes are NOT local to this test block -- they'll stay in effect for other tests until the end of the file. See test_that("URL & version options work" above for one approach to cleanup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced w/ on.exit(reset_opts(opts))

betydb_url = "https://www.betydb.org/",
betydb_api_version = "v0",
warnings = 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. This works, but on.exit(reset_opts(opts)) is still arguably safer: It both resets all previous options, and unsets any newly set options that didn't exist before this test.

@infotroph infotroph mentioned this pull request Oct 8, 2017
@infotroph infotroph mentioned this pull request Oct 10, 2017
@dlebauer
Copy link
Collaborator Author

@sckott with the latest updates this branch is now passing. Do you have a schedule for pushing new versions to CRAN? Would be nice to be able to use a stable release in the training materials I am writing.

@sckott sckott added this to the v0.4 milestone Oct 23, 2017
@sckott
Copy link
Contributor

sckott commented Oct 23, 2017

we can target for next milestone https://github.com/ropensci/traits/milestone/4

some of those issues i may move to next milestone, so could be pretty soon.

@sckott sckott merged commit 8fe9631 into ropensci:master Mar 12, 2018
@sckott
Copy link
Contributor

sckott commented Mar 12, 2018

merged, sorry about the long wait

dlebauer added a commit to dlebauer/traits that referenced this pull request Mar 16, 2018
added (Chris Black) (@infotroph) to author list based on contributions to the betydb interface (especially ropensci#94, ropensci#88, ropensci#82)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants