-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submission: arkdb #224
Comments
Editor checks:
Editor commentsHi @cboettig. Thanks for the submission. I'll be editing this package. I get errors in the tests or examples when running One issue raised by
I also note some genuine spelling errors. You may wish to look at
I will be looking for reviewers. |
@lmullen I continue to learn tricks from you! should remember to run the |
@cboettig Thanks for fixing those. Everything looks good to start the review, and I've solicited reviewers. |
Reviewers: @richfitz and @expectopatronum. Thanks to Verena and Rich for being willing to do this review. I've set a due date 3 weeks from now. Here are the reviewer's guide and the reviewer's template. @cboettig Could you please add the rOpenSci review badge to your repository if you haven't done so already? |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 3 Review CommentsI was able to install the package without errors on Windows 10. I ran the following commands with the following output:
I did not yet tick the check box next to Packaging guidelines. As far as I can see most/all required items are fulfilled, but I have a few remarks:
Additionally I have a few minor remarks that I would change in the code (unless I miss something and this is the better way to do it):
In general it is a well-written, light-weight package. The intention of the package is clear and the documentation useful. To highlight just a few of the things I liked:
|
Thanks for the helpful review, @expectopatronum. |
@expectopatronum Thanks for this awesome review 🎉. Very helpful. I thought I'd wait for @richfitz comments but might as well check these off now and asking a few follow-up questions. (Rich has given me a hint meanwhile on at least one other thing I should improve).
Good catch, fixed!
I'm terrible at adding these, at least until I get bug reports on particular cases. Are there some obvious example scenarios you can think of that should provide better error handling or error messages?
Grouping these, since they are related ( old <- dbConnect(MySQL(), ...)
new <- dbConnect(MonetDBLite(), "my_db")
old %>%
ark( tmpdir() ) %>%
fs::dir_ls() %>%
unark(new) This is why
Okay, pushing these few edits, lemme know what you think! |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
RGF: As @cboettig notes above, this is adapted from a similar bit of code I wrote a few months back but I don't think that counts as a COI as I regularly hate on my own code... DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 1 Review CommentsI have not read the other review as I write this review (and having now looked though the other review it looks like Carl already mostly dealt with those comments). Apologies this is so late This package provides an interface for loading data into a database from disk, and extracting data from a database to disk, via R, without needing to do the entire operation in memory. This will help with very large databases (on the order of hundreds of millions of rows or more) where fitting data into memory is not an option. It will also increase the responsiveness of such operations (probably at the cost of overall speed) making large scale import and extraction from R a more pleasant experience. This package, with about a hundred lines of code and two exported functions, fits a "micropackage" description, which makes reviewing a lot easier. Major commentsThe biggest flaw I see with the package is that it violates the Open–closed principle (I know this is not OOP but I feel this is one that really transcends any paradigm). Carl has used The package could be extended to take either 3 arguments (or some sort of wrapper object) that includes an extension and two functions (read/write). This interacts somewhat with the compression handling but I think it's probably still doable. As it is, if someone has an aversion to Misc comments:The example in the The progress and general chattiness needs to be tuneable (e.g., with a The log output as it is not incredibly informative after it has been run:
Consider a little bit of reordering and indenting to get to
(and so on) How is the default of In if (is(db_con, "SQLiteConnection") |
is(db_con, "MySQLConnection") |
Sys.getenv("arkdb_windowing") == "FALSE") {
query <- paste("SELECT * FROM", tablename, "LIMIT",
lines, "OFFSET", (start-1)*lines)
} else {
## Postgres can do windowing
query <- paste("SELECT * FROM",
tablename,
"WHERE rownum BETWEEN",
(start - 1) * lines,
"AND",
start * lines)
} (The Super minor commentsIn if(is(db_con, "src_dbi")){ I though that Your postgres windowed code is not tested. You might try that on travis? Your I am confused by the point of the untested early exit in There are also a few error paths untested (see covr output) In d <- reader()
body <- paste0(c(header, d$data), "\n", collapse = "")
p$tick()
chunk <- readr::read_tsv(body, ...)
DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE) I might be missing something but surely one can compute the headers ahead of time (0'th iteration) and then do: chunk <- reader()
names(chunk) <- header
p$tick()
DBI::dbWriteTable(db_con, tbl_name, chunk, append=TRUE) with modifications to read_chunked <- function(con, n) {
assert_connection(con)
next_chunk <- readLines(con, n)
if (length(next_chunk) == 0L) {
# if we don't stop, we will hit an error!
stop("connection has already been completely read")
}
function() {
data <- next_chunk
next_chunk <<- read_tsv(con, n, header = FALSE) # <-- change here
complete <- length(next_chunk) == 0L # <-- probable change here too
list(data = data, complete = complete)
}
} (This is quite possibly something I could do in my code that this comes from too!) I would be probably inclined to split |
@richfitz and @expectopatronum Thank you both for your thorough reviews. @cboettig: You've already started on Verena's review, of course, but could you please address these two reviews and make any necessary changes by July 30? |
@richfitz Thanks, this is excellent and the detail much appreciated! Addresses a bunch of things that have puzzled or bugged me so really appreciate your expertise here. Just skimming review for now, but one point where I could use a bit of additional feedback is the hard-wriring For instance, supporting other delimited calls from Other parsers get more tricky -- e.g. I wanted to support A somewhat related challenge that's not directly raised by the reviews is the issue of type stability. While some of the binary types I just mentioned support this out of the box, types (dates, factors, integer vs numeric etc) can change from a read-in / read-out. At the moment I feel this is an out-of-scope issue for a first release, but maybe a metadata file should be written with column types when archiving a database? (This could work within a family like Up against a deadline now but should be able to get this done by July 30. |
@cboettig - I've sketched out a proof-of-concept for generalising away from tsv here ropensci/arkdb#5 While doing that I noticed this section: compress <- match.arg(compress)
rs <- DBI::dbSendQuery(db_con, paste("SELECT COUNT(*) FROM", tablename))
size <- DBI::dbFetch(rs)
end <- size[[1]][[1]]
|
@richfitz Thanks much! I'll take a look tonight. Quick question: do you mean re |
Don't listen to me on the microbenchmark::microbenchmark(
carl = DBI::dbGetQuery(db$con, "SELECT COUNT(*) FROM flights"),
rich = DBI::dbGetQuery(db$con, "SELECT COUNT(1) FROM flights"))
## Unit: milliseconds
## expr min lq mean median uq max neval
## carl 2.404747 2.536132 2.818021 2.687870 3.002489 4.136245 100
## rich 8.938237 9.306043 9.851665 9.705794 10.161812 12.921496 100 We ran into this problem at work but the tables in question have something on the order of 4 billion rows - so the approximate counts get quite useful... Re
So you're missing the |
As I come from software engineering I can think of several cases that could be tested (of which some might be an overkill). I would at least test what happens when the file/database connections are not working though. Other tests could be passing wrong parameters (e.g., names of tables that don't exist), ... But as I said, that might be an overkill :)
Yes, that's completely reasonable and fine with me! Great job! |
@richfitz @expectopatronum @lmullen Okay, I think I've hit just about all of these now. @expectopatronum I have added error handling assertions for the argument types at the top of @richfitz I've summarized my replies to your issues in ropensci/arkdb#6. In particular, see ropensci/arkdb#10 which implements your pluggable formats (supporting Okay, I think this is a much better package now, though there are no doubt things that could be better still or new issues I've introduced. Over to you all now; thanks for all the input and looking forward to your feedback. |
@cboettig Perfect, I ticked the final checkbox. From my point of view the package is good to go! |
Glad to see these revisions, @cboettig. @expectopatronum Thanks very much for completing your thorough review. @richfitz Thanks for going back and forth to improve this package. Would you be able to look over Carl's changes within roughly one week? |
Looks good to me! |
Thanks, @richfitz and @expectopatronum. @cboettig, we will consider this package successfully onboarded. 🚀 Carl, could you please work on the following? To-dos:
Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing. We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here. |
Thanks @lmullen!, I've ticked off the items above and transferred the package over to rOpenSci now. Badges / CI etc seem to be working fine. Planning a CRAN release soon. |
@lmullen can this issue be closed? 🙏 I'm asking because of my packages registry experiments, where I noticed the review badge of |
@maelle Yes, definitely. Closing now. |
Summary
Read in large, compressed text files into a databases in chunks. Write all tables in a database out into compressed tsv files.
https://github.com/cboettig/arkdb
data deposition! Because good archiving formats are crucial. Kinda crazy how many researchers and repositories distribute data only through APIs and uncompressed sql dumps.
Researchers working with large data files
yours differ or meet our criteria for best-in-category?
I don't think so?
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
The text was updated successfully, but these errors were encountered: