Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etl: framework for medium data #140

Closed
11 of 14 tasks
beanumber opened this issue Aug 7, 2017 · 8 comments
Closed
11 of 14 tasks

etl: framework for medium data #140

beanumber opened this issue Aug 7, 2017 · 8 comments

Comments

@beanumber
Copy link

Summary

  • What does this package do? (explain in 50 words or less):
    Facilitates predictable and pipeable ETL (extract-transform-load) operations for publicly-accessible medium data sets

  • Paste the full DESCRIPTION file inside a code block below:

Package: etl
Type: Package
Title: Extract-Transform-Load Framework for Medium Data
Version: 0.3.6
Date: 2017-07-20
Authors@R: c(
    person("Ben", "Baumer", email = "ben.baumer@gmail.com",
      role = c("aut", "cre")),
    person("Carson", "Sievert", email = "cpsievert1@gmail.com", role = "ctb"))
Maintainer: Ben Baumer <ben.baumer@gmail.com>
Description: A predictable and pipeable framework for performing ETL 
    (extract-transform-load) operations on publicly-accessible medium-sized data 
    set. This package sets up the method structure and implements generic 
    functions. Packages that depend on this package download specific data sets 
    from the Internet, clean them up, and import them into a local or remote 
    relational database management system.
License: CC0
LazyData: TRUE
Imports:
    DBI,
    datasets,
    downloader,
    lubridate,
    methods,
    stringr,
    readr,
    utils
Depends:
    R (>= 2.10),
    dplyr
Suggests:
    airlines,
    dbplyr,
    knitr,
    RSQLite,
    RPostgreSQL,
    RMySQL,
    MonetDBLite,
    ggplot2,
    testthat,
    rmarkdown
URL: http://github.com/beanumber/etl
BugReports: https://github.com/beanumber/etl/issues
RoxygenNote: 6.0.1
VignetteBuilder: knitr

  • URL for the package (the development repository, not a stylized html page):
    http://github.com/beanumber/etl

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

reproducibility, because the extensions of this package will lead to reproducible medium data set used in research
data retrieval, since the extensions of this package download data
data munging, since the extensions of this package transform raw data into CSVs

  • Who is the target audience?
    R developers for the etl package itself
    R users for etl-dependent packages

  • Are there other R packages that accomplish the same thing? If so, how does
    yours differ or meet our criteria for best-in-category?
    No. This package depends heavily on dplyr and dbplyr, but it provides functionality specific to the ETL process that is not present in either.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@maelle maelle added the package label Aug 8, 2017
@maelle
Copy link
Member

maelle commented Aug 8, 2017

Thanks a lot for your submission @beanumber! We (rOpenSci onboarding editors) discussed the fit of the package and don't think it's in scope. In particular we couldn't see the scientific application of the package, why it'd lead to more reproducibility than other approaches. If you disagree with this decision feel free to provide us with a more specific/descriptive explanation.

Don't hesitate to submit other packages in the future, potentially starting with a pre-submission enquiry in this same repo.

@beanumber
Copy link
Author

@maelle Thanks for your response. I suppose it's sort of hard to see the value of this package on its own, since it's sort of a meta package. The idea is that the suite of packages that depend on this package will provide a consistent, robust user experience, instead of a collection of packages that all work in idiosyncratic ways.

@nicholasjhorton
Copy link
Contributor

For me, the etl package is attractive since it provides a way for people to share data is a principled fashion (even if the individual files being shared are larger than 50MB). This gets around the 5MB (a crazy low value) recommended package size on CRAN and also simplifies the use of a github package install to create a specific environment.

@maelle
Copy link
Member

maelle commented Aug 8, 2017

Thanks both but it's still unclear how this improves reproducibility compared to existing approaches such as e.g. this one for big datasets? (data provenance, versioning, etc.)

Our saying the package is out-of-scope doesn't mean it's useless, of course!

@beanumber
Copy link
Author

I don't know if this will change your position @maelle, but I've posted the long-form article for this on the arXiv. The manuscript explains the package itself and the purpose of the package in far greater detail.

@maelle
Copy link
Member

maelle commented Aug 24, 2017

Thanks @beanumber, the editors position was because we couldn't see how the package helps reproducibility compared to existing approaches. Is there a part of the article dealing more specifically with this? In any case good work.

Oops edited now that I see the title of the paper 🤦‍♂️

@beanumber
Copy link
Author

Section 2 addresses this. You might be interested in Section 4.2 to see an example of how this might work in practice.

@maelle
Copy link
Member

maelle commented Aug 27, 2017

@beanumber, thanks for providing us the link to your manuscript.

After discussion within the editorial team we still think that albeit very useful etl is out of scope for rOpenSci and here is why:

  • It is a general data manipulation tool, not specifically aimed at retrieving or extracting a data type or source

  • The reproducibility packages that are in scope as per our policies are "Tools that facilitate reproducible research. This includes packages that facilitate use of version control, provenance tracking, automated testing of data inputs and statistical outputs, citation of software and scientific literature.",

which isn't the case for etl. etl doesn't help tracking provenance or versions of a dataset.

We however encourage the development of your package and your communication efforts for making it better known. We suggest you submit it to the R Journal.

Thanks again, and don't hesitate to ask any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants