Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: skynet #214

Closed
12 of 19 tasks
FilipeamTeixeira opened this issue Apr 30, 2018 · 43 comments
Closed
12 of 19 tasks

Submission: skynet #214

FilipeamTeixeira opened this issue Apr 30, 2018 · 43 comments

Comments

@FilipeamTeixeira
Copy link

FilipeamTeixeira commented Apr 30, 2018

Summary

  • What does this package do? (explain in 50 words or less):
    This package allows the generation and analysis of air transport networks using the freely available data (DB1B and T100 databases) from the Bureau of Transport Statistics. I'm currently updating the package to accept other sources as well.

  • Paste the full DESCRIPTION file inside a code block below:

Type: Package
Package: skynet
Version: 1.0.2
Title: Generates Networks from BTS Data
Authors@R: person("Filipe", "Teixeira", email = "Filipe.MarquesTeixeira@Ugent.be", role = c("aut", "cre"))
Maintainer: Filipe Teixeira <Filipe.MarquesTeixeira@Ugent.be>
URL: https://github.com/FilipeamTeixeira/skynet
Description: A flexible tool that allows generating bespoke
    air transport statistics for urban studies based on publicly available
    data from the Bureau of Transport Statistics (BTS) in the United States
    <https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0>.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
Collate:
    'PowerLaw.R'
    'createNodes.R'
    'data_ODsample.R'
    'data_airports.R'
    'data_carriers.R'
    'data_metro.R'
    'disparityfilter.R'
    'findAirport.R'
    'fromto.R'
    'convertRaw.R'
    'netImport.R'
    'netDir.R'
    'netUnd.R'
    'netInt.R'
    'netPath.R'
    'skynet.R'
    'nodeStats.R'
    'import_t100.R'
    'import_db1b.R'
    'findCarrier.R'
    'plotMap.R'
    'skynet_example.R'
    'bootnet.R'
Depends:
    R (>= 3.1.2)
Imports:
    data.table,
    igraph,
    dplyr,
    ggplot2,
    ggrepel,
    stringr,
    maps
Suggests:
    knitr,
    rmarkdown,
    testthat,
    kableExtra
VignetteBuilder: knitr

  • URL for the package (the development repository, not a stylized html page):
    https://github.com/FilipeamTeixeira/skynet

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

Geospatial data, network analysis, because it works with airline networks and it allows its generation and analysis.

  •   Who is the target audience and what are scientific applications of this package?  
    Air Transport
    Urban Networks

  • Are there other R packages that accomplish the same thing? If so, how does
    yours differ or meet our criteria for best-in-category?
    No.

  •   If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN? "It is already there. https://cran.r-project.org/web/packages/skynet/index.html"
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
    • The package is novel and will be of interest to the broad readership of the journal.
    • The manuscript describing the package is no longer than 3000 words.
    • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
    • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
    • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
    • (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@noamross
Copy link
Contributor

noamross commented Apr 30, 2018

Thank you for your submission, @FilipeamTeixeira! A couple of quick questions before we proceed:

  1. You say you are currently updating the package to incorporate other sources. We need a relatively stable version so that reviewers don't have to work with a moving target. Can you let us know the status of this and when you will complete this update?

  2. Just checking, do you know if these data are available for download via stable URLs or an API? The submission would be stronger if it enabled download as well as data processing.

  3. I see you checked off that the package has a research application but not that you want to submit to JOSS. Was this a mistake, and do you/don't you want to submit to JOSS. If you do, a 200-500 word paper.md is required

Finally, we'll also need you to add code coverage to your CI and include a code coverage badge in your readme, which you can do with usethis::use_coverage().

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Apr 30, 2018

Dear @noamross, thank you for your remarks.

  1. Skynet was built to use exclusively data from the Bureau of Transport Statistics, more specifically the DB1B and T100 dataset. Meaning that the final and stable version uses those datasets. However, as some researchers asked to include the possibility of importing data from other sources, I've been building gradually new import functions to allow including other airline data.
    The update will only be a new import function and it does not affect in any way the functioning of the package as it is. It just processes the csv files from other data sources rather than the DB1B and T100 and processes it into a format skynet can work with.

  2. I've updated the readme file to show the links to those datasets. The vignettes already had them, but hopefully this will be clearer now. Unfortunately the BTS doesn't have an API and web scrapping is rather unreliable in this case. I'm not sure if I understood the data processing remark as skynet, allows importing csv files from the links mentioned in the readme file.

  3. I've submitted it to another journal but it is pending review.

Added coverage badge as well.

Thank you.

@noamross
Copy link
Contributor

noamross commented May 1, 2018

Thanks for your responses, @FilipeamTeixeira, that's helpful.

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Below is the output from goodpractice::gp()
, which summarizes things that should be fixed before proceeding to review. I've noted that some are optional, but are notes for yourself and reviewers to consider. The main issue is that your code coverage testing coverage is low (<25%). We don't require 100% coverage, but we do require that there is coverage of the major areas, and from the coverage report I see lots of functions and files are not tested. We generally expect at least 70% coverage or so, and then have reviewers comment if they think edge cases are adequately tested.

── GP skynet ───────────────────────────────────────

It is good practice to

write unit tests for all functions, and all package code in
general. 22% of code lines are covered by test cases.

R/bootnet.R:21:NA
R/bootnet.R:22:NA
R/bootnet.R:23:NA
R/bootnet.R:24:NA
R/bootnet.R:25:NA
... and 656 more lines

add a "BugReports" field to DESCRIPTION, and point it to a bug
tracker. Many online code hosting services provide bug trackers for
free, https://github.com, https://gitlab.com, etc.
use '<-' for assignment instead of '='. '<-' is the standard, and R
users and developers are used it and it is easier to read your code
for them if you use '<-'.

R/convertRaw.R:34:7
R/convertRaw.R:36:7
R/convertRaw.R:38:7
R/convertRaw.R:40:7
R/createNodes.R:91:5
... and 15 more lines

avoid long code lines, it is bad for readability. Also, many people
prefer editor windows that are about 80 characters wide. Try make
your lines shorter than 80 characters (optional, to be considered by author and reviewers based on code readability.)

R/bootnet.R:26:1
R/convertRaw.R:47:1
R/convertRaw.R:48:1
R/convertRaw.R:60:1
R/convertRaw.R:61:1
... and 83 more lines

not import packages as a whole, as this can cause name clashes
between the imported packages. Instead, import only the specific
functions you need. (optional, to be considered by author and reviewers)
fix this R CMD check NOTE: Namespace in Imports field not imported
from: ‘maps’ All declared Imports should be used.
fix this R CMD check NOTE: Found the following hidden files and
directories: .travis.yml These were most likely included in error.
See section ‘Package structure’ in the ‘Writing R Extensions’
manual. (You should add an .Rbuildignore file for these, see, https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Building-package-tarballs)
avoid 'T' and 'F', as they are just variables which are set to the
logicals 'TRUE' and 'FALSE' by default, but are not reserved words
and hence can be overwritten by the user. Hence, one should always
use 'TRUE' and 'FALSE' for the logicals.

R/PowerLaw.R:NA:NA

──────────────────────────────────────────────


Reviewers: @toph-allen @berdaniera
Due date: 2018-05-23

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented May 1, 2018

Thank you @noamross. Your comments have been quite helpful.
I've listed all the remarks fixed with - [x] and the ones partially fixed or with comments with - [ ].

  • write unit tests for all functions, and all package code in
    general. (...)
    - 74% covered now.
  • add a "BugReports" field to DESCRIPTION (...)
  • use '<-' for assignment instead of '=' (...)
  • avoid long code lines, it is bad for readability. Also, many people
    prefer editor windows that are about 80 characters wide. (...)
    - Some of the lines couldn't be avoided as they were part of expressions. I could fix it but that would involve assigning more variables increasing the code run time.
  • not import packages as a whole (...) - That refers to the package igraph which is highly connected to skynet. I use a lot of its functions so I'm not sure what's the advantage of specifying them all. For all the other packages only the used functions are imported.
  • fix this R CMD check NOTE: Namespace in Imports field (...) - map package is needed for map_data through ggplot.
  • fix this R CMD check NOTE: Found the following hidden files and
    directories (...)
  • avoid 'T' and 'F', as (...)

@noamross
Copy link
Contributor

noamross commented May 2, 2018

Thanks for the (fast!) update. Things look good and I'll proceed to find reviewers.

Also, NBD, but it looks like one of your tests trips your own deprecation warning:

test_import_db1b.R:18: warning: (unknown)
netImport function is deprecated, please use `import_db1b()`, or `import_t100()`.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented May 2, 2018

Bug fixed.

@noamross
Copy link
Contributor

noamross commented May 2, 2018

Now that your package has passed editor check stage, you can add our review status badge status to your README if you would like:

[![](https://badges.ropensci.org/214_status.svg)](https://github.com/ropensci/onboarding/issues/214)

@noamross
Copy link
Contributor

noamross commented May 2, 2018

Reviewers assigned!

Reviewers: @toph-allen @berdaniera
Due date: 2018-05-23

Aaron and Toph may be interested in the (alpha) pkgreviewer package recently developed by one of our community members to help structure the process: https://github.com/ropenscilabs/pkgreviewr

@noamross
Copy link
Contributor

noamross commented May 18, 2018

👋 Hello @toph-allen and @berdaniera, just a friendly reminder that your reviews are due next Wednesday, May 23.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented May 23, 2018

Hi @noamross. Any further news on this? I'm not sure if the reviewers noticed that they were supposed to review the code until today.
Thanks

@noamross
Copy link
Contributor

noamross commented May 23, 2018

Hi @FilipeamTeixeira. I'm eager to see the reviews, too! I've been in touch with both reviewers and while they are a bit behind they are on it.

@berdaniera
Copy link

berdaniera commented May 28, 2018

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3


Review Comments

Overall, I think this is a valuable package. My impression is that this is important for consistently wrangling and formatting a complex dataset and quickly generating network analyses for that data. Streamlining this preprocessing (skynet's goal) has value for anyone interested in the airline data.

The primary functions are clean and simple. I hope the comments below will be helpful for clarifying some of the documentation. For the functions, can you talk about the choices for using snake case, camel case, and dot naming?

The main recommendations I have are about documentation for getting the data locally first. At the very least, I'd suggest a short numbered guide with instructions for how to download data. When exploring the functions, I was able to get everything to work with the example data. Before I looked at the vignette (which explains which variables I needed) I tried to download data from the weblink but did not get the necessary variables (which import_db1b asks for directly with dplyr::select), which gave me an error.

In a broader scope, I agree with @noamross suggestion to investigate a direct way to download the data without leaving R. The initial section below might be helpful in that. I'm willing to contribute to this portion of the code if we decide that it is in fact possible (will require some more research).

Initial investigation of direct data access

The web form at https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289 creates a POST link that submits a request to their server, with the main part being a sql string, e.g.,: +SELECT+ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID+FROM++T_DB1B_TICKET+WHERE+Quarter+=1+AND+YEAR=1993

That submission creates a temporary file that gets stored here https://transtats.bts.gov/ftproot/TranStatsData/
and the return of the POST contains the url for the file, which is then downloaded with a GET request.

Given that sequence, I think it is POSSIBLE that you could create a function that uses httr to POST the request with a specific set of parameters (pre-specifying the required variables, user defined year and quarter, etc.) to download the data directly.

The only hangup would be if the POST requires cookie credentials from the website. But, I think that could also be obtained in a first step.

Maybe at least think about if this is a desirable function for the package. It would increase the safety of the functions (no errors from users getting the wrong data) and make the process of getting the data that much easier on the user.

Package metadata files

For the readme, I'd suggest adding a short annotated example of downloading data and reading it in, instead of just a link to the files. E.g., do I download all of the data for a given year/quarter? Do I download the default fields from BTS? That step may be familiar to people who use these data but could be a challenge for newcomers.

Online documentation

The online documentation (https://FilipeamTeixeira.github.io/skynet/) is clear and provides a basic overview of the main functions.

The Overview page could give some additional detail to motivate me to use this package. What does the package enable? What are some of the main challenges that arise from this BTS data? E.g., data wrangling and preparation for network analysis?

The example code on the overview page requires me to have an 'OD_2011Q1' object in my environment, which I did not already. I'd recommend adding a line to that example to show the data download and import.

The Get Started page is a great feature of this documentation. On that page, the Import Datasets instructions say to use netImport(), but when I do that I get a warning message:

netImport function is deprecated, please use `import_db1b()`, or `import_t100()`.

This seemed to work:

import_db1b(skynet_example("Coupon_2001Q1.csv"), skynet_example("Ticket_2001Q1.csv"))
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05)

On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE) and got an error about not yet being able to do this operation with the disparity filter. After checking out the help for ?make.netDir I inferred that I could remove disp = TRUE from the code, but it could be explicit in the documentation by showing a code snippet for the carriers line: make.netDir(OD_2001Q1, carrier = TRUE) and adding a note that it doesn't work yet with the disparity filter.

Functionality

The example code

import_db1b(skynet_example("Coupon_2001Q1.csv"), skynet_example("Ticket_2001Q1.csv"))
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05)

# a minimal example code:
make.netDir(OD_2001Q1)

# a fully documented example with the default parameters specified:
make.netDir(OD_2001Q1, 
            disp = FALSE, alpha = 0.003, 
            cap = FALSE, pct = 10, 
            carrier = FALSE, metro = FALSE)


For the import_db1b documentation, I suggest explicitly stating in the description that this is for importing both the Coupon and Ticket data from the same year and quarter.

Group by carrier

On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE) before I looked at the help documentation and got an error about not yet being able to do this operation with the disparity filter.

After checking out the help for ?make.netDir I saw that I could remove disp = TRUE from the code. This could be explicit in the documentation by showing a code snippet with the carrier operation and adding a note that it doesn't work yet with the disparity filter.

Another suggestion is in the main example code to use all of the defaults, either displaying all of the arguments or just using the minimal default example (like I did above).

# does not work:
make.netDir(OD_2001Q1, disp = TRUE, alpha = 0.05, carrier = TRUE)

# works:
make.netDir(OD_2001Q1, carrier = TRUE)

Filter edge weight

In the help for make.netDir I'd suggest sorting the arguments by their use; disp with alpha, cap with pct.

make.netDir(OD_2001Q1, cap = TRUE)

Create map

The documentation to create maps assumes that I have an object called list, which I think is the saved network object. The first command (from the doc) does not work without creating the object first.

# does not work:
netMap(list$netDir)

# works:
my_network <- make.netDir(OD_2001Q1, cap = TRUE)
netMap(my_network$netDir)

The maps are very nice, good work on that!

Additional functions

my_airport <- findAirport("RDU")

findCarrier("Southwest")

findCarrier("WN")

fitPlaw(my_network$gDir)

# some extra exploration
nodeStats(OD_2001Q1) %>%
  dplyr::filter(airport == "RDU")

fromto_network <- make.netDir(OD_2001Q1)
fromto.stat(fromto_network$gDir, "ATL", orig = "from")

These functions work fine for me. For the power law fit, I wonder if it makes sense to export the Alpha and R^2 results as list objects that can be stored as output instead of just printing.

Code inspection

Inline code documentation is complete and clear, which will make future contributions easy.

Suggest using message or warning instead of print to communicate with the user (e.g., for the return message in fitPlaw or make.path).

@noamross
Copy link
Contributor

noamross commented May 28, 2018

Thank you for your review, @berdaniera! A quick note on your comments about importing data. Site-scraping functions like this are useful, but they can be unreliable as data providers may change their back-end processing. I leave it up to you whether to implement this @FilipeamTeixeira, but should you choose to you should make sure that the function documentation and maybe a message indicate that the function is dependent on an undocumented/unsupported API and may be unstable. You should also check that it doesn't violate terms of service and have your HTTP requests set a user-agent like skynet R package, developed by your@email. BTS does not appear to have a robots.txt file which would be the clearest way explicitly allow or disallow this sort of use.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented May 28, 2018

Thank you for your thorough review @berdaniera. I'll go through the comments as soon as possible, but I just wanted to first reply to the comment about importing data. Initially I had such feature, but they changed something on their website and suddenly it stopped working.
There are a couple of packages (R and Python) online which have the same aim, and they all stopped working after a while. A remark I got once from the Bureau of Transport Statistics was that for this data they weren't making any API available, meaning that scraping wasn't allowed as well. I believe that it might be related to the comercial databases providing this data for a hefty fee.

Anyway, I stopped investing time in finding a solution. It might be something to try again if there's an easy way of doing it, but if not I'll have to leave it as it is.

For the rest I'll go through it as soon as possible.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented May 28, 2018

For the functions, can you talk about the choices for using snake case, camel case, and dot naming?

The rationale behind it was: dot naming for main functions (e.g. make.netDir) with the case at the end to indicate type of network being created, camel case for all secondary functions (e.g. createNodes), snake case for import functions (e.g. import_db1b). If you believe that it doesn't make sense please let me know. When I started writing this package, I never thought it would grow this big, but I wanted to make sure that its functions would end up making some sense.

The Overview page could give some additional detail to motivate me to use this package. What does the package enable? What are some of the main challenges that arise from this BTS data? E.g., data wrangling and preparation for network analysis?

This is a tricky one. I have currently a paper in review, which I would like to wait to be published before adding more information. Then it will be possible to add the full paper, and as much information as possible. I'm presenting this as well in an air transport conference soon, meaning that I can even add more after than.

For the readme, I'd suggest adding a short annotated example of downloading data and reading it in, instead of just a link to the files. E.g., do I download all of the data for a given year/quarter? Do I download the default fields from BTS?

For the import_db1b documentation, I suggest explicitly stating in the description that this is for importing both the Coupon and Ticket data from the same year and quarter.

Done

The example code on the overview page requires me to have an 'OD_2011Q1' object in my environment, which I did not already. I'd recommend adding a line to that example to show the data download and import.

When importing a csv using the import_db1b for example, it creates an object with the name of those files. Meaning that when importing Coupon 2011Q1 and Ticket 2011Q1, it will generate an OD_2011Q1 object.

The Get Started page is a great feature of this documentation. On that page, the Import Datasets instructions say to use netImport(), but when I do that I get a warning message

There was an issue with the documentation which has been fixed meanwhile.

Group by carrier
On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with carrier = TRUE)

I'm not really sure what do you really refer to here. The limitation with the carrier and disparity filter is merely a mathematical challenge. Having multiple carriers, means that the origin-destination matrix will have parallel edges. However, the disparity filter wasn't designed to work with parallel edges. I didn't design the filter as I've mentioned in the documentation.

Filter edge weight
In the help for make.netDir I'd suggest sorting the arguments by their use; disp with alpha, cap with pct.

Done

Create map
The documentation to create maps assumes that I have an object called list, which I think is the saved network object.

Not sure if I understood your question. As from the documentation, the create map function asks for a data frame. The example with the list, lies on the fact that the make.net* functions always create a list with three objects.

These functions work fine for me. For the power law fit, I wonder if it makes sense to export the Alpha and R^2 results as list objects that can be stored as output instead of just printing.

I've been wanting to improve this function to reflect more what's needed in network analysis. For now if I understood your comment correctly, I've fixed what you've mentioned.

Once again thank you all and if you have any remarks or if I missed something please let me know. Current version is 1.0.4.

@toph-allen
Copy link

toph-allen commented Jun 1, 2018

Geez, sorry I'm so tardy with my review! I'll have it finished by the end Saturday.

I'm really looking forward to taking a look at this package! I have a more-than-passing familiarity with the underlying DB1B and T100 datasets, and am interested to see how they're surfaced here.

@noamross
Copy link
Contributor

noamross commented Jun 1, 2018

My apologies to @toph-allen for mis-tagging him and to @toph for bothering him unnecessarily!

@toph-allen
Copy link

toph-allen commented Jun 7, 2018

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

2.5-3


Review Comments

Overview

(I wrote this overview after I wrote most of the comments that follow)

The skynet package performs useful functions on some flight datasets which are tricky to work with by themselves. The DB1B and T100 datasets are complex, and having a package which helps wrangle them into a more usable format is a definite value.

I think there could be improvements made to the package's documentation, and to its initial setup process.

  • In particular, I echo the previous reviewer's comments about clarity getting the user set up with the correct data files downloaded. Ideally, as the other reviewers have noted, the package would include a way to download the data directly in R. Failing that, clearer documentation would be helpful.
  • I found that there were some variables named expected by import_db1b() which differed from the BTS names for downloaded files, and some other non-standard ways that the function operates, which could combine to cause confusion and frustration for a user who's unfamiliar with the package or the data. In particular, I think that the import function's use of argument names, and its assigning rather than returning the final object, should be changed.
  • I echo the previous reviewer's comments about some of the naming of objects in example code being somewhat confusing (e.g. an object called list in the netMap() function documentation).
  • I think that the documentation would be improved by some additional information, as described above, below, and echoing the comments from other reviewers.

Installation

Installed with no problem.

Importing the Files — Documentation & Code

DB1B and T100 are really confusing datasets, especially when you're first dealing with them.

The README.md file on the skynet repo does a good job of orienting the user to set up the package and referring them to the place they can download the files.

However, I ran into some holdups getting the right files downloaded for import. The README.md recommends downloaded files with a specific set of variables included, because the full datasets are very large.

The table on line 58 in README.md (line 55 of README.Rmd) lists the suggested variables to include. However, these variable names are not a direct match for the "Field Name" or the "Description" fields of the DB1B download form.

For example:

  • "Destination" is one of the recommended fields, and this corresponds to the variable whose "Field Name" is "Dest" and "Description" is "Destination Airport Code".
  • The DB1BTicket variable "FarePerMile" is referred to as "Yield" in the README.md file and the error messages displayed, and I was only able to figure out which variable needed to be checked by cross-referencing with the package source code.

For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I'm unclear if the recommended names for the T100 dataset are a requirement for the package to function correctly or just a recommendation for clarity.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

The import_db1b() function (and its T100 equivalent) should include a more explicit description of the object that they generate.

Comments on function style

I would recommend making some changes to the behavior of the function import_db1b(). If other functions in the package behave similarly, I would recommend changing them too.

First, I'd recommendation is to change the names of the function's arguments and the way they're handled.

The function requires two arguments, x and y, each a CSV file. The function searches for "ticket" and "coupon" in the file names to determine which is which. Using the file names as implicit indicators of their contents, and having two named arguments which are actually interchangable, goes against common coding style and might be confusing to users.

In my opinion, it would be clearer to users and more robust to different file names if the function required two arguments named coupon and ticket, each corresponding to the expected file — especially since both are required.

Second, I'd recommend changing the way the function returns its data. The function uses the filename of the input coupon dataset to determine the name of an object, and assigns that object. This object is assigned in an environment which is assigned in import_db1b.R, presumably when the package is loaded.

The expected behavior would be for the function to return the created object, so that it can be assigned like:

db1b <- import_db1b("data-raw/71336240_t_DB1B_COUPON.csv", "data-raw/71336240_t_DB1B_TICKET.csv")

This gives the user more control over the name and location of the object that's created by the import process.

The import_t100() function follows this second pattern, and I'd also recommend changing it there.

Network Generation Functions

I'm less familiar with iGraph, so my comments here are a little less detailed.

Using the functions included, I was able to replicate the vignette's network map from the DB1B datasets successfully.

All the functions which begin make. process the imported data to generate various different objects. Some of these are clear from the function documentation, but some aren't, and they could generally use a little more detailed description. In general, I would recommend including the following information:

  • Details on the type of object returned by the function. For example, make.netDir() returns a list containing a data frame, an iGraph object, and summary statistics. However, this information isn't detailed in the function's documentation, only in the package's vignette.
  • If the fields in the returned objects aren't immediately obvious from their names, I would include a list of the variable types and how they are computed.
  • For iGraph objects, which may be less widely known than objects like data frames, you might want to include a link to the iGraph package documentation online. This is less crucial, though — users can look this up themselves if need be — unless you link to a specific, relelvant part of the iGraph documentation.

Miscellaneous

  • I'd include a link in the documentation for import_db1b() and import_db1b() either to the pages where the user can download the files, or to the documentation which includes more detailed instructions on downloading the files with the requisite variables.
  • The README.md recommends naming the T100 CSV files to T100 year mkt and T100 year seg, but the function documentation recommends naming them T100 2011Q1mkt.csv and T100 2011Q1seg.csv. Ideally, the filenames shouldn't matter, and the files should explicitly be passed as one type or another.
  • The README.md should include community guidelines.
  • The usage of the skynet_example() function is not clear from its documentation. I don't know if this function is entirely necessary — example data could be included as .RData files in the package's data directory.
  • The netMap() function's help page is titled PlotMap — perhaps leftover from a previous name of that function, judging from the source file's name.
  • The map function uses curves, but these aren't great circle arcs — perhaps this caused unwildy plots for some map projections, though, so this isn't as big a deal. The plots produced are clear.
  • My initial thought was that the functions might be clearer and more consistent if they all used snake case for names — I wasn't sure if there was a specific rationale for using that versus dots versus camel case. I see that the other author has raised this, and while I think that consistent snake case would be more in line with what I'd have expected, it's not something which I'd quibble with too much.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 7, 2018

@toph-allen Thank you for your comments and thorough review.

In particular, I echo the previous reviewer's comments about clarity getting the user set up with the correct data files downloaded. Ideally, as the other reviewers have noted, the package would include a way to download the data directly in R. Failing that, clearer documentation would be helpful.

As I've mentioned before the BTS changes their website structure often which I assume that it is related to the fact that they don't allow scraping. It might be counterproductive to create a function that will be unusable after a couple of weeks.

I found that there were some variables named expected by import_db1b() which differed from the BTS names for downloaded files, and some other non-standard ways that the function operates, which could combine to cause confusion and frustration for a user who's unfamiliar with the package or the data. In particular, I think that the import function's use of argument names, and its assigning rather than returning the final object, should be changed.

Thank you for your feedback on this. However, the variables to be used are well described in the ReadMe file and help section for each function. There's as well an argument zip for when the users don't want to manually select their variables.

I echo the previous reviewer's comments about some of the naming of objects in example code being somewhat confusing (e.g. an object called list in the netMap() function documentation).
I think that the documentation would be improved by some additional information, as described above, below, and echoing the comments from other reviewers.
Installation

I'm not sure about this comment and to which naming you are referring to. Some of the issues previously mentioned by the other reviewer have been fixed.

However, I ran into some holdups getting the right files downloaded for import. The README.md recommends downloaded files with a specific set of variables included, because the full datasets are very large.

The table on line 58 in README.md (line 55 of README.Rmd) lists the suggested variables to include. However, these variable names are not a direct match for the "Field Name" or the "Description" fields of the DB1B download form.

For example:

"Destination" is one of the recommended fields, and this corresponds to the variable whose "Field Name" is "Dest" and "Description" is "Destination Airport Code".
The DB1BTicket variable "FarePerMile" is referred to as "Yield" in the README.md file and the error messages displayed, and I was only able to figure out which variable needed to be checked by cross-referencing with the package source code.
For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I don't think that I fully understood these remarks. Both the DB1B and the T100 dataset use non-standard naming. This means that if you select the variables yourself, they will have one name, but if you download the full file (zipped) they will be named differently. The reason why I chose a different variable name was to homogenise this and other issues. As I am working on a way to import other datasets (e.g. Sabre, OAG), it only makes sense that the variables have a standard name closer to all the others. Also long named variables might generate confusion and they are not really clear for those reading with statistics.

I'm unclear if the recommended names for the T100 dataset are a requirement for the package to function correctly or just a recommendation for clarity.

File size is not an issue as they are quite small, so you can always download all the variables.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

Again this is described in the ReadMe File. There are differences between the zipped file and the pre-selected variables file. The variables are named differently, and this might be the reason of your confusion. When importing the zipped file which uses the zip = TRUE argument, the variable is name Distance and BulkFare, however, when selecting the variables yourself, they are called DISTANCE_FULL and BULK_FARE respectively. So there's no error there. I lower cased everything as well to be easier to import to SQL databases.

The import_db1b() function (and its T100 equivalent) should include a more explicit description of the object that they generate.

Thank you for your comment, but this is explained in the Vignettes.

The function requires two arguments, x and y, each a CSV file. The function searches for "ticket" and "coupon" in the file names to determine which is which. Using the file names as implicit indicators of their contents, and having two named arguments which are actually interchangable, goes against common coding style and might be confusing to users.

In my opinion, it would be clearer to users and more robust to different file names if the function required two arguments named coupon and ticket, each corresponding to the expected file — especially since both are required.

Changed.

Second, I'd recommend changing the way the function returns its data. The function uses the filename of the input coupon dataset to determine the name of an object, and assigns that object. This object is assigned in an environment which is assigned in import_db1b.R, presumably when the package is loaded.

When working with temporal data it is easier this way. However, the name now is created not with basis on the file but on if it is DB1B, which will start then with OD, or T100, which will start with T100. The rest will be calculated with basis on the year and quarter variables.
The reason why I forced the name this way was that an uniform naming seemed to be helpful according to the researchers working with this data, and who gave me feedback on this.

Details on the type of object returned by the function. For example, make.netDir() returns a list containing a data frame, an iGraph object, and summary statistics. However, this information isn't detailed in the function's documentation, only in the package's vignette.
If the fields in the returned objects aren't immediately obvious from their names, I would include a list of the variable types and how they are computed.
For iGraph objects, which may be less widely known than objects like data frames, you might want to include a link to the iGraph package documentation online. This is less crucial, though — users can look this up themselves if need be — unless you link to a specific, relelvant part of the iGraph documentation.

Thank you for your comments. However I'm not sure if I understood them correctly.
I've added some extra information on the named objects generated, but I'm not 100% sure if based on your comments if it will be enough.

I'd include a link in the documentation for import_db1b() and import_db1b() either to the pages where the user can download the files, or to the documentation which includes more detailed instructions on downloading the files with the requisite variables.

It's in the ReadMe file.

The README.md recommends naming the T100 CSV files to T100 year mkt and T100 year seg, but the function documentation recommends naming them T100 2011Q1mkt.csv and T100 2011Q1seg.csv. Ideally, the filenames shouldn't matter, and the files should explicitly be passed as one type or another.
The README.md should include community guidelines.

Fixed.

The usage of the skynet_example() function is not clear from its documentation. I don't know if this function is entirely necessary — example data could be included as .RData files in the package's data directory.

This was required and requested by CRAN. As for the package I do understand your remarks, but I do want this package to follow CRAN's reviewers comments as well, so for now it has to be there.

The netMap() function's help page is titled PlotMap — perhaps leftover from a previous name of that function, judging from the source file's name.

Fixed.

The map function uses curves, but these aren't great circle arcs — perhaps this caused unwildy plots for some map projections, though, so this isn't as big a deal. The plots produced are clear.

Thank you for your comments here. I actually didn't want them to be great circles arcs. The reason why they are curved is to be easier to read the map. This was feedback from researchers using the package for their research.

My initial thought was that the functions might be clearer and more consistent if they all used snake case for names — I wasn't sure if there was a specific rationale for using that versus dots versus camel case. I see that the other author has raised this, and while I think that consistent snake case would be more in line with what I'd have expected, it's not something which I'd quibble with too much.

Thank you for your comments here. I've explained the rational behind the naming above in my comments to the other reviewer.

Thank you @toph-allen, @berdaniera and @noamross for all the comments and for all the help making this package better. Meanwhile I've pushed all the changes to my GitHub, and skynet is currently on version 1.1.0.

@noamross
Copy link
Contributor

noamross commented Jun 8, 2018

Thank you @toph-allen for your review and @FilipeamTeixeira for your response. Let me comment on a few things here now that both reviews are in:

  • Function naming: this was commented on by both reviewers, and I agree that the mix of approaches is unusual and somewhat unintuitive. We recommend snake case but pretty much require consistency. We find a bit of rationalization, like a pattern of prefix_verb(_object), (sky_fit_plaw, sky_gen_netmap) makes it easier for users to understand and discover functions, and works for packages with many functions. Due to widespread use of autocomplete, longish names are not really an issue. This would break the API and require a 2.0 release.
  • Having a function assign a value to the environment rather than to an object is not standard practice, especially as default behavior. I can imagine a number of cases where it would cause problems, such as if you were processing a large number of files from different dates within using an apply paradigm or parallel code. If you really think this behavior is necessary, one option would be to add an assign_default=FALSE argument to those import functions (or even allow the user to set a value under options() to set this behavior).
  • I note that @toph-allen's suggestion of making the filename order consistent in the import functions isn't implemented in convertRaw. This should be consistent across functions. I also note that your argument naming is fairly inconsistent across functions. Much like function names, consistency in arguments is a great help for the users. For instance, using graph as an argument wherever functions expect an igraph object, or, cfile, wherever they expect a coupon file, makes the function API easily interpretable for the user and code that uses the package much more readable and auto-completable.
  • For some of the documentation suggestions, please follow suggestions on documentation beyond the README to and suggested information in function documentation. With documentation, we often work in the opposite paradigm of "don't repeat yourself". Rather, the paradigm we use is, "multiple points of entry" - that a user may be hunting for information in one of several locations and they should all include critical information or at least link to each other.

@toph-allen and @berdaniera, please let us know how well @FilipeamTeixeira's changes address your comments.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 9, 2018

@noamross
Just a quick reply to your comments.

1- Indeed that would require a major change which I might include on a next major release.
2- I've been thinking of "fixing" this by working with classes. That would eliminate the need of specific naming for several objects. However, this would only come on a major release as well.
3- Fixed. I forgot that function indeed.
4- I'm not sure to what this refers to.

Thank you once again for your constructive and detailed comments.

@noamross
Copy link
Contributor

noamross commented Jun 10, 2018

Hello, @FilipeamTeixeira, regarding (4), a specific example I'm referring to is @toph-allen's suggestion

I'd include a link in the documentation for import_db1b() and import_db1b() either to the pages where the user can download the files, or to the documentation which includes more detailed instructions on downloading the files with the requisite variables.

Yes, this information is in the README, but the concept of "multiple points of entry" is that the user may be looking for information in the function file and should find via repeating information or linking to another source.

Regarding 1,2,3, and other issues brought up by reviewers: Having a consistent and fully documented API where functions behave as commonly expected is pretty core to our standards, so these major-version changes will be needed before we're able to accept. We at times accept packages that only address some review iterms via a "roadmap" for a future version (see the skimr review for an example), but this tends to be limited to cases of extensions or at times re-factoring of internals for the purpose of enabling extensions.

Specifically with (2) - classes seem like a likely solution and S3 classes are relatively simple to implement, though I'm still not sure what the problem is that assigning to the environment solves, based on your feedback from users. Perhaps if you describe the use-case in a bit more detail we could suggest a solution.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 11, 2018

Dear @noamross. Thank you for your comments and for clarifying my questions.
For (4) I've fixed this issue. I wasn't fully understanding what was exactly the remark but I've managed to understand what was the goal now.
Regarding (1), I'm afraid that it won't happen anytime soon. The main reason here is that this package's paper has been submitted and to change every function in the paper now would be sloppy. There is still some consistency in the functions and what they are meant to, and I do intend to change it in a nearby future, probably still this year when I add machine learning support. I hope you all understand this, and that I don't really have an option here.

About (2), I'm also not found of assigning objects like that but the issue is simple. Most users will use this package for longitudinal analysis. Meaning that they won't work merely with one year but with multiple. Some users mentioned that it would be easier to name the imported objects automatically, so they could incorporate them in an easy way into loops. If the object syntax is always the same, except for the suffix which shows the year and quarter, it is easy then to run any sort of analysis for lets say 10 years. From my understanding classes wouldn't solve this, as you would still need to name each object.
So for example if I wanted to run an average degree centrality function on all of my networks and get confidence intervals, it is easy to do it if the only thing changing is the suffix of my target object(s).
I am aware that you can do this from the import, but this standardises it and makes it more consistent. However, if you have any other suggestion I'm willing to look into it as I don't doubt that you all have more experience than I do.

(3) has been fixed, so there's nothing to point here.

Once again thank you for your remarks and for your help with this package.

@noamross
Copy link
Contributor

noamross commented Jun 11, 2018

@FilipeamTeixeira Regarding the API and review timing issue, one option would be to put this issue on hold until your other review is done. We generally don't recommend putting a package in review in two places at once, for just this reason. There are times where, because a package is in widespread use or tightly linked to other tools, it makes sense to avoid changing the API for compatibility reasons. However, I don't think that is the case here - we would rather help get the package to a point where it is more usable for more people.

Re: (2) Making function input or output depend on file or object naming is pretty fragile and incompatible with common functional programming strategies. A much more common and robust way to doing this type of workflow is to create a list of objects and then iterate on that list using a function such as lapply() or purrr::map(). This approach that is broken by assigning the object to the environment. If the date of the object is critical, a better way for it to be retained is as an attribute or as part of the list that makes up the imported object (along with the igraph and data frame). Then if you are iterating over several objects, it is easy to extract the date from each one. S3 classes help this along with, for instance, print methods that highlight important parts of the object such as the date, or extractor functions like date(object). If your audience isn't familiar with lists or iterating functions like lapply() you can demonstrate it in the vignette, and another option is to create a helper function that, say, imports a folder full of data and creates a list of objects.

@toph-allen and @berdaniera, it would be helpful if you could let us know well @FilipeamTeixeira's changes address your comments at this stage. Even if we pause the review it would be good to record those at this point so we have a record of the current state.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 11, 2018

@noamross I do understand and I also don't want to start a discussion here about this as it would be counterproductive. However, I have the feeling that I'm probably not properly conveying my message and the users' needs.

The need of having correctly named objects is to avoid having to create lists. surely that with lapply() or with purrr::map() it is possible to run a function based on the S3 classes, but I'm still not really sure how it would optimise this. Running a function would have to then scan through the list, check each S3 class from each object, return it, and apply a function on it if it matches the requirements. Now it only has to get the object name and run the function.
I'm all for optimising my code, but I would really like to understand how something will do it, as we shouldn't do something just because it has been standard. What might work for some, might not work for others and there are plenty of examples like that in R.

As for naming the functions I do understand as well. However, this decision was made after talking to people working with the package. I would like to reiterate and stress that this package is rather specific when it gets to its target. Meaning that we could not compare it with others. Also dplyr uses both snake case and dot naming, so this example exists in numerous packages.

If indeed this will be paused, it is a pity, but I regret to say that it will stay this way for a while as the users are my priority. If there are any remarks on the function naming and any arguments that would explain a standardisation of those functions, I'm happy to reconsider the changes. Another alternative is to create aliases for the functions as it is possible to find in numerous packages running the risk of increasing complexity.

In any case I'm truly thankful for all the help you guys gave so far and for all the time invested in this. So even if this goes nowhere it was already rather productive for me.

@toph-allen
Copy link

toph-allen commented Jun 12, 2018

@FilipeamTeixeira Thanks for your responses to my review. I'll attempt to respond concisely to them, and to clarify and contextualize some of my original comments.

For context, I have some familiarity with the DB1B and T100 datasets — I'm part of a team at work which is creating models using these datasets. This is our first time working with these datasets, and we wound up creating some internal scripts to do some of what skynet accomplishes. I can definitely see a role for a package like skynet in helping users work more quickly and efficiently with these datasets.

I see that you made some updates to the code in response to some of the discussion here. Thanks for being responsive to our feedback. I think that some of the changes, especially with regard to things like argument handling, will be beneficial to new users. @noamross's comments on the consistency of arguments across packages are good explanations as to why.

"Destination" is one of the recommended fields, and this corresponds to the variable whose "Field Name" is "Dest" and "Description" is "Destination Airport Code".
The DB1BTicket variable "FarePerMile" is referred to as "Yield" in the README.md file and the error messages displayed, and I was only able to figure out which variable needed to be checked by cross-referencing with the package source code.
For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.

I don't think that I fully understood these remarks. Both the DB1B and the T100 dataset use non-standard naming. This means that if you select the variables yourself, they will have one name, but if you download the full file (zipped) they will be named differently. The reason why I chose a different variable name was to homogenise this and other issues. As I am working on a way to import other datasets (e.g. Sabre, OAG), it only makes sense that the variables have a standard name closer to all the others. Also long named variables might generate confusion and they are not really clear for those reading with statistics.

I see what you're saying here — since the datasets all use different names for certain values, you're using a generic name. In that case, it might be helpful to give some examples of what the variables might be named — but this wasn't a high priority comment, in my mind.

There is an error on line 63 of import_db1b.R. The bulk fare indicator, which is exported by BTS as BULK_FARE in the CSV header, is specified in the import function as BULKFARE, which causes an error. In addition, the function expects a variable named DISTANCE_FULL. I think this is referring to the DISTANCE BTS variable. You should probably double-check all the variable names, to be sure.

Again this is described in the ReadMe File. There are differences between the zipped file and the pre-selected variables file. The variables are named differently, and this might be the reason of your confusion. When importing the zipped file which uses the zip = TRUE argument, the variable is name Distance and BulkFare, however, when selecting the variables yourself, they are called DISTANCE_FULL and BULK_FARE respectively. So there's no error there. I lower cased everything as well to be easier to import to SQL databases.

When this happened, I was actually using a normal the custom-selected variable file. The following code outputs the first line of the file; you can see that the variables are named differently from the function. Do you think that this is a recent change to the BTS website, like you mentioned?

$ head -n 1 71336240_T_DB1B_TICKET.csv
"ITIN_ID","COUPONS","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","ROUNDTRIP","ITIN_YIELD","PASSENGERS","ITIN_FARE","BULKFARE","DISTANCE_FULL","DISTANCE_GROUP","ITIN_GEO_TYPE",
$ head -n 1 71336240_T_DB1B_COUPON.csv
"ITIN_ID","MKT_ID","SEQ_NUM","COUPONS","YEAR","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","QUARTER","ORIGIN","DEST_AIRPORT_ID","DEST_AIRPORT_SEQ_ID","DEST_CITY_MARKET_ID","DEST","TRIP_BREAK","COUPON_TYPE","OPERATING_CARRIER","DISTANCE","GATEWAY",

@noamross's comments about making documentation more available and comprehensive, and about returning an object from the import functions for assignment, convey my thoughts on those subjects very well. I had no idea that the temporal nature of the dataset was the reason for the assignment and file naming. (Our team was actually doing the same thing with these datasets — using file names to represent different quarters, and pulling that info from the file name, and I had us instead use the year and quarter columns from the datasets to pull that info, for the principles that have been outlined elsewhere in the thread.)

I'll keep an eye on this and am happy to help out more as needed.

@noamross
Copy link
Contributor

noamross commented Jun 13, 2018

Many of our packages have a specific audience, but as I hope the case of @toph-allen above shows, there are many potential users out there who may use your package for reasons you have not anticipated. Our purpose in standardizing is to make sure this larger audience is served and that your package can be useful to a broader audience, and that your work gets exposure to them. We do not want standards for standards' sake. Of course every package has idiosyncrasies driven by speciality uses. Yet we want to ensure the exceptions help a user understand the package and keep it compatible with other workflows they may have. (In dplyr, for instance, dot-naming specifically identifies S3 conversion methods such as as.xxx which is a very common pattern across packages and base R)

Aliasing names and deprecating functions over time is strategy we recommend! My co-editor Scott has written a guide to this: https://ropensci.org/technotes/2017/01/05/package-evolution/

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 13, 2018

@noamross Thank you for your comments. I'll try to add some aliases as soon as possible. As for @toph-allen 's comments, I believe that there might be some confusion here. He actually supported assigning names to the objects for the sake of temporal analysis.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 14, 2018

@noamross function naming was corrected according to your specifications. Except for one minor internal function which I will patch for the next version. I kept all the older functions but with a message stating that they are deprecated.
As for the object created by assignment I hope that @toph-allen's comment made it clearer on the reason behind such approach.

I believe that all the requirements have been fulfilled unless there's something else I might be missing.

Once again thank you.

@toph-allen
Copy link

toph-allen commented Jun 15, 2018

Just to eliminate any confusion: In my review, I wasn’t requesting an argument to let the users name the objects which are then assigned in-function to the global environment. Rather, I think it would be better to follow convention and return the object at the end of the function with return(object), so that the user can assign it when they call the function, as in processed_data <- import_db1b(x = “foo”, y = “bar”).

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 15, 2018

@toph-allen So what was your point with the temporal data and the naming? Because there's no other way of running temporal data (in this case), except for asking the users to name every single object. Which I find counterproductive. As we can't "scan" the entire environment for classes or even arguments I'm not sure what's the alternative.

I had no idea that the temporal nature of the dataset was the reason for the assignment and file naming. (Our team was actually doing the same thing with these datasets — using file names to represent different quarters, and pulling that info from the file name, and I had us instead use the year and quarter columns from the datasets to pull that info, for the principles that have been outlined elsewhere in the thread.)

What made you change between methods? I can give the user the option through an argument. That's the only option I see.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 16, 2018

@noamross @berdaniera @toph-allen
To reach a middle term, I've added the option to not have the import functions automatically assigning objects. This way the user can choose. I assume that this satisfies then both parties.

R CMD check results
0 errors | 0 warnings | 0 notes

@toph-allen
Copy link

toph-allen commented Jun 16, 2018

@FilipeamTeixeira Taking a look at the code, I see that a user can now provide the argument auto = FALSE to automatically return the created object. That assignment behavior is the one I'd expect, coming to the package with no prior knowledge, and it's good that it's available. 👍

To respond to your point about temporal data:

If I'm understanding it right, the motivation for having the assignment happen automatically in the function is that it makes it easier for your users to then loop over a vector of filenames for different time periods and automatically create the objects (for example, CSVs for 2017 Q1–Q4)?

If I were dealing with multiple files, and the function returned an object for each file, I might use one of the map() family of functions, from the purrr package, to either create a list of objects, or map assignment myself.

Or, I'd create a single data frame for all time periods, and use group_by() %>% summarize() to create summaries for different time periods.

And having the option to return the object enables both of these workflows.

@noamross
Copy link
Contributor

noamross commented Jun 17, 2018

Thanks, all. @berdaniera and @toph-allen, could you at this point look over the full changes and let us know if they cover all of your original concerns?

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 22, 2018

@noamross Sorry to bother but any news on this?
Thank you.

@berdaniera
Copy link

berdaniera commented Jun 22, 2018

@FilipeamTeixeira @noamross I still need to look over all of the changes. I've set aside time this afternoon for it. Look for a reply from me by the end of the day.

@berdaniera
Copy link

berdaniera commented Jun 22, 2018

Thank you for replying to my review comments @FilipeamTeixeira. I am happy with the changes that you have made. My primary concerns in the original review were around the function naming, basic examples, and documentation for new users. I understand the desire to wait on further overview documentation until some text is published. The changes in the functions address my original concerns and the expansion of the online documentation will be helpful for new users to get data. The update on importing data looks good to me too.

Overall, the author has responded to my review and made changes to my satisfaction. I recommend approving this package.

@toph-allen
Copy link

toph-allen commented Jun 27, 2018

I concur. I think that the changes @FilipeamTeixeira has made have address my concerns and improve the package overall, making it easier to begin using. Exposing the package to more people and workflows will help it to continue moving in a direction where it can be broadly useful.

@noamross
Copy link
Contributor

noamross commented Jun 27, 2018

Approved! Thanks @FilipeamTeixeira for submitting and @toph-allen and @berdaniera for your reviews!

To-dos:

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
  • Add the rOpenSci footer to the bottom of your README
    " [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)"
  • Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to the ropensci GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
  • We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 28, 2018

@noamross Thank you for your help and for accepting the package.

I just have a practical and what might be a dumb question, but is it possible to keep the package both on my repo and on the ropensci repo? Even if mirrored or so?
Just asking because some people are used to check my GitHub for the package and I'm not sure how to re-point them to the new link.

Thank you

@noamross
Copy link
Contributor

noamross commented Jun 28, 2018

@FilipeamTeixeira GitHub automatically sets up redirects when packages are transferred, so users visiting the original link to your repo will find it.

Note you can also still pin the repository to your GitHub profile.

@FilipeamTeixeira
Copy link
Author

FilipeamTeixeira commented Jun 29, 2018

@noamross thank you a lot again for all the help. About the blog post you've mentioned I would love to contribute. So just let me know how can I do it.

Once again thank you.

@stefaniebutland
Copy link
Member

stefaniebutland commented Jul 18, 2018

About the blog post you've mentioned I would love to contribute.

@FilipeamTeixeira Thank you for your patience - was up to me to get back to you about your interest in
contributing a blog post but I'm about 19 days late!

This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here for more short form posts: https://ropensci.org/technotes/.

Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. Right now I have openings in the blog calendar for late August and September. When do you think you'd be ready to submit a draft?

Happy to answer any questions.

@noamross noamross closed this as completed May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants