Submission: skynet #214
Geospatial data, network analysis, because it works with airline networks and it allows its generation and analysis.
Confirm each of the following by checking the box. This package:
The text was updated successfully, but these errors were encountered:
Thank you for your submission, @FilipeamTeixeira! A couple of quick questions before we proceed:
Finally, we'll also need you to add code coverage to your CI and include a code coverage badge in your readme, which you can do with
Dear @noamross, thank you for your remarks.
Added coverage badge as well.
Thanks for your responses, @FilipeamTeixeira, that's helpful.
Below is the output from
── GP skynet ───────────────────────────────────────
It is good practice to
Thank you @noamross. Your comments have been quite helpful.
Thanks for the (fast!) update. Things look good and I'll proceed to find reviewers.
Also, NBD, but it looks like one of your tests trips your own deprecation warning:
The package includes all the following forms of documentation:
Final approval (post-review)
Estimated hours spent reviewing: 3
Overall, I think this is a valuable package. My impression is that this is important for consistently wrangling and formatting a complex dataset and quickly generating network analyses for that data. Streamlining this preprocessing (skynet's goal) has value for anyone interested in the airline data.
The primary functions are clean and simple. I hope the comments below will be helpful for clarifying some of the documentation. For the functions, can you talk about the choices for using snake case, camel case, and dot naming?
The main recommendations I have are about documentation for getting the data locally first. At the very least, I'd suggest a short numbered guide with instructions for how to download data. When exploring the functions, I was able to get everything to work with the example data. Before I looked at the vignette (which explains which variables I needed) I tried to download data from the weblink but did not get the necessary variables (which
In a broader scope, I agree with @noamross suggestion to investigate a direct way to download the data without leaving R. The initial section below might be helpful in that. I'm willing to contribute to this portion of the code if we decide that it is in fact possible (will require some more research).
Initial investigation of direct data access
The web form at https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289 creates a POST link that submits a request to their server, with the main part being a sql string, e.g.,:
That submission creates a temporary file that gets stored here https://transtats.bts.gov/ftproot/TranStatsData/
Given that sequence, I think it is POSSIBLE that you could create a function that uses
The only hangup would be if the POST requires cookie credentials from the website. But, I think that could also be obtained in a first step.
Maybe at least think about if this is a desirable function for the package. It would increase the safety of the functions (no errors from users getting the wrong data) and make the process of getting the data that much easier on the user.
Package metadata files
For the readme, I'd suggest adding a short annotated example of downloading data and reading it in, instead of just a link to the files. E.g., do I download all of the data for a given year/quarter? Do I download the default fields from BTS? That step may be familiar to people who use these data but could be a challenge for newcomers.
The online documentation (https://FilipeamTeixeira.github.io/skynet/) is clear and provides a basic overview of the main functions.
The Overview page could give some additional detail to motivate me to use this package. What does the package enable? What are some of the main challenges that arise from this BTS data? E.g., data wrangling and preparation for network analysis?
The example code on the overview page requires me to have an 'OD_2011Q1' object in my environment, which I did not already. I'd recommend adding a line to that example to show the data download and import.
The Get Started page is a great feature of this documentation. On that page, the Import Datasets instructions say to use
This seemed to work:
On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with
The example code
Group by carrier
On the Get Started page in the Create Networks section I tried to modify the above function to group by carriers (with
After checking out the help for
Another suggestion is in the main example code to use all of the defaults, either displaying all of the arguments or just using the minimal default example (like I did above).
Filter edge weight
In the help for
The documentation to create maps assumes that I have an object called
The maps are very nice, good work on that!
These functions work fine for me. For the power law fit, I wonder if it makes sense to export the Alpha and R^2 results as list objects that can be stored as output instead of just printing.
Inline code documentation is complete and clear, which will make future contributions easy.
Thank you for your review, @berdaniera! A quick note on your comments about importing data. Site-scraping functions like this are useful, but they can be unreliable as data providers may change their back-end processing. I leave it up to you whether to implement this @FilipeamTeixeira, but should you choose to you should make sure that the function documentation and maybe a message indicate that the function is dependent on an undocumented/unsupported API and may be unstable. You should also check that it doesn't violate terms of service and have your HTTP requests set a user-agent like
Thank you for your thorough review @berdaniera. I'll go through the comments as soon as possible, but I just wanted to first reply to the comment about importing data. Initially I had such feature, but they changed something on their website and suddenly it stopped working.
Anyway, I stopped investing time in finding a solution. It might be something to try again if there's an easy way of doing it, but if not I'll have to leave it as it is.
For the rest I'll go through it as soon as possible.
The rationale behind it was: dot naming for main functions (e.g. make.netDir) with the case at the end to indicate type of network being created, camel case for all secondary functions (e.g. createNodes), snake case for import functions (e.g. import_db1b). If you believe that it doesn't make sense please let me know. When I started writing this package, I never thought it would grow this big, but I wanted to make sure that its functions would end up making some sense.
This is a tricky one. I have currently a paper in review, which I would like to wait to be published before adding more information. Then it will be possible to add the full paper, and as much information as possible. I'm presenting this as well in an air transport conference soon, meaning that I can even add more after than.
When importing a csv using the import_db1b for example, it creates an object with the name of those files. Meaning that when importing Coupon 2011Q1 and Ticket 2011Q1, it will generate an OD_2011Q1 object.
There was an issue with the documentation which has been fixed meanwhile.
I'm not really sure what do you really refer to here. The limitation with the carrier and disparity filter is merely a mathematical challenge. Having multiple carriers, means that the origin-destination matrix will have parallel edges. However, the disparity filter wasn't designed to work with parallel edges. I didn't design the filter as I've mentioned in the documentation.
Not sure if I understood your question. As from the documentation, the create map function asks for a data frame. The example with the list, lies on the fact that the make.net* functions always create a list with three objects.
I've been wanting to improve this function to reflect more what's needed in network analysis. For now if I understood your comment correctly, I've fixed what you've mentioned.
Once again thank you all and if you have any remarks or if I missed something please let me know. Current version is 1.0.4.
Geez, sorry I'm so tardy with my review! I'll have it finished by the end Saturday.
I'm really looking forward to taking a look at this package! I have a more-than-passing familiarity with the underlying DB1B and T100 datasets, and am interested to see how they're surfaced here.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
Final approval (post-review)
Estimated hours spent reviewing:
(I wrote this overview after I wrote most of the comments that follow)
I think there could be improvements made to the package's documentation, and to its initial setup process.
Installed with no problem.
Importing the Files — Documentation & Code
DB1B and T100 are really confusing datasets, especially when you're first dealing with them.
The README.md file on the skynet repo does a good job of orienting the user to set up the package and referring them to the place they can download the files.
However, I ran into some holdups getting the right files downloaded for import. The README.md recommends downloaded files with a specific set of variables included, because the full datasets are very large.
The table on line 58 in README.md (line 55 of README.Rmd) lists the suggested variables to include. However, these variable names are not a direct match for the "Field Name" or the "Description" fields of the DB1B download form.
For these reasons, it'd be best to use the specific "Field Name", e.g. "ItinID", "MktID", etc. displayed on the BTW download pages.
I'm unclear if the recommended names for the T100 dataset are a requirement for the package to function correctly or just a recommendation for clarity.
There is an error on line 63 of
Comments on function style
I would recommend making some changes to the behavior of the function
First, I'd recommendation is to change the names of the function's arguments and the way they're handled.
The function requires two arguments,
In my opinion, it would be clearer to users and more robust to different file names if the function required two arguments named
Second, I'd recommend changing the way the function returns its data. The function uses the filename of the input coupon dataset to determine the name of an object, and assigns that object. This object is assigned in an environment which is assigned in
The expected behavior would be for the function to return the created object, so that it can be assigned like:
db1b <- import_db1b("data-raw/71336240_t_DB1B_COUPON.csv", "data-raw/71336240_t_DB1B_TICKET.csv")
This gives the user more control over the name and location of the object that's created by the import process.
Network Generation Functions
I'm less familiar with iGraph, so my comments here are a little less detailed.
Using the functions included, I was able to replicate the vignette's network map from the DB1B datasets successfully.
All the functions which begin
@toph-allen Thank you for your comments and thorough review.
As I've mentioned before the BTS changes their website structure often which I assume that it is related to the fact that they don't allow scraping. It might be counterproductive to create a function that will be unusable after a couple of weeks.
Thank you for your feedback on this. However, the variables to be used are well described in the ReadMe file and help section for each function. There's as well an argument
I'm not sure about this comment and to which naming you are referring to. Some of the issues previously mentioned by the other reviewer have been fixed.
I don't think that I fully understood these remarks. Both the DB1B and the T100 dataset use non-standard naming. This means that if you select the variables yourself, they will have one name, but if you download the full file (zipped) they will be named differently. The reason why I chose a different variable name was to homogenise this and other issues. As I am working on a way to import other datasets (e.g. Sabre, OAG), it only makes sense that the variables have a standard name closer to all the others. Also long named variables might generate confusion and they are not really clear for those reading with statistics.
File size is not an issue as they are quite small, so you can always download all the variables.
Again this is described in the ReadMe File. There are differences between the zipped file and the pre-selected variables file. The variables are named differently, and this might be the reason of your confusion. When importing the zipped file which uses the
Thank you for your comment, but this is explained in the Vignettes.
When working with temporal data it is easier this way. However, the name now is created not with basis on the file but on if it is DB1B, which will start then with OD, or T100, which will start with T100. The rest will be calculated with basis on the year and quarter variables.
Thank you for your comments. However I'm not sure if I understood them correctly.
It's in the ReadMe file.
This was required and requested by CRAN. As for the package I do understand your remarks, but I do want this package to follow CRAN's reviewers comments as well, so for now it has to be there.
Thank you for your comments here. I actually didn't want them to be great circles arcs. The reason why they are curved is to be easier to read the map. This was feedback from researchers using the package for their research.
Thank you for your comments here. I've explained the rational behind the naming above in my comments to the other reviewer.
Thank you @toph-allen, @berdaniera and @noamross for all the comments and for all the help making this package better. Meanwhile I've pushed all the changes to my GitHub, and skynet is currently on version 1.1.0.
1- Indeed that would require a major change which I might include on a next major release.
Thank you once again for your constructive and detailed comments.
Yes, this information is in the README, but the concept of "multiple points of entry" is that the user may be looking for information in the function file and should find via repeating information or linking to another source.
Regarding 1,2,3, and other issues brought up by reviewers: Having a consistent and fully documented API where functions behave as commonly expected is pretty core to our standards, so these major-version changes will be needed before we're able to accept. We at times accept packages that only address some review iterms via a "roadmap" for a future version (see the skimr review for an example), but this tends to be limited to cases of extensions or at times re-factoring of internals for the purpose of enabling extensions.
Specifically with (2) - classes seem like a likely solution and S3 classes are relatively simple to implement, though I'm still not sure what the problem is that assigning to the environment solves, based on your feedback from users. Perhaps if you describe the use-case in a bit more detail we could suggest a solution.
Dear @noamross. Thank you for your comments and for clarifying my questions.
About (2), I'm also not found of assigning objects like that but the issue is simple. Most users will use this package for longitudinal analysis. Meaning that they won't work merely with one year but with multiple. Some users mentioned that it would be easier to name the imported objects automatically, so they could incorporate them in an easy way into loops. If the object syntax is always the same, except for the suffix which shows the year and quarter, it is easy then to run any sort of analysis for lets say 10 years. From my understanding classes wouldn't solve this, as you would still need to name each object.
(3) has been fixed, so there's nothing to point here.
Once again thank you for your remarks and for your help with this package.
@FilipeamTeixeira Regarding the API and review timing issue, one option would be to put this issue on hold until your other review is done. We generally don't recommend putting a package in review in two places at once, for just this reason. There are times where, because a package is in widespread use or tightly linked to other tools, it makes sense to avoid changing the API for compatibility reasons. However, I don't think that is the case here - we would rather help get the package to a point where it is more usable for more people.
Re: (2) Making function input or output depend on file or object naming is pretty fragile and incompatible with common functional programming strategies. A much more common and robust way to doing this type of workflow is to create a list of objects and then iterate on that list using a function such as
@toph-allen and @berdaniera, it would be helpful if you could let us know well @FilipeamTeixeira's changes address your comments at this stage. Even if we pause the review it would be good to record those at this point so we have a record of the current state.
@noamross I do understand and I also don't want to start a discussion here about this as it would be counterproductive. However, I have the feeling that I'm probably not properly conveying my message and the users' needs.
The need of having correctly named objects is to avoid having to create lists. surely that with
As for naming the functions I do understand as well. However, this decision was made after talking to people working with the package. I would like to reiterate and stress that this package is rather specific when it gets to its target. Meaning that we could not compare it with others. Also dplyr uses both snake case and dot naming, so this example exists in numerous packages.
If indeed this will be paused, it is a pity, but I regret to say that it will stay this way for a while as the users are my priority. If there are any remarks on the function naming and any arguments that would explain a standardisation of those functions, I'm happy to reconsider the changes. Another alternative is to create aliases for the functions as it is possible to find in numerous packages running the risk of increasing complexity.
In any case I'm truly thankful for all the help you guys gave so far and for all the time invested in this. So even if this goes nowhere it was already rather productive for me.
@FilipeamTeixeira Thanks for your responses to my review. I'll attempt to respond concisely to them, and to clarify and contextualize some of my original comments.
For context, I have some familiarity with the DB1B and T100 datasets — I'm part of a team at work which is creating models using these datasets. This is our first time working with these datasets, and we wound up creating some internal scripts to do some of what
I see that you made some updates to the code in response to some of the discussion here. Thanks for being responsive to our feedback. I think that some of the changes, especially with regard to things like argument handling, will be beneficial to new users. @noamross's comments on the consistency of arguments across packages are good explanations as to why.
I see what you're saying here — since the datasets all use different names for certain values, you're using a generic name. In that case, it might be helpful to give some examples of what the variables might be named — but this wasn't a high priority comment, in my mind.
When this happened, I was actually using a normal the custom-selected variable file. The following code outputs the first line of the file; you can see that the variables are named differently from the function. Do you think that this is a recent change to the BTS website, like you mentioned?
$ head -n 1 71336240_T_DB1B_TICKET.csv "ITIN_ID","COUPONS","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","ROUNDTRIP","ITIN_YIELD","PASSENGERS","ITIN_FARE","BULKFARE","DISTANCE_FULL","DISTANCE_GROUP","ITIN_GEO_TYPE", $ head -n 1 71336240_T_DB1B_COUPON.csv "ITIN_ID","MKT_ID","SEQ_NUM","COUPONS","YEAR","ORIGIN_AIRPORT_ID","ORIGIN_AIRPORT_SEQ_ID","ORIGIN_CITY_MARKET_ID","QUARTER","ORIGIN","DEST_AIRPORT_ID","DEST_AIRPORT_SEQ_ID","DEST_CITY_MARKET_ID","DEST","TRIP_BREAK","COUPON_TYPE","OPERATING_CARRIER","DISTANCE","GATEWAY",
@noamross's comments about making documentation more available and comprehensive, and about returning an object from the import functions for assignment, convey my thoughts on those subjects very well. I had no idea that the temporal nature of the dataset was the reason for the assignment and file naming. (Our team was actually doing the same thing with these datasets — using file names to represent different quarters, and pulling that info from the file name, and I had us instead use the year and quarter columns from the datasets to pull that info, for the principles that have been outlined elsewhere in the thread.)
I'll keep an eye on this and am happy to help out more as needed.
Many of our packages have a specific audience, but as I hope the case of @toph-allen above shows, there are many potential users out there who may use your package for reasons you have not anticipated. Our purpose in standardizing is to make sure this larger audience is served and that your package can be useful to a broader audience, and that your work gets exposure to them. We do not want standards for standards' sake. Of course every package has idiosyncrasies driven by speciality uses. Yet we want to ensure the exceptions help a user understand the package and keep it compatible with other workflows they may have. (In dplyr, for instance, dot-naming specifically identifies S3 conversion methods such as
Aliasing names and deprecating functions over time is strategy we recommend! My co-editor Scott has written a guide to this: https://ropensci.org/technotes/2017/01/05/package-evolution/
@noamross function naming was corrected according to your specifications. Except for one minor internal function which I will patch for the next version. I kept all the older functions but with a message stating that they are deprecated.
I believe that all the requirements have been fulfilled unless there's something else I might be missing.
Once again thank you.
Just to eliminate any confusion: In my review, I wasn’t requesting an argument to let the users name the objects which are then assigned in-function to the global environment. Rather, I think it would be better to follow convention and return the object at the end of the function with
@toph-allen So what was your point with the temporal data and the naming? Because there's no other way of running temporal data (in this case), except for asking the users to name every single object. Which I find counterproductive. As we can't "scan" the entire environment for classes or even arguments I'm not sure what's the alternative.
What made you change between methods? I can give the user the option through an argument. That's the only option I see.
@FilipeamTeixeira Taking a look at the code, I see that a user can now provide the argument
To respond to your point about temporal data:
If I'm understanding it right, the motivation for having the assignment happen automatically in the function is that it makes it easier for your users to then loop over a vector of filenames for different time periods and automatically create the objects (for example, CSVs for 2017 Q1–Q4)?
If I were dealing with multiple files, and the function returned an object for each file, I might use one of the
Or, I'd create a single data frame for all time periods, and use
And having the option to return the object enables both of these workflows.
Thank you for replying to my review comments @FilipeamTeixeira. I am happy with the changes that you have made. My primary concerns in the original review were around the function naming, basic examples, and documentation for new users. I understand the desire to wait on further overview documentation until some text is published. The changes in the functions address my original concerns and the expansion of the online documentation will be helpful for new users to get data. The update on importing data looks good to me too.
Overall, the author has responded to my review and made changes to my satisfaction. I recommend approving this package.
I concur. I think that the changes @FilipeamTeixeira has made have address my concerns and improve the package overall, making it easier to begin using. Exposing the package to more people and workflows will help it to continue moving in a direction where it can be broadly useful.
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.
We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.
@noamross Thank you for your help and for accepting the package.
I just have a practical and what might be a dumb question, but is it possible to keep the package both on my repo and on the ropensci repo? Even if mirrored or so?
@FilipeamTeixeira Thank you for your patience - was up to me to get back to you about your interest in
This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here for more short form posts: https://ropensci.org/technotes/.
Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. Right now I have openings in the blog calendar for late August and September. When do you think you'd be ready to submit a draft?
Happy to answer any questions.