-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export and document data #29
Comments
* Remove information that is better recorded via git. Git knows who does what and when. So it's best to avoid recording this info as comments because it is too easy for such comments to become out of sync with changes in the code. * Describe file more generally so the comment remains up to date. Notice that the description refers to 4 tables; later a comment refers to 5 tables; and finally only 3 tables are exported. So it is best to not refer to how many tables we expect because this may continue to change. * Remove references to table numbers; instead rely on table names. This is a clearer way to refer to tables because it identifies tables in a unique way (by name as opposed to by name and number). Also, numbers become out of date too easily, as tables are added, removed or their order is changed. * Group code by table, not by task. This makes it easier to read and maintain the code.
The code that exports data contains this chunk. # eliminate rows where fam or sp is unknown #use unique(allo_main$species)
master <- subset(master, family != "Unkown")
# chnage name of "equation" column to "equation_form" 1. Do you still want to filter
|
For safety I've changed the script that exports data to subset the master dataset not by column position but by column name. Please review and ensure that the selected columns are the ones you intended. Subsetting objects by position is too risky to do it in a sript; it's OK for interactive use but not for programming because the order of the columns can change too easily and the script will be blind to such change. For programming it is important to subset data by column name. If the names change the scrpit will flag the change with an error. This is desirable because it makes the program safer. Consider the following example: dfm <- data.frame(x = 1, y = 2, z = 3)
# Say that you have something like this
selected_names <- names(dfm)[c(1, 3)]
# You can get a vector of names with `datapasta::dpasta()`
datapasta::dpasta(selected_names)
# This is the result
c("x", "z")
# Now you can use this vector to subset `dfm`
dfm_cols <- c("x", "z")
dfm[dfm_cols] |
Is the name of the file "allotemp_main.csv" expresive enough? I feel that "temp" is a sufix that may be missleading -- suggesting the data is not important but temporal. I like "_main" although I think you should pick only one of either "master" or "main". I suggest the name "allodb_master_data". |
See for example ggplot2 at https://github.com/tidyverse/ggplot2/tree/master/data-raw. Notice that each data-raw/filename.csv is cleaned in data-raw/filename.R and exported to data/ via use_data(); then each exported dataset is documented in R/data.R
Please review data-raw/data.Rmd. Some of the notes use informal abreviations (e.g. sp). Could you please rewrite, now considering that the document is not for personal use but to be shared with users? I think it is OK to abreviate words that mean a concept, for example, it is OK to use sp. to abreviate the concept species. But to refer not to the concept "species" but to an R object called |
This issue describes the process by which I export and document data. You can track this process searching for commits tagged with the number of this issue (#29).
The text was updated successfully, but these errors were encountered: