Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export and document data #29

Closed
maurolepore opened this issue Feb 28, 2018 · 4 comments
Closed

Export and document data #29

maurolepore opened this issue Feb 28, 2018 · 4 comments
Assignees
Labels

Comments

@maurolepore
Copy link
Member

maurolepore commented Feb 28, 2018

This issue describes the process by which I export and document data. You can track this process searching for commits tagged with the number of this issue (#29).

@maurolepore maurolepore self-assigned this Feb 28, 2018
maurolepore added a commit that referenced this issue Feb 28, 2018
* Remove information that is better recorded via git. Git knows who does what and when. So it's best to avoid recording this info as comments because it is too easy for such comments to become out of sync with changes in the code.

* Describe file more generally so the comment remains up to date. Notice that the description refers to 4 tables; later a comment refers to 5 tables; and finally only 3 tables are exported. So it is best to not refer to how many tables we expect because this may continue to change.

* Remove references to table numbers; instead rely on table names. This is a clearer way to refer to tables because it identifies tables in a unique way (by name as opposed to by name and number). Also, numbers become out of date too easily, as tables are added, removed or their order is changed.

* Group code by table, not by task. This makes it easier to read and maintain the code.
@maurolepore
Copy link
Member Author

maurolepore commented Feb 28, 2018

@gonzalezeb,

The code that exports data contains this chunk.

# eliminate rows where fam or sp is unknown #use unique(allo_main$species)
master <- subset(master, family != "Unkown")
# chnage name of "equation" column to "equation_form"

1. Do you still want to filter master to exclude rows where sp == "!Unknown"?

Your comment suggest so, but the code doesn't do that. Please update to remove ambiguity. If you express your goal in code you need not to repeate your goal in a comment.

This example may help you write the code you want:

dfm <- data.frame(a = 1:3, b = 3:1)
subset(dfm, a >= 2 & b <=2)
#>   a b
#> 2 2 2
#> 3 3 1

2. Do you still want to rename the column equation as equation_form?

Althought the comment suggests this, the code doesn't do it. Please solve this ambiguity.

This already donne by SHA 0e985e0

@maurolepore
Copy link
Member Author

maurolepore commented Feb 28, 2018

@gonzalezeb,

For safety I've changed the script that exports data to subset the master dataset not by column position but by column name. Please review and ensure that the selected columns are the ones you intended.

Subsetting objects by position is too risky to do it in a sript; it's OK for interactive use but not for programming because the order of the columns can change too easily and the script will be blind to such change. For programming it is important to subset data by column name. If the names change the scrpit will flag the change with an error. This is desirable because it makes the program safer.

Consider the following example:

dfm <- data.frame(x = 1, y = 2, z = 3)

# Say that you have something like this
selected_names <- names(dfm)[c(1, 3)]

# You can get a vector of names with `datapasta::dpasta()`
datapasta::dpasta(selected_names)

# This is the result
c("x", "z")

# Now you can use this vector to subset `dfm`
dfm_cols <- c("x", "z")
dfm[dfm_cols]

@maurolepore
Copy link
Member Author

maurolepore commented Feb 28, 2018

@gonzalezeb,

Is the name of the file "allotemp_main.csv" expresive enough? I feel that "temp" is a sufix that may be missleading -- suggesting the data is not important but temporal. I like "_main" although I think you should pick only one of either "master" or "main". I suggest the name "allodb_master_data".

maurolepore added a commit that referenced this issue Feb 28, 2018
See for example ggplot2 at https://github.com/tidyverse/ggplot2/tree/master/data-raw. Notice that each data-raw/filename.csv is cleaned in data-raw/filename.R and exported to data/ via use_data(); then each exported dataset is documented in R/data.R
@maurolepore
Copy link
Member Author

maurolepore commented Mar 1, 2018

@gonzalezeb

Please review data-raw/data.Rmd. Some of the notes use informal abreviations (e.g. sp). Could you please rewrite, now considering that the document is not for personal use but to be shared with users? I think it is OK to abreviate words that mean a concept, for example, it is OK to use sp. to abreviate the concept species. But to refer not to the concept "species" but to an R object called species I suggest you spell the word out and type species (and best is to use code syntax as I do here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants