-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize database #48
Comments
|
|
Good to know that you see no value in separating About my item 1 above: I believe that @gonzalezeb is the best person to decide what to do with the multiple rows of the While tracking the problem back might be relatively easy now, this will become harder as the data grows. That is why I now believe it is important to normalize the data as soon as possible, and then continue entering new data in each separate (normalized) table -- as in a SQL database. Today I realized that, by creating the values of |
|
@maurolepore equations<-distinct(equations, equation_allometry, .keep_all = T) |
#48 (comment) (1): You are right: The code that split that tables lacked a line to pick only unique rows. I added that line yesterday -- although pushed only now, sorry. But the number of rows of nrow(dplyr::distinct(allodb::equations))
#> [1] 178
nrow(dplyr::distinct(allodb::equations, equation_allometry))
#> [1] 147 Created on 2018-09-27 by the reprex package (v0.2.1) Whenever you have the chance, please have a look at the |
#48 (comment) (2): I think a good way to let users populate a spreadsheet in a safe and user-friendly way is via a google form. Users only interact with the form.
Under the hood, the form populates a file that can be downloaded as a .csv to link to a google sheet. Automatically, the sheet gets a new time stamp for each new entry. This itself is a unique identifier, although we could also generate a random one if we wanted.
|
#48 (comment) (3): RE
After you merge the duplicated equations, there is nothing else you have to do. You would simply start editing the split tables instead of the master table. For you to practice, here I placed editable (.csv) versions of the data we already have in data/: (Code that creates those .csv files) If by "on reverse" you mean the process of joining tables, this is possible with something like |
I think this will work well, I will give it a try tomorrow (on filing up splitted tables). I reviewed few of the duplicated equations and realized that the reason why the number of rows of equations is greater than the number of unique values of equation_allometry is because there is different data related to the site (i.e site_units, proxy-species, spcode, etc). I still will check them and see what I can merge. |
I think I got it now!
|
@maurolepore I wanted to update the spplited tables using create_db.Rmd. I encountered this problem after running the 2nd chunk:
Error: |
@gonzalezeb, try now. It was the kind of problem that you fix with here::her(). Now the last line is this: # Notice here()
fgeo.tool::dfs_to_csv(here("data-raw/db")) |
@gonzalezeb, notice that before running create_db.Rmd you may want to update files in data/ (i.e. first run data-raw/data.Rmd). After you do this once, and if you achieve a normalized database, then we should edit data-raw/data.Rmd to no longer start from the master table but instead start from each of the split .csv files. Then write them to data/ via |
@gonzalezeb, it looks like you have normalized the database so I'm ready to close this issue. For reference, the commit 233ef4c generated a split database of .csv files with populated |
One important aspect of making the database easy to work with and maintainable in the long run is to normalize it. Eventually, we will need to move in that direction. Our current not-normalized structure already seems to be exposing some issues. For us to assess how urgent it is to normalize the database, I'll document the issues I'm noticing here.
The text was updated successfully, but these errors were encountered: