Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize database #48

Closed
maurolepore opened this issue Sep 26, 2018 · 14 comments
Closed

Normalize database #48

maurolepore opened this issue Sep 26, 2018 · 14 comments

Comments

@maurolepore
Copy link
Member

One important aspect of making the database easy to work with and maintainable in the long run is to normalize it. Eventually, we will need to move in that direction. Our current not-normalized structure already seems to be exposing some issues. For us to assess how urgent it is to normalize the database, I'll document the issues I'm noticing here.

@maurolepore
Copy link
Member Author

maurolepore commented Sep 26, 2018

  • The biggest, ultimate issue is that it's unclear how to link the different tables derived from master. I already described this at https://github.com/forestgeo/allodb/issues/36#issuecomment-422855508. To get around this problem quickly I suggested populating equation_id with the values of equation_allometry -- so that the equation acts as an identifier of itself (Populate equation_id #47 (comment)). But I'm finding that this is not so straight forward, as the equations table has more rows than unique values of equation_allometry, i.e. equations is not normalized. This may not be a problem but makes me stop and think before I build code that relies on equation_allometry.

  • I also noticed that there are many less unique values of equation_form than there are unique values of equation_allometry. Should the database be redesigned to normalize equation_form, with the other tables pointing to equation_form via an equation_form_id? I feel so ignorant in designing a database!

@teixeirak
Copy link
Member

teixeirak commented Sep 26, 2018

  • Unless I'm misunderstanding something, the first issue is one that should be trivial to solve: (1) identify unique allometries. This can probably done based on the field equation_allometry, but just in case there happen to be identical equations for the same allometery, you might check another field (identified by @gonzalezeb-- or alternatively she can just review to make sure there are no cases of this). Then assign a unique equation.id number to each allometry.

  • Regarding the second, we expect many fewer unique values of equation_form than of equation_allometry. This is not a problem. One could create a separate table or equation orms, but I don't see that as necessary--unless for some reason its hhelpul for coding.

@maurolepore
Copy link
Member Author

Good to know that you see no value in separating equation_form. That checks the second box of item 2 in my previous comment.

About my item 1 above:

I believe that @gonzalezeb is the best person to decide what to do with the multiple rows of the equations table that have identical values of equation_allometry. (That is, whether those equations should get different values of equation_id, or get the same value of equation_id after resolving the differences between the other columns.)

While tracking the problem back might be relatively easy now, this will become harder as the data grows. That is why I now believe it is important to normalize the data as soon as possible, and then continue entering new data in each separate (normalized) table -- as in a SQL database.

Today I realized that, by creating the values of equation_id, I could mess things up. Better I'll be on hold and happy to help with whatever you ask.

@gonzalezeb
Copy link
Contributor

  1. I can see what you say Mauro. The equation table should have only 'unique" equations so I think there is a missing line in the code to generate the final equation table (that resides in data). We should have something like distinct(master, equation_allometry). Those unique equations get a unique equation_id using random number, etc. Right now I get 176 unique equation_allometry'._

  2. But the problem comes as we add more sites and more unique equations and how we will make this user friendly for future updates. Will people add data (a row) into the master table? A user should be able to add a new equation that gets a new equation_id but how is that equation_id assigned? .........I guess I answer myself! users should be able to add equations to an equation table.

  3. That said, before we go 'live' I think I should keep adding data on the master table, it is easier for me. I don't know how to do it on reverse.

@gonzalezeb
Copy link
Contributor

@maurolepore
this line will do the work (for some reason I am working from home and couldn't commit the change from my github desktop..)

equations<-distinct(equations, equation_allometry, .keep_all = T)

maurolepore added a commit that referenced this issue Sep 27, 2018
* This reduces the size of the data. But still does not normalize �the `equations` table because there are more rows than unique values of `equation_allometry` (#47).
@maurolepore
Copy link
Member Author

@gonzalezeb,

#48 (comment) (1): You are right: The code that split that tables lacked a line to pick only unique rows. I added that line yesterday -- although pushed only now, sorry. But the number of rows of equations is still greater than the number of unique values of equation_allometry:

nrow(dplyr::distinct(allodb::equations))
#> [1] 178
nrow(dplyr::distinct(allodb::equations, equation_allometry))
#> [1] 147

Created on 2018-09-27 by the reprex package (v0.2.1)

Whenever you have the chance, please have a look at the equations table and see if you can merge the 31 duplicated rows (178 - 147).

@maurolepore
Copy link
Member Author

#48 (comment) (2): I think a good way to let users populate a spreadsheet in a safe and user-friendly way is via a google form.

Users only interact with the form.

Under the hood, the form populates a file that can be downloaded as a .csv to link to a google sheet. Automatically, the sheet gets a new time stamp for each new entry. This itself is a unique identifier, although we could also generate a random one if we wanted.

@maurolepore
Copy link
Member Author

#48 (comment) (3): RE

I don't know how to do it on reverse.

After you merge the duplicated equations, there is nothing else you have to do. You would simply start editing the split tables instead of the master table. For you to practice, here I placed editable (.csv) versions of the data we already have in data/:
https://github.com/forestgeo/allodb/tree/master/data-raw/db.

(Code that creates those .csv files)

If by "on reverse" you mean the process of joining tables, this is possible with something like dplyr::left_join(table1, table2). But I suggest we should build a little shiny app that joins the tables under the hood. The users (including ourselves) could simply check boxes to view data from as many tables as they want.

@gonzalezeb
Copy link
Contributor

I think this will work well, I will give it a try tomorrow (on filing up splitted tables).

I reviewed few of the duplicated equations and realized that the reason why the number of rows of equations is greater than the number of unique values of equation_allometry is because there is different data related to the site (i.e site_units, proxy-species, spcode, etc). I still will check them and see what I can merge.

@gonzalezeb
Copy link
Contributor

I think I got it now!

nrow(dplyr::distinct(allodb::equations))
[1] 167
nrow(dplyr::distinct(allodb::equations, equation_allometry))
[1] 167

@gonzalezeb
Copy link
Contributor

@maurolepore I wanted to update the spplited tables using create_db.Rmd. I encountered this problem after running the 2nd chunk:


dirs <- dir_ls(here("./data"))
datasets_chr <- path_ext_remove(path_file(dirs))
datasets_chr %>% 
  map(~get(.x, pos = "package:allodb")) %>% 
  set_names(datasets_chr) %>% 
  fgeo.tool::dfs_to_csv("data-raw/db")

Error: dir must match a valid directory. bad dir: 'data-raw/db'

@maurolepore
Copy link
Member Author

@gonzalezeb, try now. It was the kind of problem that you fix with here::her(). Now the last line is this:

  # Notice here()
  fgeo.tool::dfs_to_csv(here("data-raw/db"))

@maurolepore
Copy link
Member Author

maurolepore commented Oct 1, 2018

@gonzalezeb, notice that before running create_db.Rmd you may want to update files in data/ (i.e. first run data-raw/data.Rmd).

After you do this once, and if you achieve a normalized database, then we should edit data-raw/data.Rmd to no longer start from the master table but instead start from each of the split .csv files. Then write them to data/ via usethis::use_data(). From then on you would edit the spilt .csv files (no longer the master table -- which we should remove). Let me know if this is unclear. I'm happy to do it myself once you are happy with the split .csv files.

@maurolepore
Copy link
Member Author

@gonzalezeb, it looks like you have normalized the database so I'm ready to close this issue. For reference, the commit 233ef4c generated a split database of .csv files with populated equation_ids (#47). There may be some details to fix it may be better to treat them as new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants