Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor discrepancies between subfolder csvs and master sheet #30

Closed
ArthurSpirling opened this issue Aug 15, 2023 · 4 comments
Closed

Minor discrepancies between subfolder csvs and master sheet #30

ArthurSpirling opened this issue Aug 15, 2023 · 4 comments

Comments

@ArthurSpirling
Copy link

ArthurSpirling commented Aug 15, 2023

Hello @vincentarelbundock -- thanks so much for providing these data.

I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name.csv) stored in the subfolders.

Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --

"aldh2" "apoeapoc" "bomregions2011" "bomregions2012"
"bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"

For example, bomregions2012.csv appears in the DAAG subfolder, but not on that master sheet. And indeed, it has documentation here.

Again, thanks for all this work!

@ArthurSpirling
Copy link
Author

Ah, also, there's an entry for hdma and hmda both from Ecdat and both seemingly identical descriptions (?) and docs.

@ArthurSpirling
Copy link
Author

Update: DAAG contains both a head.injury.csv and a headInjury.csv --- which may be identical? not sure.

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Aug 16, 2023

Thanks for the report. Glad the website is useful!

I looked at a few of these and my best guess is this:

  1. My script never calls git rm on anything, so datasets stay there forever. This is important in case someone links to the URL in one of their scripts.
  2. However, the main sheet index is created every time I run the script, and that's based on what is currently available in the packages. I think that also makes sense: If a package maintainer removes a dataset, I may still want to keep permanent links to protect users, but it's probably "polite" to not advertise the dataset anymore.

The few datasets I checked didn't seem to be available in their packages anymore. And in the head.injury case, the DAAG changelog says it was a duplicate and was removed:

https://github.com/cran/DAAG/blob/master/NEWS#L27

Again, I didn't check them all, but my provisional conclusion is that things are probably fine as-is. Makes sense?

@ArthurSpirling
Copy link
Author

Sounds good, thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants