Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ethically_culturally_sensitive columns have incorrect/outdated entries #16

Open
ivelsko opened this issue Jun 15, 2020 · 20 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@ivelsko
Copy link

ivelsko commented Jun 15, 2020

All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where sample.Ethically_culturally_sensitive is all NA, extract.Ethically_culturally_sensitive is all FALSE, and library.Ethically_culturally_sensitive is a combination of NA and FALSE, and all should be TRUE.

I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?

When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since sample.Ethically_culturally_sensitive is NA this was never correct, while extract.Ethically_culturally_sensitive and library.Ethically_culturally_sensitive as FALSE are outdated. I think the NAs in library.Ethically_culturally_sensitive come from extracts that were never built into libraries so those are probably ok

@jfy133
Copy link
Member

jfy133 commented Jun 15, 2020

Uhh that's odd. No, it shouldn't be out of date as it should always pull the data fresh, unless you are using a cache... When did you update the entries in Pandora?

@nevrome
Copy link
Member

nevrome commented Jun 15, 2020

library(magrittr)

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
  sidora.core::join_pandora_tables()

samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::select(sample.Ethically_culturally_sensitive)

@nevrome
Copy link
Member

nevrome commented Jun 15, 2020

It's your type conversion, James. Look at this:

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
  sidora.core::join_pandora_tables()

ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(sample.Id)

sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)

@nevrome
Copy link
Member

nevrome commented Jun 15, 2020

@ivelsko So @jfy133 and I probably found the problem: We have a function which transforms column types to represent them correctly in R: sidora.core:::enforce_types().

Apparently there seem to be two different ways of encoding logicals in Pandora. The column Ethically_culturally_sensitive is encoded with Yes/No which can not be automatically converted with as.logical.

I'll try to provide a quick fix now and then leave it to @jfy133 in our next hackhour to check the columns Robot and Title 😄

@nevrome
Copy link
Member

nevrome commented Jun 15, 2020

@ivelsko Maybe you can install the latest version from github and give it a test.

@ivelsko
Copy link
Author

ivelsko commented Jun 16, 2020

I installed the latest version and checked again and it's partially fixed - the Ethically_culturally_sensitive column for sample now has the correct TRUE entries, but for extract, library, capture, and sequencing the entries are still incorrectly FALSE

@nevrome
Copy link
Member

nevrome commented Jun 16, 2020

So: The values are FALSE, not NA, ja? What value do you see in Pandora? I would be really surprised if it is "Yes".

@ivelsko
Copy link
Author

ivelsko commented Jun 16, 2020

Yep, they're FALSE. In Pandora it says "Yes" in each of those 4 tabs, but it's in a grayed-out text box that's automatically populated down from checking the "Yes" button on the Individual tab. I don't have any control on changing that entry on any tab except Individual

@nevrome
Copy link
Member

nevrome commented Jun 17, 2020

I'm confused now. Please take a look at the result of this query:

library(magrittr)

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Extract")), con = con) %>%
  sidora.core::join_pandora_tables()

ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(extract.Id)

sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)

DBI::dbDisconnect(con)

Two things:

  1. The Ethically_culturally_sensitive column in the extract table seems to be of type integer. So that's why the conversion to boolean fails. It's very bad that the same type of information is encoded in different ways in different tables and we should discuss this with the Pandora devs (who do we have to contact, @stschiff & @jfy133? Or could this still be caused by the R interface?)

  2. The query above yields 0s for the extracts from site CMC. Why not 1s?

@jfy133
Copy link
Member

jfy133 commented Jun 17, 2020

I just checked, and it is only Sample that is problematic. The others are all 0,1.

But indeed, I have no idea what's going on with CMC... the only thing I can thing of is that there is some wierd propagation thing for the web UI only, where if you say at e.g. sample library something is sensitive that that is 'applied' everything downstream for display (with the assumption people shouldn't look further down?).

We should write an email to Robert F. about it though, as I'm completely lost... I'll put a reminder tomorrow.

My code (after your con set up above):

sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Library", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Capture", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Sequencing", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Id"), dplyr::contains("Ethic")) %>% dplyr::filter(Ethically_culturally_sensitive == 1) %>% print(n = 30)

@nevrome
Copy link
Member

nevrome commented Jun 17, 2020

Alright. 0/1 can be converted automatically by as.logical. So the column in TAB_Sample should be transformed, the others can stay as they are.

The CMC mismatch is really curious. Irina confirmed that the info both in the webinterface and in the csv export is correct. Only we're getting the wrong data.

@jfy133
Copy link
Member

jfy133 commented Jun 17, 2020

Yeah that makes me very nervous. I'll try and ask Robert tomorrow

@jfy133
Copy link
Member

jfy133 commented Jun 18, 2020

Ok, the Extract level 1s were mistakes. This is now fixed.

Only Sample defines this. So I think @nevrome we need to have a different system of loading stuff now, where we internally find that information already and remove those at data loading. It's ugly but I'm not sure how we can deal with that unfortunately.

@jfy133
Copy link
Member

jfy133 commented Jun 19, 2020

We had a little discussion with the sidora dev team. Having to refer to the Sample table every time we want to exclude sensitive samples is very computationally intensive.

We will ask Robert to make all those columns the same type, and then propagate this downstream. Then we set our functions to by default remove any row with a sensititve column is true, UNLESS someone requests it purposefully.

Any public version of the API will always remove these, even if published data.

@ivelsko
Copy link
Author

ivelsko commented Jun 19, 2020

This is almost ok, but I see an issue with it. For calculus/microbiome samples the only sensitive part is the human reads, and those shouldn't be analyzed by anyone. The rest of the data is fine for anyone to use, so the whole dataset shouldn't be removed for searching in this case.

It's really a problem with what the Ethically/culturally sensitive check box is supposed to indicate, which is that the data shouldn't be automatically screened/processed by the pipeline here. In some cases (ALC/FPA, for example) the whole sample and all downstream parts should be excluded, but for CMC only part of the data is off limits

@jfy133
Copy link
Member

jfy133 commented Jun 19, 2020

The problem is that the final data that is recorded in the raw data tab is the entire sequencing data, not filtered. We can then not guaruntee that people won't use it anyway, or just skip checking the 'sample' tab entirely. Secondly, with that box we don't necessarily specify what is actually sensitive about it. To find it, they would have to speak to the person.

That's why there would be an option to not filter it out (this is what I mean, 'by default' - maybe I wasn't clear), but you would only turn that on if you knew what you were looking ofr, and you are also aware you will have to contact other pepole to find out if you can use it.

@nevrome nevrome added this to To Do in Project Tracking Jun 19, 2020
@jfy133 jfy133 moved this from To Do to In Progress in Project Tracking Jun 19, 2020
@jfy133 jfy133 added the bug Something isn't working label Jul 3, 2020
@nevrome
Copy link
Member

nevrome commented Feb 25, 2022

Quick update on this in February 2022:

  • sample.Ethically_culturally_sensitive, so the iteration of the Ethically_culturally_sensitive column in TAB_Sample is still the only one relevant. All other columns downstream are still only empty dummies used for the UI. Maybe we should explicitly remove them in the download process.
  • Automatic processing in the autorun pipeline excludes sensitive samples, afaik.

@ivelsko
Copy link
Author

ivelsko commented Feb 25, 2022

Do you mean explicitly removing all downstream entries stemming from samples marked as ethically/culturally sensitive? I'm for that.

If you mean removing the other columns that repeat ethically/culturally sensitive at lower levels (extract, library, etc) then I'd prefer those be left in b/c at least it's something to indicate they shouldn't be worked with. Particularly if someone only downloads a specific table and they won't see that column from the sample table.

Autorun does exclude the sensitive samples, as long as they actually were tagged that way. There were a bunch that should have been and never were that I caught, but possibly there are entries by other people, maybe some who've left, that also weren't marked and should be.

@nevrome
Copy link
Member

nevrome commented Feb 25, 2022

I mean removing the columns 🤔. They are not reliably filled with information, so they can not be used for any decisions. Only the TAB_Sample one is, to my understanding.

Let's think about other ways to raise awareness of this issue. We could add a startup message to the package. Whenever somebody loads the library the first time in a session with library(sidora.core) we could print a warning. Many packages do this.

For example:

Warning: Pandora features many samples that are ethically and culturally sensitive and are therefore not available for many analyses. 
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application!

I'm sure you have a better phrasing in you, @ivelsko.

For downstream applications/scripts like the ones by @TCLamnidis, this message can be suppressed with suppressPackageStartupMessages(library(sidora.core)). Not so sure about calls with ::, though...

@ivelsko
Copy link
Author

ivelsko commented Feb 28, 2022

I like the idea of printing a warning. Can you print it with a big red ASCII-art yield sign? I tend to ignore these messages that come up at the start unless there's some image that catches my eye...

I think the text is pretty good, I'd suggest only some small changes

Warning: Pandora contains many samples that are ethically and/or culturally sensitive and are therefore off-limits for many analyses. 
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application and permissions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Project Tracking
  
In Progress
Development

No branches or pull requests

3 participants