Ethically_culturally_sensitive columns have incorrect/outdated entries #16

ivelsko · 2020-06-15T19:46:54Z

All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where sample.Ethically_culturally_sensitive is all NA, extract.Ethically_culturally_sensitive is all FALSE, and library.Ethically_culturally_sensitive is a combination of NA and FALSE, and all should be TRUE.

I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?

When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since sample.Ethically_culturally_sensitive is NA this was never correct, while extract.Ethically_culturally_sensitive and library.Ethically_culturally_sensitive as FALSE are outdated. I think the NAs in library.Ethically_culturally_sensitive come from extracts that were never built into libraries so those are probably ok

The text was updated successfully, but these errors were encountered:

jfy133 · 2020-06-15T19:52:04Z

Uhh that's odd. No, it shouldn't be out of date as it should always pull the data fresh, unless you are using a cache... When did you update the entries in Pandora?

nevrome · 2020-06-15T19:53:37Z

library(magrittr)

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
  sidora.core::join_pandora_tables()

samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::select(sample.Ethically_culturally_sensitive)

nevrome · 2020-06-15T19:59:24Z

It's your type conversion, James. Look at this:

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
  sidora.core::join_pandora_tables()

ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(sample.Id)

sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)

nevrome · 2020-06-15T20:16:35Z

@ivelsko So @jfy133 and I probably found the problem: We have a function which transforms column types to represent them correctly in R: sidora.core:::enforce_types().

Apparently there seem to be two different ways of encoding logicals in Pandora. The column Ethically_culturally_sensitive is encoded with Yes/No which can not be automatically converted with as.logical.

I'll try to provide a quick fix now and then leave it to @jfy133 in our next hackhour to check the columns Robot and Title 😄

nevrome · 2020-06-15T20:27:02Z

@ivelsko Maybe you can install the latest version from github and give it a test.

ivelsko · 2020-06-16T16:40:17Z

I installed the latest version and checked again and it's partially fixed - the Ethically_culturally_sensitive column for sample now has the correct TRUE entries, but for extract, library, capture, and sequencing the entries are still incorrectly FALSE

nevrome · 2020-06-16T16:42:11Z

So: The values are FALSE, not NA, ja? What value do you see in Pandora? I would be really surprised if it is "Yes".

ivelsko · 2020-06-16T18:11:26Z

Yep, they're FALSE. In Pandora it says "Yes" in each of those 4 tabs, but it's in a grayed-out text box that's automatically populated down from checking the "Yes" button on the Individual tab. I don't have any control on changing that entry on any tab except Individual

nevrome · 2020-06-17T08:05:06Z

I'm confused now. Please take a look at the result of this query:

library(magrittr)

con <- sidora.core::get_pandora_connection()

samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Extract")), con = con) %>%
  sidora.core::join_pandora_tables()

ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(extract.Id)

sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)

DBI::dbDisconnect(con)

Two things:

The Ethically_culturally_sensitive column in the extract table seems to be of type integer. So that's why the conversion to boolean fails. It's very bad that the same type of information is encoded in different ways in different tables and we should discuss this with the Pandora devs (who do we have to contact, @stschiff & @jfy133? Or could this still be caused by the R interface?)
The query above yields 0s for the extracts from site CMC. Why not 1s?

jfy133 · 2020-06-17T17:51:38Z

I just checked, and it is only Sample that is problematic. The others are all 0,1.

But indeed, I have no idea what's going on with CMC... the only thing I can thing of is that there is some wierd propagation thing for the web UI only, where if you say at e.g. sample library something is sensitive that that is 'applied' everything downstream for display (with the assumption people shouldn't look further down?).

We should write an email to Robert F. about it though, as I'm completely lost... I'll put a reminder tomorrow.

My code (after your con set up above):

sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Library", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Capture", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Sequencing", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))

sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Id"), dplyr::contains("Ethic")) %>% dplyr::filter(Ethically_culturally_sensitive == 1) %>% print(n = 30)

nevrome · 2020-06-17T18:57:25Z

Alright. 0/1 can be converted automatically by as.logical. So the column in TAB_Sample should be transformed, the others can stay as they are.

The CMC mismatch is really curious. Irina confirmed that the info both in the webinterface and in the csv export is correct. Only we're getting the wrong data.

jfy133 · 2020-06-17T19:17:33Z

Yeah that makes me very nervous. I'll try and ask Robert tomorrow

jfy133 · 2020-06-18T08:01:00Z

Ok, the Extract level 1s were mistakes. This is now fixed.

Only Sample defines this. So I think @nevrome we need to have a different system of loading stuff now, where we internally find that information already and remove those at data loading. It's ugly but I'm not sure how we can deal with that unfortunately.

jfy133 · 2020-06-19T11:35:18Z

We had a little discussion with the sidora dev team. Having to refer to the Sample table every time we want to exclude sensitive samples is very computationally intensive.

We will ask Robert to make all those columns the same type, and then propagate this downstream. Then we set our functions to by default remove any row with a sensititve column is true, UNLESS someone requests it purposefully.

Any public version of the API will always remove these, even if published data.

ivelsko · 2020-06-19T11:44:25Z

This is almost ok, but I see an issue with it. For calculus/microbiome samples the only sensitive part is the human reads, and those shouldn't be analyzed by anyone. The rest of the data is fine for anyone to use, so the whole dataset shouldn't be removed for searching in this case.

It's really a problem with what the Ethically/culturally sensitive check box is supposed to indicate, which is that the data shouldn't be automatically screened/processed by the pipeline here. In some cases (ALC/FPA, for example) the whole sample and all downstream parts should be excluded, but for CMC only part of the data is off limits

jfy133 · 2020-06-19T12:45:53Z

The problem is that the final data that is recorded in the raw data tab is the entire sequencing data, not filtered. We can then not guaruntee that people won't use it anyway, or just skip checking the 'sample' tab entirely. Secondly, with that box we don't necessarily specify what is actually sensitive about it. To find it, they would have to speak to the person.

That's why there would be an option to not filter it out (this is what I mean, 'by default' - maybe I wasn't clear), but you would only turn that on if you knew what you were looking ofr, and you are also aware you will have to contact other pepole to find out if you can use it.

nevrome · 2022-02-25T13:17:56Z

Quick update on this in February 2022:

sample.Ethically_culturally_sensitive, so the iteration of the Ethically_culturally_sensitive column in TAB_Sample is still the only one relevant. All other columns downstream are still only empty dummies used for the UI. Maybe we should explicitly remove them in the download process.
Automatic processing in the autorun pipeline excludes sensitive samples, afaik.

ivelsko · 2022-02-25T13:37:55Z

Do you mean explicitly removing all downstream entries stemming from samples marked as ethically/culturally sensitive? I'm for that.

If you mean removing the other columns that repeat ethically/culturally sensitive at lower levels (extract, library, etc) then I'd prefer those be left in b/c at least it's something to indicate they shouldn't be worked with. Particularly if someone only downloads a specific table and they won't see that column from the sample table.

Autorun does exclude the sensitive samples, as long as they actually were tagged that way. There were a bunch that should have been and never were that I caught, but possibly there are entries by other people, maybe some who've left, that also weren't marked and should be.

nevrome · 2022-02-25T14:05:50Z

I mean removing the columns 🤔. They are not reliably filled with information, so they can not be used for any decisions. Only the TAB_Sample one is, to my understanding.

Let's think about other ways to raise awareness of this issue. We could add a startup message to the package. Whenever somebody loads the library the first time in a session with library(sidora.core) we could print a warning. Many packages do this.

For example:

Warning: Pandora features many samples that are ethically and culturally sensitive and are therefore not available for many analyses. 
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application!

I'm sure you have a better phrasing in you, @ivelsko.

For downstream applications/scripts like the ones by @TCLamnidis, this message can be suppressed with suppressPackageStartupMessages(library(sidora.core)). Not so sure about calls with ::, though...

ivelsko · 2022-02-28T12:57:16Z

I like the idea of printing a warning. Can you print it with a big red ASCII-art yield sign? I tend to ignore these messages that come up at the start unless there's some image that catches my eye...

I think the text is pretty good, I'd suggest only some small changes

Warning: Pandora contains many samples that are ethically and/or culturally sensitive and are therefore off-limits for many analyses. 
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application and permissions!

nevrome added a commit that referenced this issue Jun 15, 2020

quick fix of issue raised by @ivelsko in #16

0d0db83

nevrome assigned jfy133 Jun 15, 2020

nevrome added this to To Do in Project Tracking Jun 19, 2020

jfy133 moved this from To Do to In Progress in Project Tracking Jun 19, 2020

jfy133 added the bug Something isn't working label Jul 3, 2020

nevrome mentioned this issue Jan 30, 2022

Add support for new coredb ID entries #47

Closed

nevrome mentioned this issue Mar 9, 2022

proposal for a startup message #51

Merged

nevrome mentioned this issue Mar 19, 2022

Fake columns #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ethically_culturally_sensitive columns have incorrect/outdated entries #16

Ethically_culturally_sensitive columns have incorrect/outdated entries #16

ivelsko commented Jun 15, 2020

jfy133 commented Jun 15, 2020

nevrome commented Jun 15, 2020

nevrome commented Jun 15, 2020

nevrome commented Jun 15, 2020 •

edited

Loading

nevrome commented Jun 15, 2020

ivelsko commented Jun 16, 2020

nevrome commented Jun 16, 2020

ivelsko commented Jun 16, 2020

nevrome commented Jun 17, 2020 •

edited

Loading

jfy133 commented Jun 17, 2020

nevrome commented Jun 17, 2020 •

edited

Loading

jfy133 commented Jun 17, 2020

jfy133 commented Jun 18, 2020

jfy133 commented Jun 19, 2020

ivelsko commented Jun 19, 2020

jfy133 commented Jun 19, 2020

nevrome commented Feb 25, 2022

ivelsko commented Feb 25, 2022

nevrome commented Feb 25, 2022

ivelsko commented Feb 28, 2022

Ethically_culturally_sensitive columns have incorrect/outdated entries #16

Ethically_culturally_sensitive columns have incorrect/outdated entries #16

Comments

ivelsko commented Jun 15, 2020

jfy133 commented Jun 15, 2020

nevrome commented Jun 15, 2020

nevrome commented Jun 15, 2020

nevrome commented Jun 15, 2020 • edited Loading

nevrome commented Jun 15, 2020

ivelsko commented Jun 16, 2020

nevrome commented Jun 16, 2020

ivelsko commented Jun 16, 2020

nevrome commented Jun 17, 2020 • edited Loading

jfy133 commented Jun 17, 2020

nevrome commented Jun 17, 2020 • edited Loading

jfy133 commented Jun 17, 2020

jfy133 commented Jun 18, 2020

jfy133 commented Jun 19, 2020

ivelsko commented Jun 19, 2020

jfy133 commented Jun 19, 2020

nevrome commented Feb 25, 2022

ivelsko commented Feb 25, 2022

nevrome commented Feb 25, 2022

ivelsko commented Feb 28, 2022

nevrome commented Jun 15, 2020 •

edited

Loading

nevrome commented Jun 17, 2020 •

edited

Loading

nevrome commented Jun 17, 2020 •

edited

Loading