-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ethically_culturally_sensitive columns have incorrect/outdated entries #16
Comments
Uhh that's odd. No, it shouldn't be out of date as it should always pull the data fresh, unless you are using a cache... When did you update the entries in Pandora? |
|
It's your type conversion, James. Look at this:
|
@ivelsko So @jfy133 and I probably found the problem: We have a function which transforms column types to represent them correctly in R: Apparently there seem to be two different ways of encoding logicals in Pandora. The column I'll try to provide a quick fix now and then leave it to @jfy133 in our next hackhour to check the columns |
@ivelsko Maybe you can install the latest version from github and give it a test. |
I installed the latest version and checked again and it's partially fixed - the |
So: The values are FALSE, not NA, ja? What value do you see in Pandora? I would be really surprised if it is "Yes". |
Yep, they're FALSE. In Pandora it says "Yes" in each of those 4 tabs, but it's in a grayed-out text box that's automatically populated down from checking the "Yes" button on the Individual tab. I don't have any control on changing that entry on any tab except Individual |
I'm confused now. Please take a look at the result of this query:
Two things:
|
I just checked, and it is only Sample that is problematic. The others are all 0,1. But indeed, I have no idea what's going on with CMC... the only thing I can thing of is that there is some wierd propagation thing for the web UI only, where if you say at e.g. sample library something is sensitive that that is 'applied' everything downstream for display (with the assumption people shouldn't look further down?). We should write an email to Robert F. about it though, as I'm completely lost... I'll put a reminder tomorrow. My code (after your con set up above):
|
Alright. The CMC mismatch is really curious. Irina confirmed that the info both in the webinterface and in the csv export is correct. Only we're getting the wrong data. |
Yeah that makes me very nervous. I'll try and ask Robert tomorrow |
Ok, the Extract level Only Sample defines this. So I think @nevrome we need to have a different system of loading stuff now, where we internally find that information already and remove those at data loading. It's ugly but I'm not sure how we can deal with that unfortunately. |
We had a little discussion with the sidora dev team. Having to refer to the Sample table every time we want to exclude sensitive samples is very computationally intensive. We will ask Robert to make all those columns the same type, and then propagate this downstream. Then we set our functions to by default remove any row with a sensititve column is true, UNLESS someone requests it purposefully. Any public version of the API will always remove these, even if published data. |
This is almost ok, but I see an issue with it. For calculus/microbiome samples the only sensitive part is the human reads, and those shouldn't be analyzed by anyone. The rest of the data is fine for anyone to use, so the whole dataset shouldn't be removed for searching in this case. It's really a problem with what the Ethically/culturally sensitive check box is supposed to indicate, which is that the data shouldn't be automatically screened/processed by the pipeline here. In some cases (ALC/FPA, for example) the whole sample and all downstream parts should be excluded, but for CMC only part of the data is off limits |
The problem is that the final data that is recorded in the raw data tab is the entire sequencing data, not filtered. We can then not guaruntee that people won't use it anyway, or just skip checking the 'sample' tab entirely. Secondly, with that box we don't necessarily specify what is actually sensitive about it. To find it, they would have to speak to the person. That's why there would be an option to not filter it out (this is what I mean, 'by default' - maybe I wasn't clear), but you would only turn that on if you knew what you were looking ofr, and you are also aware you will have to contact other pepole to find out if you can use it. |
Quick update on this in February 2022:
|
Do you mean explicitly removing all downstream entries stemming from samples marked as ethically/culturally sensitive? I'm for that. If you mean removing the other columns that repeat ethically/culturally sensitive at lower levels (extract, library, etc) then I'd prefer those be left in b/c at least it's something to indicate they shouldn't be worked with. Particularly if someone only downloads a specific table and they won't see that column from the sample table. Autorun does exclude the sensitive samples, as long as they actually were tagged that way. There were a bunch that should have been and never were that I caught, but possibly there are entries by other people, maybe some who've left, that also weren't marked and should be. |
I mean removing the columns 🤔. They are not reliably filled with information, so they can not be used for any decisions. Only the Let's think about other ways to raise awareness of this issue. We could add a startup message to the package. Whenever somebody loads the library the first time in a session with For example:
I'm sure you have a better phrasing in you, @ivelsko. For downstream applications/scripts like the ones by @TCLamnidis, this message can be suppressed with |
I like the idea of printing a warning. Can you print it with a big red ASCII-art yield sign? I tend to ignore these messages that come up at the start unless there's some image that catches my eye... I think the text is pretty good, I'd suggest only some small changes
|
All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where
sample.Ethically_culturally_sensitive
is all NA,extract.Ethically_culturally_sensitive
is all FALSE, andlibrary.Ethically_culturally_sensitive
is a combination of NA and FALSE, and all should be TRUE.I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?
When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since
sample.Ethically_culturally_sensitive
is NA this was never correct, whileextract.Ethically_culturally_sensitive
andlibrary.Ethically_culturally_sensitive
as FALSE are outdated. I think the NAs inlibrary.Ethically_culturally_sensitive
come from extracts that were never built into libraries so those are probably okThe text was updated successfully, but these errors were encountered: