-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze 3' end adapters #107
Comments
Thanks @balajtimate! Comments on the plots:
Did you already identify additional adapters (#106)? If so, please create a PR and link it to #106, then request a review from me to merge it. Once we have the adapters merged, you can run the tests again. With the new plots, we should then be able to identify better defaults for the Please add the table with the percentiles for the different distributions as well. We will need those to pick decent defaults. |
Awesome! I think we need more data, the histograms are very sparse. But you have the code now, and with @BorisYourich's updated list, we may have a lot more data soon. |
Hey, just uploaded some data, to the shared docs, I am running it again with greater coverage, just in case there are too few of some species. |
I'm attaching the results from the updated tests. This was done using 456 samples, with the following conditions:
Unfortunately, the adapter is still not getting inferred in most cases. Looking at the results table, it's either:
|
Thanks a lot @balajtimate! So it looks like ~%34 of adapters identified is our upper limit of what we could possibly get while setting the The bigger problem is of course that we miss ~45% based on For the remaining 21% - yeah, I guess we just don't have their adapters in the list. Here the k-mer approach would help. Of course it's also possible, that for whatever reason, there are just no adapters in these libraries. You could have a look whether you find something striking about the metadata of these samples. Perhaps many of them have something in common, like the library selection strategy. |
I ran the tests again, with all the 770 samples, while lowering Even with the freq ratio set to 0.1%, only 56% has the adapters, while there are 186 samples where no adapter gets inferred, I'll look into the metadata, that could be a good point. Interestengly, there's also 18 samples where nothing gets inferred, no organism, read length or orientation either. |
18 out of 770 is not too bad, it means that we get results for almost 98% of samples. Considering that we include samples from really uncommon organisms, I think this is quite a good result. Otherwise I think it's good that we see a clear increase on the adapters we identify when lowering Let's see what happens if we also run with shorter adapters, I think this will boost numbers some more. We might be able to reduce the 186, while still increasing the 56%. At least that's my hope. It would be really nice to have at a least a small set of known adapters to see if we find them, but I guess we saw that these are almost never reported and even the kits aren't reported consistently (or it's hard to track a kit back to a specific sample). Perhaps we can run this on our own samples - at least there we should know the adapters. Perhaps you can ask in our internal Slack channel if we have a list of samples that we uploaded to SRA and for which we have metadata and the adapter sequences in particular. Even if we have a small set, say 20 samples, that we know the adapter of, we could use the stats on those to put our numbers across the whole set into context. |
So I ran the tests again with the remaining 187 samples, with the shortened adapters of length of 12 and 10.
0420_htsinfer_results_adapterlen_12.csv Based on this, shortening the adapters to 12 might be a good idea, but I wouldn't go any lower than that, even with the 0.1 I managed to get a list of RNA-Seq samples of our own uploaded to SRA, we have ~80 samples that have the metadata (some has the exact lib prep kit used, so that's good, I'm trying to find out the kit/adapters where they don't), from various orgs (human, mouse, yeast, C.elegans, E.coli). I'll compile the list and run HTSinfer, it shouldn't take long. |
Fantastic! So with that last test on our own data, I think we can bring this to an end in terms of setting defaults. We can still think about identifying a couple more adapters from the data, but perhaps it's probably not the highest priority. I agree with adapter fragment lengths of 12. And setting the ratio param default to 2, and the min frequency maybe to 0.5% or even a bit lower. What do you think? |
Yes, I definitely agree with shortening the adapter length and lowering the frequency parameter. I was going to suggest 1, as finding adapters in 1% of the reads is a fairly strong indicator that that particular adapter was used for sequencing, but 0.5% might be better, based on these results: Also, I think it's worth noting (for issue #108) that apart from the 4 E.coli samples, all samples got their organism correctly inferred, while for the 770 samples, it was only ~50% of samples. I guess there could be multiple reasons for this, probably because in the case of our own samples, mainly model organisms were used (human, mouse, yeast etc)? I'll change the defaults then and create a PR. |
Well, based on these data: Why not use 0.1%? As long as Interesting about the organisms. I agree with the conclusions on the model organisms. What was inferred for the E. coli samples? Do we even have E. coli in the list? One important thing to note here is that - apparently - organisms are not falsely assigned if they are from a model organism - which is critically important, as these represent the bulk of samples. But more generally, we should really try to lower the number of falsely assigned organisms, even at the cost of lowering true positives as well. If ~50% of organism were correctly assigned for the 770 samples, what is the percentage of those where assignment was not possible and those that were falsely assigned? (perhaps better to move this discussion to #108 though) |
We will need the test suite to automatically generate a number of plots:
It should also generate a report with the actual numbers for the above (for the distributions, report some percentiles, e.g., 1, 2, 5, 10, 15, 25, 50, 75, 85, 90, 95, 98, 99).
The plots and report should inform the choice for default values for the following parameters in an informed manner:
--read-layout-min-match-percentage
--read-layout-min-frequency-ratio
For example, we could set the values such that, say, 95% of samples would have values above the defaults. But the best is to look at the actual histograms and see if they suggest a natural cutoff.
After running HTSinfer on all test samples, at least for some samples, no adapter will be identified.
For libraries for which there are no adapters identified (all zero), even after adding additional adapters in #106, we can try the following approach to see if they perhaps do contain an adapter that is not in our list:
AAAAAA
) that is significantly more frequent than otherThe text was updated successfully, but these errors were encountered: