Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze 3' end adapters #107

Closed
uniqueg opened this issue Mar 3, 2023 · 13 comments · Fixed by #124
Closed

Analyze 3' end adapters #107

uniqueg opened this issue Mar 3, 2023 · 13 comments · Fixed by #124
Assignees

Comments

@uniqueg
Copy link
Member

uniqueg commented Mar 3, 2023

We will need the test suite to automatically generate a number of plots:

  • A bar chart with the total number and the fraction of libraries for which an adapter was and was not identified
  • A histogram of the fractions of reads that contained the most prevalent adapter fragments
  • A histogram of the ratios of the fractions of the reads that contained the most prevalent vs the second most prevalent adapter frament

It should also generate a report with the actual numbers for the above (for the distributions, report some percentiles, e.g., 1, 2, 5, 10, 15, 25, 50, 75, 85, 90, 95, 98, 99).

The plots and report should inform the choice for default values for the following parameters in an informed manner:

  • --read-layout-min-match-percentage
  • --read-layout-min-frequency-ratio

For example, we could set the values such that, say, 95% of samples would have values above the defaults. But the best is to look at the actual histograms and see if they suggest a natural cutoff.

After running HTSinfer on all test samples, at least for some samples, no adapter will be identified.

For libraries for which there are no adapters identified (all zero), even after adding additional adapters in #106, we can try the following approach to see if they perhaps do contain an adapter that is not in our list:

  • Prepare a truncated version of the library that only contains the last $k$ nucleotides
  • For the chosen $k$, construct all (i.e., $4^k$) possible $k$-mers
  • For each $k$-mer, count the number of occurences in the truncated library
  • Inspect the distribution of counts: is there a single high entropy outlier (not something like AAAAAA) that is significantly more frequent than other $k$-mers?
  • If so, try identifying a full 15 nt adapter candidate sequence by searching for the $k$-mer in the original library and comparing what's up-/downstream of each match (can be done automated, but probably easier to just do manually)
  • Search for candidate in other libraries: in those libraries where an adapter was already clearly identified, there should be (almost) no counts for the new adapter; but in other libraries for which no adapter was found, there may perhaps be many counts
  • Depending on the outcome of the search, include the adapter candidate (15 nts)

A good range for $k$ might be between 5 and 8; the longer the $k$, the more clear will be the result; however, also the counts per $k$ will be lower and the counting process will take longer (exponentially)

@balajtimate
Copy link
Collaborator

balajtimate commented Mar 16, 2023

After running the testing on the collected samples (58 total), here are some preliminary plots:
1, Bar chart showing the number of samples for which an adapter was and was not identified:
Image
I still need to rotate the actual adapter categories, but it can be seen that still a large portion of the adapters is not getting inferred. Next, I will run the tests again with the parameters set to different values.

2, Histogram of the fractions of reads that contained the most prevalent adapters
Image
Here the maximum number of samples analyzed is higher, because this information was collected from the read_layout_XXX.json files (not from the combined result file), and in case of PE samples, there's two.

3, Histogram of the ratios of the fractions of the reads that contained the most prevalent vs the second most prevalent adapter
Image
Here I run into the issue that when the second most prevalent adapter percent is very close to 0 (like 0.1), the ratio between the first and second most prevalent adapter is high (e.g. first most prevalent adapter is 33.98%, second is 0.01%). Also, what should be the case when the second most prevalent is 0%?

@uniqueg
Copy link
Member Author

uniqueg commented Mar 16, 2023

Thanks @balajtimate!

Comments on the plots:

  1. I don't think we need to see which adapter was found, so you can just show two bars: not identified and identified. Then you also probably won't have an issue with the labels. Would be good to put fractions on the Y axis.
  2. I think it would be good to have higher resolution here. Currently you have bins of 4%, but 4% is already high for most libraries. Maybe it's best to prepare two plots: One from 0 to 100%, with 50 bins of 2% (not very different from what you have now), and then another one from 0 to 10%, with 50 bins of 0.2%. You should probably also remove the cases where the fraction of adapters is zero.
  3. For that plot, I would use logarithmic scaling on the x axis. And to avoid zero divisions, add a small pseudocount to all values (e.g., 0.01% or something like that; you can also check the data and use the lowest non-zero value you find for the most common and second most common adapters, then round that down to the next decimal; e.g., if the lowest you find is 0.23%, use 0.2% as the pseudocount). However, remove the cases where both the most prevalent adapter is also zero.

Did you already identify additional adapters (#106)? If so, please create a PR and link it to #106, then request a review from me to merge it. Once we have the adapters merged, you can run the tests again. With the new plots, we should then be able to identify better defaults for the --read-layout-min-match-percentage and --read-layout-min-frequency-ratio params.

Please add the table with the percentiles for the different distributions as well. We will need those to pick decent defaults.

@balajtimate
Copy link
Collaborator

Thanks for the suggestions! I changed the graphs based on them:
Image

I removed the cases where the fraction is 0, and added a pseudocount of 0.01 to the second most prevalent adapter fragments.

I am also adding the table with the results, from column '1_adapt_1' (file 1 first predicted adapter and percentage, '1_adapt_2' file 1 second most predicted adapter and percent etc):

0316_135548_graph_result.csv

I only identified 1 adapter so far from the results, I'll go through the kits we found and the online resources again and make a PR. I'll also push the code for the test and plots too.

@uniqueg
Copy link
Member Author

uniqueg commented Mar 20, 2023

Awesome!

I think we need more data, the histograms are very sparse. But you have the code now, and with @BorisYourich's updated list, we may have a lot more data soon.

@BorisYourich
Copy link
Collaborator

Hey, just uploaded some data, to the shared docs, I am running it again with greater coverage, just in case there are too few of some species.

@balajtimate
Copy link
Collaborator

I'm attaching the results from the updated tests. This was done using 456 samples, with the following conditions:

Unfortunately, the adapter is still not getting inferred in most cases. Looking at the results table, it's either:

  • The ratio between the first most common and second most common is close to 1 (e.g. ATCTCGTATGCCGTC and TCGTATGCCGTCTTC - they seem to be part of each other, so it'll be removed)
  • The other issue is that only 156/456 (34%) has any of the adapters in at least 2% of the reads.
  • In 97 samples no adapter can be found at all. Maybe I could try the k-mer approach on these?

@uniqueg
Copy link
Member Author

uniqueg commented Apr 13, 2023

Thanks a lot @balajtimate!

So it looks like ~%34 of adapters identified is our upper limit of what we could possibly get while setting the --read-layout-min-match-percentage to 2%. So with ~28% we are at ~82%, which is not too bad. If you remove the other adapter fragments that align against each other, we may well get to >90%, hopefully. If necessary, we can still lower --read-layout-min-frequency-ratio a bit, but it should be mostly fine.

The bigger problem is of course that we miss ~45% based on --read-layout-min-match-percentage. Have you tried lowering it? You could try with 1%, 0.5%, 0.2% and 0.1%.

For the remaining 21% - yeah, I guess we just don't have their adapters in the list. Here the k-mer approach would help. Of course it's also possible, that for whatever reason, there are just no adapters in these libraries. You could have a look whether you find something striking about the metadata of these samples. Perhaps many of them have something in common, like the library selection strategy.

@balajtimate
Copy link
Collaborator

I ran the tests again, with all the 770 samples, while lowering --read-layout-min-frequency-ratio, with the adapters removed from #121, here are the results:
merge_from_ofoct (1)

Even with the freq ratio set to 0.1%, only 56% has the adapters, while there are 186 samples where no adapter gets inferred, I'll look into the metadata, that could be a good point. Interestengly, there's also 18 samples where nothing gets inferred, no organism, read length or orientation either.

@uniqueg
Copy link
Member Author

uniqueg commented Apr 17, 2023

18 out of 770 is not too bad, it means that we get results for almost 98% of samples. Considering that we include samples from really uncommon organisms, I think this is quite a good result.

Otherwise I think it's good that we see a clear increase on the adapters we identify when lowering --read-layout-min-match-percentage - and without changing the ratio. The maximum we can get - with the current adapter fragment set - still seems to be around 76% ($(770-186)/770$). For the remaining ones, there are either no (or almost no) adapters in there, or we don't have the corresponding adapter fragment in our list yet. With up to 56% now, we identify about 73% of these 76% with the --read-layout-min-frequency-ratio - could be better, but not too bad.

Let's see what happens if we also run with shorter adapters, I think this will boost numbers some more. We might be able to reduce the 186, while still increasing the 56%. At least that's my hope.

It would be really nice to have at a least a small set of known adapters to see if we find them, but I guess we saw that these are almost never reported and even the kits aren't reported consistently (or it's hard to track a kit back to a specific sample). Perhaps we can run this on our own samples - at least there we should know the adapters. Perhaps you can ask in our internal Slack channel if we have a list of samples that we uploaded to SRA and for which we have metadata and the adapter sequences in particular. Even if we have a small set, say 20 samples, that we know the adapter of, we could use the stats on those to put our numbers across the whole set into context.

@balajtimate
Copy link
Collaborator

balajtimate commented Apr 20, 2023

So I ran the tests again with the remaining 187 samples, with the shortened adapters of length of 12 and 10.
merge_from_ofoct

  • With the length of 12, an additional 9 samples got inferred (5%), while 1 got two adapters with very similar percentage (ratio <2), 45 had an inferred adapter with <0.1%, and 132 samples remained without any hits.
  • With the adapter length of 10, 11 samples got inferred (6%), plus 1 with the ratio <2. In this case, only 13 samples remained without any hits, and 162 had an inferred adapter <0.1%.

0420_htsinfer_results_adapterlen_12.csv
0420_htsinfer_results_adapterlen_10.csv

Based on this, shortening the adapters to 12 might be a good idea, but I wouldn't go any lower than that, even with the 0.1 --read-layout-min-match-percentage, we can only identify a couple more than with 12.

I managed to get a list of RNA-Seq samples of our own uploaded to SRA, we have ~80 samples that have the metadata (some has the exact lib prep kit used, so that's good, I'm trying to find out the kit/adapters where they don't), from various orgs (human, mouse, yeast, C.elegans, E.coli). I'll compile the list and run HTSinfer, it shouldn't take long.

@uniqueg
Copy link
Member Author

uniqueg commented Apr 23, 2023

Fantastic!

So with that last test on our own data, I think we can bring this to an end in terms of setting defaults. We can still think about identifying a couple more adapters from the data, but perhaps it's probably not the highest priority.

I agree with adapter fragment lengths of 12. And setting the ratio param default to 2, and the min frequency maybe to 0.5% or even a bit lower. What do you think?

@balajtimate
Copy link
Collaborator

Yes, I definitely agree with shortening the adapter length and lowering the frequency parameter. I was going to suggest 1, as finding adapters in 1% of the reads is a fairly strong indicator that that particular adapter was used for sequencing, but 0.5% might be better, based on these results:
merge_from_ofoct (2)
0425_zavolab_samples_minmatch_01.csv
These are from our own 77 samples, I think they're fairly consistent with the results of the 770-sample runs, as in almost half of the adapters were inferred with 0.5%, and a much bigger ratio with lowering it to 0.1%.

Also, I think it's worth noting (for issue #108) that apart from the 4 E.coli samples, all samples got their organism correctly inferred, while for the 770 samples, it was only ~50% of samples. I guess there could be multiple reasons for this, probably because in the case of our own samples, mainly model organisms were used (human, mouse, yeast etc)?

I'll change the defaults then and create a PR.

@uniqueg
Copy link
Member Author

uniqueg commented Apr 25, 2023

Well, based on these data: Why not use 0.1%? As long as --read-layout-min-frequency-ratio remains discriminative (which I think it does), I wouldn't be too concerned about false positives.

Interesting about the organisms. I agree with the conclusions on the model organisms. What was inferred for the E. coli samples? Do we even have E. coli in the list? One important thing to note here is that - apparently - organisms are not falsely assigned if they are from a model organism - which is critically important, as these represent the bulk of samples. But more generally, we should really try to lower the number of falsely assigned organisms, even at the cost of lowering true positives as well. If ~50% of organism were correctly assigned for the 770 samples, what is the percentage of those where assignment was not possible and those that were falsely assigned? (perhaps better to move this discussion to #108 though)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants