-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved mapping #344
Improved mapping #344
Conversation
…rlap if a cluster overlaps another by >=80%, then we detect conflict and has to filter one out
different clusters
I'm not going to think about this til next week, but an idea for you. Suppose your mapping process is fixed, and we compare how it works on two different PRGs. How does your change in mapping interact with these. Is it better with prg1 or 2? Dan's prg is constructed differently to the 4way one. I wonder how the 4way does with his new e coli prg |
In this case, both the code proposed in this PR and the code previous to this PR will work similarly. If we overcluster a set of similar genes into a single PRG, we can assume that when we map a region from that set of genes, it will generate a single cluster of hits to that PRG only. Thus we won't have conflicting clusters and the improvements in this PR, which deal with conflicting clusters detection and selection won't have much of an effect. Overclustering creates another type of problem that
In this other case, when we map a region of a gene, we might map to two or more PRGs due to underclustering. In this case, the improvements described in this PR for conflicting clusters detection and selection will have a significant effect, as we are able to detect conflicting clusters much better, with less restrictions, and to select the PRG that fits better the mapped read (well, at least in our benchmarks).
I can run the 4-way evaluation on his PanRG also. Closing remarksWe could think of the argument that it is the user's responsibility to provide a cleaner PanRG, where over- and underclustering is reduced, but it seems in practice this is complicated. Usually the parameters we set for the tools we use for clustering alleles into genes pre-MSA don't work well for all genes, and will always undercluster and overcluster, sometimes a large fraction of the genes. It is also hard to quantify and fix this. I can speak of my own experience on this using mmseqs2 to cluster the plasmid genes. I think Dan is having the same issue with panaroo and his AMR genes. And we are in principle devs/advanced users of pandora. There are also the predefined sets of clusters (e.g. panX and piggy clusters we used in the paper) that we don't have control over these parameters... Anyway, what I am trying to say is that it is complicated for the user to build a very-high-quality PanRG, thus we should expect and model over- and underclustering in the PanRG in pandora. This PR is a first try onto modelling underclustering, and it shows significant improvements in Dan's benchmark and good improvements in the 4-way benchmark, so it is a change we should consider merging. |
I think it would be good to understand why the denovo results regress compared to dev (orange). Do you have any idea what that is being caused by? |
No idea, but I've started an investigation/debugging but will be delayed as the cluster is down for the week basically. I will update here when I understand and can explain what is happening. |
Merging, quoting Zam in slack:
|
Summary
@Danderson123 has been having some issues using pandora in his artificial amira dataset. This artificial dataset is simply composed of several PRGs, each one representing an AMR gene, built from their MSAs, and he tries to map the first allele of each MSA back to their respective PRGs with pandora. This should be a trivial task, and we should be getting 99.9% accuracy. Well, his use case is not that simple because many genes are contained inside another (he is using panaroo to cluster the alleles, and there are thresholds to separate genes, and there is always edge cases, e.g. if two alleles are identical but the shorter has a deletion spanning 30% of the original allele, do we classify as the same gene or as a new gene?). Anyway, it is a simple artificial dataset, but we should expect some shared regions between unrelated graphs due to the way the alleles are clustered, and this way of clustering alleles is standard, so in general I think pandora should be aware of this and behave well in this situation.
At first he was getting ~91% accuracy. PR #343 increased that to ~94% (I reported 94.7% on that PR but that is imprecise, as it was on a subset of the input). This PR increases the accuracy to 99.6%, and was done by Dan and I iteratively exchanging code, ideas, and benchmarks. I'd be up to increasing this accuracy to 99.9%, but this can be done later.
Details
This PR improves
pandora
mapping by doing a better detection of conflicting clusters and selection.Conflicting clusters happen in
pandora
when we have 2 good clusters mapping to two different genes and they stem from the same region of the read, so we have to choose one. Prior to this PR this was done simply by checking if one cluster contained the other. This containment can be understood as an envelopment, e.g. if cluster 1 stems from a region in the read spanning the interval[200, 600]
, then it envelops cluster 2 spanning[400, 600]
, so these clusters are declared as conflicting and we have to choose one. In several of @Danderson123 examples, this enveloping did not happen because the smaller cluster would match one or two more minimisers, spanning a little bit more bases, e.g.[400, 610]
and then the clusters would not conflict anymore and we would report both of them, whereas reporting the first cluster only is clearly the correct choice. In one extreme example, a single additional base was mapped and the clusters were inferred as not conflicting! Anyway, we've now changed this, and we now compute the overlap between two clusters, and if that overlap is >80% of the smaller cluster, we say we have a conflict. This 80% is a default value that can be changed with a new parameter,--min-cluster-overlap
:Once we have decided we have two conflicting clusters, we have to choose the best one. Analysing data from @Danderson123 results on his artificial dataset, we figured out that we would get better results by:
The
*.clusters_filter_report
files were improved to reflect all this information so that we can understand and debug the choicespandora
makes in real datasets. This is one example of some lines from this file for the amira artificial data, that shows thatpandora
preferred to map an allele ofgroup_23595
to its own PRG:The previous data shows that we filtered out the mapping to PRG
group_9877
favouring the mapping togroup_22365
becausegroup_22365
has many more unique minimisers (175 vs 61). However, when solving the next conflict, even though the mapping togroup_22365
has more unique minimisers than the mapping togroup_23595
(175 vs 174), this falls on the 5% tolerance and so we then look at the target coverage, and we see that the mapping togroup_23595
has a much better target coverage than the mapping togroup_22365
(99.4% vs 65%), and thus we choose to mapgroup_23595
to its own PRG.This unique minimisers tolerance for choosing the cluster is 5% by default, but can be configured with this new CLI parameter:
Core code
The core code for this PR is just these two, the rest is "supporting" code:
Evaluation
Amira results: @Danderson123 ran this version on his amira dataset, and he improved the accuracy of detecting alleles from ~94% to 99.6%;
4-way results
I ran this on the 4-way results (nanopore data only). No denovo results clearly improved (blue curve = current release, orange = dev, green = this PR):
However, with denovo, it is arguable if this PR improves the results (blue curve = current release, orange = dev, green = this PR). While we do increase recall from 82% to 83.8%, this comes with an increase of error rate from 0.17% to 0.25%, compared to
dev
. This is in contradiction with the no denovo results and with amira results, so it is a bit confusing to me...