Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved mapping #344

Merged
merged 12 commits into from
Dec 11, 2023
Merged

Improved mapping #344

merged 12 commits into from
Dec 11, 2023

Conversation

leoisl
Copy link
Collaborator

@leoisl leoisl commented Aug 8, 2023

Summary

@Danderson123 has been having some issues using pandora in his artificial amira dataset. This artificial dataset is simply composed of several PRGs, each one representing an AMR gene, built from their MSAs, and he tries to map the first allele of each MSA back to their respective PRGs with pandora. This should be a trivial task, and we should be getting 99.9% accuracy. Well, his use case is not that simple because many genes are contained inside another (he is using panaroo to cluster the alleles, and there are thresholds to separate genes, and there is always edge cases, e.g. if two alleles are identical but the shorter has a deletion spanning 30% of the original allele, do we classify as the same gene or as a new gene?). Anyway, it is a simple artificial dataset, but we should expect some shared regions between unrelated graphs due to the way the alleles are clustered, and this way of clustering alleles is standard, so in general I think pandora should be aware of this and behave well in this situation.

At first he was getting ~91% accuracy. PR #343 increased that to ~94% (I reported 94.7% on that PR but that is imprecise, as it was on a subset of the input). This PR increases the accuracy to 99.6%, and was done by Dan and I iteratively exchanging code, ideas, and benchmarks. I'd be up to increasing this accuracy to 99.9%, but this can be done later.

Details

This PR improves pandora mapping by doing a better detection of conflicting clusters and selection.

  1. Conflicting clusters detection

Conflicting clusters happen in pandora when we have 2 good clusters mapping to two different genes and they stem from the same region of the read, so we have to choose one. Prior to this PR this was done simply by checking if one cluster contained the other. This containment can be understood as an envelopment, e.g. if cluster 1 stems from a region in the read spanning the interval [200, 600], then it envelops cluster 2 spanning [400, 600], so these clusters are declared as conflicting and we have to choose one. In several of @Danderson123 examples, this enveloping did not happen because the smaller cluster would match one or two more minimisers, spanning a little bit more bases, e.g. [400, 610] and then the clusters would not conflict anymore and we would report both of them, whereas reporting the first cluster only is clearly the correct choice. In one extreme example, a single additional base was mapped and the clusters were inferred as not conflicting! Anyway, we've now changed this, and we now compute the overlap between two clusters, and if that overlap is >80% of the smaller cluster, we say we have a conflict. This 80% is a default value that can be changed with a new parameter, --min-cluster-overlap:

  --min-cluster-overlap FLOAT Minimum proportion of overlap between two clusters of hits to consider them conflicting. Only one of the conflicting clusters will be kept. [default: 0.8]
  1. Conflicting clusters selection

Once we have decided we have two conflicting clusters, we have to choose the best one. Analysing data from @Danderson123 results on his artificial dataset, we figured out that we would get better results by:

  1. Selecting the cluster that contains the highest number of unique minimisers with a tolerance (default is 5%). By unique minimisers we mean that if there is a minimiser that map to a single place or to 10 places in a PRG, we will count just as one, i.e. we are not multicounting minimisers that map to several different regions of a graph anymore;
  2. If both clusters have similar number of unique minimisers (within the tolerance) then we choose the cluster that best "fits" the graph, i.e. the one with highest target coverage. The target coverage is the number of unique minimising kmers the cluster goes through over the number of minimisers in the max path of the PRG;

The *.clusters_filter_report files were improved to reflect all this information so that we can understand and debug the choices pandora makes in real datasets. This is one example of some lines from this file for the amira artificial data, that shows that pandora preferred to map an allele of group_23595 to its own PRG:

read	prg	nb_of_unique_minimisers	target_cov	read_start	read_end	overlap	status	favoured_cluster_prg	favoured_cluster_nb_of_unique_minimisers	favoured_cluster_target_cov	favoured_cluster_read_start	favoured_cluster_read_end
group_23595	group_9877	61	0.663043	501	920	1	filtered_out	group_22365	175	0.650558	501	1762
group_23595	group_22365	175	0.650558	501	1762	1	filtered_out	group_23595	174	0.994286	502	1762
group_23595	group_23595	174	0.994286	502	1762	NA	kept

The previous data shows that we filtered out the mapping to PRG group_9877 favouring the mapping to group_22365 because group_22365 has many more unique minimisers (175 vs 61). However, when solving the next conflict, even though the mapping to group_22365 has more unique minimisers than the mapping to group_23595 (175 vs 174), this falls on the 5% tolerance and so we then look at the target coverage, and we see that the mapping to group_23595 has a much better target coverage than the mapping to group_22365 (99.4% vs 65%), and thus we choose to map group_23595 to its own PRG.

This unique minimisers tolerance for choosing the cluster is 5% by default, but can be configured with this new CLI parameter:

  --cluster-mini-tolerance FLOAT
                              When two clusters of hits are conflicting, the one with highest number of unique minimisers will be kept. However, if the difference between the number of unique minimisers is too small, less than this parameter, then we will prefer the cluster that has higher target coverage. [default: 0.05]

Core code

The core code for this PR is just these two, the rest is "supporting" code:

  1. Conflicting clusters detection: excerpt 1 and excerpt 2
  2. Conflicting clusters selection: excerpt 1

Evaluation

  1. Amira results: @Danderson123 ran this version on his amira dataset, and he improved the accuracy of detecting alleles from ~94% to 99.6%;

  2. 4-way results

I ran this on the 4-way results (nanopore data only). No denovo results clearly improved (blue curve = current release, orange = dev, green = this PR):

image

However, with denovo, it is arguable if this PR improves the results (blue curve = current release, orange = dev, green = this PR). While we do increase recall from 82% to 83.8%, this comes with an increase of error rate from 0.17% to 0.25%, compared to dev. This is in contradiction with the no denovo results and with amira results, so it is a bit confusing to me...

image

@iqbal-lab
Copy link
Collaborator

I'm not going to think about this til next week, but an idea for you. Suppose your mapping process is fixed, and we compare how it works on two different PRGs.
In prg1, we overcluster a bit, so sometimes two different genes end up in one msa
In prg2, we undercluster, so sometimes one gene is split in two.

How does your change in mapping interact with these. Is it better with prg1 or 2?

Dan's prg is constructed differently to the 4way one.

I wonder how the 4way does with his new e coli prg

@leoisl
Copy link
Collaborator Author

leoisl commented Aug 9, 2023

In prg1, we overcluster a bit, so sometimes two different genes end up in one msa

In this case, both the code proposed in this PR and the code previous to this PR will work similarly. If we overcluster a set of similar genes into a single PRG, we can assume that when we map a region from that set of genes, it will generate a single cluster of hits to that PRG only. Thus we won't have conflicting clusters and the improvements in this PR, which deal with conflicting clusters detection and selection won't have much of an effect. Overclustering creates another type of problem that pandora still does not handle: listing several maximum likelihood paths in the PRGs when there is evidence of overclustering or mixed samples, instead of always generating a single one... But that is out of the scope of this PR I guess...

In prg2, we undercluster, so sometimes one gene is split in two.

In this other case, when we map a region of a gene, we might map to two or more PRGs due to underclustering. In this case, the improvements described in this PR for conflicting clusters detection and selection will have a significant effect, as we are able to detect conflicting clusters much better, with less restrictions, and to select the PRG that fits better the mapped read (well, at least in our benchmarks). pandora code prior to this PR has little support for underclustering, I think it does a good job assigning random candidate PRGs for mapping a read when we have several equally good mappings, as evidenced by the improved results in Roundhound once this was added. But once this situation does not happen, e.g. a set of similar good mapping but not identical, it suffers to detect conflicting clusters and selecting the best one. And my guess is that this is also happening in many of our use cases, including Roundhound, as a more restricted situation (identical mappings) is happening.

I wonder how the 4way does with his new e coli prg

I can run the 4-way evaluation on his PanRG also.

Closing remarks

We could think of the argument that it is the user's responsibility to provide a cleaner PanRG, where over- and underclustering is reduced, but it seems in practice this is complicated. Usually the parameters we set for the tools we use for clustering alleles into genes pre-MSA don't work well for all genes, and will always undercluster and overcluster, sometimes a large fraction of the genes. It is also hard to quantify and fix this. I can speak of my own experience on this using mmseqs2 to cluster the plasmid genes. I think Dan is having the same issue with panaroo and his AMR genes. And we are in principle devs/advanced users of pandora. There are also the predefined sets of clusters (e.g. panX and piggy clusters we used in the paper) that we don't have control over these parameters... Anyway, what I am trying to say is that it is complicated for the user to build a very-high-quality PanRG, thus we should expect and model over- and underclustering in the PanRG in pandora. This PR is a first try onto modelling underclustering, and it shows significant improvements in Dan's benchmark and good improvements in the 4-way benchmark, so it is a change we should consider merging.

@mbhall88
Copy link
Member

mbhall88 commented Aug 9, 2023

I think it would be good to understand why the denovo results regress compared to dev (orange). Do you have any idea what that is being caused by?

@leoisl
Copy link
Collaborator Author

leoisl commented Aug 16, 2023

I think it would be good to understand why the denovo results regress compared to dev (orange). Do you have any idea what that is being caused by?

No idea, but I've started an investigation/debugging but will be delayed as the cluster is down for the week basically. I will update here when I understand and can explain what is happening.

@leoisl
Copy link
Collaborator Author

leoisl commented Dec 11, 2023

Merging, quoting Zam in slack:

I say merge. We'll deal with any future issues

@leoisl leoisl merged commit 40d4e78 into iqbal-lab-org:dev Dec 11, 2023
1 check passed
@leoisl leoisl deleted the improved_mapping branch December 13, 2023 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants