Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in counting of mutations at site 95 in 21J clade #312

Closed
jbloom opened this issue Dec 19, 2022 · 3 comments
Closed

bug in counting of mutations at site 95 in 21J clade #312

jbloom opened this issue Dec 19, 2022 · 3 comments

Comments

@jbloom
Copy link

jbloom commented Dec 19, 2022

I have been using matUtils, and I have encountered what I think is some bug in how mutations are being counted specifically for mutations to I at site 95 in spike for the Delta clades (eg, 21J). I cannot for the life of me figure out what is causing this as the results seems sensible for other mutations for Delta and for other clades.

Specifically, clade 21J should have a fair number of mutations to/from I at site 95 in spike. See for instance this nextstrain view: https://nextstrain.org/ncov/gisaid/global/6m?c=gt-S_95

But when I use matUtils to count the mutations either to or from I at site 95 in spike, I get 0 counts for clade 21J. This contrasts to other clades where there are counts.

See the code block below:

#!/bin/bash
  
curl http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2022/12/18/public-2022-12-18.all.masked.nextclade.pangolin.pb.gz > mat.pb.gz
wget -O - http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/wuhCor1.fa.gz | gunzip -c > ref.fasta
wget -O - http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/genes/ncbiGenes.gtf.gz | gunzip -c > ref.gtf

matUtils extract -i mat.pb.gz --clade "21J (Delta)" -o 21J.pb.gz
matUtils summary -i 21J.pb.gz -g ref.gtf -f ref.fasta -t 21J.tsv
cat 21J.tsv | grep S:.95I | wc -l > 21J_lines_with_95I.txt
cat 21J.tsv | grep S:I95 | wc -l > 21J_lines_with_I95.txt

matUtils extract -i mat.pb.gz --clade "21L (Omicron)" -o 21L.pb.gz
matUtils summary -i 21L.pb.gz -g ref.gtf -f ref.fasta -t 21L.tsv
cat 21L.tsv | grep S:.95I | wc -l > 21L_lines_with_95I.txt
cat 21L.tsv | grep S:I95 | wc -l > 21L_lines_with_I95.txt

The files 21J_lines_with_95I.txt and 21J_lines_with_I95.txt both have counts of zero, reflecting the (seemingly incorrect) count of no such mutations in the matUtils parsed mutation-annotated tree. In contract, the files 21L_lines_with_95I.txt and 21L_lines_with_I95.txt have non-zero counts reflecting the (presumably correct) existence of such mutations in clade 21L.

@AngieHinrichs
Copy link
Contributor

Nucleotide 21846 (the S:95 mutation) is masked in the Delta branch of the UCSC/UShER tree because it was very unreliably detected and that led to false branching problems in the tree. Starting with Delta, and increasingly with Omicron, amplicon dropout and suboptimal default configs in some genome assembly pipelines have caused a lot of problems, so as time goes on I'm masking more sites. In addition to mutations that frequently have false reversions to reference due to amplicon dropout, I often mask indel regions because sometimes stray/contam read alignments extend into those gaps and cause false substitutions. My script that masks specific mutations in specific branches of the tree after new sequences are placed in the tree and before matOptimize is here.

@jbloom
Copy link
Author

jbloom commented Dec 19, 2022

@AngieHinrichs, great, this is really helpful and explains things! We really appreciate all your quick help in understanding these things and in general maintaining such great MATs for easy use.

This may be too much of a request to implement easily, but I was wondering if you might consider eventually somehow putting the masked sites for each clade in a YAML or some other machine-readable format and documenting a bit more clearly that this masking happens? It totally makes sense when I look at this script, but for the types of analyses I often am trying to do (see which mutations are present to what extent in different clades), it's helpful to have an easy way to distinguish between what mutations actually have zero counts versus are just masked in a specific clade.

(Eg, we stumbled across this T95I when we were trying to relate our deep mutational scanning data to mutations that might preferentially occur in Omicron versus Delta lineages.)

@jbloom
Copy link
Author

jbloom commented Dec 19, 2022

Actually, I was able to extract mutations without much problem from script, so above is probably pretty low priority.

I'll go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants