-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to count the number of mapped reads against a vg? #1701
Comments
I don't think there is a recommended way yet, but in theory we'd put something in
You might get fewer MAPQ > 0 reads in the VCF graph, as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty. |
Hi @edawson Thanks for the quick response.
So, this is to say that the MAPQ=0 reads are filtered out? With my tests, I the minimum MAPQ I ever saw was 1. I guess....is there any way I could access these? At the moment, I'm not sure how I'd use the GAM/JSON to access a (non-read) alignment with no MAPQ values and a read with MAPQ=0, right?
I guess this gets in the theory and purpose of a graph genome, but this is a bit of a conceptual problem, no? If there are reads which align on a reference with some MAPQ score against a somewhat unique sequence, a graph genome could end up dismissing these reads because of the inherent variation within the graph? Maybe the problem is the MAPQ metric applied for graph genomes? |
I think the addendum to the question above is, would accessing the MAPQ==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue? |
In protobuf, 0 values are not written. So reads with mapq=0 won't have
any mapping_quality field when you look at them.
To extract these, you can use jq
`vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam`
To find non-zero values, for example 55, you can do something similar
`vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG
mapq55.gam`
Going through JSON is slow, but it allows for fairly general queries of GAM
files.
…On Mon, May 28, 2018 at 5:16 AM, evanbiederstedt ***@***.***> wrote:
@edawson <https://github.com/edawson>
I think the addendum to the question above is, would accessing the mapq==0
reads require hacking on my part in the source code? It might be useful to
get the lines in question which would need to be changed. Perhaps this is a
feature request to be made in another issue?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1701 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7od9rf8k9YFPi0ydu5ltjaPT_uxvks5t28CKgaJpZM4UMiWe>
.
|
Hi @glennhickey Thanks for the help For the commands given, i.e.
I see an output for
but there is an error with
Error:
Are the above commands correct? |
Sorry, I forgot a - at the very end of both examples.
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG - > mapq0.gam
vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG
- > mapq55.gam
…On Mon, May 28, 2018 at 12:34 PM, evanbiederstedt ***@***.***> wrote:
Hi @glennhickey <https://github.com/glennhickey>
Thanks for the help
For the commands given, i.e.
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam
I see an output for
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'
but there is an error with
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG
Error:
[vg view] error: no filename given
Are the above commands correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1701 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7vb4eF-1C2313HnnB54B3VMoX_z9ks5t3Cb_gaJpZM4UMiWe>
.
|
@glennhickey |
So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph. Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):
Attempts at addition: (1)
That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320! However, I'm not sure if I should celebrate just yet...as @edawson mentioned,
(2)
15742494 is the length of the original GAMs for both. Why would that be? (3)
I'm not sure what this means, but here the value is greater for the GRAPH (4)
Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for Questions: (A) How should I interpret each of these values? (B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be? Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference? |
Some reads map more ambiguously to the linear reference than to the graph.
Other reads map more ambiguously to the graph than the linear reference.
It is possible that the latter group is larger. We are currently studying
this effect and ways to mitigate it, and hope to have some results to
publish shortly. Filtering rare variants out of the VCF is one simple
approach.
…On Mon, May 28, 2018 at 7:10 PM, evanbiederstedt ***@***.***> wrote:
@glennhickey <https://github.com/glennhickey> @edawson
<https://github.com/edawson>
So, I'm still a bit confused what to make of this. As I say, my priors
would be that the graph has more reads mapped than the "reference-only"
graph.
Here are the numbers I now see, after parsing the JSONs (with the caveat
that maybe I made a mistake somewhere):
REFERENCE-ONLY
length(json)= 15742494
length(scores) = 15530966
length(identity) = 15530966
length(mapping_quality)=14358612
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1383882
length(scores) =1172354
length(identity) =1172354
length(mapping_quality)=0 ## this makes sense! good!
GRAPH
length(json)= 15742494 ### (It's interesting these are the same length...)
length(scores) =15523701
length(identity) =15523701
length(mapping_quality)=14280344
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1462150
length(scores) =1243357
length(identity) =1243357
length(mapping_quality)= 0
Attempts at addition:
(1)
reference: length(scores) + length(scores)_mapq0 => 15530966 + 1172354 = 16703320
graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058
That means yes, there are perhaps more MAPQ==0 reads on the graph.
16767058 > 16703320!
However, I'm not sure if I should celebrate just yet...as @edawson
<https://github.com/edawson> mentioned,
so if you want to count mapped reads you'd want to check the MAPQs.
(2)
reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494
graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249
15742494 is the length of the original GAMs for both. Why would that be?
(3)
reference: length(scores) + length(json)_mapq0 => 15530966 + 1383882 = 16914848
graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851
I'm not sure what this means, but here the value is greater for the GRAPH
(4)
reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966
graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701
Here, the number of mapped reads with MAPQ>=1 is greater in the
reference-only graph versus the VCF graph. Using the length of the JSONs
with values for scores, the value for the reference-only graph is still
larger....I'm not sure what this means.
Questions:
(A) How should I interpret each of these values?
(B) What would be the recommended way to measure the total number of
mapped reads in this case? If the reference-only graph *still* has more
reads mapped than the VCF graph...why would this be?
Could the VCF affect things? For example, I think the VCF did not have
information on the sex chromosomes or the mitochondrial chromosome. Would
that possibly make a difference?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1701 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7mo0bgxzjfy9w3zBVrVQ6gq7BkAFks5t3IPcgaJpZM4UMiWe>
.
|
Hi @glennhickey Thanks for the help.
I see. I guess this could make sense. One other thing I don't entirely understand yet: Using the JSON from For each case, the size of the json and the size of the
|
The graph with variation will have more ambiguity in it. This will result in more MAPQ=0 alignments. MAPQ=0 means the top two alignments had the same score. Look for reads that get MAPQ=0 and score=0. For a short read alignment in vg this means they are unmapped. In this case you're getting something like 0.4% more, which isn't a big deal IMO if you're mapping more reads correctly overall. It isn't clear that it's wrong to get more or less MAPQ=0 alignments if it's happening because you've learned that your read has more optimal mapping locations. |
I think we should go ahead and expose this at the CLI, either in the existing possible examples: |
Great work on vg, I've just come back to it and mapping is a lot more performant. Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency. Also, although I've tried to set a high pc identity threshold 0.95 (script below) eg:
Script 2 ... stats attempt:
|
vg stats -a will give you some mapping statistics. You can also filter by
identity using vg filter -p -u -f
…On Wed, Aug 5, 2020 at 8:16 AM Colin Davenport ***@***.***> wrote:
Great work on vg, I've just come back to it and mapping is a lot more
performant.
Has this stats/report functionality been provided yet for GAM files ? I'm
investigating multimapping against references from about 50 highly related
bacteria (metagenome), and am struggling to get my head around stats,
multimapping and mapping efficiency.
Also, although I've tried to set a high pc identity threshold 0.95 (script
below)
I'm still getting lower identity mappings in the result jsons
eg:
{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55
#!/bin/bash
# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020
threads=56
for fq in `ls *.fastq`
do
echo "Aligning FASTQ SE short"
echo $i
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 --min-ident 0.95 -f $fq > $fq.nomulti.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1 --min-ident 0.95 -f $fq > $fq.1multi.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5 --min-ident 0.95 -f $fq > $fq.5multi.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam
#--surject-to bam
done
for gam in `ls *.gam`
do
echo "Converting gam"
echo $i
nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
nohup srun -p short -c 1 -t 119 vg index -d $gam.index -N $gam
done
Script 2 ... stats attempt:
#!/bin/bash
# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020
for filename in `ls *.gam`
do
echo $i
nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
grep identity $filename.json > $filename.json.al.json
echo "Alignment stats"
echo "Total lines (reads): "
wc -l $filename.json
echo "Total lines with identity (mapped reads): "
grep -c identity $filename.json
done
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1701 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG373RA32EB2RJGPRIVQX3R7FESJANCNFSM4FBSEWPA>
.
|
Hi there
This is another clarification question:
(1)
Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...
(2)
I created a non-branching graph against the reference genome with one component per chromosome using the following command:
I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa.
Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph. That's strange, right? We'd expect that, each alignment against the reference-only graph should also be a valid alignment against the reference+VCF graph....yet, the reference-only graph has a higher number of mapped reads (using the measurement of mapped reads in discussed in (1))
Any ideas what could be going on here? Perhaps I'm measuring the total number of mapped reads incorrectly?
The text was updated successfully, but these errors were encountered: