How to count the number of mapped reads against a vg? #1701

evanbiederstedt · 2018-05-24T16:00:32Z

Hi there

This is another clarification question:

(1)

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

(2)

I created a non-branching graph against the reference genome with one component per chromosome using the following command:

$ vg construct -r hg38.fa > output.vg

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa.

Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph. That's strange, right? We'd expect that, each alignment against the reference-only graph should also be a valid alignment against the reference+VCF graph....yet, the reference-only graph has a higher number of mapped reads (using the measurement of mapped reads in discussed in (1))

Any ideas what could be going on here? Perhaps I'm measuring the total number of mapped reads incorrectly?

The text was updated successfully, but these errors were encountered:

edawson · 2018-05-24T16:27:28Z

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

I don't think there is a recommended way yet, but in theory we'd put something in vg stats, or a similar namespace. It sounds like each alignment gets a score/identity, but each read gets a MAPQ (which is computed from the multiple alignments per read), so if you want to count mapped reads you'd want to check the MAPQs.

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa. Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph.

You might get fewer MAPQ > 0 reads in the VCF graph, as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

evanbiederstedt · 2018-05-24T17:03:36Z

Hi @edawson

Thanks for the quick response.

You might get fewer MAPQ > 0 reads in the VCF graph

So, this is to say that the MAPQ=0 reads are filtered out? With my tests, I the minimum MAPQ I ever saw was 1.

I guess....is there any way I could access these? At the moment, I'm not sure how I'd use the GAM/JSON to access a (non-read) alignment with no MAPQ values and a read with MAPQ=0, right?

as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

I guess this gets in the theory and purpose of a graph genome, but this is a bit of a conceptual problem, no? If there are reads which align on a reference with some MAPQ score against a somewhat unique sequence, a graph genome could end up dismissing these reads because of the inherent variation within the graph?

Maybe the problem is the MAPQ metric applied for graph genomes?

evanbiederstedt · 2018-05-28T09:16:57Z

@edawson

I think the addendum to the question above is, would accessing the MAPQ==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?

glennhickey · 2018-05-28T13:10:55Z

In protobuf, 0 values are not written. So reads with mapq=0 won't have any mapping_quality field when you look at them. To extract these, you can use jq `vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam` To find non-zero values, for example 55, you can do something similar `vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG

mapq55.gam`

Going through JSON is slow, but it allows for fairly general queries of GAM files.

…

On Mon, May 28, 2018 at 5:16 AM, evanbiederstedt ***@***.***> wrote: @edawson <https://github.com/edawson> I think the addendum to the question above is, would accessing the mapq==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7od9rf8k9YFPi0ydu5ltjaPT_uxvks5t28CKgaJpZM4UMiWe> .

evanbiederstedt · 2018-05-28T16:34:06Z

Hi @glennhickey

Thanks for the help

For the commands given, i.e.

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam

I see an output for

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'

but there is an error with

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG

Error:

[vg view] error: no filename given

Are the above commands correct?

glennhickey · 2018-05-28T16:45:00Z

…

On Mon, May 28, 2018 at 12:34 PM, evanbiederstedt ***@***.***> wrote: Hi @glennhickey <https://github.com/glennhickey> Thanks for the help For the commands given, i.e. vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam I see an output for vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' but there is an error with vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG Error: [vg view] error: no filename given Are the above commands correct? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7vb4eF-1C2313HnnB54B3VMoX_z9ks5t3Cb_gaJpZM4UMiWe> .

evanbiederstedt · 2018-05-28T21:10:34Z

@glennhickey
Thanks for this. Ah yes, this makes sense, as you would need to pipe JSON format into jq

evanbiederstedt · 2018-05-28T23:10:18Z

@glennhickey @edawson

So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.

Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):

REFERENCE-ONLY
length(json)= 15742494
length(scores) = 15530966
length(identity) = 15530966
length(mapping_quality)=14358612

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1383882
length(scores) =1172354
length(identity) =1172354
length(mapping_quality)=0 ## this makes sense! good!

GRAPH
length(json)= 15742494 ### (It's interesting these are the same length...)
length(scores) =15523701
length(identity) =15523701
length(mapping_quality)=14280344

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1462150
length(scores) =1243357
length(identity) =1243357
length(mapping_quality)= 0

Attempts at addition:

(1)

reference: length(scores) + length(scores)_mapq0 =>  15530966 + 1172354 = 16703320
graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

However, I'm not sure if I should celebrate just yet...as @edawson mentioned,

so if you want to count mapped reads you'd want to check the MAPQs.

(2)

reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 
graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249

15742494 is the length of the original GAMs for both. Why would that be?

(3)

reference: length(scores) + length(json)_mapq0 =>  15530966 + 1383882  = 16914848
graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851

I'm not sure what this means, but here the value is greater for the GRAPH

(4)

reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966
graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701

Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means.

Questions:

(A) How should I interpret each of these values?

(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?

Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?

glennhickey · 2018-05-29T16:00:33Z

Some reads map more ambiguously to the linear reference than to the graph. Other reads map more ambiguously to the graph than the linear reference. It is possible that the latter group is larger. We are currently studying this effect and ways to mitigate it, and hope to have some results to publish shortly. Filtering rare variants out of the VCF is one simple approach.

…

On Mon, May 28, 2018 at 7:10 PM, evanbiederstedt ***@***.***> wrote: @glennhickey <https://github.com/glennhickey> @edawson <https://github.com/edawson> So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph. Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere): REFERENCE-ONLY length(json)= 15742494 length(scores) = 15530966 length(identity) = 15530966 length(mapping_quality)=14358612 Using the commands above to check for (mapping_quality | not) values: length(json)= 1383882 length(scores) =1172354 length(identity) =1172354 length(mapping_quality)=0 ## this makes sense! good! GRAPH length(json)= 15742494 ### (It's interesting these are the same length...) length(scores) =15523701 length(identity) =15523701 length(mapping_quality)=14280344 Using the commands above to check for (mapping_quality | not) values: length(json)= 1462150 length(scores) =1243357 length(identity) =1243357 length(mapping_quality)= 0 Attempts at addition: (1) reference: length(scores) + length(scores)_mapq0 => 15530966 + 1172354 = 16703320 graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058 That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320! However, I'm not sure if I should celebrate just yet...as @edawson <https://github.com/edawson> mentioned, so if you want to count mapped reads you'd want to check the MAPQs. (2) reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249 15742494 is the length of the original GAMs for both. Why would that be? (3) reference: length(scores) + length(json)_mapq0 => 15530966 + 1383882 = 16914848 graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851 I'm not sure what this means, but here the value is greater for the GRAPH (4) reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966 graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701 Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means. Questions: (A) How should I interpret each of these values? (B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph *still* has more reads mapped than the VCF graph...why would this be? Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7mo0bgxzjfy9w3zBVrVQ6gq7BkAFks5t3IPcgaJpZM4UMiWe> .

evanbiederstedt · 2018-05-29T17:12:40Z

Hi @glennhickey

Thanks for the help.

It is possible that the latter group is larger.

I see. I guess this could make sense.

One other thing I don't entirely understand yet: Using the JSON from (mapping_quality | not) above, there are values for identity and scores. Should I conclude that these are MAPQ==0 alignments?

For each case, the size of the json and the size of the identity fields are different, e.g.

 length(json)= 1383882, length(scores) =1172354

ekg · 2018-05-29T18:09:47Z

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

The graph with variation will have more ambiguity in it. This will result in more MAPQ=0 alignments. MAPQ=0 means the top two alignments had the same score.

Look for reads that get MAPQ=0 and score=0. For a short read alignment in vg this means they are unmapped.

In this case you're getting something like 0.4% more, which isn't a big deal IMO if you're mapping more reads correctly overall. It isn't clear that it's wrong to get more or less MAPQ=0 alignments if it's happening because you've learned that your read has more optimal mapping locations.

edawson · 2018-06-06T16:59:09Z

I think we should go ahead and expose this at the CLI, either in the existing stats subcommand, the sift subcommand, or via a new gamstats subcommand`. Open to suggested on the CLI.

possible examples:
vg stats -G reads.gam -m graph.vg
vg gamstats -m reads.gam
vg sift -S -m reads.gam

colindaven · 2020-08-05T12:16:22Z

Great work on vg, I've just come back to it and mapping is a lot more performant.

Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.

Also, although I've tried to set a high pc identity threshold 0.95 (script below)
I'm still getting lower identity mappings in the result jsons

eg:

{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

threads=56

for fq in `ls *.fastq`

        do
                echo "Aligning FASTQ SE short"
                echo $i
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60       --min-ident 0.95 -f $fq > $fq.nomulti.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1  --min-ident 0.95 -f $fq > $fq.1multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5  --min-ident 0.95 -f $fq > $fq.5multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam

                #--surject-to bam 
done


for gam in `ls *.gam`

        do
                echo "Converting gam"
                echo $i
                nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
                nohup srun -p short -c 1 -t 119 vg index -d  $gam.index -N $gam

done

Script 2 ... stats attempt:

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

for filename in `ls *.gam`

        do
                echo $i
                nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
                grep identity $filename.json > $filename.json.al.json

                echo "Alignment stats"
                echo "Total lines (reads): " 
                wc -l $filename.json
                echo "Total lines with identity (mapped reads): " 
                grep -c identity $filename.json
done

glennhickey · 2020-08-05T18:50:25Z

vg stats -a will give you some mapping statistics. You can also filter by identity using vg filter -p -u -f

…

On Wed, Aug 5, 2020 at 8:16 AM Colin Davenport ***@***.***> wrote: Great work on vg, I've just come back to it and mapping is a lot more performant. Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency. Also, although I've tried to set a high pc identity threshold 0.95 (script below) I'm still getting lower identity mappings in the result jsons eg: {"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55 #!/bin/bash # Run vg view for all vg gam files -> json on SLURM # Colin D, August 2020 threads=56 for fq in `ls *.fastq` do echo "Aligning FASTQ SE short" echo $i nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 --min-ident 0.95 -f $fq > $fq.nomulti.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1 --min-ident 0.95 -f $fq > $fq.1multi.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5 --min-ident 0.95 -f $fq > $fq.5multi.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam #--surject-to bam done for gam in `ls *.gam` do echo "Converting gam" echo $i nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json nohup srun -p short -c 1 -t 119 vg index -d $gam.index -N $gam done Script 2 ... stats attempt: #!/bin/bash # Run vg view for all vg gam files -> json on SLURM # Colin D, August 2020 for filename in `ls *.gam` do echo $i nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json grep identity $filename.json > $filename.json.al.json echo "Alignment stats" echo "Total lines (reads): " wc -l $filename.json echo "Total lines with identity (mapped reads): " grep -c identity $filename.json done — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1701 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373RA32EB2RJGPRIVQX3R7FESJANCNFSM4FBSEWPA> .

edawson added the feature request For functionality that is desired at the CLI level label Jun 6, 2018

edawson self-assigned this Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to count the number of mapped reads against a vg? #1701

How to count the number of mapped reads against a vg? #1701

evanbiederstedt commented May 24, 2018

edawson commented May 24, 2018

evanbiederstedt commented May 24, 2018 •

edited

Loading

evanbiederstedt commented May 28, 2018 •

edited

Loading

glennhickey commented May 28, 2018 via email

evanbiederstedt commented May 28, 2018

glennhickey commented May 28, 2018 via email

evanbiederstedt commented May 28, 2018

evanbiederstedt commented May 28, 2018

glennhickey commented May 29, 2018 via email

evanbiederstedt commented May 29, 2018

ekg commented May 29, 2018

edawson commented Jun 6, 2018

colindaven commented Aug 5, 2020

glennhickey commented Aug 5, 2020 via email

How to count the number of mapped reads against a vg? #1701

How to count the number of mapped reads against a vg? #1701

Comments

evanbiederstedt commented May 24, 2018

edawson commented May 24, 2018

evanbiederstedt commented May 24, 2018 • edited Loading

evanbiederstedt commented May 28, 2018 • edited Loading

glennhickey commented May 28, 2018 via email

evanbiederstedt commented May 28, 2018

glennhickey commented May 28, 2018 via email

evanbiederstedt commented May 28, 2018

evanbiederstedt commented May 28, 2018

glennhickey commented May 29, 2018 via email

evanbiederstedt commented May 29, 2018

ekg commented May 29, 2018

edawson commented Jun 6, 2018

colindaven commented Aug 5, 2020

glennhickey commented Aug 5, 2020 via email

evanbiederstedt commented May 24, 2018 •

edited

Loading

evanbiederstedt commented May 28, 2018 •

edited

Loading