Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to count the number of mapped reads against a vg? #1701

Open
evanbiederstedt opened this issue May 24, 2018 · 14 comments
Open

How to count the number of mapped reads against a vg? #1701

evanbiederstedt opened this issue May 24, 2018 · 14 comments
Assignees
Labels
feature request For functionality that is desired at the CLI level

Comments

@evanbiederstedt
Copy link

Hi there

This is another clarification question:

(1)

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

(2)

I created a non-branching graph against the reference genome with one component per chromosome using the following command:

$ vg construct -r hg38.fa > output.vg

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa.

Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph. That's strange, right? We'd expect that, each alignment against the reference-only graph should also be a valid alignment against the reference+VCF graph....yet, the reference-only graph has a higher number of mapped reads (using the measurement of mapped reads in discussed in (1))

Any ideas what could be going on here? Perhaps I'm measuring the total number of mapped reads incorrectly?

@edawson
Copy link
Contributor

edawson commented May 24, 2018

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

I don't think there is a recommended way yet, but in theory we'd put something in vg stats, or a similar namespace. It sounds like each alignment gets a score/identity, but each read gets a MAPQ (which is computed from the multiple alignments per read), so if you want to count mapped reads you'd want to check the MAPQs.

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa. Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph.

You might get fewer MAPQ > 0 reads in the VCF graph, as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

@evanbiederstedt
Copy link
Author

evanbiederstedt commented May 24, 2018

Hi @edawson

Thanks for the quick response.

You might get fewer MAPQ > 0 reads in the VCF graph

So, this is to say that the MAPQ=0 reads are filtered out? With my tests, I the minimum MAPQ I ever saw was 1.

I guess....is there any way I could access these? At the moment, I'm not sure how I'd use the GAM/JSON to access a (non-read) alignment with no MAPQ values and a read with MAPQ=0, right?

as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

I guess this gets in the theory and purpose of a graph genome, but this is a bit of a conceptual problem, no? If there are reads which align on a reference with some MAPQ score against a somewhat unique sequence, a graph genome could end up dismissing these reads because of the inherent variation within the graph?

Maybe the problem is the MAPQ metric applied for graph genomes?

@evanbiederstedt
Copy link
Author

evanbiederstedt commented May 28, 2018

@edawson

I think the addendum to the question above is, would accessing the MAPQ==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?

@glennhickey
Copy link
Contributor

glennhickey commented May 28, 2018 via email

@evanbiederstedt
Copy link
Author

Hi @glennhickey

Thanks for the help

For the commands given, i.e.

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam

I see an output for

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'

but there is an error with

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG

Error:

[vg view] error: no filename given

Are the above commands correct?

@glennhickey
Copy link
Contributor

glennhickey commented May 28, 2018 via email

@evanbiederstedt
Copy link
Author

@glennhickey
Thanks for this. Ah yes, this makes sense, as you would need to pipe JSON format into jq

@evanbiederstedt
Copy link
Author

@glennhickey @edawson

So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.

Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):

REFERENCE-ONLY
length(json)= 15742494
length(scores) = 15530966
length(identity) = 15530966
length(mapping_quality)=14358612

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1383882
length(scores) =1172354
length(identity) =1172354
length(mapping_quality)=0 ## this makes sense! good!

GRAPH
length(json)= 15742494 ### (It's interesting these are the same length...)
length(scores) =15523701
length(identity) =15523701
length(mapping_quality)=14280344

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1462150
length(scores) =1243357
length(identity) =1243357
length(mapping_quality)= 0

Attempts at addition:

(1)

reference: length(scores) + length(scores)_mapq0 =>  15530966 + 1172354 = 16703320
graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058 

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

However, I'm not sure if I should celebrate just yet...as @edawson mentioned,

so if you want to count mapped reads you'd want to check the MAPQs.

(2)

reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 
graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249

15742494 is the length of the original GAMs for both. Why would that be?

(3)

reference: length(scores) + length(json)_mapq0 =>  15530966 + 1383882  = 16914848
graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851

I'm not sure what this means, but here the value is greater for the GRAPH

(4)

reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966
graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701

Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means.

Questions:

(A) How should I interpret each of these values?

(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?

Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?

@glennhickey
Copy link
Contributor

glennhickey commented May 29, 2018 via email

@evanbiederstedt
Copy link
Author

Hi @glennhickey

Thanks for the help.

It is possible that the latter group is larger.

I see. I guess this could make sense.

One other thing I don't entirely understand yet: Using the JSON from (mapping_quality | not) above, there are values for identity and scores. Should I conclude that these are MAPQ==0 alignments?

For each case, the size of the json and the size of the identity fields are different, e.g.

 length(json)= 1383882, length(scores) =1172354

@ekg
Copy link
Member

ekg commented May 29, 2018

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

The graph with variation will have more ambiguity in it. This will result in more MAPQ=0 alignments. MAPQ=0 means the top two alignments had the same score.

Look for reads that get MAPQ=0 and score=0. For a short read alignment in vg this means they are unmapped.

In this case you're getting something like 0.4% more, which isn't a big deal IMO if you're mapping more reads correctly overall. It isn't clear that it's wrong to get more or less MAPQ=0 alignments if it's happening because you've learned that your read has more optimal mapping locations.

@edawson edawson added the feature request For functionality that is desired at the CLI level label Jun 6, 2018
@edawson
Copy link
Contributor

edawson commented Jun 6, 2018

I think we should go ahead and expose this at the CLI, either in the existing stats subcommand, the sift subcommand, or via a new gamstats subcommand`. Open to suggested on the CLI.

possible examples:
vg stats -G reads.gam -m graph.vg
vg gamstats -m reads.gam
vg sift -S -m reads.gam

@edawson edawson self-assigned this Jun 6, 2018
@colindaven
Copy link
Contributor

Great work on vg, I've just come back to it and mapping is a lot more performant.

Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.

Also, although I've tried to set a high pc identity threshold 0.95 (script below)
I'm still getting lower identity mappings in the result jsons

eg:

{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

threads=56

for fq in `ls *.fastq`

        do
                echo "Aligning FASTQ SE short"
                echo $i
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60       --min-ident 0.95 -f $fq > $fq.nomulti.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1  --min-ident 0.95 -f $fq > $fq.1multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5  --min-ident 0.95 -f $fq > $fq.5multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam

                #--surject-to bam 
done


for gam in `ls *.gam`

        do
                echo "Converting gam"
                echo $i
                nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
                nohup srun -p short -c 1 -t 119 vg index -d  $gam.index -N $gam

done

Script 2 ... stats attempt:

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

for filename in `ls *.gam`

        do
                echo $i
                nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
                grep identity $filename.json > $filename.json.al.json

                echo "Alignment stats"
                echo "Total lines (reads): " 
                wc -l $filename.json
                echo "Total lines with identity (mapped reads): " 
                grep -c identity $filename.json
done

@glennhickey
Copy link
Contributor

glennhickey commented Aug 5, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request For functionality that is desired at the CLI level
Projects
None yet
Development

No branches or pull requests

5 participants