-
Notifications
You must be signed in to change notification settings - Fork 594
mpileup ignores (?) RG SM fields #599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
RG and SM do not appear to be relevant to the pileup format.
|
@kirkmcclure I also noticed that RG and SM do make a difference in mpileup/bcftools, but seems to be completely ignored in the mpileup format, which is required as input for VarScan. I was wondering how to use VarScan for variant calling with one bam file containing multiple samples labeled by different RG and SM. Do you have any idea about that? Thanks |
I just ran into this issue as well using samtools versions Here is my minimal example using samtools/bcftools/htslib > cat issue/test.sam
@HD VN:1.6 SO:unknown
@SQ SN:chr1 LN:1000000
@RG ID:0 SM:sample_0
@RG ID:1 SM:sample_1
r0 0 chr1 24 0 1M * 0 0 G I RG:Z:0
r1 0 chr1 24 0 1M * 0 0 G I RG:Z:1 > samtools mpileup issue/test.sam
[mpileup] 2 samples in 1 input files
chr1 24 N 2 ^!G$^!G$ II As @ju-cheng describes, > bcftools mpileup --no-reference issue/test.sam
[mpileup] 2 samples in 1 input files
Note: The maximum per-sample depth with -d 250 is 125.0x
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup --no-reference issue/test.sam
##contig=<ID=chr1,length=1000000>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_0 sample_1
chr1 24 . N G,<*> 0 . DP=2;I16=0,0,2,0,0,0,80,3200,0,0,0,0,0,0,0,0;QS=0,2,0;VDB=0.02;SGB=-0.759771;MQ0F=1 PL 4,3,0,4,3,4 4,3,0,4,3,4 |
I just ran into this issue again and found my previous post through a google search. I think this must be a bug that has always existed or a feature that was planned but never implemented, @lh3? |
See also the documentation fixes in PR #1055:
|
Hmm, the documentation is still ambiguous. From the man page for samtools 0.1.19 (http://www.htslib.org/doc/0.1.19/samtools.html)
And if I understand correctly, #1055 asserts that even as far back as samtools 0.1.19, the @rg header was never intended to be used for pileup output? And in the man page for version samtools 1.9 (http://www.htslib.org/doc/1.9/samtools.html) it's even harder to read the documentation any other way than that samtools mpileup supports the @rg header for pileup format:
As a user, I am rather confused :/ |
It looks like there is a feature request for this: #546 |
That is correct. The historical documentation was misleading about these historical versions' capabilities. That was fixed by PR #1055, which landed in 1.10. To reduce confusion, I suggest you use the latest samtools version and the corresponding current documentation. The historical documentation is provided for historical reasons and is (obviously) not separately maintained. |
Ah, thanks! I thought 1.9 was the current version. Silly me -,- |
Closing as fixed in current release. |
I have a number (18) of bam files with alignments from a total of 5 samples. I have added @rg lines to these with SM:sample_ids. When I run samtools mpileup with these samtools reports (correctly) 5 samples in 18 bam files. However in the pileup file I get 18 series of columns, one for each file; exactly the same as I get when using the -R option.
I'm guessing that I either have the @rg header line incorrectly set, or that I'm expecting mpileup to do something it isn't supposed to do, but can't quite work out from the documentation what's going on.
The command I'm using :
samtools mpileup -f ref.fasta *.bam > data.pileup
which then reports :
5 samples in 18 input files
the resulting files though have 18 sets of 3 columns reporting the alignment information (and no information as to which is what).
The bam file header have lines taking the form:
@rg ID:SP SM:SP
and individual alignment lines have:
RG:Z:SP
I expect this is not a bug, but rather my misunderstanding of what mpileup is supposed to do, but would rather like to confirm this.
thanks,
Martin
The text was updated successfully, but these errors were encountered: