Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpileup ignores (?) RG SM fields #599

Closed
lmjakt opened this issue Jul 27, 2016 · 10 comments
Closed

mpileup ignores (?) RG SM fields #599

lmjakt opened this issue Jul 27, 2016 · 10 comments

Comments

@lmjakt
Copy link

lmjakt commented Jul 27, 2016

I have a number (18) of bam files with alignments from a total of 5 samples. I have added @rg lines to these with SM:sample_ids. When I run samtools mpileup with these samtools reports (correctly) 5 samples in 18 bam files. However in the pileup file I get 18 series of columns, one for each file; exactly the same as I get when using the -R option.
I'm guessing that I either have the @rg header line incorrectly set, or that I'm expecting mpileup to do something it isn't supposed to do, but can't quite work out from the documentation what's going on.
The command I'm using :

samtools mpileup -f ref.fasta *.bam > data.pileup

which then reports :
5 samples in 18 input files

the resulting files though have 18 sets of 3 columns reporting the alignment information (and no information as to which is what).

The bam file header have lines taking the form:
@rg ID:SP SM:SP

and individual alignment lines have:
RG:Z:SP

I expect this is not a bug, but rather my misunderstanding of what mpileup is supposed to do, but would rather like to confirm this.

thanks,

Martin

@kirkmcclure
Copy link

RG and SM do not appear to be relevant to the pileup format.
However, this issue is not valid in general, as mpileup does use RG and SM fields.
A simple demonstration (note additional field and SGB change in BCF output when RG tags are intentionally ignored):

> cat /tmp/Issue599.sam
@HD     VN:1.4  SO:coordinate
@RG     ID:grp1 DS:Group 1      LB:Library 1    SM:Sample1
@RG     ID:grp2 DS:Group 2      LB:Library 2    SM:Sample2
@SQ     SN:ref1 LN:56   M5:08c04d512d4797d9ba2a156c1daba468
@SQ     SN:ref2 LN:60   M5:7c35feac7036c1cdef3bee0cc4b21437
p001    0       ref1    1       0       1M      *       0       0       C      A MD:Z:1  NM:i:0  RG:Z:grp1
r001    0       ref2    1       0       1M      *       0       0       T      A MD:Z:1  NM:i:0  RG:Z:grp2

> ./samtools mpileup -g /tmp/Issue599.sam > /tmp/Issue599_two.bcf
[mpileup] 2 samples in 1 input files
<mpileup> Set max per-file depth to 4000

> ./samtools mpileup -R -g /tmp/Issue599.sam > /tmp/Issue599_one.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

> ../bcftools/bcftools view /tmp/Issue599_two.bcf > /tmp/Issue599_two.bcf.txt
> ../bcftools/bcftools view /tmp/Issue599_one.bcf > /tmp/Issue599_one.bcf.txt

> diff /tmp/Issue599_two.bcf.txt /tmp/Issue599_one.bcf.txt
4c4
< ##samtoolsCommand=samtools mpileup -g /tmp/Issue599.sam
---
> ##samtoolsCommand=samtools mpileup -R -g /tmp/Issue599.sam
23,26c23,26
< ##bcftools_viewCommand=view /tmp/Issue599_two.bcf; Date=Wed May 17 13:09:01 2017
< #CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT Sample1  Sample2
< ref1  1       .       N       C,<*>   0       .       DP=1;I16=0,0,1,0,0,0,32,1024,0,0,0,0,0,0,0,0;QS=0,1,0;SGB=-0.157211;MQ0F=1      PL      4,3,0,4,3,4    0,0,0,0,0,0
< ref2  1       .       N       T,<*>   0       .       DP=1;I16=0,0,1,0,0,0,32,1024,0,0,0,0,0,0,0,0;QS=0,1,0;SGB=-0.157211;MQ0F=1      PL      0,0,0,0,0,0    4,3,0,4,3,4
---
> ##bcftools_viewCommand=view /tmp/Issue599_one.bcf; Date=Wed May 17 13:09:34 2017
> #CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT /tmp/Issue599.sam
> ref1  1       .       N       C,<*>   0       .       DP=1;I16=0,0,1,0,0,0,32,1024,0,0,0,0,0,0,0,0;QS=0,1,0;SGB=-0.379885;MQ0F=1      PL      4,3,0,4,3,4
> ref2  1       .       N       T,<*>   0       .       DP=1;I16=0,0,1,0,0,0,32,1024,0,0,0,0,0,0,0,0;QS=0,1,0;SGB=-0.379885;MQ0F=1      PL      4,3,0,4,3,4

@ju-cheng
Copy link

ju-cheng commented Jun 6, 2018

@kirkmcclure I also noticed that RG and SM do make a difference in mpileup/bcftools, but seems to be completely ignored in the mpileup format, which is required as input for VarScan. I was wondering how to use VarScan for variant calling with one bam file containing multiple samples labeled by different RG and SM. Do you have any idea about that? Thanks

@winni2k
Copy link

winni2k commented Oct 8, 2019

I just ran into this issue as well using samtools versions 0.1.19-44428cd, 1.5, and 1.9.

Here is my minimal example using samtools/bcftools/htslib 1.9:

> cat issue/test.sam
@HD	VN:1.6	SO:unknown
@SQ	SN:chr1	LN:1000000
@RG	ID:0	SM:sample_0
@RG	ID:1	SM:sample_1
r0	0	chr1	24	0	1M	*	0	0	G	I	RG:Z:0
r1	0	chr1	24	0	1M	*	0	0	G	I	RG:Z:1
> samtools mpileup issue/test.sam                                                                                                                                  
[mpileup] 2 samples in 1 input files
chr1	24	N	2	^!G$^!G$	II

As @ju-cheng describes, bcftools mpileup seems to split the pileup by read group as expected

> bcftools mpileup --no-reference issue/test.sam 
[mpileup] 2 samples in 1 input files
Note: The maximum per-sample depth with -d 250 is 125.0x
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup --no-reference issue/test.sam
##contig=<ID=chr1,length=1000000>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample_0	sample_1
chr1	24	.	N	G,<*>	0	.	DP=2;I16=0,0,2,0,0,0,80,3200,0,0,0,0,0,0,0,0;QS=0,2,0;VDB=0.02;SGB=-0.759771;MQ0F=1	PL	4,3,0,4,3,4	4,3,0,4,3,4

@winni2k
Copy link

winni2k commented Mar 26, 2020

I just ran into this issue again and found my previous post through a google search. I think this must be a bug that has always existed or a feature that was planned but never implemented, @lh3?

@jmarshall
Copy link
Member

See also the documentation fixes in PR #1055:

Grouping by sample (SM) identifiers was only ever for VCF/BCF output. Remove SM info from the first paragraph: text pileup output is only grouped by input file.

@winni2k
Copy link

winni2k commented Mar 27, 2020

Hmm, the documentation is still ambiguous. From the man page for samtools 0.1.19 (http://www.htslib.org/doc/0.1.19/samtools.html)

mpileup samtools mpileup [-EBugp] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

Generate BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample identifiers in @rg header lines. If sample identifiers are absent, each input file is regarded as one sample.

And if I understand correctly, #1055 asserts that even as far back as samtools 0.1.19, the @rg header was never intended to be used for pileup output?

And in the man page for version samtools 1.9 (http://www.htslib.org/doc/1.9/samtools.html) it's even harder to read the documentation any other way than that samtools mpileup supports the @rg header for pileup format:

mpileup samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

Generate pileup for one or multiple BAM files. Alignment records are grouped by sample (SM) identifiers in @rg header lines. If sample identifiers are absent, each input file is regarded as one sample.

As a user, I am rather confused :/

@winni2k
Copy link

winni2k commented Mar 27, 2020

It looks like there is a feature request for this: #546

@jmarshall
Copy link
Member

jmarshall commented Mar 27, 2020

And if I understand correctly, #1055 asserts that even as far back as samtools 0.1.19, the @RG header was never intended to be used for pileup output?

That is correct. The historical documentation was misleading about these historical versions' capabilities. That was fixed by PR #1055, which landed in 1.10.

To reduce confusion, I suggest you use the latest samtools version and the corresponding current documentation. The historical documentation is provided for historical reasons and is (obviously) not separately maintained.

@winni2k
Copy link

winni2k commented Mar 27, 2020

Ah, thanks! I thought 1.9 was the current version. Silly me -,-

@daviesrob
Copy link
Member

Closing as fixed in current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants