Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@PG ID:bwa line containing @RG ID:... causes SAMFormatException #677

Closed
jblachly opened this issue Aug 3, 2016 · 8 comments
Closed

@PG ID:bwa line containing @RG ID:... causes SAMFormatException #677

jblachly opened this issue Aug 3, 2016 · 8 comments

Comments

@jblachly
Copy link

jblachly commented Aug 3, 2016

_(Foreword) This may actually be an htslib issue rather than specifically htsjdk_

Verify

I searched the issue tracker for the text: "clash"

Subject of the issue

Briefly, when read groups are added at the alignment step with BWA (instead of adding them later), bwa inserts the @pg BAM header line that includes its complete command line after the CL: key. However, when the command line string includes -R @RG<tab>ID:xyz... , the read group ID is parsed as a new key by htslib/htsjdk's header parsing code. This causes an exception.

I must further note that this only occurs when the command line actually has the literal "tab" character, which is most likely to occur when bwa is called from a pipeline stage where the command line was build programmatically. When running bwa from the command line and typing the \t escape sequence as a string, bwa parses it correctly but includes the original \t in the @pg line, thus avoiding htslib/htsjdk interpreting ID: as a distinct key.

The behavior can be replicated by running bwa from command line and typing -R $'@RG<control-v><tab>ID:whatever<control-v><tab>SM:whatever<control-v><tab>PU:whatever'

This has not bit me 'til now as I am formally automating everything that was previously done with a hodgepodge of shell scripts.

[Wed Aug 03 17:32:48 EDT 2016] picard.analysis.directed.CollectHsMetrics BAIT_INTERVALS=[designed-probe-coords.interval_list] TARGET_INTERVALS=[my-targets.interval_list] INPUT=aligned_grch37/redacted.bam OUTPUT=aligned_grch37/redacted.hsmetrics.txt PER_TARGET_COVERAGE=aligned_grch37/redacted.pertarget.txt REFERENCE_SEQUENCE=/export/local/ref/genomes/grch37/grch37.fa    MINIMUM_MAPPING_QUALITY=20 MINIMUM_BASE_QUALITY=20 CLIP_OVERLAPPING_READS=true METRIC_ACCUMULATION_LEVEL=[ALL_READS] NEAR_DISTANCE=250 COVERAGE_CAP=200 SAMPLE_SIZE=10000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Aug 03 17:32:48 EDT 2016] Executing as x@x on Linux 4.4.0-24-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
[Wed Aug 03 17:32:49 EDT 2016] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2024275968
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair ID:redacted clashes with ID:bwa. Line:
@PG     ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:/export/local/tools/bwa/bwa mem -t 32 -R @RG      ID:redacted     SM:redacted     LB:1    PL:Illumina     PM:Miseq        PU:MiseqSN.1 CN:OSU /export/local/ref/indices/bwa/grch37/grch37 cutadapt/redacted_R1.fastq cutadapt/redacted_R2.fastq; File /home/x/capture/aligned_grch37/redacted.bam; Line number 88
        at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:238)
        at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
        at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(SAMTextHeaderCodec.java:293)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:87)
        at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:524)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:177)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:132)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:293)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:145)
        at picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:110)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Your environment

  • version of htsjdk
    htsjdk included with Picard 2.5.0 from GitHub
    Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
  • version of java
$ java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
  • which OS
$ uname -a
Linux redacted 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.

bwa -R 'ID:anything'

followed by picardtools

Expected behaviour

No exception (ignore previously seen keys, at least on @pg lines)

Actual behaviour

htsjdk.samtools.SAMFormatException as above

Proposed solution

ID: keys seen after the initial ID: on a @pg line should be ignored

Other commentary

The inclusion of RG at the alignment step is convenient and a better use of resources than doing a separate run of AddRemoveReadGroups.

The SAM specifications indicate that a command line follows CL:. This is imperfect, but I propose that bwa -R is common enough that htslib/htsjdk should ignore previously seen tags.

Workaround in the meantime

Pipelines should include escape sequence \t rather than literal <tab> control characters in bwa command lines

@nh13
Copy link
Member

nh13 commented Aug 4, 2016

@PG lines cannot have tabs, see section 1.3 of the SAM spec:

Thus header lines match /^@[A-Z][A-Z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/.

@jblachly
Copy link
Author

jblachly commented Aug 4, 2016

Field values cannot have tabs, yes. I agree is a bwa bug. But is an htslib/jtsjdk issue nontheless -- feel free to mark 'wontfix,' but I thought it was important to document in case someone else runs across this, as there are I think zero google hits for this.

@nh13
Copy link
Member

nh13 commented Aug 4, 2016

Thanks for documenting it here for folks to find it, and feel free to link the issue you log with BWA, as it likely needs to sanitize its output in this case. Also, you can modify your script/code not to produce tabs.

@lucacozzuto
Copy link

lucacozzuto commented Jul 31, 2017

Dear all, I found that adding this command to bwa
-R '@rg\tID:${MYID}\tSM:${MYID}\tLB:library\n@PG\tID:bwa'
I fix this problem (at least for qualimap). In few word I attach to @rg a new line with @pg. I know it is a workaround...
Best,

@ethering
Copy link

ethering commented Dec 6, 2019

I've also run into this problem. Unfortunately the workaround used by @lucacozzuto doesn't work in picardtools as the ID tag is still present in both @RG and @PG tags. I think the best option is to leave out the -R option in BWA initially and use picard's AddOrReplaceReadGroups to add the @RG line later.

@dixonlab
Copy link

I've come across this problem as well when using Picard. My workaround has been to include "VALIDATION_STRINGENCY=LENIENT" in the Picard commands that throw this error.

@rwanwork
Copy link

rwanwork commented Jan 8, 2021

Thank you for opening this issue! Indeed, I just faced this problem while using gatk. I guess the problem is we're using the -R option to bwa to update the readgroup. But @PG lists the command-line that was used. So, this causes a duplicate in ID. One is the actual ID; the other one occurs in the -R option of bwa. I don't think the presence or absence of a tab character is the cause. (Though, to use bwa's -R option, we have to use the \t character.) The only workaround I can think of is to have a script go through and remove the "CL:" field of @PG...

@feiloo
Copy link

feiloo commented Nov 14, 2023

Fyi, you can just use two backslashes: -R "@RG\\tID:1\\tLB:...
bwa now has a warning about it, but for bwamem2 its atm still a gotcha.

posting so others who stumble here like me can quickly fix it, until its properly escaped
bwa-issue: lh3/bwa#83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants