get_reference_sequence with_seq=True returns different reference for same position? #1180

semenko · 2023-03-31T18:01:43Z

I'm parsing an alignment of methylated sequencing data from biscuit.

At some regions, get_reference_sequence(with_seq=True) returns different bases for the same position, which I don't understand.

Here's an example run on this tiny .bam file, looking at chr1:10468

$ samtools faidx GRCh38-DAC-U2AF1.fna chr1:10468-10468
>chr1:10468-10468
T

However pysam reports both C and T as the reference (in addition to a substitution).

import pysam
print(f"Pysam Version: {pysam.__version__}")

with pysam.AlignmentFile("/home/semenko/sorted.tinysam.bam", "rb") as sam:
    for read in sam.fetch():
        print([x for x in read.get_aligned_pairs(with_seq=True) if x[1] == 10468])

Output:

Pysam Version: 0.20.0
[]
[]
[]
[(124, 10468, 'C')]   <----- ?????
[(128, 10468, 'c')]
[(112, 10468, 'T')]    <------- correct

(This seems a little like #895 -- I wonder if the MD tags are corrupt somehow?)

The text was updated successfully, but these errors were encountered:

jamorrison · 2023-04-04T17:36:01Z

I'm not sure what pysam's behavior is in handling insertions, but could this difference be the result of insertions in your reads?

jmarshall · 2023-04-05T21:29:49Z

The first problem here is that the coordinates in the tuples returned by get_aligned_pairs are 0-based. So to look at the same position shown in your faidx command, you need to look at 0-based position 10467:

Pysam Version: 0.21.0
[]
[]
[]
[(123, 10467, 'T')]
[(127, 10467, 't')]
[(111, 10467, 'T')]

which shows the expected unanimous T.

I will improve the get_aligned_pairs documentation so that it mentions that the coordinates are 0-based.

The second problem is that you would expect the answer to be unanimous for (1-based) position 10469 too. By looking at all the tuples returned by get_aligned_pairs or by using get_reference_sequence, you can print out the complete nearby reference sequence as computed by pysam. For these last three reads, the computed reference sequences are:

                                    3                                                                                               12
                                    |                                                                                               ||
    TAATTTTAATTTTATTTTaATTTTaATTTTaAcTTTaATTTTAATTTTAACTTTTAATTTTTAATTTTAATTTTAATTTTAATCCTAATTTTAATTTTaAcTcTTAATTTTAATTTTAATTTTAATTTTCGCGGT
      ATTTTAATTTtATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAAccTtcG
                     TAATTTTAATTTTAATTTTAATTTTAATTTTaAcTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAATTTTTgcgGTAcTTTTAgTT

I have also reconstructed these manually from CIGAR, SEQ, and MD for these three reads, and get the same results as pysam. So I am fairly sure pysam is correct for these MD values.

The position marked 1 (scroll right) is your position 10468 showing the unanimous T. The adjacent position marked 2 is the one your script printed out, and is indeed variously C or T.

A similar discrepancy between the three reads also occurs e.g. at the position marked 3, which is to the left of most of the indels in these reads so is easily manually checked from the CIGARs and MDs as emitted by biscuit. And the three alignment records do indeed predict different reference bases at this position.

So it would appear that the MD values emitted by biscuit, while they are syntactically correct w.r.t. the ‘0’ separators, are not quite correct.

It would be interesting to see what samtools calmd did for this BAM file. Is your GRCh38-DAC-U2AF1.fna reference file available for download?

semenko · 2023-04-05T22:38:57Z

Thanks for taking the time to look into this and for such a detailed reply!

Here's GRCh38-DAC-U2AF1.fna.gz

For reference, it's generated from:

GRCh38 from GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
the ENCODE DAC exclusion list from ENCFF356LFX.bed.gz
the NCBI U2AF1 fix from GCA_000001405.15_GRCh38_GRC_exclusions.bed

Via:

bedtools maskfasta -fi GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -fo GRCh38-DAC.fna -bed DAC-Exclusion-List-ENCFF356LFX.bed
bedtools maskfasta -fi GRCh38-DAC.fna -fo GRCh38-DAC-U2AF1.fna -bed U2AF1-GRCh38-Exclusion.bed

jamorrison · 2023-04-06T16:52:15Z

I looked at this for BISCUIT specifically (problem 2 from @jmarshall's comment). I'll keep that specific discussion over on BISCUIT's issue, but you can see my comment here: huishenlab/biscuit#37 (comment).

Clarification for #1180. In future, we may consider returning this as a collections.namedtuple, as there's no natural ordering to the elements.

jmarshall · 2023-09-01T12:00:23Z

Documentation has been clarified. The remainder of the issue has I think been covered by the linked BISCUIT issue.

semenko mentioned this issue Mar 31, 2023

MD tag generation incorrect? huishenlab/biscuit#37

Closed

jmarshall added a commit that referenced this issue Sep 1, 2023

Mention that get_aligned_pairs() results are 0-based

fb8b21b

Clarification for #1180. In future, we may consider returning this as a collections.namedtuple, as there's no natural ordering to the elements.

jmarshall closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_reference_sequence with_seq=True returns different reference for same position? #1180

get_reference_sequence with_seq=True returns different reference for same position? #1180

semenko commented Mar 31, 2023

jamorrison commented Apr 4, 2023

jmarshall commented Apr 5, 2023

semenko commented Apr 5, 2023

jamorrison commented Apr 6, 2023

jmarshall commented Sep 1, 2023

get_reference_sequence with_seq=True returns different reference for same position? #1180

get_reference_sequence with_seq=True returns different reference for same position? #1180

Comments

semenko commented Mar 31, 2023

jamorrison commented Apr 4, 2023

jmarshall commented Apr 5, 2023

semenko commented Apr 5, 2023

jamorrison commented Apr 6, 2023

jmarshall commented Sep 1, 2023