Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_reference_sequence with_seq=True returns different reference for same position? #1180

Closed
semenko opened this issue Mar 31, 2023 · 5 comments

Comments

@semenko
Copy link

semenko commented Mar 31, 2023

I'm parsing an alignment of methylated sequencing data from biscuit.

At some regions, get_reference_sequence(with_seq=True) returns different bases for the same position, which I don't understand.

Here's an example run on this tiny .bam file, looking at chr1:10468

$ samtools faidx GRCh38-DAC-U2AF1.fna chr1:10468-10468
>chr1:10468-10468
T

However pysam reports both C and T as the reference (in addition to a substitution).

import pysam
print(f"Pysam Version: {pysam.__version__}")

with pysam.AlignmentFile("/home/semenko/sorted.tinysam.bam", "rb") as sam:
    for read in sam.fetch():
        print([x for x in read.get_aligned_pairs(with_seq=True) if x[1] == 10468])

Output:

Pysam Version: 0.20.0
[]
[]
[]
[(124, 10468, 'C')]   <----- ?????
[(128, 10468, 'c')]
[(112, 10468, 'T')]    <------- correct

(This seems a little like #895 -- I wonder if the MD tags are corrupt somehow?)

@jamorrison
Copy link

I'm not sure what pysam's behavior is in handling insertions, but could this difference be the result of insertions in your reads?

@jmarshall
Copy link
Member

The first problem here is that the coordinates in the tuples returned by get_aligned_pairs are 0-based. So to look at the same position shown in your faidx command, you need to look at 0-based position 10467:

Pysam Version: 0.21.0
[]
[]
[]
[(123, 10467, 'T')]
[(127, 10467, 't')]
[(111, 10467, 'T')]

which shows the expected unanimous T.

I will improve the get_aligned_pairs documentation so that it mentions that the coordinates are 0-based.


The second problem is that you would expect the answer to be unanimous for (1-based) position 10469 too. By looking at all the tuples returned by get_aligned_pairs or by using get_reference_sequence, you can print out the complete nearby reference sequence as computed by pysam. For these last three reads, the computed reference sequences are:

                                    3                                                                                               12
                                    |                                                                                               ||
    TAATTTTAATTTTATTTTaATTTTaATTTTaAcTTTaATTTTAATTTTAACTTTTAATTTTTAATTTTAATTTTAATTTTAATCCTAATTTTAATTTTaAcTcTTAATTTTAATTTTAATTTTAATTTTCGCGGT
      ATTTTAATTTtATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAAccTtcG
                     TAATTTTAATTTTAATTTTAATTTTAATTTTaAcTTTTAATTTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTAATTTTTAATTTTAATTTTAATTTTAATTTTTgcgGTAcTTTTAgTT

I have also reconstructed these manually from CIGAR, SEQ, and MD for these three reads, and get the same results as pysam. So I am fairly sure pysam is correct for these MD values.

The position marked 1 (scroll right) is your position 10468 showing the unanimous T. The adjacent position marked 2 is the one your script printed out, and is indeed variously C or T.

A similar discrepancy between the three reads also occurs e.g. at the position marked 3, which is to the left of most of the indels in these reads so is easily manually checked from the CIGARs and MDs as emitted by biscuit. And the three alignment records do indeed predict different reference bases at this position.

So it would appear that the MD values emitted by biscuit, while they are syntactically correct w.r.t. the ‘0’ separators, are not quite correct.

It would be interesting to see what samtools calmd did for this BAM file. Is your GRCh38-DAC-U2AF1.fna reference file available for download?

@semenko
Copy link
Author

semenko commented Apr 5, 2023

Thanks for taking the time to look into this and for such a detailed reply!

Here's GRCh38-DAC-U2AF1.fna.gz


For reference, it's generated from:

Via:

bedtools maskfasta -fi GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -fo GRCh38-DAC.fna -bed DAC-Exclusion-List-ENCFF356LFX.bed
bedtools maskfasta -fi GRCh38-DAC.fna -fo GRCh38-DAC-U2AF1.fna -bed U2AF1-GRCh38-Exclusion.bed

@jamorrison
Copy link

I looked at this for BISCUIT specifically (problem 2 from @jmarshall's comment). I'll keep that specific discussion over on BISCUIT's issue, but you can see my comment here: huishenlab/biscuit#37 (comment).

jmarshall added a commit that referenced this issue Sep 1, 2023
Clarification for #1180. In future, we may consider returning this as a
collections.namedtuple, as there's no natural ordering to the elements.
@jmarshall
Copy link
Member

Documentation has been clarified. The remainder of the issue has I think been covered by the linked BISCUIT issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants