Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the htslib C-API.
This page provides a quick introduction in using pysam followed by the API. See usage
for more detailed usage instructions.
To use the module to read a file in BAM format, create a ~pysam.AlignmentFile
object:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
Once a file is opened you can iterate over all of the read mapping to a specified region using ~pysam.AlignmentFile.fetch
. Each iteration returns a ~pysam.AlignedSegment
object which represents a single read along with its fields and optional tags:
for read in samfile.fetch('chr1', 100, 120):
print(read)
samfile.close()
To give:
EAS56_57:6:190:289:82 0 99 <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; 69 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA 0 192 1
EAS56_57:6:190:289:82 0 99 <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; 137 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC 73 64 1
EAS51_64:3:190:727:308 0 102 <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844 99 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG 99 18 1
...
You can also write to a ~pysam.AlignmentFile
:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
if read.is_paired:
pairedreads.write(read)
pairedreads.close()
samfile.close()
An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the ~pysam.AlignmentFile.pileup
method. Each iteration returns a ~pysam.PileupColumn
which represents all the reads in the SAM file that map to a single base in the reference sequence. The list of reads are represented as ~pysam.PileupRead
objects in the PileupColumn.pileups <pysam.PileupColumn.pileups>
property:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup("chr1", 100, 120):
print("\ncoverage at base %s = %s" % (pileupcolumn.pos, pileupcolumn.n))
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
# query position is None if is_del or is_refskip is set.
print('\tbase in read %s = %s' %
(pileupread.alignment.query_name,
pileupread.alignment.query_sequence[pileupread.query_position]))
samfile.close()
The above code outputs:
coverage at base 99 = 1
base in read EAS56_57:6:190:289:82 = A
coverage at base 100 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 101 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 102 = 2
base in read EAS56_57:6:190:289:82 = G
base in read EAS51_64:3:190:727:308 = G
...
Commands available in samtools are available as simple function calls. For example:
pysam.sort("-o", "output.bam", "ex1.bam")
corresponds to the command line:
samtools sort -o output.bam ex1.bam
Analogous to ~pysam.AlignmentFile
, a ~pysam.TabixFile
allows fast random access to compressed and tabix indexed tab-separated file formats with genomic data:
import pysam
tabixfile = pysam.TabixFile("example.gtf.gz")
for gtf in tabixfile.fetch("chr1", 1000, 2000):
print(gtf.contig, gtf.start, gtf.end, gtf.gene_id)
~pysam.TabixFile
implements lazy parsing in order to iterate over large tables efficiently.
More detailed usage instructions is at usage
.
Note
Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates.
Note
- The above examples can be run in the
tests
directory of the installation directory. Type 'make' before running them.
https://github.com/pysam-developers/pysam
The pysam code repository, containing source code and download instructions
http://pysam.readthedocs.org/en/latest/
The pysam website containing documentation
Objects of type ~pysam.AlignmentFile
allow working with BAM/SAM formatted files.
pysam.AlignmentFile
pysam.AlignmentHeader
An ~pysam.AlignedSegment
represents an aligned segment within a SAM/BAM file.
pysam.AlignedSegment
pysam.PileupColumn
pysam.PileupRead
pysam.IndexedReads
~pysam.TabixFile
opens tabular files that have been indexed with tabix.
pysam.TabixFile
To iterate over tabix files, use ~pysam.tabix_iterator
:
pysam.tabix_iterator
pysam.tabix_compress
pysam.tabix_index
pysam.asTuple
pysam.asVCF
pysam.asBed
pysam.asGTF
pysam.FastaFile
pysam.FastxFile
pysam.FastqProxy
pysam.VariantFile
pysam.VariantHeader
pysam.VariantRecord
pysam.VariantHeaderRecord
HTSFile is the base class for pysam.AlignmentFile
and pysam.VariantFile
.
pysam.HTSFile