Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation of AlignmentFile.fetch is unclear about the region argument #700

Open
blaiseli opened this issue Jul 10, 2018 · 1 comment

Comments

@blaiseli
Copy link

blaiseli commented Jul 10, 2018

The API documentation says:

fetch reads aligned in a region.

See AlignmentFile.parse_region() for more information on genomic regions. reference and end are also accepted for backward compatiblity as synonyms for contig and stop, respectively.

I don't see any documentation about such a parse_region method.

The term "region" links to the glossary which says:

A genomic region, stated relative to a reference sequence. A region consists of reference name (‘chr1’), start (10000), and end (20000). Start and end can be omitted for regions spanning a whole chromosome. If end is missing, the region will span from start to the end of the chromosome. Within pysam, coordinates are 0-based, half-open intervals, i.e., the position 10,000 is part of the interval, but 20,000 is not. An exception are samtools compatible region strings such as ‘chr1:10000:20000’, which are closed, i.e., both positions 10,000 and 20,000 are part of the interval.

This doesn't really tell how the region argument is to be provided.

Suppose I want to iterate over the alignments found within the first 100 positions of chromosome "I".

The above indications somehow hint me that region consists in three elements, so I naturally attempted using a (reference name, start, end) tuple: bamfile.fetch(region=("I", 0, 99)) but I got the following error:

pysam/libcalignmentfile.pyx in pysam.libcalignmentfile.AlignmentFile.fetch (pysam/libcalignmentfile.c:15021)()

pysam/libchtslib.pyx in pysam.libchtslib.HTSFile.parse_region (pysam/libchtslib.c:11479)()

~/lib/python3.6/re.py in split(pattern, string, maxsplit, flags)
    210     and the remainder of the string is returned as the final element
    211     of the list."""
--> 212     return _compile(pattern, flags).split(string, maxsplit)
    213 
    214 def findall(pattern, string, flags=0):

TypeError: expected string or bytes-like object

Same error if I use strings instead of ints for start and end.

I tried bamfile.fetch(region="I:0:99") and this resulted in ValueError: start out of range (-1), so I assume this is interpreted as a "samtools compatible region string" (and I should instead use "I:1:100"). But the documentation seems to consider this as "an exception". What is the standard, non-exceptional, way to specify the region, using zero-based coordinates, and with the possibility to omit start and end, as the documentation tells ? Apparently not by the means of a tuple.

Only after more careful inspection and experimentation did I realize that start and end were actually supposed to be passed as separate parameters, and region being actually only the chromosome name: bamfile.fetch(start=0, end=99, region="I")

I guess one source of confusion was that the glossary uses the terms "start" and "end" as belonging to the specification of a "region", while the method documentation refers to the end parameter as something just here for backwards compatibility with an older way of specifying regions, using a reference name. If I understand correctly, the up-to-date and standard way of doing is to use the contig, start and stop parameters, and that region is actually only intended to be used with a special "samtools compatible region string".

Arguments would benefit from more explicit documentation. Only until_eof and multiple_iterators are currently explicitly documented. Another improvement could be to give usage examples in the API documentation, not just in the introduction: I'm quite sure I had already learned how to this by following the introduction a few years ago. Today, after some time without pysam practice, I directly went to the API documentation to get a refresher about the specific method I wanted to use, and that's how I got into all this confusion.

@AndreasHeger
Copy link
Contributor

Thanks, I have fixed the broken link to the doc for parse_region - might take a moment for RTD to show.
Hopefully this resolves some ambiguity.
Definitely more to be done for the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants