In this recipe we look at how to find the longest stretch of a specified set of characters in a ``BiologicalSequence`` object. In this particular case we'll work with ``NucleotideSequences``, and look for stretchs of purines and pyrimidines. In this context, I use the term *stretch* to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.

First, we'll configure the environment:

In [1]:
from __future__ import print_function
from skbio import NucleotideSequence

Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same.

In [2]:
nuc = NucleotideSequence.read('data/single_sequence1.fasta', seq_num=1)
nuc

<NucleotideSequence: GAGTTTGATC... (length: 1541)>

To find the longest stretch of purines, we use ``NucleotideSequence.find_features('purine_run')`` to find *all* purine stretches, and then Python's ``max`` function to find the longest stretch:

In [3]:
max(nuc.find_features('purine_run'), key=lambda e: len(e[2]))

(130, 141, 'AAAGAGGGGGA')

The result includes start position, end position, and the subsequence of the longest stretch of purines.

To find the longest stretch of pyrimidines:

In [4]:
max(nuc.find_features('pyrimidine_run'), key=lambda e: len(e[2]))

(175, 181, 'CTCTTC')

To find the longest stretch of some other character or characters that don't have built-in support in scikit-bio's sequence objects, we can use ``BiologicalSequence.regex_iter``. For example, to find the longest stretch of ``N`` characters:

In [5]:
import re
n_run = re.compile(r'([N]+)')
max(nuc.regex_iter(n_run), key=lambda e: len(e[2]))

(572, 573, 'N')

Finally, let's try to find a stretch of a character that doesn't exist in the sequence:

In [6]:
x_run = re.compile(r'([X]+)')
list(nuc.regex_iter(x_run))

[]

After expanding the generator returned by ``NucleotideSequence.regex_iter``, we see that there are no stretches of the ``X`` character.