# Sequence Reader

The `SeqReader` class reads a sequence from a FASTA-formatted file.

More generally, a `SeqReader` can be thought of as a _lazily-evaluated_ sequence.  Only a small amount of the full sequence is in working memory at any given time.  Operations on a `SeqReader` typically return generators that create the modified sequence on an as-needed basis.

In [1]:
%load_ext autoreload
%autoreload 2

import set_dir

`SeqReader` can be imported straight from `Sequence`.  In addition, the functions `read_first` and `read_all` are used to create a `SeqReader` from a file.

In [2]:
from Sequence import Seq, SeqReader, read_first, read_all

Here, we read the first sequence named `sample_1` from the file `data/samples.fasta`, and print it as a DNA and an RNA sequence.

In [3]:
reader = read_first("data/samples.fasta", name="sample_1")
print(reader)

for line in reader.as_DNA():
    print(line)

for line in reader.as_RNA():
    print(line)

print("")

for line in reader.as_DNA(complement=True):
    print(line)

for line in reader.as_RNA(complement=True):
    print(line)

SeqReader(sample_1)
AAACCC
AAACCC

GGGTTT
GGGUUU


And here, we translate `sample_1` into amino acids, in all six reading frames.

In [4]:
print(list( reader.translate( 1) ))
print(list( reader.translate( 2) ))
print(list( reader.translate( 3) ))

print(list( reader.translate(-1) ))
print(list( reader.translate(-2) ))
print(list( reader.translate(-3) ))

['K', 'P']
['N']
['T']
['G', 'F']
['G']
['V']


Here, we read all sequences with names that don't end in `"2"`, and print out their complements in groups of 10:

In [5]:
readers = read_all(
    "data/samples.fasta",
    id_filter=lambda x, _: x[-1] != "2"
)

for reader in readers:
    print(f"Reader {reader} for sequence of length {len(reader)}")
    for line in reader.as_DNA(window_size=10, complement=True):
        print(line)
    print("")

Reader SeqReader(sample_1) for sequence of length 0

Reader SeqReader(sample_3) for sequence of length 0

Reader SeqReader(sample_4) for sequence of length 80

Reader SeqReader(sample_5) for sequence of length 0



When a `SeqReader` is created, its sequence is stored in a tempfile.  The `SeqReader` can then produce as many windows into the file as desired.  Here, I zip together two generators for the same sequence, with different window lengths, to show that the generators don't interfere with each other's positions in the file.

(For this demo, I use a special numeric sequence to better illustrate that both sequences are at the correct positions.)

In [6]:
from Sequence import read_first

r1 = read_first( "data/sample_numeric.fasta" )

print(r1.filename)

for line1, line2 in zip(r1.as_DNA(window_size=5), r1.as_DNA(window_size=7)):
    print(f"{line1} -> {line2}")

C:\Users\Vince\AppData\Local\Temp\tmpwvcah_3g
01234 -> 0123456
56789 -> 789


This means computations can be run on the sequence in parallel.  For example, the `find_orfs` function scans through each of the six reading frames in separate threads, and combines all the results at the end.

In [7]:
r1 = read_first( "data/samples.fasta", name="sample_2" )

r1.find_orfs(verbose=True)
print("")

  - Read frame  1 in 0.005005598068237305 seconds.
  - Read frame -3 in 0.009793519973754883 seconds.
  - Read frame -2 in 0.012395858764648438 seconds.
  - Read frame  2 in 0.005517482757568359 seconds.
  - Read frame -1 in 0.010523319244384766 seconds.
  - Read frame  3 in 0.0030670166015625 seconds.

Total computing time: ~0.04630279541015625 seconds.
Actual elapsed time:   0.2812812328338623 seconds.



A `SeqReader` can be converted to a plain `Seq` object by passing it to the `Seq()` constructor.  Be careful about doing this with long sequences -- sequences from a file are lazily evaluated for a reason.

In [8]:
r1 = read_first( "data/samples.fasta", name="sample_3" )
r2 = read_first( "data/samples.fasta", name="sample_5" )

# Convert to Sequence objects
s1 = Seq(r1)
s2 = Seq(r2)

print(s1)
print(s2)
print("")

print(s1 & s2)
print(s1 ^ s2)
print(s1 | s2)
print("")

ACGTACGTACGTACGTACGTACGT
AAAACCCCGGGGTTTT----RRRR

A----C----G----T----A-G-
A----C----G----TACGT----
AMRWMCSYRSGKWYKTACGTRVRD



A work in progress: `SeqReader` sequences can be combined with (some of) the same operations that work on plain `Seq`s.  These produce a new `SeqCombiner` object, which lazily evaluates the combination of the two sequences as needed.

A `Seq` and a `SeqReader` can be combined as well, in either order, as shown in the second and third example below.

In [9]:
combined = r1 & r2
print(combined)
print(type(combined))
print("")

for line in combined:
    print(line)

for line in (r1 ^ s2):
    print(line)

for line in (s1 | r2):
    print(line)

<Sequence._reader.SeqCombiner object at 0x00000266DF6E8A60>
<class 'Sequence._reader.SeqCombiner'>

A----C----G----T----A-G-
A----C----G----TACGT----
AMRWMCSYRSGKWYKTACGTRVRD


## Efficiency: Finding ORFs in Chromosome 21

A large part of my motivation for this project was to find ORFs in Human Chromosome 21 in the reverse direction as quickly as in the forward direction.  I show my results below.

In [None]:
reader = read_first( "data/chr21.fasta" )

orfs = reader.find_orfs(verbose=True)
print(f"{len(orfs)} ORFs found.")

I re-ran my old method sequentially, just to ensure I had accurate times.

Below is a comparison:

| Frame | Old Method | New Method |
| :-: | :-: | :-: |
|  1 | 5.759676694869995 | 63.35096454620361 |
|  2 | 5.744913816452026 | 62.14136457443237 |
|  3 | 5.478733062744141 | 62.42976236343384 |
| -1 | 61.48100733757019 | 149.64347290992737 |
| -2 | 59.72647547721863 | 150.68320631980896 |
| -3 | 60.6691198348999  | 152.72155261039734 |

I have mixed feelings about these results.  On the one hand, the time to read in the reverse direction is only 2~3x as much as in the forward direction, which means I achieved my goal of equalizing the two.

However, my new method is much slower than the previous.  I suspect I've added too much overhead in creating these classes.  For example, half this time is taken up by converting to amino acids alone:

In [27]:
from time import time

s = time()
for i, x in enumerate(reader.translate(1)):
    pass
t = time()
print(f"done: {t-s}")

done: 35.55509352684021


In the old method, translating the `1` reading frame for Chromosome 21 took ~3 seconds.

Further optimization could hopefully bring these numbers down.  It seems like the file reading is no longer the bottleneck; now other parts of the program are.

Below, I run the `1` and `-1` frame in isolation, just to confirm the times:

In [18]:
reader = read_first( "data/chr21.fasta" )

_, orfs, t = reader.find_orfs(frame=1, verbose=True, verbose_output=True)
print(f"Num orfs: {len(orfs)}")
print(f"Time: {t}")

Num orfs: 19609
Time: 46.730241537094116


In [19]:
reader = read_first( "data/chr21.fasta" )

_, orfs, t = reader.find_orfs(frame=-1, verbose=True, verbose_output=True)
print(f"Num orfs: {len(orfs)}")
print(f"Time: {t}")

Num orfs: 20428
Time: 113.19659638404846
