# Data streaming with Python


Data science libraries like *numpy*, *pandas*, *matplotlib* and *seaborn* provide superpowers to Python users.

**But**: like everything in programming they are not the solution to any kind of problem.

If you can hold all of your data in your machine's memory, these libraries are awesome, but what for *really* huge datasets?

Let's explore this issue with the help of some Python internals:

In [None]:
import sys

In [None]:
help(sys)

In [None]:
mylist = [0,1,2,3,4]

In [None]:
for n in mylist:
    print(n)

In [None]:
myrange = range(5)

In [None]:
for n in myrange:
    print(n)

In [None]:
sys.getsizeof(mylist) # get the size of mylist in bytes

In [None]:
sys.getsizeof(myrange) # same thing for myrange

**scaling up**

In [None]:
mylargerlist = [0,1,2,3,4,5,6,7,8,9]

In [None]:
mylargerrange = range(10)

In [None]:
sys.getsizeof(mylist)

In [None]:
sys.getsizeof(myrange)

**going even larger**

In [None]:
myhugerange = range(10*10**6)

In [None]:
myhugelist = list(myhugerange) # not going to type this literally

In [None]:
sys.getsizeof(myhugerange), sys.getsizeof(myhugelist)

**a range allows you to do many of the things you can do with the corresponding list without ever storing all the elements in memory**

In [None]:
len(myhugerange)

In [None]:
len(myhugelist)

In [None]:
sum(myhugerange)

In [None]:
sum(myhugelist)

**some things work even better with a range than with a list**

In [None]:
%timeit 9999999 in myhugerange

In [None]:
%timeit 9999999 in myhugelist

## Iterators and generators

Python ranges are one example of objects that yield potentially long sequences on demand.

Whenever you do not need all of your data available at once for manipulating it, such a behavior can be very resource-friendly.

Such objects are called iterators or generators in Python and are very widely used.

Examples:

In [None]:
# a gigantic range; better don't try to turn it into a list!
r = range(10**30)

In [None]:
it = iter(r)

In [None]:
next(it)

In [None]:
import random

# a so-called generator expression
random_numbers_generator = (random.randrange(100) for n in r)

In [None]:
import itertools

# an endless sequence of recycled characters
going_in_circles = itertools.cycle('Roundabout')

In [None]:
letters_and_numbers = zip(going_in_circles, random_numbers_generator)

In [None]:
# generator functions

def generate_fib():
    a, b = 0, 1
    yield a
    yield b
    while True:
        a, b = b, a+b
        yield b

fib_generator = generate_fib()

In [None]:
# get lists of all fibonacci numbers < n

def fib_seq(fib_max):
    seq = []
    for fib in fib_generator:
        if fib > fib_max:
            return seq
        seq.append(fib)

### The iterator protocol

Iterable objects in Python define a special `__iter__` method that returns an *iterator* object.

The *iterator* object, in turn, define:

- a `__next__` method that returns the next element of the *iterator* object
- their own `__iter__` method that just returns the *iterator* object itself

*Generators* are special *iterators* that Python autogenerates from generator expressions and functions.

You can iterate over any iterable by hand with next(iter(iterable) or you can let a `for` loop handle this for you.

In [None]:
def get_n_items_from_iterator(iterator, n):
    """A generic function to return a user-defined number of items from a possibly endless iterator as a list"""
    list_of_items = []
    if n == 0: return list_of_items
    for c, item in enumerate(iterator, 1):
        list_of_items.append(item)
        if c == n:
            return list_of_items

In [None]:
get_n_items_from_iterator(letters_and_numbers, 12)

## Reading and writing files with Python

### Writing

In [None]:
o = open('/var/tmp/fileio/test.txt', 'w')

In [None]:
o

In [None]:
o.write('Hello world\nHow are you today?\n')

In [None]:
o.close()

### Reading

In [None]:
i = open('/var/tmp/fileio/test.txt')

In [None]:
i

In [None]:
i.read()

In [None]:
i.close()

In [None]:
with open('/var/tmp/fileio/test.txt') as i:
    lines = i.readlines()

In [None]:
lines

### What if the file I'm reading is really large?

The read and readlines methods of file objects read the whole file into memory (like pandas does, too) so if your file is a couple of Gigabytes in size, you may freeze your computer!

**Luckily file objects are also iterators!**

In [None]:
with open('/var/tmp/fileio/test.txt') as i:
    first_line = next(i)

In [None]:
first_line

In [None]:
with open('/var/tmp/fileio/test.txt') as i:
    for line in i:
        print(line)

### A first example with biological sequence data

In [None]:
# extract sequence identifiers from a multi-FASTA file
fasta_file = '/var/tmp/fileio/Mac_2020.fa'

In [None]:
# peek into the file
with open(fasta_file) as i:
    print(i.read(200))

In [None]:
# let's extract just the sequence names (from lines starting with ">")
seq_names = []
with open(fasta_file) as i:
    for line in i:
        if line[0] == '>':
            seq_names.append(line[1:].strip())

In [None]:
len(seq_names)

In [None]:
# rewrite just sequences of interest to a new file
seqs_wanted = ['chr_015', 'chr_020', 'chr_101']
write_line = False
with open(fasta_file) as i, open('wanted.fa', 'w') as o:
    for line in i:
        if line[0] == '>':
            current_seq = line[1:].strip()
            if current_seq in seqs_wanted:
                write_line = True
            else:
                write_line = False
        if write_line:
            o.write(line)
