# Understanding Iterators and (mostly) Generators
Seetha Krishnan
<br>
ASPP - Asia Pacific 2018

## Iterators
Iterators are everywhere. 
An iterator is simply an object that can be iterated upon, say using a `for` loop

In this extremely simple example, the __range(4)__ is the iterable object which at each iteration provides a different value to the __"i"__ variable.

In [16]:
for i in range(4):
    print(i)

0
1
2
3


You can iterate over strings, lists, files, dictionaries etc

In [79]:
import numpy as np

filename = 'sometextIwrote.txt'
f = open(filename, 'r') #Open file
# When you iterate over a text file, you get lines
for linenumber, lines in enumerate(f):
    print(f'{linenumber} > {lines}')
f.close()

0 > This is one of those text files

1 > That you may think contains the secret to life and happiness

2 > In reality, this contains nothing

3 > Yes, I am certain this is the last line of the text

4 > Ok I lied, this is the last line. There is nothing more



Most of the time you can get away with iterating over objects, storing them and analysing. But things can quickly get out of hand if you have a large data or multiple loops

__For Example__: 
- You have a large csv file (15 GB). If I ask you to find mean number of characters, how would you do it?
- 

## Generators
Generators are a simple, yet elegant type of iterators.

__To create generators:__ 
- Define a function
- instead of the return statement, use the __yield__ keyword. 

In [73]:
def charcount(filename):
    """ Generator function that reads lines and  yields the line and characters in each line """
    with open(filename) as fin:
        for linenumber, lines in enumerate(fin):
            yield lines, len(lines)

In [74]:
c = charcount(filename='sometextIwrote.txt')
print(c)

<generator object charcount at 0x113777a98>


A generator does not hold anything in memory
<br>It "yields" one result at a time and hasnt computed anything till you ask for the value - by saying next

In [75]:
c1 = next(c)

In [76]:
print(c1)

('This is one of those text files\n', 32)


Instead of calling next every time, you will typically use generator functions as an __iterator object__

In [67]:
c = charcount(filename='sometextIwrote.txt')
for l in c:
    print(f'> {l[0][:-1]} char count: {l[1]}')

> This is one of those text files char count: 32
> That you may think contains the secret to life and happiness char count: 61
> In reality, this contains nothing char count: 34
> Yes, I am certain this is the last line of the text char count: 52
> Ok I lied, this is the last line. There is nothing more char count: 56


### Task1
- Define a second generator that yields the number of words in each line
- Use charcount(filename) as an input to this generator
- output print statement should include (linenumber, charcount and word count)

Tip : What I like to do when writing a generator is to print statements instead of yield.
When I am satisfied with the accuracy of the print statement, I convert it to yield

### Task 1 - Solution

In [12]:
def countwords(linearray):
    for lines, ccount in linearray:
#         print('Get number of words')
        yield ccount, len(lines.split())
#         print('After yield2')

In [52]:
#What will be the order of the print statements in this case?
for i, l in enumerate(countwords(charcount(filename = 'sometextIwrote.txt'))):
    print(f'> line number:{i} char count: {l[0]}, number of words: {l[1]}')

> line number:0 char count: 32, number of words: 7
> line number:1 char count: 61, number of words: 11
> line number:2 char count: 34, number of words: 5
> line number:3 char count: 52, number of words: 12
> line number:4 char count: 56, number of words: 12


## Generators are great for large datasets that you want to process one line at a time
- Unlike batch type of programming,that could create a large memory footprint, generators iterate over data __lazily__ without loading the entire data source into memory at once.
- __yield__ is not __return__!! 
- When functions `return`, they are done for good. Generators are alive till values are exhausted
- Functions always start from the first line, generators start where you left off : at __yield__ 
- __Limitation__ - with a generator you can only iterate. You can't peak ahead or look behind

## Task 2 : Streaming with `yield`
Multiple CSV files stored in a directory, contain information of x-y position of a swimming zebrafish across time.
<br>__The task:__
1. Loop through each csv file, acquire the x and y position and find distance travelled by the fish at each time point.
2. To find distance travelled between two timepoints, you need to get the x and y position of fish at two consecutive frames.
3. Using the acquired distance travelled, print time spent by the fish at a speed below the threshold. 

  <img src="files/fish.png"  width="400" >

In [None]:
import csv
import os


def CSVfileGrabber(dirname):
    """Step 1 : Grab CSV files from a directory """
    for filename in os.listdir(dirname):
        if filename.endswith('.csv'):
            print('Working on: {}'.format(filename[:5]))  # Print name of fish
            yield os.path.join(dirname, filename)


def readcsv(filename):
    """Step 2 : read the csv files line by line """
    with open(filename) as f:
        # An extra step here using the built in csv library
        # to get a reader object that can be iterated over
        csvreader = csv.reader(f)
        for i, line in enumerate(csvreader):
            # Skip a few lines
            if i < 10:
                continue
            else:
                yield line


def getxy(linearray):
    """Step 3 : from every line yielded from the iterator, get x and y coordinates """
    for i in linearray:
        # x and y coordinates are in the 3rd and 4th column respectively
        yield [int(i[2]), int(i[3])]

In [None]:
dirname = '/Users/seetha/Desktop/Microbetest/ExampleFile/'  # A small sample dataset

# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for g in getxy(readcsv(files)):
        #         print(g)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

In [None]:
def consecutivexy1(linearray):
    """Step 4: get consecutive xy values"""
    # Here we want to get two consecutive xy to get speed/frame
    # Make use of the next keyword
    for i, line in enumerate(linearray):
        if i == 0:
            prevxy = line
            nextxy = next(linearray)
        else:
            prevxy = nextxy
            nextxy = line
        yield prevxy, nextxy

A nice way to do this is to use itertools (which is an amazing library for looping through iterators) https://docs.python.org/3/library/itertools.html

In [None]:
from itertools import tee


def consecutivexy2(linearray):
    # This makes two copies of the same iterable
    prevxy, nextxy = tee(linearray, 2)
    next(nextxy)  # discard one
    yield from zip(prevxy, nextxy)  # Note here I am using "yield from"

#### Sidenote : `yield from`
With `yield from`, we can skip an extra `for` loop

In [None]:
# A simple example to see what the yield from function will do 
A = range(5)
B = range(6, 11)

def temp(A, B): #Without yield from
    for a, b in zip(A, B):
        yield a, b
            
for i in temp(A, B): 
    print(i)
# Two loops!! You need two loops!!

In [None]:
# After Python 3.3 and existance of yield from
def yieldfromexample(A, B):
    yield from zip(A, B)
for i in yieldfromexample(A, B):
    print(i)

In [None]:
# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for x, y in consecutivexy1(getxy(readcsv(files))):
#         print(x, y)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

## Write the next parts on your own
- Step 5 : Calculate distance between the two consecutive points
- Step 6 : Put it all together

In [None]:
# Step 5: Calculate euclidean distance
import math


def getdist(xy):
    """  
    Write a generator function that recieves 
    the previous and next x-y location of the fish 
    and calculates the distance between the two points
    
    Euclidean distance between two points (x1, y1) and (x2, y2) is 
    sqrt((x1-x2)^2 + (y1-y2)^2)
   """

In [None]:
# Step 6: Put it all together
def getframes(dist, threshold, frames_per_sec):
    """
    Count frames with distance below a user-defined threshold and
    complete the print statement given below
    (Hint: use enumerate to find number of frames)
    
    Example:
    Of 16.27 seconds recording time, time spent with speed less than 10 is 12.83 seconds
    """
    
    print('Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds')

## Task2 : Solution

In [None]:
def getdist(xy):
    # Calculate euclidean distance
    for prevxy, nextxy in xy:
        # zip allows you to iterate two lists parallely
        dist = [(a - b)**2 for a, b in zip(prevxy, nextxy)]
        dist = math.sqrt(sum(dist))
        yield dist

# @tz.curry
def getframes(dist, threshold=10, frames_per_sec=30):
    dist_count = 0
    for i, d in enumerate(dist):
        if d < threshold:
            dist_count += 1
    print('Of {:0.3f} seconds recording time, time spent with speed less than {} is {:0.3f} seconds'.format(
        i / frames_per_sec, threshold, dist_count / frames_per_sec))

In [None]:
# Test your code with larger datasets
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for files in CSVfileGrabber(dirname):
    getframes(
        getdist(
            consecutivexy2(
                getxy(readcsv(files)))), threshold=10, frames_per_sec=30)

### The above statement that calls multiple generators looks ugly. <br> In such cases, with multiple genertors lined up, yield can start to feel unintuitive and tedious

Enter Toolz
<br> Toolz by Matt Rocklin - http://toolz.readthedocs.io/en/latest/
<br> It makes streaming super easy - intuitive and concise !

For more examples and explanation from Elegant Scipy written by the brilliant ASPP faculty - https://github.com/elegant-scipy/notebooks/blob/master/notebooks/ch8.ipynb

(Filed under things I can't believe I hardly used before this tutorial)

In [None]:
import toolz as tz

#### tz.pipe - passes a value through a sequence of functions - one by one
Pipe is simply syntactic sugar to make multiple function calls easy

In [None]:
# This will do exactly as the previous call (without the added brackets)
# The function calls are cleaner and can be read from left to right - which is sooo much better
def pipeline(filename):
    pipe = tz.pipe(filename,
                readcsv,
                getxy,
                consecutivexy1,
                getdist,
                getframes(threshold=10, frames_per_sec=30)
               )
    return pipe

In [None]:
for i in CSVfileGrabber(dirname):
    pipeline(i)

### What happened there?

## The magic of curry

<br> Curry = Haskell Brooks Curry 
<br> __"Currying"__ means partially evaluating a function and returning another function. 

In [None]:
# If you dont give all inputs to a python function, it becomes angry
sum()

### By currying, we are breaking down the evaluation of a function 
A curried function evaluates partially when you dont give it all the arguments, and fully when all arguments are available. 
<br>(That's why Python screamed before we added @tz.curry and the curried function could be added to the pipeline chain without any errors)

In [None]:
#the @curry decorator creates a curried function
from toolz import curry
@curry
def curried_sum(x, y):
    return x + y

In [None]:
A = curried_sum(2) #Just holds on to the value and produces no error

In [None]:
print(A(5)) #If you provide the second argument now - the function is fully evaluated

Toolz has a number of useful curried functions to help us stream. <br> All of the Toolz functions are available as curried functions in the toolz.curried namespace. And it also gives curried versions of functions like map, filter and reduce.

You can write any function that will analyse your dataset, line by line, which can then be curried and passed on to the toolz pipeline

Reminder :
1. map - Return an iterator that applies function to every item of iterable, yielding the results
2. filter - Construct an iterator from those elements of iterable for which function returns true
3. reduce - Performing some computation on an iterator and returning the result.