# Understanding Iterators and (mostly) Generators
Seetha Krishnan
<br>
ASPP - Asia Pacific 2018

## Iterators
Iterators are everywhere. 
An iterator is simply an object that can be iterated upon, say using a `for` loop

In this extremely simple example, the __range(4)__ is the iterable object which at each iteration provides a different value to the __"i"__ variable.

In [35]:
for i in range(4):
    print(i)

0
1
2
3


You can iterate over strings, lists, files, dictionaries etc

In [36]:
# Iterate over lines of a file
filename = 'sometextIwrote.txt'
f = open(filename, 'r')
for linenumber, lines in enumerate(f):
    print('{} > {}'.format(linenumber, lines))
f.close()
# Enumerate is one of those super cool in-built python functions that
# allows you to loop over an object and have an automatic counter

0 > This is one of those text files

1 > That you may think contains the secret to life and happiness

2 > In reality, this contains nothing

3 > Yes, I am certain this is the last line of the text

4 > Ok I lied, this is the last line. There is nothing more



__Sidenote__ :  The proper way to open and close a file is not like the above example, but using a `with` statement, which takes care of opening and closing a file <br>(These are called Context Managers, we will talk more about them later)

In [37]:
with open(filename) as f:
    for linenumber, lines in enumerate(f):
        print('{} > {}'.format(linenumber, lines))

0 > This is one of those text files

1 > That you may think contains the secret to life and happiness

2 > In reality, this contains nothing

3 > Yes, I am certain this is the last line of the text

4 > Ok I lied, this is the last line. There is nothing more



#### Underneath the covers is a specific protocol:
iter : This returns the iterator object itself
<br>next() : This returns the next value. 
<br>_StopIteration_ error once all the objects have been looped through.

If you want to know how to write an iterator from scratch, refer to some of these tutorials
<br>https://www.programiz.com/python-programming/iterator
<br>http://anandology.com/python-practice-book/iterators.html

In [38]:
it = iter(range(4))
print(it)

<range_iterator object at 0x10cc77030>


In [39]:
print(next(it))  # Run this multiple times

0


## Generators
Generators are a simple, yet elegant type of iterators.

__To create generators:__ 
- Define a function
- instead of the return statement, use the __yield__ keyword. 

In [40]:
# Count number of words per line of csv file
def readtxt(filename):
    with open(filename) as fin:
        for line in fin:
            yield line #This defines a generator

In [49]:
r = readtxt(filename='sometextIwrote.txt')
print(r)

<generator object readtxt at 0x10cb94d00>


In [52]:
# A generator does not hold anything in memory
# It "yields" one result at a time and hasnt computed anything till you ask for the value - by saying next
print(next(r))

That you may think contains the secret to life and happiness



Instead of calling next every time, you will typically use generator functions as an __iterator object__

In [61]:
for l in readtxt(filename='sometextIwrote.txt'):
    print(f'> {l}')

> This is one of those text files

> That you may think contains the secret to life and happiness

> In reality, this contains nothing

> Yes, I am certain this is the last line of the text

> Ok I lied, this is the last line. There is nothing more



We can define a second generator that counts the number of words in each line takes the first generator as input

In [62]:
def countwords(linearray):
    for i, line in enumerate(linearray):
        yield i, len(line.split()) #This should give the number of words in each line

In [69]:
for i, n in countwords(readtxt(filename = 'sometextIwrote.txt')):
    print('Number of words in line {} is {}'.format(i, n))

Number of words in line 0 is 7
Number of words in line 1 is 11
Number of words in line 2 is 5
Number of words in line 3 is 12
Number of words in line 4 is 12


In [70]:
print(list(countwords(readtxt(filename = 'sometextIwrote.txt'))))

[(0, 7), (1, 11), (2, 5), (3, 12), (4, 12)]


### Whats so great about a generator? 
- Generators allow you to iterate over some data __lazily__ without loading the entire data source into memory at once.  (Great for large datasets!)
- When functions `return`, they are done for good. Not generators.
- Functions always start from the first line, generators start where you left off : at __yield__ 

### Real world example 1
Multiple CSV files stored in a directory, contain information of x-y position of a swimming zebrafish across time.
<br>__The task:__
1. Loop through each csv file, acquire the x and y position and find distance travelled by the fish at each time point.
2. To find distance travelled between two timepoints, you need to get the x and y position of fish at two consecutive frames.
3. Using the acquired distance travelled, print time spent by the fish at a speed below the threshold. 

  <img src="files/fish.png"  width="400" >

In [10]:
import csv
import os

# Step 1 : Grab CSV files from a directory
def CSVfileGrabber(dirname):
    for filename in os.listdir(dirname):
        if filename.endswith('.csv'):
            print('Working on: {}'.format(filename[:5]))  # Print name of fish
            yield os.path.join(dirname, filename)

# Step 2 : read the csv files line by line
def readcsv(filename):
    with open(filename) as f:
        # An extra step here using the built in csv library
        # to get a reader object that can be iterated over
        csvreader = csv.reader(f)
        for i, line in enumerate(csvreader):
            # Skip a few lines
            if i < 10:
                continue
            else:
                yield line

# Step 3 : get x and y coordinates
def getxy(linearray):
    for i in linearray:
        # x and y coordinates are in the 3rd and 4th column respectively
        yield [int(i[2]), int(i[3])]

In [66]:
dirname = '/Users/seetha/Desktop/Microbetest/ExampleFile/'  # A small sample dataset

# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for g in getxy(readcsv(files)):
        #         print(g)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
Parsed lines from this csv file is 17
Working on: Fish2
Parsed lines from this csv file is 17


In [12]:
# Step 4: get consecutive xy
def consecutivexy1(linearray):
    # Here we want to get two consecutive xy to get speed/frame
    # Make use of the next keyword
    for i, line in enumerate(linearray):
        if i == 0:
            prevxy = line
            nextxy = next(linearray)
        else:
            prevxy = nextxy
            nextxy = line
        yield prevxy, nextxy

In [67]:
# A nice way is to use itertools (which is an amazing library for looping of iterators)
# https://docs.python.org/3/library/itertools.html
from itertools import tee

def consecutivexy2(linearray):
    # This makes two copies of the same iterable
    prevxy, nextxy = tee(linearray, 2)
    next(nextxy)  # discard one
    yield from zip(prevxy, nextxy)  # Note this cool keyword here 'yield from'

#### Sidenote : `yield from`
With `yield from`, we can skip an extra `for` loop

In [14]:
# A simple example to see what the yield from function will do 
A = range(5)
B = range(6, 11)

def temp(A, B):
    for a, b in zip(A, B):
        yield a, b
            
for i in temp(A, B): 
    print(i)
# Two loops!! You need two loops!!

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


In [15]:
# After Python 3.3
def yieldfromexample(A, B):
    yield from zip(A, B)
for i in yieldfromexample(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


In [71]:
# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for x, y in consecutivexy1(getxy(readcsv(files))):
#         print(x, y)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
Parsed lines from this csv file is 16
Working on: Fish2
Parsed lines from this csv file is 16


## Write the next parts on your own
- Step 5 : Calculate distance between the two consecutive points
- Step 6 : Put it all together

In [17]:
# Step 5: Calculate euclidean distance
import math


def getdist(xy):
    """  
    Write a generator function that recieves 
    the previous and next x-y location of the fish 
    and calculates the distance between the two points 
   """

In [72]:
# Step 6: Put it all together
def getframes(dist, threshold, frames_per_sec):
    """
    Count frames with distance below a user-defined threshold and
    complete the print statement given below
    (Hint: use enumerate to find number of frames)
    
    Example:
    Of 16.27 seconds recording time, time spent with speed less than 10 is 12.83 seconds
    """
    
    print('Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds')

## Solution

In [89]:
def getdist(xy):
    # Calculate euclidean distance
    for prevxy, nextxy in xy:
        # zip allows you to iterate two lists parallely
        dist = [(a - b)**2 for a, b in zip(prevxy, nextxy)]
        dist = math.sqrt(sum(dist))
        yield dist

# @tz.curry
def getframes(dist, threshold=10, frames_per_sec=30):
    dist_count = 0
    for i, d in enumerate(dist):
        if d < threshold:
            dist_count += 1
    print('Of {:0.3f} seconds recording time, time spent with speed less than {} is {:0.3f} seconds'.format(
        i / frames_per_sec, threshold, dist_count / frames_per_sec))

In [90]:
# Test your code with larger datasets
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for files in CSVfileGrabber(dirname):
    getframes(
        getdist(
            consecutivexy2(
                getxy(readcsv(files)))), threshold=10, frames_per_sec=30)

Working on: Fish1
Of 16.267 seconds recording time, time spent with speed less than 10 is 12.833 seconds
Working on: Fish6
Of 16.267 seconds recording time, time spent with speed less than 10 is 15.133 seconds


### The above statement that calls multiple generators looks ugly 
Lets make it more beautiful! - using toolz
<br> Toolz by Matt Rocklin - http://toolz.readthedocs.io/en/latest/
<br> It makes streaming super easy - intuitive and concise !

(Filed under things I can't believe I hardly used before this tutorial)

In [91]:
import toolz as tz

In [92]:
# This will do exactly as the previous call (without the added brackets)
# And it can be read from left to right - which is sooo much better
def pipeline(filename):
    pipe = tz.pipe(filename,
                readcsv,
                getxy,
                consecutivexy1,
                getdist,
                getframes(threshold=10, frames_per_sec=30)
               )
    return pipe

In [93]:
for i in CSVfileGrabber(dirname):
    pipeline(i)

Working on: Fish1
Of 16.267 seconds recording time, time spent with speed less than 10 is 12.833 seconds
Working on: Fish6
Of 16.267 seconds recording time, time spent with speed less than 10 is 15.133 seconds


## The magic of curry

<br> Curry = Named after Haskell Curry (yes like the programming language)
<br> __"Currying"__ means partially evaluating a function and returning another function. 

In [94]:
# If you dont give all inputs to a python function, it becomes angry
sum()

TypeError: sum expected at least 1 arguments, got 0