# Understanding Iterators and (mostly) Generators
Seetha Krishnan
<br>
ASPP - Asia Pacific 2018

## Iterators
Iterators are everywhere. 
An iterator is simply an object that can be iterated upon, say using a `for` loop

In this extremely simple example, the __range(4)__ is the iterable object which at each iteration provides a different value to the __"i"__ variable.

In [32]:
for i in range(4):
    print(i)

0
1
2
3


You can iterate over strings, lists, files, dictionaries etc

In [33]:
import numpy as np

filename = 'sometext.txt'
f = open(filename, 'r') #Open file
# When you iterate over a text file, you get lines
for linenumber, lines in enumerate(f):
    print(f'{linenumber} > {lines}')
f.close()

0 > The skill to do math on a page

1 > Has declined to the point of outrage.

2 > Equations quadratica

3 > Are solved on Mathmatica,

4 > And on birthdays we don't know our age.


Most of the time you can get away with iterating over objects, storing them and analysing. But things can quickly get out of hand if you have a large data or multiple loops

__For Example__: 
- You have a large csv file. Iterating and appending results to a list is not going to be memory or cpu efficient. You need something that will allow you to parse one line at a time. 

- If you have to stream data from a server, webservice, or a camera, you are continuosly generating a series of values. Now you want to iterate through this, but you do not know the length of this data or when it will end. And you dont want to keep appending it into a list 

## Generators
Generators are a simple, yet elegant type of iterators.

__To create generators:__ 
- Define a function
- instead of the return statement, use the __yield__ keyword. 

In [82]:
def charcount(filename):
    """ Generator function that reads lines and  yields the line and characters in each line """
    with open(filename) as fin:
        for linenumber, lines in enumerate(fin):
            yield lines, len(lines)

In [83]:
c = charcount(filename='sometext.txt')
print(c)

<generator object charcount at 0x10fc0cca8>


A generator does not hold anything in memory
<br>It "yields" one result at a time and hasnt computed anything till you ask for the value - by saying next

In [84]:
c1 = next(c)

In [85]:
print(c1)

('The skill to do math on a page\n', 31)


Instead of calling next every time, you will typically use generator functions as an __iterator object__

In [86]:
c = charcount(filename='sometext.txt')
for l in c:
    print(f'> {l[0][:-1]} \t char count: {l[1]}')

> The skill to do math on a page 	 char count: 31
> Has declined to the point of outrage. 	 char count: 38
> Equations quadratica 	 char count: 21
> Are solved on Mathmatica, 	 char count: 26
> And on birthdays we don't know our age 	 char count: 39


### Task1
- Define a second generator that yields the number of words in each line
- Use charcount(filename) as an input to this generator
- output print statement should include (linenumber, charcount and word count)

Tip : What I like to do when writing a generator is to print statements instead of yield.
When I am satisfied with the accuracy of the print statement, I convert it to yield

### Task1: Solution

## Generators are great for large datasets that you want to process one line at a time
- a __generator__ is also an __iterator__!(not vice versa)
- generators iterate over data __lazily__ without loading the entire data source into memory at once.

__You make a generator that can be iterated over 

- When functions `return`, they are done for good. Generators are alive till values are exhausted
- Functions always start from the first line, generators start where you left off : at __yield__ 
- __Limitation__ - with a generator you can only iterate. You can't peak ahead or look behind

## Task 2 : Streaming with `yield`
Multiple CSV files stored in a directory, contain information of x-y position of a swimming zebrafish across time.
<br>__The task:__
1. Loop through each csv file, acquire the x and y position and find distance travelled by the fish at each time point.
2. To find distance travelled between two timepoints, you need to get the x and y position of fish at two consecutive frames.
3. Using the acquired distance travelled, print time spent by the fish at a speed below the threshold. 

  <img src="files/fish.png"  width="400" >

### Read from csv files - line by line

In [164]:
import csv
import os


def CSVfileGrabber(dirname):
    """Step 1 : Grab CSV files from a directory """
    for filename in os.listdir(dirname):
        if filename.endswith('.csv'):
            print('Working on: {}'.format(filename[:5]))  # Print name of fish
            yield os.path.join(dirname, filename)


def readxy(filename):
    """Step 2 : read the csv files line by line """
    with open(filename) as f:
        csvreader = csv.reader(f)
        for i, line in enumerate(csvreader):
            # Skip a few lines
            if i < 10:
                continue
            else:
                 # x and y coordinates
                x = int(line[2])
                y = int(line[3])
                yield (x, y)

Just to make sure things are working

In [165]:
dirname = '/Users/seetha/Desktop/Microbetest/ExampleFile/'  # A small sample dataset

for files in CSVfileGrabber(dirname):
    print(files)
    
# for files in CSVfileGrabber(dirname):
#     numline = 0
#     for g in readxy(files):
#         numline += 1
#     print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
/Users/seetha/Desktop/Microbetest/ExampleFile/Fish1_example.csv
Working on: Fish2
/Users/seetha/Desktop/Microbetest/ExampleFile/Fish2_example.csv


### Get consecutive values for distance calculation

In [166]:
def consecutivexy1(linearray):
    """Step 4: get consecutive xy values"""
    # Here we want to get two consecutive xy to get speed/frame
    # Make use of the next keyword
    for i, line in enumerate(linearray):
        if i == 0:
            prevxy = line
            nextxy = next(linearray)
        else:
            prevxy = nextxy
            nextxy = line
        yield prevxy, nextxy

A nice way to do this is to use itertools (which is an amazing library for looping through iterators) https://docs.python.org/3/library/itertools.html
<br> `tee` : Return n independent iterators from a single iterable. `tee(seq, n)`

In [167]:
from itertools import tee


def consecutivexy2(linearray):
    # This makes two copies of the same iterable
    prevxy, nextxy = tee(linearray, 2)
    next(nextxy)  # discard one
    yield from zip(prevxy, nextxy)  # Note here I am using "yield from"

In [168]:
# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for x, y in consecutivexy2(readxy(files)):
        #         print(x, y)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
Parsed lines from this csv file is 16
Working on: Fish2
Parsed lines from this csv file is 16


### Sidenote : `yield from`
With `yield from`, we can skip an extra `for` loop

In [44]:
# A simple example to see what the yield from function will do
A = range(5)
B = range(6, 11)

# Without yield from
def temp(range1, range2):
    for a, b in zip(range1, range2):
        yield a, b
        
# Two loops!! You need two loops!!
for i in temp(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


In [45]:
# After Python 3.3 and existance of yield from
def yieldfromexample(A, B):
    yield from zip(A, B)
for i in yieldfromexample(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


`Yield from` is especially useful when you have multiple iterators, recursive data structures

## Write the next parts on your own
- Step 5 : Calculate distance between the two consecutive points
- Step 6 : Put it all together

In [46]:
# Step 5: Calculate euclidean distance
import math


def getdist(xy):
    """  
    Write a generator function that recieves 
    the previous and next x-y location of the fish 
    and calculates the distance between the two points
    
    Euclidean distance between two points (x1, y1) and (x2, y2) is 
    sqrt((x1-x2)^2 + (y1-y2)^2)
   """

In [None]:
# Step 6: Put it all together
def getframes(dist, threshold, frames_per_sec):
    """
    Count frames with distance below a user-defined threshold and
    complete the print statement given below
    (Hint: use enumerate to find number of frames)
    
    Example:
    Of 16.27 seconds recording time, time spent with speed less than 10 is 12.83 seconds
    """
    
    print(f'Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds')

### Task2: Solution
Will be inserted here

In [188]:
import math


def getdist(xy):
    # Calculate euclidean distance
    for prevxy, nextxy in xy:
        # zip allows you to iterate two lists parallely
        dist = [(a - b)**2 for a, b in zip(prevxy, nextxy)]
        dist = math.sqrt(sum(dist))
        yield dist

@tz.curry
def getframes(dist, threshold=10, frames_per_sec=30):
    dist_count = 0
    for i, d in enumerate(dist):
        if d < threshold:
            dist_count += 1
    print('Of {:0.3f} seconds recording time, time spent with speed less than {} is {:0.3f} seconds'.format(
        i / frames_per_sec, threshold, dist_count / frames_per_sec))

In [189]:
# Test your code with larger datasets
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for files in CSVfileGrabber(dirname):
    getframes(
        getdist(
            consecutivexy1(
                readxy(files))), threshold=10, frames_per_sec=30)

Working on: Fish1
Of 16.267 seconds recording time, time spent with speed less than 10 is 12.833 seconds
Working on: Fish6
Of 16.267 seconds recording time, time spent with speed less than 10 is 15.133 seconds


## The above statement that calls multiple generators looks ugly. <br> 
In such cases, with multiple genertors lined up, yield can start to feel unintuitive and tedious

Enter Toolz
<br> Toolz by Matt Rocklin - http://toolz.readthedocs.io/en/latest/
<br> It makes streaming super easy - intuitive and concise !

For more examples and explanation from Elegant Scipy written by the brilliant ASPP faculty - https://github.com/elegant-scipy/notebooks/blob/master/notebooks/ch8.ipynb

(Filed under things I can't believe I hardly used before this tutorial)

In [190]:
import toolz as tz

#### tz.pipe - passes a value through a sequence of functions - one by one
Pipe is simply syntactic sugar to make multiple function calls easy

In [191]:
# This will do exactly as the previous call (without the added brackets)
# The function calls are cleaner and can be read from left to right - which is sooo much better
def pipeline(filename):
    pipe = tz.pipe(filename,
                readxy,
                consecutivexy1,
                getdist,
                getframes(threshold=10, frames_per_sec=30)
               )
    return pipe

In [192]:
pipeline

<function __main__.pipeline>

In [193]:
for i in CSVfileGrabber(dirname):
    pipeline(i)

Working on: Fish1
Of 16.267 seconds recording time, time spent with speed less than 10 is 12.833 seconds
Working on: Fish6
Of 16.267 seconds recording time, time spent with speed less than 10 is 15.133 seconds


### What happened there?

## The magic of curry

<br> Curry = Haskell Brooks Curry 
<br> __"Currying"__ means partially evaluating a function and returning another function. 

In [None]:
# If you dont give all inputs to a python function, it becomes angry
sum()

### By currying, we are breaking down the evaluation of a function 
A curried function evaluates partially when you dont give it all the arguments, and fully when all arguments are available. 

In [155]:
def currythis(my_function):
    def f1(a):
        def f2(b):
            return my_function(a, b)  
        return f2
    return f1

def func1(a, b):
    return a + b

In [157]:
func1(1)

TypeError: func1() missing 1 required positional argument: 'b'

In [158]:
f = currythis(my_function=func1) #Define a curried function

In [161]:
f(1)

<function __main__.currythis.<locals>.f1.<locals>.f2>

## Sidenote : Functions taking functions as input
__`Map` function in python__ : Returns an iterator that applies function to every item of iterable, yielding the results
<br>`map(function_to_apply, list_of_inputs)`

In [196]:
def getlen(text):
    return len(text.split())

map(getlen, ['Im ok', 'I will be ok'])

[2, 4]

In [195]:
filename = 'sometext.txt'
with open(filename) as f:
    for i in map(getlen, f):
        print(i, end=" ")

8 7 2 4 8 

What would this function look like?

In [89]:
def a_map_function(myfunc, myseq):
    for x in myseq: 
        yield myfunc(x)

In [90]:
with open(filename) as f:
    for i in a_map_function(getlen, f):
        print(i, end=" ")

8 7 2 4 8 

## Task 3 
__filter__ - Construct an iterator from those elements of iterable for which function returns true
<br>__To Do__ : Impliment the built-in filter function by yourself. <br> `filter(function, iterator)`
Get those lines of the textfile which contain greater than 10 letter words

### Task3 : Solution 

In [153]:
def words(text):
    for i in text.split():
        if len(i) > 10:
            return True
        
def myfilter(myfunc, myseq):
    for x in myseq:
        if myfunc(x):
            yield x
        else:
            continue

In [154]:
with open(filename) as f:
    for i in myfilter(words, f):
        print(i, end="\n")

Are solved on Mathematica,



## Lets get back to curry now

In [162]:
def currythis(my_function):
    """
    In python, functions can take other functions as input and even return them.
    """
    print('Going in')
    def f1(a):
        print('Evaluating partially give input', a)
        def f2(b):
            print('Evaluating fully after getting input', b)
            return my_function(a, b)  
#         print('f2', f2)
        return f2
#     print('f1', f1)
    return f1

def func1(a, b):
    return a + b

In [42]:
f = currythis(my_function=func1) #Define a curried function

Going in


In [43]:
g = f(1) #Just holds on to the value and produces no error

Evaluating partially give input 1


In [44]:
g(2)

Evaluating fully after getting input 2


3

`@currythis` is just syntactic sugar for `currythis(func1)`
<br> The `@` operator defines a decorator

In [45]:
@currythis
def func1(x, y):
    print(f'Sum {x + y}')
    return x + y


def test_curry(f, x, y):
    f1 = f(x)
    assert f1(y) == x + y

Going in


In [46]:
test_curry(func1, 3, 5) #If you provide the second argument now - the function is fully evaluated

Evaluating partially give input 3
Evaluating fully after getting input 5
Sum 8


In [47]:
#Try giving all the arguments to func1 now, 
func1(a = 3, b = 5) #Why doesnt this work?

TypeError: f1() got an unexpected keyword argument 'b'

In [48]:
import functools
def currythis(my_function):
    print('Going in')
    def new_func(*args, **kwargs):
        try:
            return my_function(*args, **kwargs)
        except TypeError:
            print('Not all arguments given')
            return functools.partial(new_func, *args, **kwargs)
    return new_func

In [55]:
@currythis
def func1(a, b, c):
    print(f'Sum {a+b+c}')
    return a + b + c

Going in


In [58]:
f = func1(b = 1, c = 2)
f

Not all arguments given


functools.partial(<function currythis.<locals>.new_func at 0x10fbf0158>, b=1, c=2)

In [57]:
f(3)

Sum 6


6

`tz.curry` works in a similar manner

In [1]:
from toolz import curry
@curry
def func1(a, b, c):
    print(f'Sum {a+b+c}')
    return a + b + c

In [2]:
f = func1(b = 1, c = 2)

In [3]:
f(3)

Sum 6


6

### In our pipeline:
- functions can be defined as `func1(sequence, *args, **kwargs)`. The @curry decorator converts them into a curried function.

## Task 4: if time permits
use toolz.pipe 

### All the functions in toolz are curried

Toolz has a number of useful curried functions to help us stream. <br> All of the Toolz functions are available as curried functions in the toolz.curried namespace. And it also gives curried versions of functions like map, filter and reduce.

You can also write any function that will analyse your dataset, line by line, which can then be curried and passed on to the toolz pipeline

Reminder :
1. map - Return an iterator that applies function to every item of iterable, yielding the results
2. filter - Construct an iterator from those elements of iterable for which function returns true
3. reduce - Performing some computation on an iterator and returning the result.