# Understanding Iterators and (mostly) Generators
Seetha Krishnan
<br>
ASPP - Asia Pacific 2018

## Iterators
Iterators are everywhere. 
An iterator is simply an object that can be iterated upon, say using a `for` loop

In this extremely simple example, the __range(4)__ is the iterable object which at each iteration provides a different value to the __"i"__ variable.

In [19]:
for i in range(4):
    print(i)

0
1
2
3


You can iterate over strings, lists, files, dictionaries etc

In [20]:
import numpy as np

with open(filename) as f:
    for linenumber, lines in enumerate(f):
        print(f'{linenumber} > {lines}')

0 > The skill to do math on a page

1 > Has declined to the point of outrage.

2 > Equations quadratica

3 > Are solved on Mathematica,

4 > And on birthdays we don't know our age.


Most of the time you can get away with iterating over objects, storing them and analysing. But things can quickly get out of hand if you have a large data or multiple loops

__For Example__: 
- You have a large csv file. Iterating and appending results to a list is not going to be memory or cpu efficient. You need something that will allow you to parse one line at a time. 

- If you have to stream data from a server, webservice, or a camera, you are continuosly generating a series of values. Now you want to iterate through this, but you do not know the length of this data or when it will end. And you dont want to keep appending it into a list 

## Generators
Generators are a simple, yet elegant type of iterators.

__To create generators:__ 
- Define a function
- instead of the return statement, use the __yield__ keyword. 

In [21]:
def charcount(filename):
    """ Generator function that reads lines and  yields the line and characters in each line """
    with open(filename) as fin:
        for linenumber, lines in enumerate(fin):
            yield lines, len(lines)

In [22]:
c = charcount(filename='sometext.txt')
print(c)

<generator object charcount at 0x10e2cf938>


A generator does not hold anything in memory
<br>It "yields" one result at a time and hasnt computed anything till you ask for the value - by saying next

In [23]:
c1 = next(c)

In [24]:
print(c1)

('The skill to do math on a page\n', 31)


Instead of calling next every time, you will typically use generator functions as an __iterator object__

In [25]:
c = charcount(filename='sometext.txt')
for l in c:
    print(f'> {l[0][:-1]} \t char count: {l[1]}')

> The skill to do math on a page 	 char count: 31
> Has declined to the point of outrage. 	 char count: 38
> Equations quadratica 	 char count: 21
> Are solved on Mathematica, 	 char count: 27
> And on birthdays we don't know our age 	 char count: 39


#### See basic_generator_example.py

### Task1
- Define a second generator that yields the number of words in each line
- Use charcount(filename) as an input to this generator
- output print statement should include (linenumber, charcount and word count)

Tip : What I like to do when writing a generator is to print statements instead of yield.
When I am satisfied with the accuracy of the print statement, I convert it to yield

### Task1: Solution

#### See basic_generator_example.py

## Generators are great for large datasets that you want to process one line at a time
- a __generator__ is also an __iterator__!(not vice versa)
- generators iterate over data __lazily__ without loading the entire data source into memory at once.

__You make a generator that can be iterated over 

- When functions `return`, they are done for good. Generators are alive till values are exhausted
- Functions always start from the first line, generators start where you left off : at __yield__ 
- __Limitation__ - with a generator you can only iterate. You can't peak ahead or look behind

## Task 2 : Streaming with `yield`
Multiple CSV files stored in a directory, contain information of x-y position of a swimming zebrafish across time.
<br>__The task:__
1. Loop through each csv file, acquire the x and y position and find distance travelled by the fish at each time point.
2. To find distance travelled between two timepoints, you need to get the x and y position of fish at two consecutive frames.
3. Using the acquired distance travelled, print time spent by the fish at a speed below the threshold. 

  <img src="files/fish.png"  width="400" >

### Read from csv files - line by line

In [26]:
import csv
import os


def CSVfileGrabber(dirname):
    """Step 1 : Grab CSV files from a directory """
    for filename in os.listdir(dirname):
        if filename.endswith('.csv'):
            print('Working on: {}'.format(filename[:5]))  # Print name of fish
            yield os.path.join(dirname, filename)


def readxy(filename):
    """Step 2 : read the csv files line by line """
    with open(filename) as f:
        csvreader = csv.reader(f)
        for i, line in enumerate(csvreader):
            # Skip a few lines
            if i < 10:
                continue
            else:
                 # x and y coordinates
                x = int(line[2])
                y = int(line[3])
                yield (x, y)

Just to make sure things are working

In [27]:
dirname = '/Users/seetha/Desktop/Microbetest/ExampleFile/'  # A small sample dataset

for files in CSVfileGrabber(dirname):
    print(files)
    
# for files in CSVfileGrabber(dirname):
#     numline = 0
#     for g in readxy(files):
#         numline += 1
#     print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
/Users/seetha/Desktop/Microbetest/ExampleFile/Fish1_example.csv
Working on: Fish2
/Users/seetha/Desktop/Microbetest/ExampleFile/Fish2_example.csv


### Get consecutive values for distance calculation

In [28]:
def consecutivexy1(linearray):
    """Step 4: get consecutive xy values"""
    # Here we want to get two consecutive xy to get speed/frame
    # Make use of the next keyword
    for i, line in enumerate(linearray):
        if i == 0:
            prevxy = line
            nextxy = next(linearray)
        else:
            prevxy = nextxy
            nextxy = line
        yield prevxy, nextxy

A nice way to do this is to use itertools (which is an amazing library for looping through iterators) https://docs.python.org/3/library/itertools.html
<br> `tee` : Return n independent iterators from a single iterable. `tee(seq, n)`

In [29]:
from itertools import tee


def consecutivexy2(linearray):
    # This makes two copies of the same iterable
    prevxy, nextxy = tee(linearray, 2)
    next(nextxy)  # discard one
    yield from zip(prevxy, nextxy)  # Note here I am using "yield from"

In [30]:
# Just to make sure things are working
for files in CSVfileGrabber(dirname):
    numline = 0
    for x, y in consecutivexy2(readxy(files)):
        #         print(x, y)
        numline += 1
    print('Parsed lines from this csv file is {}'.format(numline))

Working on: Fish1
Parsed lines from this csv file is 16
Working on: Fish2
Parsed lines from this csv file is 16


### Sidenote : `yield from`
With `yield from`, we can skip an extra `for` loop

In [31]:
# A simple example to see what the yield from function will do
A = range(5)
B = range(6, 11)

# Without yield from
def temp(range1, range2):
    for a, b in zip(range1, range2):
        yield a, b
        
# Two loops!! You need two loops!!
for i in temp(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


In [32]:
# After Python 3.3 and existance of yield from
def yieldfromexample(A, B):
    yield from zip(A, B)
for i in yieldfromexample(A, B):
    print(i)

(0, 6)
(1, 7)
(2, 8)
(3, 9)
(4, 10)


`Yield from` is especially useful when you have multiple iterators, recursive data structures

## Write the next parts on your own
- Step 5 : Calculate distance between the two consecutive points
- Step 6 : Put it all together

In [33]:
# Step 5: Calculate euclidean distance
import math


def getdist(xy):
    """  
    Write a generator function that recieves 
    the previous and next x-y location of the fish 
    and calculates the distance between the two points
    
    Euclidean distance between two points (x1, y1) and (x2, y2) is 
    sqrt((x1-x2)^2 + (y1-y2)^2)
   """

In [39]:
# Step 6: Put it all together
def getframes(dist, threshold, frames_per_sec):
    """
    Count frames with distance below a user-defined threshold and
    complete the print statement given below
    (Hint: use enumerate to find number of frames)
    
    Example:
    Of 16.27 seconds recording time, time spent with speed less than 10 is 12.83 seconds
    """
    
    print('Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds')

### Task 2: Solution
Will be inserted here

In [40]:
# Test your code with larger datasets
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for files in CSVfileGrabber(dirname):
    getframes(
        getdist(
            consecutivexy1(
                readxy(files))), threshold=10, frames_per_sec=30)

Working on: Fish1
Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds
Working on: Fish6
Of {:0.2f} seconds recording time, time spent with speed less than {} is {:0.2f} seconds


## The above statement that calls multiple generators looks ugly. <br> 
In such cases, with multiple genertors lined up, yield can start to feel unintuitive and tedious

Enter Toolz
<br> Toolz by Matt Rocklin - http://toolz.readthedocs.io/en/latest/
<br> It makes streaming super easy - intuitive and concise !

For more examples and explanation from Elegant Scipy written by the brilliant ASPP faculty - https://github.com/elegant-scipy/notebooks/blob/master/notebooks/ch8.ipynb

(Filed under things I can't believe I hardly used before this tutorial)

In [41]:
import toolz as tz

#### tz.pipe - passes a value through a sequence of functions - one by one
Pipe is simply syntactic sugar to make multiple function calls easy

In [42]:
# This will do exactly as the previous call (without the added brackets)
# The function calls are cleaner and can be read from left to right - which is sooo much better
def pipeline(filename):
    pipe = tz.pipe(filename,
                readxy,
                consecutivexy1,
                getdist,
                getframes(threshold=10, frames_per_sec=30)
               )
    return pipe

In [43]:
dirname = '/Users/seetha/Desktop/Microbetest/Collective/'
for i in CSVfileGrabber(dirname):
    pipeline(i)

Working on: Fish1


TypeError: getframes() missing 1 required positional argument: 'dist'