# Generators in Python

1. [Generator Expressions](#exp)
2. [Yield](#yield)
3. [Simple Use Cases](#ex)
   * [Reading Large Files](#csv)
   * [Generating Infinite Sequences](#seq)
   * [Detecting Palindromes](#pal)
4. [Advanced Generator Methods](#adv)
5. [Creating Data Pipelines with Generators](#data)

* Generator functions are a special kind of function that returns a lazy iterator
    * __Lazy (call-by-need)__ - An evaluation strategy which delays the evaluation of an expression until its value is needed and which also avoids repeated evaluations
        * Benefits - define control flow, define potentially infinite data structures, increase performance and avoid error conditions
        * Often combined with memoization, where function results are stored in an indexed lookup table
        * Also called _generator object_
* Lazy iterators/generator objects can be looped over like a list, but their contents are not stored in memory
* Can be progressed by a for loop or using the built-in next() function, at which point it retrieves the next value from the statement/expression
    * next() is particularly helpful for testing generators in the console/terminal
    * There are additional optional keywords, but all that's necessary is the name of the generator
    * If there is no 'next' value, the generator will throw `StopIteration`.
* Generators use the __yield__ keyword
    * Yield - indicates where a value is sent back to the caller, but doesn't exit the function like a return would
    * The state of the function is remembered, so when next() is called on a generator object, it provides the next iteration after where you previously left off

In [7]:
def my_gen():
    for num in range(5):
        yield num
        num += 1        

## Generator Expressions <a class="anchor" id="exp"></a>

* Generators can be written like list comprehension except with () rather than []
* Allow the creation of the generator object without building and holding the entire object in memory before iteration
    * Result is equivalent to generator function, just an implied yield in each iteration
* List comprehension can be faster to evaluate than equivalent generator expressions when the list is _smaller_ than the running machiine's available memory
    * If speed is an issue, but memory isn't, list comp is likely better
* Often used with aggregate functions like sum, max, and min

In [11]:
g = (x**2 for x in range(5))
g

<generator object <genexpr> at 0x7f1994610c10>

In [12]:
print(next(g))
print(next(g))
print(next(g))
print(next(g))
print(next(g))

0
1
4
9
16


In [16]:
# generators are so much smaller than list comprehen.!
import sys

square_lc = [i ** 2 for i in range(10000)]
lc = sys.getsizeof(square_lc)

square_gc = (i **2 for i in range(10000))
gc = sys.getsizeof(square_gc)

print(f'List comprehension: {lc} bytes\nGenerator: {gc} bytes')

List comprehension: 87616 bytes
Generator: 112 bytes


## More about yield <a class="anchor" id="yield"></a>

* Primary job is to control the flow of a generator function
* Calling a generator function/using a generator expression returns a generator object, which can be assigned to a variable in order to use it
    * It has its own methods, etc, like any other object
* When the yield statement is hit, the program suspends function execution and returns the yielded value to the caller
    * Because it's suspended, not _ended_, it can resume from the same place the next time it's called upon

## Simple Use Cases <a class="anchor" id="ex"></a>

### Reading large files <a class="anchor" id="csv"></a>

In [None]:
# Use a csv reader GENERATOR FUNCTION
    # just opens the file and reads it + yield statement 
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

# Count the number of rows in a large csv
csv_gen = csv_reader("some_csv.txt")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

### Generating an infinte sequence <a class="anchor" id="seq"></a>

In [4]:
# For a finite range, you call range and evaluate in a list
a = range(5)
list(a)

[0, 1, 2, 3, 4]

In [None]:
# Because memory is finite, we need a generator
def infinite_sequence():
    num = 0
    while True:
        yield num
        num +=1
# This can be used in a for loop and continues until ended by hand
for i in infinte_sequence():
    print(i, end=" ")

### Detecting Palindromes <a class="anchor" id="pal"></a>

* Practical use for infinite sequences
* Palindrome detector will locate all sequences of letters or numbers that are palindromes
    * The function will take an input number, reverse it and see if it's the same as the original 

In [None]:
def is_palindrome(num):
    # skip single digit inputs
    if num //10 = 0:
        return False
    temp = num
    reversed_num = 0
    
    # Ignore this math
    while temp != 0:
        reversed_num = (reversed_num * 10) + (temp % 10)
        temp = temp // 10
    
    if num == reversed_num:
        return num
    else:
        return False

# use generator to print the infinte sequence of palindromes
for i in infinite_sequence():
    pal = is_palindrome(i)
    if pal:
        print(pal)

## Advanced Generator Methods <a class="anchor" id="adv"></a>

* Yield is actually an _expression_ rather than a statement, though it can be used as either.
    * This allows it to to be manipulated (ex: assign yielded values within other code (see the refactored sequence generator))
* send() - 
    * Sends a value back to the generator
    * When  execution picks up after yield, i will take the value that is sent from the generator
* throw() -
    * Allows you to throw an exception
* close() -
    * Allows you to stop a generator
    * Raises StopIteration exception

In [1]:
# refactor palindrome code to only return T/F
def is_pal(num):
    # skip single digit inputs
    if num // 10 == 0:
        return False
    temp = num
    reversed_num = 0
     
    if num == reversed_num:
        return True
    else:
        return False
    
# Refactor squence generator
def infinite_palindromes():
    num = 0
    
    while True:
        if is_pal(num):
            i = (yield num) # yield expression
            if i is not None:
                num = i
        num += 1

* The main function for our new palindrome generator will send back the lowest number with another digit
    * ie: if the palindrome is 121, then it will send() 1000
* If digits hits 5, throw a ValueError
* Updated to use close to end the loop instead

1. Create a generator object and iterate through it, only yielding a value when a palindrome is found
2. Determine the number of digits in that palindrome
3. Send 10 ** digits to the generator
4. Execution now goes back to the generator logic and assigns 10 ** digits to i
5. Because i now has a value, the program updates num, increments and checks for palindromes again. 

* Once we find and yield another palindrome, we iterate via the for loop in the below code again, and the generator picks up at `i = (yield num)`. i is now None, because we didn't explicitly send it a value

In [None]:
# main function code 
pal_gen = infinite_palindromes()
for i in pal_gen:
    print(i)
    digits = len(str(i))
    if digits == 5:
        #pal_gen.throw(ValueError("We don't like large palindromes"))
        pal_gen.close()
    pal_gen.send(10 ** (digits))

* This is called a __coroutine__ - a generator function into which you can pass data
    * Useful (but not necessary) for building data pipelines
    * <a href= "http://www.dabeaz.com/coroutines/">For more about coroutines click here</a>

## Data Pipelines with Generators <a class="anchor" id="data"></a>

___Plan___
1. Read every line of a file
2. Split each line into a list of values
3. Extract column names
4. Use the column names and lists to create a dictionary
5. Filter out irrelevant data (rounds)
6. Calculate the total and average values for relevant data (rounds)

In [27]:
# Read each line from file with generator expression
file_name = "resources/generators.csv"
lines = (line for line in open(file_name))

# Use another generator exp. to split lines into lists
list_line = (s.rstrip().split(",") for s in lines)

* rstrip() removes trailing characters from strings, with no argument it removes whitespace and newlines
* split() explicitly splits a str into a list at a given separator  
___Common Design Pattern:___
* The generator `list_line` iterates through the first generator `lines`
    * `lines` retrieves a row from the file
    * `lines_list` turns the line into a list

In [28]:
# Extract column names from the file
cols = next(list_line)
print(cols)

['permalink', 'company', 'numEmps', 'category', 'city', 'state', 'fundedDate', 'raisedAmt', 'raisedCurrency', 'round']


* Because the column names are typically the first row, calling next on the generator advances the iterator over `list_lines` once
* To help with filtering and performing operations on the data, make __another generator__ to convert it to dictionaries
    * Key = column names from cols
    * Value = data from list_line
* zip() takes in iterables and returns an iterator, object which can then be converted to a given data type

In [29]:
# Use generator expression to create dictionaries
company_dicts = (dict(zip(cols,data)) for data in list_line)

* Use __another generator__ to filter the desired funding round and pull the amount raised 
    * Iterates through the results of company_dicts and takes the raisedAmt for any company_dict where the round key is 'a'  
    
___Remember that generators don't iterate through all items at once. In fact, don't iterate through anything until we actually use a loop or a function that works on iterables___

In [30]:
# Filter dictionaries for data of interest
funding = (
    int(company_dict['raisedAmt']) for company_dict in company_dicts if company_dict['round'] == 'a')

# Find the sum (works on iterables)
total_series_a = sum(funding)
print(f'Total series A fundraising: ${total_series_a}')

KeyError: 'round'

In [21]:
file_name = "resources/generators.csv"

lines = (line for line in open(file_name))

list_line = (s.rstrip().split(",") for s in lines)

cols = next(list_line)

company_dicts = (dict(zip(cols, data)) for data in list_line)

funding = (

    int(company_dict["raisedAmt"])

    for company_dict in company_dicts

    if company_dict["round"] == "a"

)

total_series_a = sum(funding)

print(f"Total series A fundraising: ${total_series_a}")

KeyError: 'round'