In [1]:
import sys
print("Python Version:", sys.version, '\n')

# Advanced Data Types: The Collections Module

Throughout Python's existence, several tasks have popped up over time that are regularly a pain for people. To address those, the collections model has several "new" data types that smooth over constant issues in python. Let's look at some of those types.

## DefaultDict

Dictionaries expect that you will create a key-value pair before using the value. That's pretty reasonable most of the time, but sometimes you just want it to assume some basic value whenever a new key is entered. See this example.

In [2]:
count = {}
count['duck'] = 0

animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    print(animal)

count

duck
duck
duck


KeyError: 'goose'

It didn't have a value for `goose` so it couldn't add 1 to it. We can get around that with some try-except work - but that's sort of annoying. The `defaultdict` allows us to specify ahead of time to just assume a basic type of value for any new key. For instance, if we tell it to expect an `int` it will assume 0.

In [3]:
count = {}

animals = ['duck','duck','duck','goose']

for animal in animals:
    try:
        count[animal] += 1
    except KeyError:
        count[animal] = 1

count

{'duck': 3, 'goose': 1}

In [4]:
from collections import defaultdict

count = defaultdict(int)
animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    
count

defaultdict(int, {'duck': 3, 'goose': 1})

## Named Tuple

Sometimes you want to create a class, but the class only needs to store data, and you are lazy.

You could put the data in a dictionary, but there is a set amount of info that never changes for each instance. You could put the data in a tuple, but then you need to remember the order. What if you could have the simplicity of a tuple, but labels like a dictionary, and access methods by name like a dictionary? That's a **named tuple**.

In [5]:
from collections import namedtuple

Alumni = namedtuple('Alumni','name age gender degree title salary employer')

alice = Alumni(name='Alice',
               age=29,
               gender='F',
               degree ='PhD',
               title = 'Data Scientist',
               salary = 115000,
               employer = 'Thumbtack')

# Call any attribute of 'alice' instance of Alumni using dot-notation:
alice.age

29

## Deque

A deque (double-ended queue) is a lovely type of object that's designed for accessing data on either end. A normal list is only optimized for adding-removing from the right with things like append and pop. Deque's are designed to be ambivalent about sides. 

In [6]:
from collections import deque

d = deque([1,2,3,4])
d.appendleft(3)
d

deque([3, 1, 2, 3, 4])

In [7]:
# Specify left side
d.popleft()

3

In [8]:
d

deque([1, 2, 3, 4])

In [9]:
# default append/pop is on the RIGHT
d.append(5)
d

deque([1, 2, 3, 4, 5])

In [10]:
d.pop()

5

In [11]:
d

deque([1, 2, 3, 4])

We can also use deque's as a sliding window so we don't have to play weird games about chopping bits and pieces off if we want a fixed length.

In [12]:
window = deque(maxlen=4)
for idx in range(10):
    window.append(idx)
    print(window)
    
print("---SWITCH---")
for idx in range(10):
    window.appendleft(idx)
    print(window)

deque([0], maxlen=4)
deque([0, 1], maxlen=4)
deque([0, 1, 2], maxlen=4)
deque([0, 1, 2, 3], maxlen=4)
deque([1, 2, 3, 4], maxlen=4)
deque([2, 3, 4, 5], maxlen=4)
deque([3, 4, 5, 6], maxlen=4)
deque([4, 5, 6, 7], maxlen=4)
deque([5, 6, 7, 8], maxlen=4)
deque([6, 7, 8, 9], maxlen=4)
---SWITCH---
deque([0, 6, 7, 8], maxlen=4)
deque([1, 0, 6, 7], maxlen=4)
deque([2, 1, 0, 6], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)


# Generators

Generators aren't in the `collections` package, but are instead a standard part of Python 3. They're extremely powerful and solve a lot of problems for us.

Often times in an analysis, we don't really want to load a whole thing into memory. We really just want a `cursor` that knows where it is in the data. For instance, imagine I was trying to load all the books ever written into Python... that's too big for my RAM. However, if I just had an object that kept track of which book it was on, and what page it needs to read next, I could load things page-by-page. That's exactly what a generator does (albeit, I've oversimplified a bit). 

A **generator function** is defined like a normal function, but whenever it needs to generate a value, it does so with the `yield` keyword instead of `return`. If the body of a function contains `yield`, the function automatically becomes a generator function.

A generator function returns a **generator object**. Generator objects are used either by calling the next method on the generator object or using the generator object in a `for...in...` loop. 

We can use that to give us data over and over, without having to pre-generate all the data. Let's see an example.

Below: 
* `generate_numbers` is a **generator function**
* `my_generator` is a **generator object**

In [13]:
# This is a generator function
def generate_numbers():
    """
    An infinite number generator
    """
    x = 0
    while True:
        x += 1
        yield x # instead of return, I use yield, which makes this into a generator!
        
# This is a generator object:         
my_generator = generate_numbers()

# Calling the generator object
for iteration in range(10):
    next_number = next(my_generator)
    print(next_number)

1
2
3
4
5
6
7
8
9
10


This could go on until infinity! Now realistically, if I asked python to generate an infinite `list` of numbers, I'd run out of RAM. But here, I've just asked Python to keep track of what number comes next, and to forget everything else. Then when it updates, it just says, "oh this number comes next now". Let's prove to ourselves that Python isn't pre-generating the whole `list` by comparing the size in memory of the generator and the list.

In [14]:
from sys import getsizeof as sizeof

In [15]:
a = [idx for idx in range(200)]
b = (idx for idx in range(200)) # By wrapping in parens, this is a generator
print(sizeof(a))
print(sizeof(b))

1664
112


The list is 1672 bytes, the generator is only 88 bytes! That's because it's not storing all the data, just a cursor to loop through the data.

In [16]:
type(a)

list

In [17]:
type(b)

generator

In [18]:
print(b)

<generator object <genexpr> at 0x7fc44f21e120>


Generators are iterables, so we can loop through them with a `for` just like normal.

In [None]:
# This will return a list of 200 numbers
for ix in b:
    print(ix)

Why does this matter? Because if we want to work with large, streaming data, we can't always fit it into memory. The generator doesn't ask it to fit in memory, it just remembers where it is pulling the data from... for instance, what line in the CSV am I on? Then it hands to the next data as you ask for it. You can keep adding data to a file, or always pull the most recent data and use that with generators.

### More notes on Generators

**iteration** - reading items one by one. 
* Everything you can use `for...in...` loop on is an iterable
* Iterables: lists, strings, files... 
* You can read iterables as much as you wish, but you store all the values in memory and this is not always what you want when you have a lot of of values!

**Generators** are iterators that you can ***only iterate over once***. Generators do *not* store all the values in memory - **they generate the values on the fly**. 

In [19]:
# Another example: 
this_generator = (x*x for x in range(3))

for i in this_generator:
    print(i)

0
1
4


Important! 
* Use `()` instead of `[]` to make it a `generator` instead of a `list`
* You ***cannot*** perform `for i in this_generator` a second time since **generators can only be used *one***: they calculate 0, then forget about it and calucalte 1, and end caluclating 4, one by one. 

Below - trying to call `this_generator` a second time returns *nothing*

In [20]:
for i in this_generator:
    print(i)

The `yield` keyword is used like `return` except the function will return a generator

In [21]:
def create_generator():
    my_list = range(3)
    for i in my_list:
        yield i*i
        
# Create a generator: 
another_generator = create_generator()

#another_generator is an object
print(another_generator)

<generator object create_generator at 0x7fc44f1ff040>


In [22]:
# First use of another_generator

for i in another_generator:
    print(i)

0
1
4


In [23]:
# Trying to use another_generator a second time -- no output!
for i in another_generator:
    print(i)

With `yield`, when you **call the function, the code you have written in the function body *does not run***. The function only returns the *generator object*!

Your code will continue where it left off each time `for` uses the generator. 

The first time the `for` calls the generator object created from the function, it will run the code in the function beginning until it hits yield. Then it'll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting `yield`. That can be because the loop has come to an end, or because you no longer satisfy an `if/else`. 

We can see this in action by calling `my_generator` from earlier in this notebook a second time. Using the exact same code and `range(10)`, the generator will pick up where it left off earlier:

In [24]:
for iteration in range(10):
    next_number = next(my_generator)
    print(next_number)

11
12
13
14
15
16
17
18
19
20


In [25]:
# And again: 
for iteration in range(10):
    next_number = next(my_generator)
    print(next_number)

21
22
23
24
25
26
27
28
29
30


Another example - here's a generator for Fibonacci numbers:

In [26]:
# Creating a Fibonacci number generator-function
def fib(limit):
    
    # initalize first two Fibonacci numbers: 
    a, b = 0, 1
    
    # one by one, yield next fib num
    while a < limit:
        yield a
        a, b = b, a + b
        
# Creating a fibonacci number generator-object with limit of 5
keep_fibbin = fib(5)

In [27]:
# Prints first num of Fib-sequence
print(keep_fibbin.__next__())

0


In [28]:
# Call it a few more times
print(keep_fibbin.__next__())
print(keep_fibbin.__next__())
print(keep_fibbin.__next__())
print(keep_fibbin.__next__())

1
1
2
3


In [29]:
# What about after we've reached the limit?
print(keep_fibbin.__next__())

StopIteration: 

Can't do it - the generator object has reached it's limit. 

We can, however, call the generator-*function* as many times as we want: 

In [30]:
for i in fib(5): 
    print(i)

0
1
1
2
3


### Further Reading
* [How to use generators and yield in Python](https://realpython.com/introduction-to-python-generators/)
* [Python Wiki on Generators](https://wiki.python.org/moin/Generators)
* [Python Generators](https://www.programiz.com/python-programming/generator)