In [1]:
import sys
print("Python Version:", sys.version, '\n')

Python Version: 3.7.5 (default, Oct 25 2019, 10:52:18) 
[Clang 4.0.1 (tags/RELEASE_401/final)] 



# Advanced Data Types: The Collections Module

Throughout Python's existence, several tasks have popped up over time that are regularly a pain for people. To address those, the collections model has several "new" data types that smooth over constant issues in python. Let's look at some of those types.

## DefaultDict

Dictionaries expect that you will create a key-value pair before using the value. That's pretty reasonable most of the time, but sometimes you just want it to assume some basic value whenever a new key is entered. See this example.

In [2]:
count = {}
count['duck'] = 0

animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    print(animal)

count

duck
duck
duck


KeyError: 'goose'

It didn't have a value for `goose` so it couldn't add 1 to it. We can get around that with some try-except work - but that's sort of annoying. The `defaultdict` allows us to specify ahead of time to just assume a basic type of value for any new key. For instance, if we tell it to expect an `int` it will assume 0.

In [3]:
count = {}

animals = ['duck','duck','duck','goose']

for animal in animals:
    try:
        count[animal] += 1
    except KeyError:
        count[animal] = 1

count

{'duck': 3, 'goose': 1}

In [4]:
from collections import defaultdict

count = defaultdict(int)
animals = ['duck','duck','duck','goose']

for animal in animals:
    count[animal] += 1
    
count

defaultdict(int, {'duck': 3, 'goose': 1})

## Named Tuple

Sometimes you want to create a class, but the class only needs to store data, and you are lazy.

You could put the data in a dictionary, but there is a set amount of info that never changes for each instance. You could put the data in a tuple, but then you need to remember the order. What if you could have the simplicity of a tuple, but labels like a dictionary, and access methods by name like a dictionary? That's a **named tuple**.

In [5]:
from collections import namedtuple

Alumni = namedtuple('Alumni','name age gender degree title salary employer')

alice = Alumni(name='Alice',
               age=29,
               gender='F',
               degree ='PhD',
               title = 'Data Scientist',
               salary = 115000,
               employer = 'Thumbtack')

alice.age

29

## Deque

A deque (double-ended queue) is a lovely type of object that's designed for accessing data on either end. A normal list is only optimized for adding-removing from the right with things like append and pop. Deque's are designed to be ambivalent about sides. 

In [10]:
from collections import deque

d = deque([1,2,3,4])
d.appendleft(3)
d

deque([3, 1, 2, 3, 4])

In [11]:
d.popleft()

3

We can also use deque's as a sliding window so we don't have to play weird games about chopping bits and pieces off if we want a fixed length.

In [12]:
window = deque(maxlen=4)
for idx in range(10):
    window.append(idx)
    print(window)
    
print("---SWITCH---")
for idx in range(10):
    window.appendleft(idx)
    print(window)

deque([0], maxlen=4)
deque([0, 1], maxlen=4)
deque([0, 1, 2], maxlen=4)
deque([0, 1, 2, 3], maxlen=4)
deque([1, 2, 3, 4], maxlen=4)
deque([2, 3, 4, 5], maxlen=4)
deque([3, 4, 5, 6], maxlen=4)
deque([4, 5, 6, 7], maxlen=4)
deque([5, 6, 7, 8], maxlen=4)
deque([6, 7, 8, 9], maxlen=4)
---SWITCH---
deque([0, 6, 7, 8], maxlen=4)
deque([1, 0, 6, 7], maxlen=4)
deque([2, 1, 0, 6], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)


# Generators

Generators aren't in the `collections` package, but are instead a standard part of Python 3. They're extremely powerful and solve a lot of problems for us.

Often times in an analysis, we don't really want to load a whole thing into memory. We really just want a `cursor` that knows where it is in the data. For instance, imagine I was trying to load all the books ever written into Python... that's too big for my RAM. However, if I just had an object that kept track of which book it was on, and what page it needs to read next, I could load things page-by-page. That's exactly what a generator does (albeit, I've oversimplified a bit). 

We can use that to give us data over and over, without having to pre-generate all the data. Let's see an example.

In [13]:
def generate_numbers():
    """
    An infinite number generator
    """
    x = 0
    while True:
        x += 1
        yield x # instead of return, I use yield, which makes this into a generator!
        
        
my_generator = generate_numbers()
for iteration in range(10):
    next_number = next(my_generator)
    print(next_number)

1
2
3
4
5
6
7
8
9
10


This could go on until infinity! Now realistically, if I asked python to generate an infinite `list` of numbers, I'd run out of RAM. But here, I've just asked Python to keep track of what number comes next, and to forget everything else. Then when it updates, it just says, "oh this number comes next now". Let's prove to ourselves that Python isn't pre-generating the whole `list` by comparing the size in memory of the generator and the list.

In [14]:
from sys import getsizeof as sizeof

In [15]:
a = [idx for idx in range(200)]
b = (idx for idx in range(200)) # By wrapping in parens, this is a generator
print(sizeof(a))
print(sizeof(b))

1680
128


The list is 1672 bytes, the generator is only 88 bytes! That's because it's not storing all the data, just a cursor to loop through the data.

In [16]:
type(b)

generator

Generators are iterables, so we can loop through them with a `for` just like normal.

In [17]:
for ix in b:
    print(ix)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199


Why does this matter? Because if we want to work with large, streaming data, we can't always fit it into memory. The generator doesn't ask it to fit in memory, it just remembers where it is pulling the data from... for instance, what line in the CSV am I on? Then it hands to the next data as you ask for it. You can keep adding data to a file, or always pull the most recent data and use that with generators.