# Chapter 1
# Data Structures and Algorithms

## Table of Contents:
1. [Unpacking a Sequence into Separate Variables](#1.1-Unpacking-a-Sequence-into-Separate-Variables)
2. [Unpacking elements from Iterables of Arbitrary Length](#1.2-Unpacking-elements-from-Iterables-of-Arbitrary-Length)
3. [Keeping the Last N Items](#1.3-Keeping-the-Last-N-Items)
4. [Finding the Largest or Smallest N Items](#1.4-Finding-the-Largest-or-Smallest-N-Items)
5. [Implementing a Priority Queue](#1.5-Implementing-a-Priority-Queue)
6. [Mapping Keys to Multiple Values in a Dictionary](#1.6-Mapping-Keys-to-Multiple-Values-in-a-Dictionary)
7. [Keeping Dictionaries in Order](#1.7-Keeping-Dictionaries-in-Order)
8. [Calculating with Dictionaries](#1.8--Calculating-with-Dictionaries)
9. [Finding Commonalities in Two Dictionaries](#1.9-Finding-Commonalities-in-Two-Dictionaries)
10. [Removing Duplicates from a Sequence while Maintaining Order](#1.10-Removing-Duplicates-from-a-Sequence-while-Maintaining-Order)
11. [Naming a Slice](#1.11-Naming-a-Slice)
12. [Determining the Most Frequently Occurring Items in a Sequence](#1.12-Determining-the-Most-Frequently-Occurring-Items-in-a-Sequence)
13. [Sorting a List of Dictionaries by a Common Key](#1.13-Sorting-a-List-of-Dictionaries-by-a-Common-Key)
14. [Sorting Objects Without Native Comparison Support](#1.14-Sorting-Objects-Without-Native-Comparison-Support)
15. [Group Records Together Based on a Field](#1.15-Group-Records-Together-Based-on-a-Field)
16. [Filtering Sequence Elements](#1.16-Filtering-Sequence-Elements)
17. [Extracting a Subset of a Dictionary](#1.17-Extracting-a-Subset-of-a-Dictionary)
18. [Mapping Names to Sequence Elements](#1.18-Mapping-Names-to-Sequence-Elements)
19. [Transforming and Reducing Data at the Same Time](#1.19-Transforming-and-Reducing-Data-at-the-Same-Time)
20. [Combining Multiple Mappings into a Single Mapping](#1.20-Combining-Multiple-Mappings-into-a-Single-Mapping)

## 1.1 Unpacking a Sequence into Separate Variables

If you want to to unpack an N-element tuple or sequence into a collection of N variables.

That can be done using simple assignment operation. The only requirement is that the number of variables and structure match the sequence.

In [1]:
p = (4, 5)
a, b = p
a, b

(4, 5)

In [2]:
data = 'damn you beauty'.split(' ')
word1, word2, word3 = data

stuff = ['some stuff happens', (2017, 7, 12)]
event, (year, month, day) = stuff

**NOTE**: Unpacking actually works with any Iterable object, not just tuples of lists. This includes strings, files, iterators, and generators

In [3]:
s = 'Hello'
a, b, c, d, e = s

When unpacking, if you want to discard certain values, you can just pick a throwaway variable name

In [4]:
_, date = stuff
date

(2017, 7, 12)

In [5]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.2 Unpacking elements from Iterables of Arbitrary Length

You need to unpack N elements from an iterable, but the iterable may be longer than N elements, causing a "too many values to unpack exception".

Python "star expressions" can be used to address this problem. For example, suppose you run a course and decide at the end of the semester that you're going to drop the first and last homework grades and only average the rest of them. A start expression makes it easy:

In [6]:
def drop_first_last(grades):
    first, *middle, last = grades
    return avg(middle)

Another use case: suppose you have user records that consist of a name and email address, followed by an arbitrary number of phone numbers. You could unpack the records like this:

In [7]:
record = ('Tu', 'tu@random.com', '773-555-1212', '267-694-9395')
name, email, *phone_numbers = record
name

'Tu'

In [8]:
phone_numbers

['773-555-1212', '267-694-9395']

**NOTE**: It's worth noting that the `phone_numbers` variable will always be a list, regardless of how many elements are unpacked (including None). Thus, you won't have to worry the possibility of it not being a list or so.

The starred variable can also be the first one in the list.

In [9]:
*trailing, current = [10, 9, 8, 7, 6, 10, 23, 100]
print(trailing)
print(current)

[10, 9, 8, 7, 6, 10, 23]
100


It's also worth noting that the star syntax can be especially useful when iterating over a sequence of tuples of varying length.

In [10]:
records = [
    ('foo', 1, 2),
    ('bar', 'hello'),
    ('foo', 3, 4),
]

def do_foo(x, y):
    print('foo', x, y)
    
def do_bar(s):
    print('bar', s)
    
for tag, *args in records:
    if tag == 'foo':
        do_foo(*args)
    elif tag == 'bar':
        do_bar(*args)

foo 1 2
bar hello
foo 3 4


Star unpacking can also be useful when combined with string processing:

In [11]:
line = 'nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false'
uname, *fields, homedir, sh = line.split(':')
print(uname)
print(homedir)
print(fields)
print(sh)

nobody
/var/empty
['*', '-2', '-2', 'Unprivileged User']
/usr/bin/false


In [12]:
record = ('ACME', 50)

In [13]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.3 Keeping the Last N Items

You want to keep a limited history of the last few items seen during iteration or during some other kind of processing.

Keeping a limited history is a perfect use for a `collections.deque`. For example, the following code performs a simple text match on a sequence of lines and yields the matching line along with the previous N lines of context when found:

In [14]:
from collections import deque

def search(lines, pattern, history=5):
    previous_lines = deque(maxlen=history)
    for line in lines:
        if pattern in line:
            yield line, previous_lines
        previous_lines.append(line)

When writing code to search for items, it is common to use a generator function involving `yield`, as shown in this recipe's solution. This decouples the process of searching from the code that uses the results.

Using `deque(maxlen=N)` creates a fixed-sized queue. When new items are added and the queue is full, the oldest item is automatically removed. For example:

In [15]:
q = deque(maxlen=3)
q.append(1)
q.append(2)
q.append(3)
print(q)
q.append(4)
print(q)

deque([1, 2, 3], maxlen=3)
deque([2, 3, 4], maxlen=3)


More generally, a `dequeue` can be used whenever you need a simple queue structure. If you don't give a maximum size, you get an unbounded queue that lets you append and pop items on either end.

In [16]:
q = deque()
q.append(1)
q.append(2)
q.append(3)
q

deque([1, 2, 3])

In [17]:
q.appendleft(4)
q

deque([4, 1, 2, 3])

In [18]:
q.pop()
q

deque([4, 1, 2])

In [19]:
q.popleft()
q

deque([1, 2])

Adding or popping items from either end of a queue has O(1) complexity. This is unlike a list where inserting or removing items from the front of the list is O(N).

In [20]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.4 Finding the Largest or Smallest N Items

You want to make a list of the largest or smallest N items in a collection.

The `heapq` module has two functions - `nlargest()` and `nsmallest()` - that do exactly what you want:

In [21]:
import heapq

nums = [1, 8, 23, 2, 7, -4, 18, 23, 42, 37, 2]
print(heapq.nlargest(3, nums))
print(heapq.nsmallest(3, nums))

[42, 37, 23]
[-4, 1, 2]


Both functions also accept a key parameter that allows them to be used with more complicated data structures. For example:

In [22]:
portfolio = [
    {'name': 'IBM', 'shares': 100, 'price': 91.1},
    {'name': 'AAPL', 'shares': 50, 'price': 543.22},
    {'name': 'FB', 'shares': 200, 'price': 21.09},
    {'name': 'HPQ', 'shares': 35, 'price': 31.75},
    {'name': 'YHOO', 'shares': 45, 'price': 16.35},
    {'name': 'ACME', 'shares': 75, 'price': 115.65}
]

cheap = heapq.nsmallest(3, portfolio, key=lambda s: s['price'])
expensive = heapq.nlargest(3, portfolio, key=lambda s: s['price'])

In [23]:
print(cheap)
print(expensive)

[{'name': 'YHOO', 'shares': 45, 'price': 16.35}, {'name': 'FB', 'shares': 200, 'price': 21.09}, {'name': 'HPQ', 'shares': 35, 'price': 31.75}]
[{'name': 'AAPL', 'shares': 50, 'price': 543.22}, {'name': 'ACME', 'shares': 75, 'price': 115.65}, {'name': 'IBM', 'shares': 100, 'price': 91.1}]


Underneath the covers, these functions work by first converting the data into a list where items are ordered as a heap. For example:

In [24]:
nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]

heap = list(nums)
heapq.heapify(heap)
heap

[-4, 2, 1, 23, 7, 2, 18, 23, 42, 37, 8]

The most important feature of a heap is that `heap[0]` is always the smallest item. Moreover, subsequent items can be easily found using the `heapq.heappop()` method, which pops off the first item and replaces it with the next smallest item (an operation that requires O(logN) operations where N is the size of the heap). For example, to find the three smallest items, you would do this:

In [25]:
heapq.heappop(heap)

-4

In [26]:
heapq.heappop(heap)

1

In [27]:
heapq.heappop(heap)

2

The `nlargest()` and `nsmallest()` functions are most appropriate if you are trying to find a relatively small number of items. If you are simply trying to find the single smallest or largest item (N=1), it is faster to use `min()` and `max()`. Similarly, if N is about the same size as the collection itself, it is usually faster to sort it first and take a slice (i.e., use `sorted(items)[:N]` or `sorted(items)[-N:]`). It should be noted that the actual implementation of `nlargest()` and `nsmallest()` is adaptive in how it operates and will carry out some of these optimizations on your behalf (e.g., using sorting if N is close to the same size as the input).

In [28]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.5 Implementing a Priority Queue

You want to implement a queue that sorts items by a given priority and always returns the item with the highest priority on each pop operation.

The following class uses the `heapq` module to implement a simple priority queue:

In [34]:
import heapq


class PriorityQueue(object):
    
    def __init__(self):
        self._queue = []
        self._index = 0
        
    def push(self, item, priority):
        heapq.heappush(self._queue, (-priority, self._index, item))
        self._index += 1
        
    def pop(self):
        return heapq.heappop(self._queue)[-1]

In [40]:
pq = PriorityQueue()
pq.push('foo', 1)
pq.push('bar', 5)
pq.push('spam', 4)
pq.push('grok', 1)

In [41]:
pq.pop()

'bar'

In [42]:
pq.pop()

'spam'

In [43]:
pq.pop()

'foo'

In [44]:
pq.pop()

'grok'

The core of this recipe concerns the use of the `heapq` module. The functions `heapq.heappush()` and `heapq.heappop()` insert and remove items from a list `_queue` in a way such that the first item in the list has the smallest priority. The `heappop()` method always returns the **smallest** item, so that is the key to making the queue pop the correct items. Moreover, since the push and pop operations have O(log N) complexity where N is the number of items in the heap, they are fairly efficient even for fairly large values of N.

In this recipe, the queue consists of the form `(-priority, index, item)`. The `priority` value is negated to get the queue to sort items from highest priority to lowest priority. This is opposite of the normal heap ordering, which sorts from lowest to highest.

The role of the `index` variable is to properly order items with the same priority level. By keeping a constantly increasing index, the items will be sorted according to the order in which they were inserted.  However, the index also serves an important role in making the comparison operations work for items that have the same priority level in case two items cannot be compared.

If you make `(priority, item)` tuples, they can be compared as long as the priorities are different. However, if two tuples with equal priorities are compared, the comparison fails as before.

By introducing the extra index and making `(priority, index, item)` tuples, you avoid this problem entirely since no two tuples will ever have the same value for `index` (and Python never bothers to compare the remaining tuple values once the result of comparison can be determined.

In [45]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.6 Mapping Keys to Multiple Values in a Dictionary

You want to make a dictionary that maps keys to more than one value (a so-called "multidict")

A dictionary is a mapping where each key is mapped to a single value. If you want to map keys to multiple values, you need to store the multiple values in another container such as a list or set.

In [46]:
d = {
    'a': [1, 2, 3],
    'b': [4, 5]
}

e = {
    'a': [1, 2, 3],
    'b': [4, 5]
}

The choice of whether or not to use lists or sets depends on intended use. Use a list if you want to preserve the insertion order of item. Use a set if you want to eliminate duplicates (and don't care about the order).

To easily construct such dictionaries, you can use `defaultdict` in the `collections` module. A feature of `defaultdict` is that it automatically initializes  the first value so you can simply focus on adding items. For example:

In [47]:
from collections import defaultdict

d = defaultdict(list)
d['a'].append(1)
d['a'].append(2)
d['b'].append(4)
print(d)

defaultdict(<class 'list'>, {'a': [1, 2], 'b': [4]})


One caution with `defaultdict` is that it will automatically create dictionary entries for keys accessed later on (even if they aren't currently found in the dictionary). If you don't want this behavior, you might use `setdefault()` on an ordinary dictionary instead.

In [48]:
d = {}
d.setdefault('a', []).append(1)
d.setdefault('a', []).append(2)
d.setdefault('b', []).append(4)
print(d)

{'a': [1, 2], 'b': [4]}


However, it's a little unnatural - not to mention the fuct that it always creates a new instance of the initial value on each invocation (the empty list [] in the example).

In principle, constructing a multivalued dictionary is simple. However, initialization of the first value can be messy if you try to do it yourself.

In [49]:
d = {}
pairs = [('a', 1), ('a', 2), ('b', 4)]
for key, value in pairs:
    if key not in d:
        d[key] = []
    d[key].append(value)

Using a `defaultdict` simply leads to much cleaner code:

In [50]:
d = defaultdict(list)
for key, value in pairs:
    d[key].append(value)

In [51]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.7 Keeping Dictionaries in Order

You want to create a dictionary, and you also want to control the order of items when iterating or serializing.

To control the order of items in a dictionary, you can use an `OrderedDict` from the `collections` module. It exactly preserves the original insertion order of data when iterating:

In [52]:
from collections import OrderedDict

d = OrderedDict()
d['foo'] = 1
d['bar'] = 2
d['spam'] = 3
d['grok'] = 4

for key in d:
    print(key, d[key])

foo 1
bar 2
spam 3
grok 4


An `OrderedDict` can be particularly useful when you want to build a mapping that you may want to later serialize or encode into a different format. For example, if you want to precisely control the order of fields appearing in a JSON encoding, first building the data in an `OrderedDict` will do the trick:

In [53]:
import json
json.dumps(d)

'{"foo": 1, "bar": 2, "spam": 3, "grok": 4}'

An `OrderedDict` internally maintains a doubly linked list that orders the keys according to insertion order. When a new item is first inserted, it is placed at the end of the list. Subsequent reassignment of an existing key doesn't change the order.

Be aware that the size of an `OrderedDict` is more than twice as large as a normal dictionary due to the extra linked list that's created. Thus, if you are going to build a data structure involving a large number of `OrderedDict` instances (e.g., reading 100,000 lines of a CSV file into a list of `OrderedDict` instances), you would need to study the requirements of your application to determine if the benefits outweighed the extra memory overhead.

In [None]:
%reset

## 1.8  Calculating with Dictionaries

You want to perform various calculations (e.g. minimum value, maximum value, sorting. etc.) on a dictionary of data.

Consider this:

In [1]:
prices = {
    'ACME': 45.23,
    'AAPL': 612.78,
    'IBM': 205.55,
    'HPQ': 37.20,
    'FB': 10.75
}

In order to perform useful calculations on the dictionary contents, it is often useful to invert the keys and values of the dictionary using `zip()`. For example, here's how to find the min and max:

In [3]:
min_price = min(zip(prices.values(), prices.keys()))
max_price = max(zip(prices.values(), prices.keys()))

print(min_price)
print(max_price)

(10.75, 'FB')
(612.78, 'AAPL')


In [4]:
prices_sorted = sorted(zip(prices.values(), prices.keys()))

print(prices_sorted)

[(10.75, 'FB'), (37.2, 'HPQ'), (45.23, 'ACME'), (205.55, 'IBM'), (612.78, 'AAPL')]


In [5]:
# be aware that zip() creates an iterator that can only be consumed once
prices_names = zip(prices.values(), prices.keys())
print(min(prices_names))
print(max(prices_names))

(10.75, 'FB')


ValueError: max() arg is an empty sequence

If you want to know information about the corresponding keys (e.g., which stock has the lowest price?). YOu can get the key corresponding to the min or max value if you supply a key function to `min()` and `max()`.

In [6]:
min(prices, key=lambda k: prices[k])

'FB'

In [7]:
max(prices, key=lambda k: prices[k])

'AAPL'

However, to get the minimum value, you'll need to perform an extra lookup step. For example:

In [10]:
prices[max(prices, key=lambda k: prices[k])]

612.78

The solution involving `zip()` solves the problem by "inverting" the dictionary into a sequence of `(value, key)` pairs. When performing comparisons on such tuples, the `value` element is compared first, followed by the `key`. This gives you exactly the behavior that you want and allows reductions and sorting to be easily performed on the dictionary contents using a single statement.

It should be noted that in calculations involving `(value, key)` pairs, if multiple have the same value, the key will be used to determine the result.

In [11]:
prices = { 'AAA': 45.23, 'ZZZ': 45.23 }
min(zip(prices.values(), prices.keys()))

(45.23, 'AAA')

In [12]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.9 Finding Commonalities in Two Dictionaries

You have to dictionaries and want to find out what they might have in common (same keys, same values, etc.).

Consider two dictionaries:

In [13]:
a = {
    'x': 1,
    'y': 2,
    'z': 3
}

b = {
    'w': 10,
    'x': 11,
    'y': 2
}

To find out what the two dictionaries have in common, simply perform common set operations using the `keys()` or `items()` methods. For example:

In [14]:
# Find keys in common
a.keys() & b.keys()

{'x', 'y'}

In [15]:
# Find keys in a that are not in b
a.keys() - b.keys()

{'z'}

In [20]:
# Find (key, value) pairs in common
a.items() & b.items()

{('y', 2)}

These kinds of operations can also be used to alter or filter dictionary contents. For example, suppose you want to make a new dictionary with selected keys removed. Here is some example code to do that:

In [21]:
# Make a new dictionary with certain key s removed
c = {key: a[key] for key in a.keys() - {'z', 'w'}}
c

{'x': 1, 'y': 2}

In [23]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


`keys()` and `items()` methods of a dictionary return keys-view object that exposes the key and items-view object consisting of `key, value` pairs. These objects support common set operations such as unions, intersections, and differences. Thus you don't need to convert them into sets.

Although similar, the `values()` method does not support the set operations described in this recipe. In part, this is due to the fact that unlike keys, the items contained in a values view aren't guaranteed to be unique. This alone makes certain set operations of questionable utility. However, if you must perform such calculations, they can be accomplished by simply converting the values to a set first.

## 1.10 Removing Duplicates from a Sequence while Maintaining Order

You want to eliminate duplicate values in a sequence, but preserve the order of the remaining items.

If the values in the sequence are hashable, the problem can be easily solved using a set and a generator:

In [12]:
def dedupe(items):
    seen = set()
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

In [14]:
# Here's an example of how to use your function
a = [1, 5, 2, 1, 9, 1, 5, 10]
list(dedupe(a))

[1, 5, 2, 9, 10]

This only works if the items in the sequence are hashable. If you are trying to eliminate duplicates in a sequence of unhashable types (such as `dict`s), you can make a slight change to this recipe:

In [16]:
def dedupe(items, key=None):
    seen = set()
    for item in items:
        val = item if key is None else key(item)
        if val not in seen:
            yield item
            seen.add(val)

In [17]:
a = [
    {'x': 1, 'y': 2},
    {'x': 1, 'y': 3},
    {'x': 1, 'y': 2},
    {'x': 2, 'y': 4}
]

list(dedupe(a, key=lambda d: (d['x'], d['y'])))

[{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]

In [18]:
list(dedupe(a, key=lambda d: d['x']))

[{'x': 1, 'y': 2}, {'x': 2, 'y': 4}]

The latter solution also works nicely if you want to eliminate duplicates based on the value of a single field or attribute or a larger data structure:

If all you want to do is eliminate duplicates, it is often easy enough to make a set. For example:

In [21]:
set([1, 5, 2, 1, 3, 9, 5, 10])

{1, 2, 3, 5, 9, 10}

However, this approach doesn't preserve any kind of ordering. So, the resulting data will be scrambled afterward. The solution shown avoids this.

The use of a generator function in this recipe reflects the fact that you might want the function to be extremely general purpose - not necessarily tied directly to list processing. For example, if you want to read a file, eliminating duplicate lines, you could simply do this:

```python
with open(somefile, 'r') as f:
    for line in dedupe(f):
        ...
```

In [1]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.11 Naming a Slice

Your program has become an unreadable mess of hardcoded slice indices and you want to clean it up.

Suppose you have some code that is pulling specific data fields out of a record string with fixed fields (e.g., from a flat file or similar format):

In [1]:
record = '....................100          .......513.25     ..........'

# avoid having a lot of mysterious hardcoded indices
# and what you're doing becomes much clearer
SHARES = slice(20, 32)
PRICE = slice(40, 48)

cost = int(record[SHARES]) * float(record[PRICE])
cost

51325.0

As a general rule, writing code with a lot of hardcoded index values leads to a readability and maintenance mess. For example, if you come back to the code a year later, you'll look at it and wonder what you were thinking when you wrote it. The solution shown is simply a way of more clearly stating what your code is actually doing.

In general, the built-in `slice()` creates a slice object that can be used anywhere a slice is allowed. For example:

In [2]:
items = [0, 1, 2, 3, 4, 5, 6]
a = slice(2, 4)
items[a]

[2, 3]

In [3]:
items[a] = [10, 11]
items

[0, 1, 10, 11, 4, 5, 6]

In [4]:
del items[a]
items

[0, 1, 4, 5, 6]

You can even do start, stop, step with a `slice` instance.

In [5]:
a = slice(10, 50, 2)
a.start

10

In [6]:
a.stop

50

In [20]:
a.step

2

In [22]:
# You can also map a slice onto a sequence of a specific
# size by using its indices(size) method
# this returns a tuple (start, stop, step) where all values
# have been suitably limited to fit within bounds (as to avoid
# Index Error exceptions)
s = "HelloWorld"
a.indices(len(s))

(10, 10, 2)

In [19]:
for i in range(*a.indices(len(s))):
    print(i)
    print(s[i])

## 1.12 Determining the Most Frequently Occurring Items in a Sequence

You have a sequence of items, and you'd like to determine the most frequently occurring items in the sequence.

The `collections.Counter` class is designed for just such a problem. It even comes with a handy `most_common()` method that will give you the answer.

In [21]:
from collections import Counter

words = [
   'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',
   'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the',
   'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into',
   'my', 'eyes', "you're", 'under'
]

word_counts = Counter(words)
top_three = word_counts.most_common(3)
top_three

[('eyes', 8), ('the', 5), ('look', 4)]

In [23]:
# as input, Counter objects can be fed any sequence of hashsable input items
# under cover, a Counter is a dictionary that maps the items to the number of
# occurrences
word_counts["not"]

1

In [24]:
word_counts["eyes"]

8

In [25]:
# if you want to increment the count manually, simply use
# addition
morewords = ['why', 'are', 'you', 'not', 'looking', 'in', 'my', 'eyes']
for word in morewords:
    word_counts[word] += 1
word_counts

Counter({'are': 1,
         'around': 2,
         "don't": 1,
         'eyes': 9,
         'in': 1,
         'into': 3,
         'look': 4,
         'looking': 1,
         'my': 4,
         'not': 2,
         'the': 5,
         'under': 1,
         'why': 1,
         'you': 1,
         "you're": 1})

In [27]:
# alternatively, you could use the update() method
word_counts.update(morewords)
word_counts

Counter({'are': 3,
         'around': 2,
         "don't": 1,
         'eyes': 11,
         'in': 3,
         'into': 3,
         'look': 4,
         'looking': 3,
         'my': 6,
         'not': 4,
         'the': 5,
         'under': 1,
         'why': 3,
         'you': 3,
         "you're": 1})

In [30]:
# we could also do mathematical operations
a = Counter(words)
b = Counter(morewords)

print(a + b)
print(a - b)

Counter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2, 'around': 2, "don't": 1, "you're": 1, 'under': 1, 'why': 1, 'are': 1, 'you': 1, 'looking': 1, 'in': 1})
Counter({'eyes': 7, 'the': 5, 'look': 4, 'into': 3, 'my': 2, 'around': 2, "don't": 1, "you're": 1, 'under': 1})


In [31]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.13 Sorting a List of Dictionaries by a Common Key

You have a list of dictionaries and you would like to sort the entries according to one or more of the dictionary values.

Sorting this type of structure is easy using the `operator` module's `itemgetter` function.

In [1]:
rows = [
    {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003},
    {'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
    {'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
    {'fname': 'Big', 'lname': 'Jones', 'uid': 1004}
]

It's fairly easy to output these rows ordered by any of the fields common to all of the dictionaries. For example:

In [2]:
from operator import itemgetter

rows_by_fname = sorted(rows, key=itemgetter('fname'))
rows_by_uid = sorted(rows, key=itemgetter('uid'))

print(rows_by_fname)
print(rows_by_uid)

[{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}, {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'John', 'lname': 'Cleese', 'uid': 1001}]
[{'fname': 'John', 'lname': 'Cleese', 'uid': 1001}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'Big', 'lname': 'Jones', 'uid': 1004}]


In [3]:
rows_by_lfname = sorted(rows, key=itemgetter('lname', 'fname'))
rows_by_lfname

[{'fname': 'David', 'lname': 'Beazley', 'uid': 1002},
 {'fname': 'John', 'lname': 'Cleese', 'uid': 1001},
 {'fname': 'Big', 'lname': 'Jones', 'uid': 1004},
 {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}]

In this example, `rows` is passed to the built-in `sorted()` function, which accepts a keyword argument `key`. This argument is expected to be a callable that accepts a single item from `rows` as input and returns a value that will be used as the basis for sorting. The `itemgetter()` function creates just such a callable.

The `operator.itemgetter()` function takes as arguments the lookup indices used to extract the desired values from the records in `rows`. It can be a dictionary key name, a numeric list element, or any value that can be fed to an object's `__getitem__()` method. If you give multiple indices to `itemgetter()`, the callable it produces will return a tuple with all of the elements in it, and `sorted()` will order the output according to the sorted order of the tuples. This can be useful if you want to simultaneously sort on multiple fields (such as last and first name, as shown in the example).

In [4]:
# sometimes labmda is used
rows_by_fname_ = sorted(rows, key=lambda r: r['fname'])
rows_by_lfname = sorted(rows, key=lambda r: (r['lname'], r['fname']))

Both work just fine, but the solution involving `itemgetter()` typically runs a bit faster.

In [5]:
# min and max can be applied
min(rows, key=itemgetter('uid'))

{'fname': 'John', 'lname': 'Cleese', 'uid': 1001}

In [6]:
max(rows, key=itemgetter('uid'))

{'fname': 'Big', 'lname': 'Jones', 'uid': 1004}

## 1.14 Sorting Objects Without Native Comparison Support

You want to sort objects of the same class, but they don't natively support comparison operations.

The built-in `sorted()` function takes a key argument that can be passed a callable that will return some value in the object that `sorted` will use to compare objects. For example, if you have a sequence of `User` instances in your application, and you want to sort them by their `user_id` attribute, you would supply a callable that takes a `User` instance as input and returns the `user_id`.

In [7]:
class User:
    def __init__(self, user_id):
        self.user_id = user_id
        
    def __repr__(self):
        return f'User({self.user_id})'
    
users = [User(23), User(3), User(99)]
users

[User(23), User(3), User(99)]

In [8]:
sorted(users, key=lambda u: u.user_id)

[User(3), User(23), User(99)]

In [9]:
# instead of using lambda, an alternative approach
# is to use operator.attrgetter()
from operator import attrgetter
sorted(users, key=attrgetter('user_id'))

[User(3), User(23), User(99)]

It's a matter of personal preference to use `lambda` or `attrgetter`. Similar to `itemgetter`, `attrgetter` is a tad bit faster and allows multiple fields to be extracted simultaneously.

In [10]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.15 Group Records Together Based on a Field

You have a sequence of dictionaries or instances and you want to iterate over the data in groups based on the value of a particular field, such as date.

The `itertools.groupby()` function is particularly useful for grouping data together like this. To illustrate, suppose you have the following list of dictionaries:

In [24]:
rows = [
    {'address': '5412 N CLARK', 'date': '07/01/2012'},
    {'address': '5148 N CLARK', 'date': '07/04/2012'},
    {'address': '5800 E 58TH', 'date': '07/02/2012'},
    {'address': '2122 N CLARK', 'date': '07/03/2012'},
    {'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'},
    {'address': '1060 W ADDISON', 'date': '07/02/2012'},
    {'address': '4801 N BROADWAY', 'date': '07/01/2012'},
    {'address': '1039 W GRANVILLE', 'date': '07/04/2012'},
]

Now suppose you want to iterate over the data in chunks together grouped by date. To do this, first sort by the desired field (in this case, date) and the use `itertools.groupby()`:

In [27]:
from operator import itemgetter
from itertools import groupby

# sort by the desired field first
rows.sort(key=itemgetter('date'))

# iterate in group
for date, items in groupby(rows, key=itemgetter('date')):
    print(date)
    for i in items: print('\t', i)

07/01/2012
	 {'address': '5412 N CLARK', 'date': '07/01/2012'}
	 {'address': '4801 N BROADWAY', 'date': '07/01/2012'}
07/02/2012
	 {'address': '5800 E 58TH', 'date': '07/02/2012'}
	 {'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'}
	 {'address': '1060 W ADDISON', 'date': '07/02/2012'}
07/03/2012
	 {'address': '2122 N CLARK', 'date': '07/03/2012'}
07/04/2012
	 {'address': '5148 N CLARK', 'date': '07/04/2012'}
	 {'address': '1039 W GRANVILLE', 'date': '07/04/2012'}


The `groupby()` function works by scanning a sequence and finding sequential "runs" of identical values (or values returned by the given key function). On each iteration, it returns the value along with an iterator that produces all of the items in a group with the same value.

An important preliminary step is sorting the data according to the field of interest. Since `groupby()` only examines consecutive items, failing to sort first won't group the records as you want.

If your goal is to simply group the data together by dates into a large data structure that allows random access, you may have better luck using `defaultdict()` to build a multidict:

In [29]:
from collections import defaultdict

rows_by_date = defaultdict(list)
for row in rows:
    rows_by_date[row['date']].append(row)
    
print(rows_by_date)

defaultdict(<class 'list'>, {'07/01/2012': [{'address': '5412 N CLARK', 'date': '07/01/2012'}, {'address': '4801 N BROADWAY', 'date': '07/01/2012'}], '07/02/2012': [{'address': '5800 E 58TH', 'date': '07/02/2012'}, {'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'}, {'address': '1060 W ADDISON', 'date': '07/02/2012'}], '07/03/2012': [{'address': '2122 N CLARK', 'date': '07/03/2012'}], '07/04/2012': [{'address': '5148 N CLARK', 'date': '07/04/2012'}, {'address': '1039 W GRANVILLE', 'date': '07/04/2012'}]})


If memory is no concern, it may be faster to do this than to first sort the records and iterate using `groupby()`.

In [30]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.16 Filtering Sequence Elements

You have data inside of a sequence, and need to extract values or reduce the sequence using some criteria.


The easiest way to filter sequence data is often to use a list comprehension. For example:

In [31]:
mylist = [1, 4, -5, 10, -7, 2, 3, -1]
[n for n in mylist if n > 0]

[1, 4, 10, 2, 3]

In [33]:
[n for n in mylist if n < 0]

[-5, -7, -1]

One potential issue with list comprehension is that if the input is large it might produce a large result. If this is a concern, you can use generator expressions to produce the filtered values iteratively:

In [39]:
pos = (n for n in mylist if n > 0)
pos

<generator object <genexpr> at 0x1071b5258>

In [40]:
for x in pos: print(x)

1
4
10
2
3


Sometimes the filtering criteria is more complicated such as involving exception handling or some other tricky detail. For this, put the filtering code into its own function and use the built-in `filter()` function:

In [42]:
values = ['1', '2', '-3', '-', '4', 'N/A', '5']

def is_int(val):
    try: x = int(val); return True
    except ValueError: return False
    
ivals = list(filter(is_int, values))
ivals

['1', '2', '-3', '4', '5']

List comprehension and generator expressions are often the easiest and most straightforward ways to filter simple data. They also have the added power to transform the data at the same time.

Another notable tool is `itertools.compress()`, which takes an iterable and an accompanying Boolean selector sequence as input. As output, it gives you all of the items in the iterable where the corresponding element in the selector is True. This can be useful if you're trying to apply the results of filtering one sequence to another related sequence. For example, suppose you have the following two columns of data:

In [43]:
addresses = [
    '5412 N CLARK',
    '5148 N CLARK',
    '5800 E 58TH',
    '2122 N CLARK'
    '5645 N RAVENSWOOD',
    '1060 W ADDISON',
    '4801 N BROADWAY',
    '1039 W GRANVILLE',
]

counts = [0, 3, 10, 4, 1, 7, 6, 1]

Now suppose you want to make a list of all addresses where the corresponding count value was greater than 5. Here's how you could do it:

In [46]:
from itertools import compress

more5 = [n > 5 for n in counts]
more5

[False, False, True, False, False, True, True, False]

In [47]:
list(compress(addresses, more5))

['5800 E 58TH', '4801 N BROADWAY', '1039 W GRANVILLE']

The key here is to create a sequence of Booleans that indicates which elements satisfy the desired condition. The `compress()` function then picks out the items corresponding to `True` values.

Like `filter()`, `compress()` normally returns an iterator. Thus you need to use `list()` to make it a list.

In [48]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


## 1.17 Extracting a Subset of a Dictionary

You want to make a dictionary which is a subset of another dictionary.

This can be easily achieved by using dictionary comprehension:

In [1]:
prices = {
    'orange': 4.99,
    'apple': 5.99,
    'grape': 3.99,
    'cherry': 6.99,
    'banana': 4.49,
}

In [2]:
# Make a dictionary of items with prices over $4
p1 = {key:value for key, value in prices.items() if value > 4}

# Make a dictionary of healthy items
healthy_names = ['grape', 'banana']
p2 = {key: value for key, value in prices.items() if key in healthy_names}

print(p1, p2)

{'orange': 4.99, 'apple': 5.99, 'cherry': 6.99, 'banana': 4.49} {'grape': 3.99, 'banana': 4.49}


Much of what can be accomplished with dictionary comprehension might also be done by creating a sequence of tuples and passing them into `dict` function:

In [3]:
dict((key, value) for key, value in prices.items() if value > 4)

{'apple': 5.99, 'banana': 4.49, 'cherry': 6.99, 'orange': 4.99}

That said, dictionary comprehension is usually clearer and runs faster. Another way of doing the second example:

In [5]:
healthy_names_set = {'grape', 'banana'}
{key: prices[key] for key in prices.keys() & healthy_names_set}

{'banana': 4.49, 'grape': 3.99}

In [7]:
%reset

## 1.18 Mapping Names to Sequence Elements

You have code that accesses elements in lists or tuples by position, but that makes the code hard to read. It'd be great if there is a way to be less dependent on position in the structure, by accessing the elements by name.

`collections.namedtuple()` provides these benefits, while adding minimal overhead over using a normal `tuple` object. `collections.namedtuple()` is actually a factory method that returns a subclass of the standard Python `tuple` type. You feed it a type name, and the fields it should have, and it returns a class that you can instantiate, passing in values for the fields you've defined, and so on.

In [2]:
from collections import namedtuple

Subscriber = namedtuple('Subscriber', ['address', 'joined'])
sub = Subscriber('Philly', '2018-02')

In [3]:
sub

Subscriber(address='Philly', joined='2018-02')

In [4]:
print(sub.address)
print(sub.joined)

Philly
2018-02


Although being a class of its own, an instance of`namedtuple` is actually interchangeble with a `tuple` and supports all of its operations:

In [5]:
sub[0]

'Philly'

In [6]:
addr, joined = sub

In [7]:
print(addr, joined)

Philly 2018-02


In [8]:
len(sub)

2

A major use case of `namedtuple` is to decouple your code from the position of the elements it manipulates. For example, if you get back a list of tuples as a result of database query. Then it would for sure break your code if there is a new column in the database. Not so if you cast the returned `tuple` to `namedtuple`.

In [11]:
def compute_cost_1(records):
    total = 0.0
    for rec in records:
        total += rec[1] * rec[2]
    return total


def compute_cost_2(records):
    total = 0.0
    for rec in records:
        s = Stock(*rec)
        total += s.shares * s.price
    return total

Naturally you can avoid the explicit conversion to `Stock` `namedtuple` if the `records` sequence in the example already contained such instances.

Another possible use case of `namedtuple` is to replace dictionary, which takes more memory. However, a `namedtuple` is **immutable**, so keep that in mind.

In [12]:
Stock = namedtuple('Stock', ['name', 'shares', 'price'])
s = Stock('AMZ', '30', '50')
s

Stock(name='AMZ', shares='30', price='50')

In [13]:
s.shares = 80

AttributeError: can't set attribute

If you do want to change, though, then `_replace` method will replace the attribute and generate a new object.

In [14]:
s._replace(shares=80)

Stock(name='AMZ', shares=80, price='50')

A subtle and interesting use of `_replace()` is to have a "prototype" `tuple` containing default values and then use `_replace()` to create new tuples with new values.

In [16]:
Stock = namedtuple('Stock', ['name', 'shares', 'price', 'date', 'time'])

# create a prototype stock
prototype = Stock('', 0, 0.0, None, None)

# function to convert a dictionary to a stock
def dict_to_stock(s):
    return prototype._replace(**s)

In [17]:
a = {'name': 'ACME', 'shares': 100, 'price': 123.45}
b = {'name': 'HUBS', 'shares': 120, 'price': 111.25, 'date': '2018/02/25'}

In [18]:
dict_to_stock(a)

Stock(name='ACME', shares=100, price=123.45, date=None, time=None)

In [19]:
dict_to_stock(b)

Stock(name='HUBS', shares=120, price=111.25, date='2018/02/25', time=None)

In [20]:
%reset

## 1.19 Transforming and Reducing Data at the Same Time

You need to execute some reducer function (`sum()`, `max()`, `min()`), but first need to transform or filter data.

A very elegant way to combine data reduction and transformation is to use a generator-expression argument:

In [1]:
# calculating the sum of squares
nums = [1, 2, 45, 100, 23]
sum(x*x for x in nums)

12559

In [7]:
# Some other examples
import os

# determine if there is any file that ends with
# .ipynb inside the current directory
files = os.listdir('.')
if any(name.endswith('.ipynb') for name in files):
    print('There are some iPython notebooks')
else:
    print('There is not!')

There are some iPython notebooks


In [8]:
# Output a tuple as CSV file
",".join(str(x) for x in ('AMZ', 300, 100.12))

'AMZ,300,100.12'

In [9]:
# Data reduction across fields of a data structure
portfolio = [
    {'name': 'GOOG', 'shares': 50},
    {'name': 'FB', 'shares': 75},
    {'name': 'MSFT', 'shares': 87},
    {'name': 'AMZN', 'shares': 100},
]
min(s['shares'] for s in portfolio)

50

The solution shows a subtle syntactic aspect of generator expressions when supplied as a single argument inside a function (i.e., you don't need repeated parentheses).

In [10]:
# These statements are the same
print(sum(x*x for x in nums))
print(sum((x*x for x in nums)))

12559
12559


Using a generator is more efficient in terms of memory use. You could use list comprehension too, but that is fine when you have a small list. A large list would cause the program to create one in memory, only for that to be discarded later. The generator transforms the data iteratively and is therefore much more memory-efficient.

In [12]:
# Also, min and max accept a key argument that is useful
# where you might be inclined to use a generator as well
from operator import itemgetter

# origin -> returns 50
print(min(s['shares'] for s in portfolio))

# alternative -> returns {'name': 'GOOG', 'shares': 50}
print(min(portfolio, key=itemgetter('shares')))

50
{'name': 'GOOG', 'shares': 50}


In [13]:
%reset

## 1.20 Combining Multiple Mappings into a Single Mapping

You have multiple dictionaries or mappings that you want to logically combine into a single mapping to perform certain operations, such as looking up values or checking for existing keys.

An easy way to this is to use the `ChainMap` class from `collections` module:

In [17]:
from collections import ChainMap

# suppose you have two dictionaries
a = {'x': 1, 'y': 2}
b = {'y': 2, 'z': 3}

c = ChainMap(a, b)

In [18]:
c['x']

1

In [19]:
c['y']

2

In [20]:
c['z']

3

A `ChainMap` takes multiple mappings and make them logically appear as one. However, it doesn't literally merge them together. Instead, it simply keeps a list of the underlying mappings and redefines common dictionary operations to scan the list. Most of the operations will work:

In [21]:
len(c)

3

In [22]:
list(c.keys())

['x', 'y', 'z']

In [23]:
list(c.values())

[1, 2, 3]

In [24]:
list(c.items())

[('x', 1), ('y', 2), ('z', 3)]

If there are duplicate keys, the value from the first mapping will get used. Thus the entry `c['z']` in the example would always refer to the value in the dictionary `a`, not the one in `b`.

In [25]:
# operations that mutate the mapping
# always affect the first mapping listed
c['z'] = 10
c['w'] = 40
del c['x']

In [26]:
a

{'w': 40, 'y': 2, 'z': 10}

A `ChainMap` is particularly useful when working with scoped values such as variables in a programming language (i.e., globals, locals, etc.). In fact, there are methods that make this easy:

In [31]:
values = ChainMap()
values['x'] = 1
values

ChainMap({'x': 1})

In [32]:
# Add a new mapping
values = values.new_child()
values['x'] = 2

In [34]:
# Add a new mapping
values = values.new_child()
values['x'] = 3

In [35]:
values

ChainMap({'x': 3}, {'x': 2}, {'x': 1})

In [36]:
values['x']

3

In [37]:
# Discard last mapping
values = values.parents
values['x']

2

In [38]:
# Discard last mapping
values = values.parents
values['x']

1

In [39]:
values

ChainMap({'x': 1})

As an alternative to `ChainMap`, you can use `update method`.

In [40]:
merged = dict(b)
merged.update(a)
merged

{'w': 40, 'y': 2, 'z': 10}

This works, but it requires creating a completely separate dictionary object (or destructively alter one of the existing dictionaries). Also, if any of the original dictionaries mutate, the changes don't get reflected in the merged dictionary. A `ChainMap`, on the other hand **mutates the merged dictionary**.

In [41]:
%reset