# Chapter 10 Lecture Notes

Please read chapter N of the textbook.

These notes take 1 - 3 lecture hours to cover.

## Dictionaries

A Python **dictionary** is a data structure that stores a *key:value* pairs. For
instance, a dictionary might store *ID*:*name* pairs of students in a course,
*word*:*definition* pairs for a dictionary, and so on.

Dictionaries are popular in practice because they allow fast searching by key,
and can be used to model many kinds of data. For instance, it's possible to
simulate lists using dictionaries.

## Making Dictionaries

Suppose a university has three semesters: spring, summer, and fall. Due to the
intricacies of its database system, the spring semester is numbered 1, the
summer semester is numbered 2, and the fall semester is numbered 5:

| Semester | Number |
|----------|--------|
| Spring   | 1      |
| Summer   | 4      |
| Fall     | 7      |

We can represent this in Python as a dictionary like this:

In [1]:
semester_code = {'spring': 1, 'summer': 4, 'fall': 7}

print(semester_code)          # {'spring': 1, 'summer': 4, 'fall': 7}
print(semester_code.keys())   # dict_keys(['spring', 'summer', 'fall'])
print(semester_code.values()) # dict_values([1, 4, 7])
print(len(semester_code))     # 3

{'spring': 1, 'summer': 4, 'fall': 7}
dict_keys(['spring', 'summer', 'fall'])
dict_values([1, 4, 7])
3


`semester_code` is a dictionary with three *key:value* pairs: `'spring': 1`,
`'summer': 4`, and `'fall': 7`. The keys are `'spring'`, `'summer'`, and
`'fall'`, and the corresponding values are `1`, `4`, and `7`.

You get the code for a semester by writing `semester_code[key]`, where `key` is
one of the key strings `'spring'`, `'summer'`, or `'fall'`:

In [4]:
semester_code = {'spring': 1, 'summer': 4, 'fall': 7}

print(semester_code['fall'])    # 7
print(semester_code['summer'])  # 4
print(semester_code['spring'])  # 1

print(semester_code['autumn'])  # KeyError: 'autumn' is not a key

7
4
1


KeyError: 'autumn'

If you look for a key that is not in the dictionary, you get a `KeyError`.

You can add new *key:value* pairs to a dictionary by assigning them:

In [2]:
semester_code = {}             # empty dictionary
semester_code['spring'] = 1    # {'spring': 1}
semester_code['summer'] = 4    # {'spring': 1, 'summer': 4}
semester_code['fall'] = 7      # {'spring': 1, 'summer': 4, 'fall': 7}

print(semester_code)           # {'spring': 1, 'summer': 4, 'fall': 7}
print(semester_code.keys())    # dict_keys(['spring', 'summer', 'fall'])
print(semester_code.values())  # dict_values([1, 4, 7])
print(len(semester_code))      # 3


{'spring': 1, 'summer': 4, 'fall': 7}
dict_keys(['spring', 'summer', 'fall'])
dict_values([1, 4, 7])
3


Assignment also lets you change a key's value:

In [5]:
semester_code = {}            # empty dictionary
semester_code['spring'] = 1
semester_code['summer'] = 4
semester_code['fall'] = 7     # {'spring': 1, 'summer': 4, 'fall': 7}

semester_code['fall'] = 8     # change the value of 'fall' to 8
                              # {'spring': 1, 'summer': 4, 'fall': 8}

print(semester_code)          # {'spring': 1, 'summer': 4, 'fall': 8}
print(semester_code.keys())   # dict_keys(['spring', 'summer', 'fall'])
print(semester_code.values()) # dict_values([1, 4, 8])
print(len(semester_code))     # 3

{'spring': 1, 'summer': 4, 'fall': 8}
dict_keys(['spring', 'summer', 'fall'])
dict_values([1, 4, 8])
3


In general, if `d` is a dictionary, then:
- `d[key]` is the value associated with `key`
- `d[key] = value` sets the value associated with `key` to `value`

These operations are *very* efficient, even with large dictionaries.

## Searching Dictionaries

The `in` operator efficiently checks if a *key* is in a dictionary:

In [6]:
semester_code = {'spring': 1, 'summer': 4, 'fall': 7}

print('fall' in semester_code)    # True
print('autumn' in semester_code)  # False
print(1 in semester_code)         # False

True
False
False


`k in semester_code` is `True` if `k` is a key in the dictionary, and `False`
otherwise.

### Hashing

The `in` operator does *not* search the keys one after the other. Instead, `in`
with dictionaries uses a neat trick called **hashing**. Even in a dictionary with
millions of keys, checking for a key with `in` nearly instantaneous.

We won't go in the details of hashing in this course other than to say it is
the technique that makes dictionaries so efficient.

## Searching Dictionaries Values

Searching through the *values* of a dictionary is not as efficient as searching
for keys. Although it is easy to write:

In [None]:
semester_code = {'spring': 1, 'summer': 4, 'fall': 7}

print(1 in semester_code.values())  # True
print(2 in semester_code.values())  # False
print(7 in semester_code.values())  # True

True
False
True


Searching *values* does *not* use hashing. Instead, value searches are done
value by value, about the same performance as if a loop was used to search a
list of the values. This is much slower than searching for keys, and so do it
with care.

## Example: Finding Reverse Words

Suppose we want to count how many words in [words.txt](words.txt) are the
*reverse* of another word. For example, "pat" is the reverse of "tap", and
"stressed" is the reverse of "desserts".

One way to do this is to read all the words into a list and then for each word
search if it's reverse is also in the list:

In [5]:
# read all the words into a list
word_list = open('words.txt', 'r').read().split('\n')
word_list.remove('')  # remove the empty string at the end of the file

def too_slow():
    count = 0
    for word in word_list:
        rev = word[::-1]  # word[::-1] is the reverse of word
        if rev in word_list:  
            count += 1
    return count

print(too_slow())

885


This takes over a minute to run on my computer!

The reason it's so slow is because `rev in word_list` searches through the list
one word at a time, from left to right. On average, a search has to check
$\frac{n}{2}$ words in the list. This adds up to $\frac{n}{2} + \frac{n}{2} +
\ldots + \frac{n}{2} = n \cdot \frac{n}{2} = \frac{n^2}{2}$ checks.

For [words.txt](words.txt), which has over 113,000 words, that's about 6.5
billion words checked:

In [2]:
word_list = open('words.txt', 'r').read().split('\n')

n = len(word_list)
print(f'number of words: {n}')
print(f'{n}^2 = {(n ** 2) / 2}')

number of words: 113784
113784^2 = 6473399328.0


### A Dictionary of Words

By storing the words in a dictionary we can greatly reduce the amount of work
done, and thus improve the performance. We'll store the words as keys, and the
corresponding values will all be 1s:

In [3]:
# store the words as keys in a dictionary
word_dict = {}
for word in open('words.txt', 'r'):
    word = word.strip()
    word_dict[word] = 1

def pretty_fast():
    """Returns the number of words that are the reverse of another word.
    Also writes the reverse words to reverse_words.txt.
    """
    rev_words = open('reverse_words.txt', 'w')
    count = 0
    for word in word_dict:
        rev = word[::-1]  # word[::-1] is the reverse of word
        if rev in word_dict:
            count += 1
            rev_words.write(f'{word}\n')
    rev_words.close()
    return count

print(pretty_fast())

885


This is *much* faster than the list version: it takes less than one-tenth of a
second to run on my computer.

This speedup is due to the fact that searching for a key in a dictionary using
`in` is very fast. With a dictionary, `in` does *not* search the words one at a
time the way it does with a list. Instead, it uses hashing to get the right
answer almost immediately, no matter the size of the dictionary.

## Example: Counting Characters in a String

Suppose you want to count the frequency of characters in a string. For instance,
in 'title' the letter *t* occurs 2 times, *i* occurs 1 time, *l* occurs 1 time,
and *e* occurs 1 times.

A dictionary is a good choice for storing such character counts: the keys are
the characters, and the values are the counts. For 'title', the dictionary will
be `{'t': 2, 'i': 1, 'l': 1, 'e': 1}`.

We can build this dictionary with a loop:

In [1]:
s = 'title'

letters = {}
for c in s:
    if c in letters:
        letters[c] += 1
    else:
        letters[c] = 1

print(letters)

{'t': 2, 'i': 1, 'l': 1, 'e': 1}


We can write this as a function to make it easier to re-use:

In [2]:
def count_letters(s):
    """Returns a dictionary with the frequency of each letter in s.
    """
    letters = {}
    for c in s:
        if c in letters:
            letters[c] += 1
        else:
            letters[c] = 1
    return letters

print(count_letters('title'))

{'t': 2, 'i': 1, 'l': 1, 'e': 1}


`count_letters` works with any string, so lets count the frequency of letters in
English words:

In [8]:
# read in words.txt as a big string
big_string = open('words.txt', 'r').read()

letter_count = count_letters(big_string)

print(letter_count)
print('number of keys:', len(letter_count))

{'a': 68574, '\n': 113783, 'h': 20186, 'e': 106752, 'd': 34548, 'i': 77392, 'n': 60505, 'g': 27832, 's': 86526, 'l': 47003, 'r': 64963, 'v': 9186, 'k': 9366, 'w': 8533, 'o': 54538, 'f': 12706, 'b': 17794, 'c': 34281, 'u': 31151, 't': 57029, 'm': 24739, 'p': 25789, 'y': 13473, 'x': 2700, 'j': 1780, 'z': 3750, 'q': 1632}
number of keys: 27


Notice that there are 27 keys in `letter_count`, but there are only 26 lowercase
letters. The extra character is the `\n` (newline) character that files use to
separate the lines:

In [10]:
# read in words.txt as a big string
words = open('words.txt', 'r').read()

letter_count = count_letters(words)

print('number of newlines:', letter_count['\n'])

number of newlines: 113783


We can use the `pop` method to delete *key*:*value* pairs from a dictionary:

In [12]:
# read in words.txt as a big string
words = open('words.txt', 'r').read()

letter_count = count_letters(words)

letter_count.pop('\n')  # delete the '\n' key:value pair

print(letter_count)
print('number of keys:', len(letter_count))

{'a': 68574, 'h': 20186, 'e': 106752, 'd': 34548, 'i': 77392, 'n': 60505, 'g': 27832, 's': 86526, 'l': 47003, 'r': 64963, 'v': 9186, 'k': 9366, 'w': 8533, 'o': 54538, 'f': 12706, 'b': 17794, 'c': 34281, 'u': 31151, 't': 57029, 'm': 24739, 'p': 25789, 'y': 13473, 'x': 2700, 'j': 1780, 'z': 3750, 'q': 1632}
number of keys: 26


### Sorting the Letter Counts

Now suppose we want to *sort* the letter counts from most frequent to least
frequent.

Let's first write **pseudocode** for the program. Pseudocode is an informal
description that we can use as a blueprint for the program:

- Read the words into a dictionary, removing the `\n` character in the
  dictionary.
- Convert the dictionary to a list of `[count, letter]` pairs.
- Sort the list of from smallest count to biggest count.
- Reverse the list so that the biggest count is first.
- Print the letters and counts.

A handy trick for writing programs based on pseudocode is to cut-and-paste the
steps above into Python and write them as comments:

```python
# Read the words into a dictionary, removing the `\n` character in the dictionary.

# Convert the dictionary to a list of `[count, letter]` pairs.

# Sort the list from smallest count to biggest count.

# Reverse the list so that the biggest count is first.

# Print the letters and counts.
```

Now we can write the program by "filling in the blanks":

In [19]:
# Read the words into a dictionary, removing the `\n` character in the dictionary.
words = open('words.txt', 'r').read()
letter_count = count_letters(words)
letter_count.pop('\n')

# Convert the dictionary to a list of `[count, letter]` pairs.
letter_list = []
for c in letter_count:
    pair = [letter_count[c], c]
    letter_list.append(pair)

# Sort the list from smallest count to biggest count.
letter_list.sort()

# Reverse the list so that the biggest count is first.
letter_list.reverse()

# Print the letters and counts.
for pair in letter_list:
    print(f'{pair[1]}: {pair[0]}')


e: 106752
s: 86526
i: 77392
a: 68574
r: 64963
n: 60505
t: 57029
o: 54538
l: 47003
d: 34548
c: 34281
u: 31151
g: 27832
p: 25789
m: 24739
h: 20186
b: 17794
y: 13473
f: 12706
k: 9366
v: 9186
w: 8533
z: 3750
x: 2700
j: 1780
q: 1632


## Questions

1. Why must dictionary keys be unique?

2. If `d` is a dictionary that does *not* have the key `k`, what happens when
   you call `d[k]`?

3. What does this print?

   ```python
   d ={'a': 3, 'b': 4}
   d['a'] = 4
   d['a'] = 5
   print(d['a'])
   ```

4. If `d` is a dictionary, what expression will give you all of the keys in `d`?
   What about all of the values?

5. Why makes searching for a key in a dictionary so much faster than searching
   for a value in a list?

6. How can you remove a key:value pair from a dictionary?

7. What is pseudocode?