# Lecture 29 Notes

## Dictionaries

Recall that a **data structure** is an organized collection of data. We've seen
two main Python data structures: *strings*, which are collections of characters,
and *lists*, which are collections of any values.

A **dictionary** is another useful data structure that stores data in
*key*:*value* pairs. Dictionaries are also known as **maps**, **associative
arrays**, **tables**, or **hash tables**.

For example, here is a dictionary of *name*:*age* pairs:


In [1]:
age = {'Marge': 34, 'Homer': 36}
print(age.keys())    # dict_keys(['Marge', 'Homer'])
print(age.values())  # dict_values([34, 36])

dict_keys(['Marge', 'Homer'])
dict_values([34, 36])


A dictionary key can be any *immutable* (non-changeable) type, such as strings
or numbers. For `ages`, the keys are strings. Keys *can't* be *lists*, since
changing the list could cause the dictionary to lose track of the position of
its associated value.

You normally access values in a dictionary by using its associated key. For
example:

In [2]:
print(age['Marge'])  # 34 is the value of key 'Marge'
print(age['Homer'])  # 36 is the value of key 'Homer'

34
36


**Important** Accessing a dictionary value through its key is extremely
efficient. It is much faster that searching for an element in a list (e.g. using
`find`), and is often faster than *binary search*. But unlike binary search,
Python's dictionaries *don't* need to be stored in sorted order.

On the flip side, accessing a key by its value is *not* efficient. You can do
it, but it's about the same speed as doing a linear search through all the
values.

If you ask for a key that's *not* in the dictionary, you get a `KeyError`:


In [3]:
print(age['Bart'])  # KeyError: 'Bart' is not a key in the dictionary

KeyError: 'Bart'

You can only put keys in the []-brackets of a dictionary. For instance, in
`age`, you *can't* get efficiently get a name given an age:


In [4]:
print(age[34])  # KeyError: 34 is not a key in the dictionary

KeyError: 34

## Keys are Unique, Values are Not

In a Python dictionary, all the keys *must* be different: repeated keys are not
allowed. However, values don't need to be unique: values can be repeated. For
example, this is okay:


In [5]:
age = {'Marge': 34, 'Homer': 36, 'Carl': 34}
print(age['Marge'])  # 34 is the value of key 'Marge'
print(age['Carl'])   # 34 is also the value of key 'Carl'

34
34


`'Carl'` and `'Marge'` have the same age, and that's no problem. Different
people can be the same age.

But this *is* a problem:


In [6]:
bad_age = {'Marge': 34, 'Homer': 36, 'Marge': 35}  # oops, two Marge's
print(bad_age)  # {'Marge': 35, 'Homer': 36}

{'Marge': 35, 'Homer': 36}


Since identical keys are *not* allowed, the second `'Marge'` overwrites the
first one, which may, or may not, be what you want. 

If you really do need to have two `'Marge'`s, then you should probably change
the keys:


In [7]:
age = {'Marge 1': 34, 'Homer': 36, 'Marge 2': 35}  # okay, different keys for Marge
print(age)  # {'Marge 1': 34, 'Homer': 36, 'Marge 2': 35}

{'Marge 1': 34, 'Homer': 36, 'Marge 2': 35}


## Dictionaries are Changeable

Like lists, dictionaries are *mutable*, i.e. they can be changed. You can
add/remove *key*:*value* pairs, and also change the value associated with a key.
For example:


In [8]:
age = {'Marge': 34, 'Homer': 36}

age['Bart'] = 10  # add a new key:value pair
print(age)  # {'Marge': 34, 'Homer': 36, 'Bart': 10}

age['Bart'] = 11  # change the value associated with a key
print(age)  # {'Marge': 34, 'Homer': 36, 'Bart': 11}

del age['Bart']   # remove a key:value pair
print(age)  # {'Marge': 34, 'Homer': 36}

{'Marge': 34, 'Homer': 36, 'Bart': 10}
{'Marge': 34, 'Homer': 36, 'Bart': 11}
{'Marge': 34, 'Homer': 36}


These are all fast operations, since they are based on the key.

## Example: Word Counting

Suppose you want to count how many times each different word occurs in a text
file. This is a good job for a dictionary: words will be the keys, and the
associated value is how many times the word appears, i.e. *word*:*count* pairs.

The word counts for a very small file might look like this:

```python
{'cow': 3, 'pig': 1, 'horse': 2}
```

This means `'cow'` appears 3 times, `'pig'` appears 1 time, and `'horse'`
appears 2 times.

Let's write a program that works like this:

- Open a text file.
- Read the file line-by-line.
- Split each line into individual words. Python strings have a built-in method
  called `split` to do this:
  
  ```python
  >>> names = 'Ken, Alex, Art, Anna'
  >>> names.split(', ')
  ```

- Add the words to a dictionary of *word*:*count* pairs. If the word is already
  in the dictionary, then its count is incremented. If the word is not in the
  dictionary, it is added with a count of 1.

### Cleaning the Text

Getting words from a file is conceptually easy, but the details are tricky
because of things like punctuation. For simplicity, we will strip out everything
that isn't a letter or a space using this function:


In [11]:
def clean_text(text):
    """Remove all non-letter characters from text.
    Keeps just alphabetic characters and spaces.
    >>> clean_text('Hello, world!')
    'Hello  world '
    """
    cleaned_text = ''
    for char in text:
        if char.isalpha() or char == ' ':
            cleaned_text += char  # keep spaces and letters
        else:
            cleaned_text += ' '   # replace other characters with spaces
    return cleaned_text

cleaned = clean_text('Hello, world!')
print(f'"{cleaned}"')

"Hello  world "


### A Word Counting Function

Now we can write `count_words`, which returns a dictionary of the counts of all
the words in a text file:


In [14]:
def count_words(fname):
    """Count the number of words in a file.
    """
    # word_count stores the count of each word
    word_count = {}
    
    contents = open(fname)
    for line in contents:
        # split the line into words
        words = clean_text(line).split(' ')
        
        # add all the words to the dictionary
        for w in words:
            w = w.strip()   # remove whitespace at the beginning and end
            if w != '':     # ignore empty strings
                if w in word_count:
                    word_count[w] += 1
                else:
                    word_count[w] = 1
    
    return word_count

word = count_words('joke.txt')
print(word)
print(word.keys())     # dict_keys(['this', 'is', 'a', 'broken', 'joke', 'but', 'it', 'isnt', 'funny', 'at', 'all', ''])
print(word.values())   # dict_values([1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
print(word['broken'])  # 2
print(word['joke'])    # KeyError: 'joke' is not a key in the dictionary

{'Who': 1, 's': 2, 'there': 1, 'A': 2, 'broken': 2, 'pencil': 2, 'who': 1, 'Never': 1, 'mind': 1, 'It': 1, 'pointless': 1}
dict_keys(['Who', 's', 'there', 'A', 'broken', 'pencil', 'who', 'Never', 'mind', 'It', 'pointless'])
dict_values([1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1])
2


KeyError: 'joke'

[joke.txt](joke.txt) contains this text:

```
Who's there?
A broken pencil.
A broken pencil who?
Never mind. It's pointless.
```

Then:

In [19]:
words = count_words('joke.txt')
print(words)
print(words.keys())     # dict_keys(['this', 'is', 'a', 'broken', 'joke', 'but', 'it', 'isnt', 'funny', 'at', 'all', ''])
print(words.values())   # dict_values([1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
print(words['broken'])  # 2
print(words['joke'])    # KeyError: 'joke' is not a key in the dictionary

{'Who': 1, 's': 2, 'there': 1, 'A': 2, 'broken': 2, 'pencil': 2, 'who': 1, 'Never': 1, 'mind': 1, 'It': 1, 'pointless': 1}
dict_keys(['Who', 's', 'there', 'A', 'broken', 'pencil', 'who', 'Never', 'mind', 'It', 'pointless'])
dict_values([1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1])
2


KeyError: 'joke'

[dracula.txt](dracula.txt) is the text of [Bram Stoker's *Dracula*](https://en.wikipedia.org/wiki/Dracula):

In [18]:
words = count_words('dracula.txt')
print('Unique words:', len(words))           # 10018
print(' Total words:', sum(words.values()))  # 161883

print('vampire:', words['vampire'])  # 14
print(' blood:', words['blood'])     # 112

Unique words: 10018
 Total words: 161883
vampire: 14
 blood: 112


## Example: Printing the Top 10 Most Frequent Words

Suppose you want the top 10 most frequent words in a file. We can calculate this
by first getting all the word counts using `count_words`, and then converting
that to a list of [*count*, *word*] pairs that we can sort from highest count to
lowest count:


In [20]:
def print_top10(fname):
    word_count = count_words(fname)
    
    #
    # make a list if [count, word] pairs
    #
    # e.g. if word_count is {'cat':3, 'dog':2, 'mouse':1}, count_pairs
    # will be [3, 'cat'], [2, 'dog'], [1, 'mouse']
    #
    count_pairs = []
    for w in word_count:  # this loops through all they keys in word_count
        pair = [word_count[w], w]
        count_pairs.append(pair)
    
    #
    # sort the words from highest count to lowest 
    # 
    # when sorting a list, Python's built-in sort compares the first element of
    # the list first
    #
    count_pairs.sort()
    count_pairs.reverse()  # reverse the list, we want highest count to lowest

    #
    # print the top 10 words
    #
    for pair in count_pairs[:10]:
        print(pair[1], pair[0])

print_top10('dracula.txt')

the 7205
and 5589
I 4831
to 4371
of 3572
a 2880
in 2384
that 2375
he 1931
was 1866
