# Dictionaries

## The basics

Dictionaries are key-value maps.

They are defined as follows:

In [1]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
assert len(eng_to_french) == 3

As you can see, the `len` function will return the number of elements in the dictionary.


### `keys`, `values` and `items` views

You can obtain all the keys in the dictionary with the `keys` method. Additionally, starting with Python 3.6, dictionaries preserve the order in which the keys were created, so that `keys` call will be stable.

In [5]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
assert list(eng_to_french.keys()) == ["red", "blue", "green"]

You can obtain the values from a dictionary using the `values` method:

In [9]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
assert list(eng_to_french.values()) == ["rouge", "bleu", "vert"]

dict_values(['rouge', 'bleu', 'vert'])


The `items()` method returns a sequence of key-value tuples you can iterate over:

In [10]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
for k, v in eng_to_french.items():
    print(f"The French word for {k} is {v}")

The French word for red is rouge
The French word for blue is bleu
The French word for green is vert


| NOTE: |
| :---- |
| The `keys`, `values`, and `items` return *views* and not lists. Views behave like sequences but are dynamically updated whenever the dictionary changes. |

The `del` statement can be used to remove a key-value pair from the dictionary:

In [11]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}

del eng_to_french["red"]
assert eng_to_french == {"blue": "bleu", "green": "vert"}

Let's see the concept of dynamically updated views below:

In [13]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
eng_to_french_keys = eng_to_french.keys()
eng_to_french_key_list = list(eng_to_french_keys)

del eng_to_french["red"]
assert eng_to_french == {"blue": "bleu", "green": "vert"}
assert eng_to_french_keys == {"blue", "green"}
assert eng_to_french_key_list == ["red", "blue", "green"]

See how I didn't have to call the `keys()` method again after having deleted and entry to see that updated. It was done automatically.

By contrast, the list I materialized before deleting the element was not updated.

### Safe access to dictionary keys with `in` and `get`

Attempting to access a key that doesn't exist raises an error:

In [15]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
try:
    eng_to_french["purple"]
except KeyError as e:
    print("Oops: ", e)

Oops:  'purple'


To work around this behavior you can either use the `in` operator:

In [16]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}

if "red" in eng_to_french:
    print(f"red in French is {eng_to_french['red']}")

if "purple" not in eng_to_french:
    print(f"purple in French is unknown")


red in French is rouge
purple in French is unknown


Alternatively, you can use the `get` function, which lets you pass a second argument if the dictionary doesn't contain the key:

In [17]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}
assert eng_to_french.get("red", "unknown") == "rouge"
assert eng_to_french.get("purple", "unknown") == "unknown"

### Setting/Getting default values for keys with `setdefault` and `defaultdict`

The `setdefault` method can be used to:
+ get the key from the dictionary, if it exists.
+ return a default value if the given key doesn't exist in the dictionary, and create a new key in the dictionary with the associated default value.

In [18]:
eng_to_french = {"red": "rouge", "blue": "bleu", "green": "vert"}

assert eng_to_french.setdefault("red", "unknown") == "rouge"
assert eng_to_french.setdefault("purple", "unknown") == "unknown"

Note how confusing the method name is.

There is also a `defaultdict` subclass of `dict` that you can import from the `collections` module. That dictionary can be configured to have a default value.

In [19]:
from collections import defaultdict

eng_to_french = defaultdict(lambda: "unknown", {"red": "rouge", "blue": "bleu", "green": "vert"})
assert eng_to_french["red"] == "rouge"
assert eng_to_french["purple"] == "unknown"

That's much more sensible than `setdefault`.

### Creating copies of a dictionary

You can create shallow copies of a dictionary using the `copy` method:

In [20]:
x = {"a": 1, "b": 2}
y = x.copy()
assert x == y

You can create a deep copy of the dictionary using `copy.deepcopy` method.

### Merging dictionaries with `update`

The `update` method updates a first dictionary with all the key-value pairs of a second dictionary. For keys that are common to both, the values from the second dictionary override those of the first.

In [21]:
a = {1: "One", 2: "Two"}
b = {0: "Zero", 1: "__one__"}

a.update(b)
assert a == {1: "__one__", 2: "Two", 0: "Zero"}

### Exercise

Use a dictionary to count the frequencies of words in a sentence.

In [25]:
sample_string = "To be or not to be"
word_ocurrences = {}

for word in sample_string.split():
    word_ocurrences[word] = word_ocurrences.get(word, 0) + 1

for word, ocurrences in word_ocurrences.items():
    print(f"{word!r} occurs {ocurrences} times")

'To' occurs 1 times
'be' occurs 2 times
'or' occurs 1 times
'not' occurs 1 times
'to' occurs 1 times


## What can be used as a key?

Any Python object that is immutable and hashable can be used as a key to a dictionary.

As a result, lists cannot be used as keys in a dictionary. However, you con work around this issue by using tuples instead.

Additionally, keys must be hashable, which means, the key must have a stable hash value provided by the `__hash__` method that doesn't change throughout the life of the value. As a result, tuples containing mutable values cannot be used as keys because their hash value won't be stable.

The following table illustrates these restrictions:

| Python type | Immutable? | Hashable? | Dictionary key? |
| :---------- | :--------- | :-------- | :-------------- |
| int         | yes        | yes       | yes             |
| float       | yes        | yes       | yes             |
| boolean     | yes        | yes       | yes             |
| complex     | yes        | yes       | yes             |
| str         | yes        | yes       | yes             |
| bytes       | yes        | yes       | yes             |
| bytearray   | no         | no        | no              |
| list        | no         | no        | no              |
| tuple       | yes        | sometimes<br>(only when tuple elements are immutable) | sometimes<br>(only when tuple elements are immutable)       |
| set         | no         | no        | no              |
| frozenset   | yes        | yes       | yes             |
| dictionary  | no         | no        | no              |


### Exercise

In processing raw text, it's quite often necessary to clean and normalize the text before doing anything else. For example, if you want to find the frequence of words in a text, it's quite common that everything is normalized (all in lowercase or uppercase, remove punctuation). It's also common to break the text into a series of words and write each of them on its own line.

In this exercise, the task is to read the first part of the first chapter of Moby Dick, make sure everything is one case, remove all punctuation, and write the words one per line to a second file.

Finally, use that output file to count the number of times each word occurs.

The code that will generate the desired output file is as follows:

In [29]:
import string

with open("data/moby_01.txt", "r") as infile:
    with open("data/out/moby_01_normalized_out.txt", "w") as outfile:
        for line in infile:
            line_lowercase = line.lower()
            normalized_line = line_lowercase.translate(line_lowercase.maketrans("", "", string.punctuation))
            line_words = normalized_line.split()
            for word in line_words:
                outfile.write(word + "\n")

With the file created, we can iterate over the words in the file and create a dictionary:

In [30]:
with open("data/out/moby_01_normalized_out.txt", "r") as infile:
    word_ocurrences = {}
    for word in infile:
        word_ocurrences[word] = word_ocurrences.get(word, 0) + 1


Finally, we can just output the dictionary contents:

In [32]:
for word, num_ocurrences in word_ocurrences.items():
    word = word.strip()
    print(f"{word!r} occurs {num_ocurrences} times")

'call' occurs 1 times
'me' occurs 5 times
'ishmael' occurs 1 times
'some' occurs 2 times
'years' occurs 1 times
'ago' occurs 1 times
'never' occurs 1 times
'mind' occurs 1 times
'how' occurs 1 times
'long' occurs 1 times
'precisely' occurs 1 times
'having' occurs 1 times
'little' occurs 2 times
'or' occurs 2 times
'no' occurs 1 times
'money' occurs 1 times
'in' occurs 4 times
'my' occurs 4 times
'purse' occurs 1 times
'and' occurs 9 times
'nothing' occurs 2 times
'particular' occurs 1 times
'to' occurs 5 times
'interest' occurs 1 times
'on' occurs 1 times
'shore' occurs 1 times
'i' occurs 9 times
'thought' occurs 1 times
'would' occurs 1 times
'sail' occurs 1 times
'about' occurs 2 times
'a' occurs 6 times
'see' occurs 1 times
'the' occurs 14 times
'watery' occurs 1 times
'part' occurs 1 times
'of' occurs 8 times
'world' occurs 1 times
'it' occurs 6 times
'is' occurs 7 times
'way' occurs 1 times
'have' occurs 1 times
'driving' occurs 1 times
'off' occurs 2 times
'spleen' occurs 1 times

As a bonus, we can create reports for:
+ printing the five most common words and their number of occurrences
+ printing the five least common words and their number of occurrences

Despite not having seen that, dictionaries support a sorting method:

In [39]:
sorted = sorted(word_ocurrences.items(), key=lambda x: x[1], reverse=True)
print(sorted[:10])

[('the\n', 14), ('and\n', 9), ('i\n', 9), ('of\n', 8), ('is\n', 7), ('a\n', 6), ('it\n', 6), ('me\n', 5), ('to\n', 5), ('in\n', 4)]


That makes it super easy to come up with the most and least used words.

The `sorted` method takes a sequence of objects (key-value pairs in this case), and lets you specify in the `key` parameter how to sort it.

Therefore:

In [43]:
most_common = sorted[:5]
least_common = sorted[-5:]

print("Most common words:")
for word, num_ocurrences in most_common:
    print(f"{word.strip()!r} occurs {num_ocurrences} times")
print()

print("Least common words:")
for word, num_ocurrences in least_common:
    print(f"{word.strip()!r} occurs {num_ocurrences} times")

Most common words:
'the' occurs 14 times
'and' occurs 9 times
'i' occurs 9 times
'of' occurs 8 times
'is' occurs 7 times

Least common words:
'land' occurs 1 times
'look' occurs 1 times
'at' occurs 1 times
'crowds' occurs 1 times
'watergazers' occurs 1 times
