# Associations and dictionaries

In mathematics, we group numbers or other elements in parentheses, thus, forming a *tuple*. For example, to represent a three-dimensional Euclidean point, we'd use 3-tuple notation like `(32,9,9732)`.  Python uses the same mathematical notation:

In [1]:
p = (32,9,9732)
print(type(p))
print(p)

<class 'tuple'>
(32, 9, 9732)


Because Python also uses parentheses for grouping subexpressions like `(1+2)*3`, there is an ambiguity in the language. Does `(5)` represent a single element tuple containing 5 or is it just the integer 5? It turns out that Python considers it an integer so we use the slightly awkward notation `(5,)` instead to mean a 1-tuple.

Tuples are ordered and so we access the elements using array indexing notation.

In [2]:
print(p[0])
print(p[1])
print(p[2])

32
9
9732


BUT, tuples are **immutable**, meaning you can't change the elements.  For example, `p[0]=34` gives you an error:

`TypeError: 'tuple' object does not support item assignment`.

Relevant to our approaching topic of document analysis, we'll associate a word (string) with the frequency (integer) with which it occurs in the document. For example, if the word "cat" appears 10 times, we'd create a tuple like this:

In [2]:
a = ('cat', 10)
print(type(a))
print(a)

<class 'tuple'>
('cat', 10)


The tuple notation works even when the values are variables not literals:

In [3]:
word = 'cat'
freq = 10
a = (word, freq)

## Bag of words representation

A document is a sequence of words that we can represent simply as a list of strings. For example, let's split apart a simple document into words:

In [4]:
doc = 'the cat sat on the hat on the mat'
words = doc.split(' ')
words

['the', 'cat', 'sat', 'on', 'the', 'hat', 'on', 'the', 'mat']

One representation for bag of words is just a list of associations (order of tuples doesn't matter):

In [7]:
bag = [('the',3), ('cat',1), ('sat',1), ('on', 2), ('hat',1), ('mat', 1)]
bag

[('the', 3), ('cat', 1), ('sat', 1), ('on', 2), ('hat', 1), ('mat', 1)]

That representation is a faithful representation of a bag of words, but looking up word frequencies is not efficient. To find a word, we must linearly scan the list of tuples looking for the word and then plucking out the frequency.

Here is a loop that walks `bag` to find and print out the number of occurrences of `'the'`.

In [5]:
bag = [('the',3), ('cat',1), ('sat',1), ('on', 2), ('hat',1), ('mat', 1)]
for a in bag:
    if a[0]=='the':
        print(a[1])
        break

3


### Exercise

What is the complexity, "big O" notation, for walking that list of associations to find a word?

## Counter objects

Here's the easy way to get a bag of words using a `Counter` object, which is a kind of `dict`ionary:

In [6]:
words

['the', 'cat', 'sat', 'on', 'the', 'hat', 'on', 'the', 'mat']

In [7]:
from collections import Counter

c = Counter(words)
print(c)
print(c['the']) # index Counters like an array
print(c['on'])

Counter({'the': 3, 'on': 2, 'cat': 1, 'sat': 1, 'hat': 1, 'mat': 1})
3
2


## Dictionaries

A list of tuples representing a list of associations is a perfectly fine way to represent a bag of words.  It implies an order because it's in a list, but we could ignore that. The biggest problem is that lists are slow to search when they get big; $O(n)$. It turns out that there is a very efficient implementation for dictionaries, which makes dictionaries very attractive from an efficiency point of view. In Python, we can also access them using array-like notation. You will learn all about this in your project.

To create a dictionary from a list of associations is easy:

In [8]:
bag

[('the', 3), ('cat', 1), ('sat', 1), ('on', 2), ('hat', 1), ('mat', 1)]

In [9]:
d = dict(bag) # make dict from list of associations
print(d)

{'the': 3, 'cat': 1, 'sat': 1, 'on': 2, 'hat': 1, 'mat': 1}


Python prints dictionaries out using `dict` literal notation, which we can use to define dictionaries directly:

In [10]:
e = {'the': 2, 'sat': 1, 'hat': 1, 'cat': 1}
print(e)

{'the': 2, 'sat': 1, 'hat': 1, 'cat': 1}


You can even do *dict comprehensions* similar to list comprehensions:

In [11]:
{c:ord(c) for c in 'abcde'}

{'a': 97, 'b': 98, 'c': 99, 'd': 100, 'e': 101}

In [12]:
[c.upper() for c in 'abcde'] # compare to this

['A', 'B', 'C', 'D', 'E']

Given a dictionary, you can get a list of the associations out as a list of tuples:

In [13]:
tuples = e.items()
print(tuples)

dict_items([('the', 2), ('sat', 1), ('hat', 1), ('cat', 1)])


In implementation, however, dictionaries are actually more complicated than lists of associations in order to get the speed.

Accessing elements of the dictionary looks like array indexing except that the index value is an arbitrary object, such as a string in our case:

In [14]:
d

{'the': 3, 'cat': 1, 'sat': 1, 'on': 2, 'hat': 1, 'mat': 1}

In [15]:
print(d['the'])
print(d['hat'])
d['hat'] = 99    # Replace the value for key hat with 99
print(d['hat'])

3
1
99


Related to the search project, you will map words to the `set` of document IDs that contain that word:

In [16]:
q = {'ronald':{3,4}, 'reagan':{19}}

Note that values can be mutable, such as those sets:

In [17]:
s = q['reagan']
s

{19}

In [18]:
s.add(77)
s.add(99)
s.add(11)

*That is a critical detail needed for your project!!!*

Keys can be any immutable object, even complex things like tuples (but not lists because lists are not immutable):

In [19]:
td = {(1,2):'joe', (3,99):'mary'}
print(td)
print( td[(1,2)] ) # (1,2) is the key

{(1, 2): 'joe', (3, 99): 'mary'}
joe


Trying to access a key that does not exist in the dictionary causes a `KeyError` so it's best to check if the key exists first:

In [20]:
# print d['foo']     # This would cause a KeyError!
if 'cat' in d:       # hat is indeed in dictionary d
    print(d['cat'])
if 'foo' in d:       # does not exist so we don't get an error on the next line    
    print(d['foo'])  # does not execute

1
