# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # example: "rvuduc3"
COLLABORATORS = [] # list of strings of your collaborators' IDs

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Part 3: Strings, default dictionaries, and iteration tools

Before we get to pairwise association mining algorithms, let's review string manipulation and introduce some tools that will prove useful.

## Strings review

Consider the following multiline string:

In [None]:
text = """How much wood could a woodchuck chuck
if a woodchuck could chuck wood?"""

> _(1 point)_ **Question 1.** Write some code to compute `words`, a list of the words contained in `text`. In addition, you should also convert all words to lowercase. You may assume that only spaces delimit words.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print ("Your solution produced:", words)

print ("Testing output...")
assert len (words) == 13
assert sum ([w == "how" for w in words]) == 1
assert sum ([w == "much" for w in words]) == 1
assert sum ([w == "wood" for w in words]) == 1
assert sum ([w == "could" for w in words]) == 2
assert sum ([w == "a" for w in words]) == 2
assert sum ([w == "woodchuck" for w in words]) == 2
assert sum ([w == "chuck" for w in words]) == 2
assert sum ([w == "if" for w in words]) == 1
assert sum ([w == "wood?" for w in words]) == 1
print ("\n(Passed.)")

> _(3 points)_ **Question 2.** Write a function that returns a dictionary whose keys are each of the unique letters of the string and whose corresponding values are the counts of each letter. Your implementation should also obey these rules:
>
> 1. As above, you should _ignore_ capitalization, in particular, "canonicalizing" all words to lowercase.
> 2. You should _discard_ non-alphabetic characters, e.g., numbers and punctuation.

In [None]:
def count_letters (s):
    """Returns a dictionary of (letter, count) pairs for the given string."""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
counts = count_letters (text)

print ("count_letters ('''{0}''')\n==\n{1}".format (text, counts))

assert len (counts) == 12 # Number of unique letters

# Check frequency of each letter against the 'woodchuck' string
assert ('a' in counts) and (counts['a'] == 2)
assert ('c' in counts) and (counts['c'] == 11)
assert ('d' in counts) and (counts['d'] == 6)
assert ('f' in counts) and (counts['f'] == 1)
assert ('h' in counts) and (counts['h'] == 6)
assert ('i' in counts) and (counts['i'] == 1)
assert ('k' in counts) and (counts['k'] == 4)
assert ('l' in counts) and (counts['l'] == 2)
assert ('m' in counts) and (counts['m'] == 1)
assert ('o' in counts) and (counts['o'] == 11)
assert ('u' in counts) and (counts['u'] == 7)
assert ('w' in counts) and (counts['w'] == 5)

print ("\n(Passed.)")

## Default dictionaries: `collections.defaultdict`

Python also has a special type of dictionary called a _default dictionary_, or `defaultdict`. You can use it to simplify the letter counter. The `defaultdict` type is defined in the `collections` module. To see how, consider this example.

In [None]:
from collections import defaultdict

# Frequency table, take 2
def count_letters2 (s):
    """Returns a (default) dictionary of (letter, count) pairs for the given string."""
    counts = defaultdict (int)
    letters = [c for c in s.lower () if c.isalpha ()]
    for letter in letters:
        counts[letter] += 1
    return counts

In [None]:
# Check answers against the first method
counts1 = count_letters (text)
counts2 = count_letters2 (text)

print ("1. {}".format (counts1))
print ("2. {}".format (counts2))

for key, value in counts1.items (): assert counts2[key] == value
for key, value in counts2.items (): assert counts1[key] == value
print ("\n(Passed: Method 2 gives the same answer as method 1.)")

> _(2 points)_ **Question 3.** Explain how this implementation differs from your previous implementation. Why does `defaultdict` take an argument?

YOUR ANSWER HERE

Consider the following code fragment. Try to predict what it will print _before_ running it. Also try changing the base type from `int` to other things, like `str`, `float`, `list`, and `set`.

In [None]:
a = defaultdict (str)
print ("{} ==> length {}".format (a, len (a))) # What will this print?
print (a['non-existent-key']) # What will this print?
print ("{} ==> length {}".format (a, len (a))) # What will this print?

## Combinations for co-occurring pairs

Suppose we wish to count co-occurring letters in words, in the sense of doing pairwise association mining. That is, you have some text consisting of a bunch of words. Let each word be a "basket" and let each alphabetic character within a word be an "item" within that basket. The question is which pairs of letters tend to co-occur most frequently.

**What about repetition?** Treat each instance of a repeated word as a distinct basket. Within a basket, count distinct occurrences of each letter. For example, for the word, `wood`, the pairs (`w`, `o`) and (`o`, `d`) occur twice each, while the pairs (`w`, `d`) and (`o`, `o`) occur once each. If `wood` were the only basket, then the table of co-occurrence counts would have 7 entries whose values are

- $T_{\mathtt{w}, \mathtt{o}} = T_{\mathtt{o}, \mathtt{w}} = T_{\mathtt{o}, \mathtt{d}} = T_{\mathtt{d}, \mathtt{o}} = 2$ and
- $T_{\mathtt{w}, \mathtt{d}} = T_{\mathtt{d}, \mathtt{w}} = T_{\mathtt{o}, \mathtt{o}} = 1$.

As it happens, Python provides a handy tool, from the `iters` module, for producing an object that you can use to iterate over exactly these pairs. Here is an example of how to apply it to a string.

In [None]:
from itertools import combinations

print ('1.', combinations ('wood', 2))
print ('2.', list (combinations ('wood', 2)))