# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # example: "rvuduc3"
COLLABORATORS = [] # list of strings of your collaborators' IDs

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Part 4: A data structure for sparse matrices (sparse 2-D tables)

For pairwise association mining, it would be great if we had a compact data structure to store _sparse_ tables, that is, a 2-D table or matrix to store pairwise counts where most of the entries are empty (i.e., take up no space). Let's apply what we've learned so far to creating just such a data structure.

To test and demonstrate it in this part of the lab, you'll apply it to the problem of computing co-occurrence counts for letters in words, with the following running example. (Review Part 3 if needed.)

In [None]:
text = """How much wood could a woodchuck chuck
if a woodchuck could chuck wood?"""

We will also build on default dictionaries and combinations, so let's load those methods as well.

In [None]:
from collections import defaultdict
from itertools import combinations

## Sparse vectors for counting

First, let's "package up" the examples in part 3.

Let's start by creating an abstract data type that we will refer to as a _sparse vector_. A sparse vector $\vec{x}$ is a collection of values, $\{x_k\}$, where $k$ is a "key" and $x_k$ is $k$'s value.

Typically, we think of vectors as mapping integer indices to real values in, for instance, linear algebra. However, in data mining we frequently map arbitrarily named objects into some integer index space. For this lab, let's design a flexible data structure that can use arbitrary (but distinct) names. That is, treat $k$ as an arbitrary name from a known set of possible names and take each $x_k$ to be an integer count.

The vector $\vec{x}$ is _sparse_ in that we expect "most" of its values to be 0. If we expect this fact to hold for our data, then we are motivated to write our code to exploit it.

From this definition of a sparse vector, it should seem "natural" that we might use a dictionary or default-dictionary to represent $\vec{x}$.

In [None]:
def sparse_vector ():
    """Returns an empty sparse vector for storing counts."""
    return defaultdict (int)

For example, here is a function that takes an input string `s` and creates a sparse vector that maps each lowercase letter $k$ to a count $x_k$. It's basically just a slightly refactored version of `count_letters3` from before.

In [None]:
def count_letters_spvec (s):
    """Returns a sparse vector of (letter, count) pairs for the given string."""
    counts = sparse_vector ()
    letters = [c for c in s.lower () if c.isalpha ()]
    for k in letters:
        counts[k] += 1
    return counts

And here's a function to print a sparse vector.

In [None]:
def print_sparse_vector (x, name=None):
    """Prints a sparse vector with an optional name."""
    if name:
        name += ' '
    else:
        name = ''
    print ("=== Vector {}in Z^{}. ===".format (name, len (x)))
    elements = sorted (x.items (), key=lambda p: p[0]) # aside: what does this do?
    for key, value in elements:
        print ("%s: %d" % (key, value))

In [None]:
# Test code
counts_vec = count_letters_spvec (text)
print_sparse_vector (counts_vec, 'counts_vec')

> _(4 points)_ **Question 1.** Write a function to update a vector $\vec{x}$ _in-place_. That is, given $\vec{x}$ and a second sparse vector $\vec{u}$, replace $\vec{x}$ by $\vec{x} \leftarrow \vec{x} + \vec{u}$.

In [None]:
def update_counts (x, u):
    """Given two sparse vectors, x and u, updates x in place by the formula, x <- x + u."""
    
    # Update x:
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return x

The following cell tests your function. Aside: In the code below, observe the use of the function `deepcopy()`. What role does this serve in this code?

In [None]:
a = defaultdict (int, {'a': 1, 'b': 2, 'c': 3})
b = defaultdict (int, {'a': -3, 'd': 4, 'e': 5, 'x': -2})

from copy import deepcopy
c = deepcopy (a) # ???
update_counts (c, b)

print ('{}\n  + {}\n  == {}'.format (a, b, c))

assert len (a) == 3 and a['a'] == 1 and a['b'] == 2 and a['c'] == 3
assert len (b) == 4 and b['a'] == -3 and b['d'] == 4 and b['e'] == 5 and b['x'] == -2
assert len (c) == 6
assert c['a'] == (a['a'] + b['a']) and c['b'] == a['b'] and c['c'] == a['c']
assert c['d'] == b['d'] and c['e'] == b['e'] and c['x'] == b['x']
assert (sorted (c.keys ()))

## Sparse matrices (2-D tables) for counting co-occurring pairs

Let's now build a data structure for sparse matrices, using sparse vectors as a building block.

Let $\mathbf{X}$ denote the matrix and let $x_{ij}$ denote its $(i, j)$ entry, where $x_{ij}$ is the number of times that letter $i$ and letter $j$ co-occur within a word. As in an earlier part of this lab, treat each instance of a repeated word as a distinct "basket" and, within each basket, consider occurrences of each letter as being distinct.

> _(3 points)_ **Question 2.** There are several possibilities for using a sparse vector as a building block for a sparse matrix. For this exercise, try creating it as a "dictionary of sparse vectors" (i.e., a dictionary of dictionaries). Thus, if `X` is a sparse matrix, we will be able to index it using the notation `X[i][j]`.
>
> In the following code cell, encode your solution as a "constructor," i.e., a function that returns an empty sparse matrix.

In [None]:
def sparse_matrix ():
    """Returns an empty sparse matrix (2-D table) for storing integer counts."""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert type (sparse_matrix ()) == defaultdict
orig_sparse_vector = sparse_vector
del sparse_vector
try:
    sparse_matrix ()
except NameError:
    print ("\n(Passed.)")
else:
    raise AssertionError ("Your sparse_matrix() implementation does not appear to use sparse_vector()")
finally:
    sparse_vector = orig_sparse_vector
    del orig_sparse_vector

> _(5 points)_ **Question 3.** Complete the following function, which returns a sparse matrix containing counts of all pairs. Remember to do the following, per our conventions from before:
>
> 1. Assume the input may have multiple words separated by spaces.
> 2. "Canonicalize" the letters by converting them to lowercase.
> 3. Ignore any character that is _not_ an alphabetic character.

In [None]:
def count_letter_pairs (s):
    """Returns a sparse matrix of co-occuring letter pairs within words,
    assuming words are separated by spaces.
    """
    Counts = sparse_matrix ()
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return Counts

Here is a function to print a sparse matrix in a somewhat more readable format, for debugging purposes.

In [None]:
def print_sparse_matrix (X, name=None):
    if not name:
        name = ''
    else:
        name += ' '
        
    # Aside: What do these lines do and how do they work?
    nr = len (X)
    nc = max (nr, max ([len (r) for r in X.values ()]))
        
    print ("=== Matrix {}in Z^({}x{}) ===".format (name, nr, nc))
    sorted_rows = sorted (X.items (), key=lambda p: p[0])
    for (i, row_i) in sorted_rows:
        sorted_cols = sorted (row_i.items (), key=lambda p: p[0])
        print ("{:>3s} | {}".format (i, sorted_cols))

In [None]:
X = count_letter_pairs (text)

print ('"{}"\n==>'.format (text))
print_sparse_matrix (X)

assert X['h']['k'] == 4 and X['h']['k'] == X['k']['h']
assert X['c']['k'] == 8 and X['c']['k'] == X['k']['c']
assert X['k']['u'] == 4 and X['k']['u'] == X['u']['k']
assert X['d']['l'] == 2 and X['d']['l'] == X['l']['d']
assert X['c']['h'] == 9 and X['c']['h'] == X['h']['c']
assert X['k']['w'] == 2 and X['k']['w'] == X['w']['k']
assert X['h']['u'] == 5 and X['h']['u'] == X['u']['h']
assert X['h']['m'] == 1 and X['h']['m'] == X['m']['h']
assert X['l']['o'] == 2 and X['l']['o'] == X['o']['l']
assert X['k']['o'] == 4 and X['k']['o'] == X['o']['k']
assert X['h']['w'] == 3 and X['h']['w'] == X['w']['h']
assert X['d']['w'] == 4 and X['d']['w'] == X['w']['d']
assert X['h']['o'] == 5 and X['h']['o'] == X['o']['h']
assert X['c']['l'] == 2 and X['c']['l'] == X['l']['c']
assert X['c']['d'] == 6 and X['c']['d'] == X['d']['c']
assert X['d']['u'] == 4 and X['d']['u'] == X['u']['d']
assert X['c']['m'] == 1 and X['c']['m'] == X['m']['c']
assert X['d']['k'] == 2 and X['d']['k'] == X['k']['d']
assert X['o']['o'] == 4
assert X['d']['h'] == 2 and X['d']['h'] == X['h']['d']
assert X['m']['u'] == 1 and X['m']['u'] == X['u']['m']
assert X['f']['i'] == 1 and X['f']['i'] == X['i']['f']
assert X['d']['o'] == 10 and X['d']['o'] == X['o']['d']
assert X['u']['w'] == 2 and X['u']['w'] == X['w']['u']
assert X['o']['u'] == 6 and X['o']['u'] == X['u']['o']
assert X['c']['u'] == 11 and X['c']['u'] == X['u']['c']
assert X['c']['o'] == 10 and X['c']['o'] == X['o']['c']
assert X['o']['w'] == 9 and X['o']['w'] == X['w']['o']
assert X['c']['w'] == 4 and X['c']['w'] == X['w']['c']
assert X['c']['c'] == 4
assert X['l']['u'] == 2 and X['l']['u'] == X['u']['l']

print ("\n(Passed.)")

> _(5 points)_ **Question 4.** Given a sparse matrix $\mathbf{X}$ (e.g., computed as `X` in the preceding code), write a function to compute the top $s$ pairs. This function should, more precisely,
>
> 1. return a list of nested tuples, `((i, j), x_ij)` where `(i, j)` are the indices of the entry $x_{\mathtt{i}, \mathtt{j}} =$ `x_ij` (i.e., count); and
> 2. since `X` is symmetric, only consider either the lower-triangle or upper-triangle.
>
> A reasonable scheme for this question is to sort all the entries to identify the top ones. Implement this idea below.

In [None]:
def top_entries (X, s):
    """Given a sparse (count) matrix, returns a list of
    the pairs that occur at least some number of times.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
s = 6
top_pairs = top_entries (X, s)

print ("=== Top {} entries of X ===".format (s))
for p in top_pairs:
    print (p)

assert (('c', 'u'), 11) in top_pairs or (('u', 'c'), 11) in top_pairs
assert (('c', 'o'), 10) in top_pairs or (('o', 'c'), 10) in top_pairs
assert (('d', 'o'), 10) in top_pairs or (('o', 'd'), 10) in top_pairs
assert (('c', 'h'), 9) in top_pairs or (('h', 'c'), 9) in top_pairs
assert (('o', 'w'), 9) in top_pairs or (('w', 'o'), 9) in top_pairs
assert (('c', 'k'), 8) in top_pairs or (('k', 'c'), 8) in top_pairs
assert (('c', 'd'), 6) in top_pairs or (('d', 'c'), 6) in top_pairs
assert (('o', 'u'), 6) in top_pairs or (('u', 'o'), 6) in top_pairs
assert len (top_pairs) == 8

print ("\n(Passed.)")

## The _A-Priori_ Algorithm

A potentially more efficient alternative to the previous algorithm is the _a-priori algorithm_. The key idea is to exploit monotonicity, which is the following property: if the pair of items, $(i, j)$, appears at least $s$ times, then items $i$ and $j$ must also appear at least $s$ times.

> _(3 points)_ **Question 5.** Based on this observation, devise a scheme that can identify frequent pairs by reading the entire data set only twice, using at most $O(n + k_s^2)$ storage, where $n$ is the number of items and $k_s$ is the number of items that appear more than $s$ times. Justify your approach.

YOUR ANSWER HERE