# Notebook 9: The A Priori Algorithm
***

In this notebook we'll run through the A Priori algorithm step-by-step for a simple data set.

We'll need numpy for this notebook, so let's load it.

In [1]:
import numpy as np 

<br>

<img width=350px src="https://img.memecdn.com/walmart-probs_o_3678107.jpg">

### Exercise 1: Coding up our example from class

Consider the basket data from the Walmart example in class:

In [2]:
b = "beer"
c = "coke"
j = "juice"
m = "milk"
n = "noodles"
p = "pepsi"

inventory = [b,c,j,m,n,p]

baskets = {0 : {b,c,m}, 1 : {j,m,p},   2 : {b,c,m,n}, 3 : {c,j}, \
           4 : {b,m,p}, 5 : {b,c,j,m}, 6 : {b,c,j},   7 : {b,c}}

Let's use a support threshold of 3 again, and have a look at how we would go about coding up the a priori algorithm that we stepped through in class.

In [3]:
# Support threshold
s = 3

The first pass creates two tables:
1. (If necessary) translates item names into integers $0, 1, 2, \ldots, n$
2. ... which are used to index in a table of counts for the singleton itemsets

So as we read the basket data, we translate the item name into an integer, which is usually done using a hash table like we did last time. Then, we use that integer to index within a table of counts of the support of that singleton itemset. We did the whole hash table song and dance last time, so we will skip it today. Suppose our lookup table is as follows:

In [4]:
lookup = {b : 0, c : 1, j : 2, m : 3, n : 4, p : 5}

Now we need to fill in a table of the `counts` by counting up the support for all items in the `inventory`. The entire inventory is our first candidate set, $C_1$.

In [5]:
C1 = inventory
n_item = len(C1)
counts = [0]*n_item # initialize to 0 counts

Before, we had assumed that we have all of the basket data at once. But in reality, we may be obtaining more and more basket data as it flows in. So it makes more sense to loop over the baskets and increment the `counts` table as we go.

In [6]:
#for basket in baskets.values():
    # TODO -- loop over this basket
        # TODO -- for each item in the basket, increment the counter
        #         (use the lookup table to figure out which index within the counts array)

In [7]:
# SOLUTION:

for basket in baskets.values():
    # TODO -- loop over this basket
    for item in basket:
        # TODO -- for each item in the basket, increment the counter
        #         (use the lookup table to figure out which index within the counts array)
        idx = lookup[item]
        counts[idx] += 1

Now we need to prune the non-frequent itemsets to find our reduced set of candidates, $L_1$. Here is where we can create our new numbering from 1 to $m$ (the number of frequent singleton itemsets), represented by a 1D array with one element for each item:
* a 0 if the item is not frequent
* a unique integer 1 to $m$ if the item is frequent
* we will put a twist on the presentation in the book and slides here: since Python indexing starts at 0, we will use integers 0 through $m-1$ to represent the new numbering, and a -1 to represent elements that are not frequent.

In [8]:
C1

['beer', 'coke', 'juice', 'milk', 'noodles', 'pepsi']

In [9]:
new_lookup = {}
cnt = 0
L1 = []
for item in C1:
    if counts[lookup[item]] >= s:
        L1.append(item)
        new_lookup[item] = cnt
        cnt += 1
    else:
        new_lookup[item] = -1

print(new_lookup)

{'beer': 0, 'coke': 1, 'juice': 2, 'milk': 3, 'noodles': -1, 'pepsi': -1}


The next pass requires generating all possible pairs from $L_1$, and will form our candidate set for the second pass, $C_2$. We have four elements, so there are $C(4,2) = 6$ distinct pairs:

The beauty of the renumbering in `new_lookup` is that we know which elements of $C_2$ correspond to which item pairs in, say, an upper-triangular matrix:
$$\begin{bmatrix}              & \text{beer} & \text{coke} & \text{juice} & \text{milk} \\
                  \text{beer}  & 0           & m_0         & m_1          & m_2         \\
                  \text{coke}  & 0           & 0           & m_3          & m_4         \\
                  \text{juice} & 0           & 0           & 0            & m_5         \\
                  \text{milk}  & 0           & 0           & 0            & 0           \end{bmatrix}$$
                  
Now, when we compute the counts for each of these six pairs, we can store them in a length-6 1D array and use either our triangular array indexing or a triples array. We will use a triples array here, and initialize it with all 0s for counts:

In [10]:
trips = []
for idx1 in range(len(L1)):
    for idx2 in range(idx1+1, len(L1)):
        trips.append((new_lookup[L1[idx1]], new_lookup[L1[idx2]], 0))

And the whole point of the renumbering in `new_lookup` is that we can use the triples/triangular array indexing function from last time to grab whatever pair of frequent items we want! Recall that the function for the index within the triples array for elements with coordinates $i$ and $j$ within the full upper-triangular array (above) is:
$$k = i\cdot \left(n - \dfrac{i+1}{2}\right) + j - i -1$$

where $i < j \leq n$, the number of items total. Here, we only have the frequent items, so $n$ is replaced by the number of frequent singleton itemsets. $i$ and $j$ are the indices within the frequent pairs matrix (above), which we can get from `new_lookup`. Say we wanted to get the index of (coke, milk) from the triples array. We know from looking at the array above that it should be $k=4$. We turn to our triples indexing to see if we can get this:

In [11]:
item1 = "coke"
item2 = "milk"
idx1 = new_lookup[item1]
idx2 = new_lookup[item2]

k = idx1*(len(L1) - (idx1+1)/2) + idx2 - idx1 - 1
print(k)

4.0


Cool!  But what if we had queried for (milk, coke)?

In [12]:
item1 = "milk"
item2 = "coke"
idx1 = new_lookup[item1]
idx2 = new_lookup[item2]

k = idx1*(len(L1) - (idx1+1)/2) + idx2 - idx1 - 1
print(k)

3.0


Well that's certainly not right... **Consider:** why did this happen? What line(s) of code could we add to the block above to fix it? *Hint: it shouldn't be more than 1-3 lines of code.*

In [13]:
# SOLUTION:

item1 = "milk"
item2 = "coke"
idx1 = new_lookup[item1]
idx2 = new_lookup[item2]
# swap if out of order
if idx1 > idx2:
    idx1, idx2 = idx2, idx1
    
k = idx1*(len(L1) - (idx1+1)/2) + idx2 - idx1 - 1
print(k)

4.0


Following along with the slides, we continue to pretend that we are streaming basket data one basket at a time. For each basket, we use a double loop to generate all pairs of frequent items in that basket. We'll use the third basket as an example.

In [14]:
basket = baskets[2]
print(basket)

{'coke', 'noodles', 'milk', 'beer'}


We see beer, coke and milk are frequent items within the basket. We can identify this using the & operator for Python sets (grabbing the intersection):

In [15]:
frequent_items = list(basket&set(L1))
print(frequent_items)

['beer', 'milk', 'coke']


Now in a double loop, we generate all pairs of frequent items within this basket. We will create a triples array to keep track of the counts of each pair. We just need to be careful that the first index in the triplets is always the smallest index that we fetch out of `new_lookup`.

In [16]:
for idx1 in range(len(frequent_items)):
    for idx2 in range(idx1+1, len(frequent_items)):
        # get the index within the triples array
        new_idx1 = new_lookup[frequent_items[idx1]]
        new_idx2 = new_lookup[frequent_items[idx2]]
        if new_idx1 > new_idx2:
            new_idx1, new_idx2 = new_idx2, new_idx1
        k = int(new_idx1*(len(L1) - (new_idx1+1)/2) + new_idx2 - new_idx1 - 1)
        # update the counter in the triples array
        # tuples are immutable, so need full reassignment
        trips[k] = (new_idx1, new_idx2, trips[k][2]+1)

Check the triangular matrix above against the updated triples array to verify that the pairs containing beer, milk and coke have been updated:

In [17]:
trips

[(0, 1, 1), (0, 2, 0), (0, 3, 1), (1, 2, 0), (1, 3, 1), (2, 3, 0)]

Now, we only did that for the third basket, but to perform the entire second pass for a priori, we need to do this for *all* baskets. So, reset the triples array to all 0s and loop over all baskets to find the frequent pair itemsets. Recall from the lecture that they are:
$$L_2 = \{\{b,c\}, \{b,m\}, \{c,j\}, \{c,m\}\}$$

In [18]:
# Reset the triples array
trips = []
for idx1 in range(len(L1)):
    for idx2 in range(idx1+1, len(L1)):
        trips.append((new_lookup[L1[idx1]], new_lookup[L1[idx2]], 0))
        
for basket in baskets.values():
    print("Your code goes here!"); break # TODO

Your code goes here!


In [19]:
# SOLUTION:

for basket in baskets.values():
    frequent_items = list(basket&set(L1))
    for idx1 in range(len(frequent_items)):
        for idx2 in range(idx1+1, len(frequent_items)):
            # get the index within the triples array
            new_idx1 = new_lookup[frequent_items[idx1]]
            new_idx2 = new_lookup[frequent_items[idx2]]
            if new_idx1 > new_idx2:
                new_idx1, new_idx2 = new_idx2, new_idx1
            k = int(new_idx1*(len(L1) - (new_idx1+1)/2) + new_idx2 - new_idx1 - 1)
            # update the counter in the triples array
            # tuples are immutable, so need full reassignment
            trips[k] = (new_idx1, new_idx2, trips[k][2]+1)
            
print(trips)

#quiz soln:
print(trips[2])

[(0, 1, 5), (0, 2, 2), (0, 3, 4), (1, 2, 3), (1, 3, 3), (2, 3, 2)]
(0, 3, 4)
