# Notebook 7: Frequent Itemsets
***

In this notebook we'll examine some of the tools of market-basket analysis that we just saw in class, like:
* finding frequent itemsets,
* finding association rules for those itemsets, and
* computing confidence and interest in those association rules.

We'll need some nice packages for this notebook, so let's load them.

In [1]:
import numpy as np 
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

<br>

### Exercise 1:  Computing support

Suppose we operate a poorly-stocked convenience store, which for the sake of giving it a name, we'll call the PDQ at Table Mesa and Foothills. Suppose the full contents of our sad PDQ is just 6 items:

In [2]:
inventory = ["apple", "banana", "candy", "fancy feast", "grape soda", "ice cream"]

Consider the following data set of baskets and their contents.

In [3]:
baskets = {0 : set(["apple", "banana", "candy", "fancy feast"]),
           1 : set(["apple", "banana", "grape soda"]),
           2 : set(["banana", "ice cream"]),
           3 : set(["apple", "candy", "ice cream"]),
           4 : set(["apple", "fancy feast", "banana", "ice cream"]),
           5 : set(["apple", "banana", "candy", "ice cream"]),
           6 : set(["candy", "ice cream", "banana"]),
           7 : set(["banana", "fancy feast", "ice cream"])}

If you are not familiar with them, [set objects](https://www.geeksforgeeks.org/sets-in-python/) in Python are a handy way for us to compute some of the relevant quantities when it comes to analyzing market baskets, like support and interest, and finding frequent itemsets. Because you can't spell "itemset" without "set"!

Key set methods include **union**, **intersection**, **in** and **subset**, which we can find using the following commands.

In [4]:
# union:
print(set.union(baskets[0], baskets[1]))

# intersection:
print(set.intersection(baskets[0], baskets[1]))

# in:
print("apple" in baskets[0])

# subset:
print({"apple", "banana"} <= baskets[0])

{'fancy feast', 'apple', 'candy', 'grape soda', 'banana'}
{'apple', 'banana'}
True
True


For more information, check out the manual pages and the above tutorial link. But for now, that's enough to get us off the ground. Let's start by computing the **support** of the item `apple` in our set of baskets.

In [5]:
# SOLUTION:

supp_apple = np.sum(["apple" in baskets[k] for k in range(len(baskets))])
print(supp_apple)

5


Now, what is the support of the itemset {`apple`, `banana`} in our basket data set? *Hint: consider using a logical AND operator within your code from the previous part.*

In [6]:
# SOLUTION:

supp_apple_and_banana = np.sum(["apple" in baskets[k] and "banana" in baskets[k] for k in range(len(baskets))])
print(supp_apple_and_banana)

4


<br>

### Exercise 2:  Finding frequent itemsets

Recall that a *frequent itemset* is one that appears in at least $s$ baskets, where $s$ is some support threshold that we choose. In order to find frequent itemsets, we need to consider subsets of items of increasing size, up to the point where no more subsets are actually contained in any baskets. For example, if no subsets of size 3 are present in any basket, then there certainly aren't any subsets of size 4 present in any basket, and at that point we could stop checking for frequent itemsets.

Find and store in a variable of some kind of variable all of the subsets of our PDQ's `inventory`. Note that we can use a **recursive** approach to solve this problem, which amounts to finding the **power set** of the `inventory` set:
* **Base case:** If $S$ has exactly one element in it (call it $a$), then $\mathcal{P}(S) = \{\emptyset, \{a\}\}$
* Assume the set $S$ has more than one element in it.
* Then the power set of $S$ is composed of:
  * Each element of $\mathcal{P}(S_{-0})$, the power set of all elements of $S$ except the first one.
  * Each set that is of the form $S_{[k]} \cup S_0$, where $S_{[k]}$ is some element of $\mathcal{P}(S_{-0})$, including the empty set.
  * We can ignore the empty set in the last step if we just tack on $\{S_0\}$

In [7]:
def powerset(s):
    '''
    Assumes s is an input set as a list, and 
    outputs the power set of s (as a list).
    --> Won't include the empty set since that isn't interesting
    --> You can also assume there are no repeats in s
    '''
    if len(s)==1:
        return 0 # TODO -- what should we return?
    else:
        s0 = [s[0]]  # a set containing only the first element of s
        ps = [s0]    # initializing the power set to be a set containing 
                     #    only the set containing the first element of s

        # For each element sk that is a set from the power set of the set
        # of all elements in s besides the first one...
        for sk in powerset(s[1:]):
            # add to the power set each element of the slightly smaller power set
            ps.append(0) # TODO (replace the 0 with something)
            
            # add to the power set the first element combined with each of the others
            ps.append(0) # TODO (replace the 0 with something)
            
        return sorted(ps, key=len) # sort it by increasing length so it looks "right"

In [8]:
# SOLUTION:

def powerset(s):
    '''
    Assumes s is an input set as a list, and 
    outputs the power set of s (as a list).
    --> Won't include the empty set since that isn't interesting
    --> You can also assume there are no repeats in s
    '''
    if len(s)==1:
        return [s]
    else:
        s0 = [s[0]]  # a set containing only the first element of s
        ps = [s0]    # initializing the power set to be a set containing 
                     #    only the set containing the first element of s

        # For each element sk that is a set from the power set of the set
        # of all elements in s besides the first one...
        for sk in powerset(s[1:]):
            # add to the power set each element of the slightly smaller power set
            ps.append(sk) # TODO (replace the 0 with something)
            
            # add to the power set the first element combined with each of the others
            ps.append(s0 + sk) # TODO (replace the 0 with something)
            
        return sorted(ps, key=len) # sort it by increasing length so it looks "right"

Let's check that we at least have the correct number of elements. There are 6 items in the inventory, and each one is either in or out of a given subset (2 choices), so the total number of elements in the power set should be $2^6 = 64$. If you left off the empty set, that means we should have 63 elements in the power set. Check that this is the case:

In [9]:
powerset(['1','2'])

[['1'], ['2'], ['1', '2']]

In [10]:
ps = powerset(inventory)
len(ps)
print(ps)

[['apple'], ['banana'], ['candy'], ['fancy feast'], ['grape soda'], ['ice cream'], ['apple', 'banana'], ['apple', 'candy'], ['apple', 'fancy feast'], ['apple', 'grape soda'], ['apple', 'ice cream'], ['banana', 'candy'], ['banana', 'fancy feast'], ['banana', 'grape soda'], ['banana', 'ice cream'], ['candy', 'fancy feast'], ['candy', 'grape soda'], ['candy', 'ice cream'], ['fancy feast', 'grape soda'], ['fancy feast', 'ice cream'], ['grape soda', 'ice cream'], ['apple', 'banana', 'candy'], ['apple', 'banana', 'fancy feast'], ['apple', 'banana', 'grape soda'], ['apple', 'banana', 'ice cream'], ['apple', 'candy', 'fancy feast'], ['apple', 'candy', 'grape soda'], ['apple', 'candy', 'ice cream'], ['apple', 'fancy feast', 'grape soda'], ['apple', 'fancy feast', 'ice cream'], ['apple', 'grape soda', 'ice cream'], ['banana', 'candy', 'fancy feast'], ['banana', 'candy', 'grape soda'], ['banana', 'candy', 'ice cream'], ['banana', 'fancy feast', 'grape soda'], ['banana', 'fancy feast', 'ice cream'

Using a support threshold of $s=0.6$, what are all of the frequent itemsets? We can create another list whose elements are fractional values for the proportion of baskets each subset appears in. This is done for the a single element below, but you'll want to generalize this to look at all of the subsets in the power set `ps`. You can, of course, combine some of these steps into fewer commands.

In [11]:
# initialize the support list
supp = [0]*len(ps)

# for the 43rd subset, check which baskets it is in
membership = [set(ps[42]) <= baskets[k] for k in range(len(baskets))]

# then compute the number of baskets the set appears in
number_of_appearances = np.sum(membership)

# and take as a ratio of the total number of baskets
supp[42] = number_of_appearances/len(baskets)

# Now, do for ALL of them! Like... with a loop.
# TODO
print(membership)

[False, False, False, False, False, False, False, False]


In [12]:
# SOLUTION:

supp = [0]*len(ps)
for i in range(len(ps)):
    supp[i] = np.sum([set(ps[i]) <= baskets[k] for k in range(len(baskets))])/len(baskets)

Can you spot any frequent itemsets? It might be easiest to view the itemsets and their support all in one shot. And since there are 63 sets, we probably want to restrict to just the frequent ones.

In [13]:
for i,s in zip(ps,supp):
    if s >= 0.6:
        print(i,s)

['apple'] 0.625
['banana'] 0.875
['ice cream'] 0.75
['banana', 'ice cream'] 0.625


So it appears that bananas and ice cream are a frequent itemset. Note that if we were doing this out in the wild, we would *only* check triples that contained both banana and ice cream; since there are no other frequent pairs, there can't be any frequent triples that *don't* contain those tasty treats.

It appears the customers of our PDQ love a good banana split. Yum!

<img width=300px src="https://i.pinimg.com/474x/9a/10/4b/9a104babf199c6b55d623972ad2bd243--banana-split-classroom-decor.jpg">

<br>

### Exercise 3:  Finding interesting association rules

The **confidence** in an association rule $I \rightarrow j$ (itemset $I$ is related to appearance of item $j$) is given by:
$$\text{conf}(I \rightarrow j) = \dfrac{\text{support}(I \cup \{j\})}{\text{support}(I)}$$

We saw that bananas and ice cream appear to go together in purchase patterns for our haplessly understocked convenience store, but that didn't indicate which *direction* the relationship goes. Is it the case that customers who buy ice cream are more likely to buy bananas, or the other way around?

Compute both $\text{conf}(\text{banana} \rightarrow \text{ice cream})$ and $\text{conf}(\text{ice cream} \rightarrow \text{banana})$. Compare the two and form a hypothesis about the direction of this relationship.

In [14]:
# SOLUTION:

# banana --> ice cream
numer = supp[ps.index(["banana","ice cream"])]
denom = supp[ps.index(["banana"])]
conf_b_i = numer/denom

# ice cream --> banana
numer = supp[ps.index(["banana","ice cream"])]
denom = supp[ps.index(["ice cream"])]
conf_i_b = numer/denom

print("conf(banana-->ice cream) = {:0.4f}".format(conf_b_i))
print("conf(ice cream-->banana) = {:0.4f}".format(conf_i_b))

conf(banana-->ice cream) = 0.7143
conf(ice cream-->banana) = 0.8333


Looks like both have pretty high confidences, but it seems a tad more likely that someone who buys ice cream will want to buy bananas. We can also characterize this using the **interest** in an association rule, which is the difference between the confidence in the rule, and the proportion of itemsets that the "resultant" item ($j$) appears in:
$$\text{interest}(I \rightarrow j) = \text{conf}(I \rightarrow j) - P(j)$$

Compute the interest for both forms of the banana/ice cream relationship.

In [15]:
# SOLUTION:

int_b_i = conf_b_i - np.sum(["ice cream" in baskets[k] for k in range(len(baskets))])/len(baskets)
int_i_b = conf_i_b - np.sum(["banana" in baskets[k] for k in range(len(baskets))])/len(baskets)

print("interest(banana-->ice cream) = {:0.4f}".format(int_b_i))
print("interest(ice cream-->banana) = {:0.4f}".format(int_i_b))

interest(banana-->ice cream) = -0.0357
interest(ice cream-->banana) = -0.0417


Alas, neither form of that rule appears to be particularly interesting. But ice cream and bananas are so delicious! **Consider:** Why was our confidence in each rule so high, but our interest in them so low?