# Generating Frequent Itemsets

The apriori function expects data in a one-hot encoded pandas DataFrame. Suppose we have the following transaction data:

In [1]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

We can transform it into the right format via the TransactionEncoder as follows:

In [4]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [25]:
df.shape

(5, 11)

Now, let us return the items and itemsets with at least 60% support:

In [5]:
from mlxtend.frequent_patterns import apriori

frequent_itemset = apriori(df, min_support=0.6)
frequent_itemset

Unnamed: 0,support,itemsets
0,0.8,(3)
1,1.0,(5)
2,0.6,(6)
3,0.6,(8)
4,0.6,(10)
5,0.8,"(3, 5)"
6,0.6,"(8, 3)"
7,0.6,"(5, 6)"
8,0.6,"(8, 5)"
9,0.6,"(10, 5)"


By default, apriori returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:

In [6]:
frequent_itemset = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemset

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Onion, Eggs)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


# Selecting and Filtering Results

The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:

In [7]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.8,(Eggs),1
1,1.0,(Kidney Beans),1
2,0.6,(Milk),1
3,0.6,(Onion),1
4,0.6,(Yogurt),1
5,0.8,"(Kidney Beans, Eggs)",2
6,0.6,"(Onion, Eggs)",2
7,0.6,"(Kidney Beans, Milk)",2
8,0.6,"(Kidney Beans, Onion)",2
9,0.6,"(Kidney Beans, Yogurt)",2


Then, we can select the results that satisfy our desired criteria as follows:

In [9]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.8) ]

Unnamed: 0,support,itemsets,length
5,0.8,"(Kidney Beans, Eggs)",2


Similarly, using the Pandas API, we can select entries based on the "itemsets" column:

In [11]:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]

Unnamed: 0,support,itemsets,length
6,0.6,"(Onion, Eggs)",2


##### Frozensets

Note that the entries in the "itemsets" column are of type frozenset, which is built-in Python type that is similar to a Python set but immutable, which makes it more efficient for certain query or comparison operations (https://docs.python.org/3.6/library/stdtypes.html#frozenset). Since frozensets are sets, the item order does not matter. I.e., the query

In [13]:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Eggs', 'Onion'} ]

Unnamed: 0,support,itemsets,length
6,0.6,"(Onion, Eggs)",2


In [15]:
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Eggs', 'Onion')) ]

Unnamed: 0,support,itemsets,length
6,0.6,"(Onion, Eggs)",2


In [17]:
frequent_itemsets[ frequent_itemsets['itemsets'] == frozenset(('Onion', 'Eggs')) ]

Unnamed: 0,support,itemsets,length
6,0.6,"(Onion, Eggs)",2


# Working with Sparse Representations

To save memory, you may want to represent your transaction data in the sparse format. This is especially useful if you have lots of products and small transactions.

In [19]:
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
sparse_df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [21]:
apriori(sparse_df, min_support=0.6, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Onion, Eggs)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


In [29]:
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(df, min_support=0.07, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Eggs),(Apple),0.8,0.2,0.2,0.25,1.25,0.04,1.066667
1,(Apple),(Eggs),0.2,0.8,0.2,1.0,1.25,0.04,inf
2,(Kidney Beans),(Apple),1.0,0.2,0.2,0.2,1.0,0.0,1.0
3,(Apple),(Kidney Beans),0.2,1.0,0.2,1.0,1.0,0.0,inf
4,(Milk),(Apple),0.6,0.2,0.2,0.333333,1.666667,0.08,1.2


In [30]:
rules[ (rules['lift'] >= 5) &
      (rules['confidence'] >= 0.6) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
75,"(Corn, Eggs)",(Ice cream),0.2,0.2,0.2,1.0,5.0,0.16,inf
78,(Ice cream),"(Corn, Eggs)",0.2,0.2,0.2,1.0,5.0,0.16,inf
92,"(Corn, Onion)",(Ice cream),0.2,0.2,0.2,1.0,5.0,0.16,inf
97,(Ice cream),"(Corn, Onion)",0.2,0.2,0.2,1.0,5.0,0.16,inf
110,"(Milk, Corn)",(Unicorn),0.2,0.2,0.2,1.0,5.0,0.16,inf
115,(Unicorn),"(Milk, Corn)",0.2,0.2,0.2,1.0,5.0,0.16,inf
123,"(Corn, Yogurt)",(Unicorn),0.2,0.2,0.2,1.0,5.0,0.16,inf
126,(Unicorn),"(Corn, Yogurt)",0.2,0.2,0.2,1.0,5.0,0.16,inf
323,"(Kidney Beans, Corn, Eggs)",(Ice cream),0.2,0.2,0.2,1.0,5.0,0.16,inf
327,"(Kidney Beans, Ice cream)","(Corn, Eggs)",0.2,0.2,0.2,1.0,5.0,0.16,inf
