# INFS4203/7203 Data Mining Tutorial 2

Let's see how we can apply the external library to find frequent itemsets

# 1. Install mlxtend library

The mlxtend package can be installed in either of these two ways: 

1. running
**conda install -c conda-forge mlxtend** 
on your *Anaconda command promote*

OR

2. running
**pip install mlxtend** 
on your *terminal* .

For more information, please check the documentation of mlxtend at http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/


# 2. Load data

We load the provided data *groceries.csv*.

In [1]:
# Import the libraries and load data
import mlxtend
import pandas as pd
import numpy as np

# load data from groceries.csv
df = pd.read_csv('groceries.csv')
# give a sign when the task finished
print("loading successful")

loading successful


# Change the data format

From the documentation, we can see the required data format is like:



|     | Type 1 | Type 2 | ...    | Type N |
| --- | ------ | ----   | ----   |  ----  |
| 0   | True   | False  | ...    | True   |
| 1   | False  | True   | ...    | True   |
| 2   | True   | True   | ...    | False  |
| 3   | True   | True   | ...    | True   |

So next, we learn how to format the original transaction data into this one.

**We first make each transaction into a "list"**, where "list" is a prespecified data structure in Python.
(See here for more on list: https://docs.python.org/3/tutorial/introduction.html#lists)


In [2]:
# Remove nan
# change the data to list
dataset = df.values.tolist()
cleanList = []

for trans in dataset: # for each transaction
    cleanTrans = []
    for x in trans: # for each element in the transaction
        if str(x) != 'nan': # if the item is not 'nan', put it in the list
            cleanTrans.append(x)
    cleanList.append(cleanTrans)
dataset = np.asarray(cleanList)

# give a sign when the task finished
print('Done')

Done


**We then change the list into a mlxtend required format use the function *TransactionEncoder()*.**

In [3]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

# This part is not required to be understood. You can just run the code and skip it.
te = TransactionEncoder() # a pre-defined function to transfer data
te_ary = te.fit(dataset).transform(dataset) 
df = pd.DataFrame(te_ary, columns=te.columns_) # fit the transferred data back into a pandas data format

# give a sign when the task finished
print('Done!')

Done!


# Apply the Apriori algorithm

set minimum support 0.009

In [14]:
# define the MIN_SUPP
MIN_SUPP = 0.008

# apply the defined apriori algorithm
freq_set = apriori(df, min_support=MIN_SUPP, use_colnames=True)

print('Done!')

Done!


In [15]:
freq_set

Unnamed: 0,support,itemsets
0,0.008033,(Instant food products)
1,0.033452,(UHT-milk)
2,0.017692,(baking powder)
3,0.052466,(beef)
4,0.033249,(berries)
...,...,...
480,0.014540,"(whole milk, root vegetables, yogurt)"
481,0.008744,"(whole milk, sausage, yogurt)"
482,0.010473,"(soda, whole milk, yogurt)"
483,0.015150,"(yogurt, whole milk, tropical fruit)"


#### Confirm if {whole milk, tropical fruit, bread} is a frequent item. 

In [33]:
check_set = ['whole milk','rolls/buns','other vegetables']

# Select the idx from the frequent set based on the given check_set
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()
if itemset_idx==[]: # given check_set does not exist in the frequent set
    print('Not frequent!')
else:
    print('Found at location '+str(itemset_idx[0]))

Found at location 379


In [30]:
# let's see our result
freq_set

Unnamed: 0,support,itemsets
0,0.008033,(Instant food products)
1,0.033452,(UHT-milk)
2,0.017692,(baking powder)
3,0.052466,(beef)
4,0.033249,(berries)
...,...,...
996,0.005186,"(root vegetables, whipped/sour cream, other ve..."
997,0.007829,"(yogurt, root vegetables, other vegetables, wh..."
998,0.007626,"(yogurt, other vegetables, tropical fruit, who..."
999,0.005592,"(yogurt, whipped/sour cream, other vegetables,..."


## How to check the i-th frequent itemset?

In [13]:
# check the 10th frequent itemset
freq_set.loc[[379], ['support', 'itemsets']]

Unnamed: 0,support,itemsets
379,0.017895,"(whole milk, other vegetables, rolls/buns)"


## How to check whether an itemset is frequent?

 If it is frequent, provide the location of the itemset in **feq_set**; otherwise provide "Not frequent". 

### Check whether 'berries' is frequent

In [12]:
# specify the itemset you want to check
check_set = ['whole milk', 'rolls/buns', 'other vegetables']

# Select the idx from the frequent set based on the given check_set
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()
if itemset_idx==[]: # given check_set does not exist in the frequent set
    print('Not frequent!')
else:
    print('Found at location '+str(itemset_idx[0]))

Found at location 379


### Check whether 'whipped/sour cream' is frequent

In [11]:
check_set = ['whole milk', 'tropical fruit', 'bread']

# Select the idx from the frequent set based on the given check_set
itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(check_set)].tolist()
if itemset_idx==[]: # given check_set does not exist in the frequent set
    print('Not frequent!')
else:
    print('Found at location '+str(itemset_idx[0]))

Not frequent!


Great! You can play with your own sets and see if they are frequent.

**Execrise**: after we find an itemset is frequent, how can we have its support?
(hint: use the location of the frequent itemset.)

## Section 4.1 Calcualte the confidence

In [6]:
"""
Return the support of the given itemset X
"""
def get_itemset_support(freq_set, X):
    # Select the idx from the frequent set based on the given check_set
    itemset_idx = freq_set.index[freq_set['itemsets'] == frozenset(X)].tolist()
    if itemset_idx==[]:
        return None # Request itemset X does not exist in the frequent itemset
    else:
        return freq_set.loc[itemset_idx[0],['support']] # Return the corresponding support
        
"""
Print the confidence of the given itemset {X} -> {Y} 
"""
def get_rule_confidence(freq_set, X, Y):

    itemset = X + Y # join itemset X and itemset Y 
    x_support = get_itemset_support(freq_set, X) # get support of X 
    joint_support = get_itemset_support(freq_set, itemset) # get support of X joint Y
    
    if joint_support is None or x_support is None: 
        return "Make sure the X, Y and X+Y are in the frequent list."
        
    print("The confidence of rule {%s} -> {%s} is: %3f"%(X, Y, joint_support/x_support))

**Now, let's calculate the confidence of rule {X} -> {Y}**

In [8]:
# Specify the content of X and Y
X = ['whipped/sour cream']
Y = ['berries']

# Get the confidence
get_rule_confidence(freq_set, X, Y)

The confidence of rule {['whipped/sour cream']} -> {['berries']} is: 0.126241


This is the end of the tutorial. You can play with your own rules now. 

#### Confirm if {whipped/sour cream} -> {berries} is a strong rule in the case that minimum confident rate is 0.2

In [7]:
# Specify the content of X and Y
X = ['whipped/sour cream']
Y = ['berries']

# Get the confidence
get_rule_confidence(freq_set, X, Y)

The confidence of rule {['whipped/sour cream']} -> {['berries']} is: 0.126241
