**30E03000 - Data Science for Business I (2021)**

# Tutorial 3: Shopping Baskets analysis and Association Rule mining with Apriori

### Business problem

Association rule mining is a technique to identify underlying relations between different items. Take an example of a supermarket where customers can buy a variety of items. Usually there is a pattern in what customers buy. For instance, parents buy baby products such as milk and diapers **together**. University students may buy beer and chips **together**. In short, **transactions involve patterns. More profit can be generated if the relationship between the items purchased in different transactions can be identified.** 

<table><tr>
<td> <img src="Images/kplussa.jpg" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="Images/skortti.jpg" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

If item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

1. A and B can be placed together so that when a customer buys one of the product she doesn't have to go far away to buy the other product.
2. People who buy one of the products can be targeted through an advertisement campaign to buy the other.
3. Collective discounts can be offered on these products if the customer buys both of them.
4. Both A and B can be packaged together. 

<font color='red'>**The process of identifying an associations between products is called association rule mining.**</font> [[1]](https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/)




### Learning objectives

In this tutorial we will learn:
- how to perform association rule mining in Python with the **Apriori algorithm** to find common shopping patterns,
- how to define new variables based on these patterns, 
- how to use background data to understand what kind of customers are behind these patterns, 
- how to apply a Decision Tree algorithm to connect background data to the newly found ("mined") shopping patterns.

### Key words

`Apriori`, `association rule mining`, `shopping basket analysis`, `Decision Trees`

### Happy shopping!

<br>
<img src="Images/blackFriday.jpg" width="400">

## Apriori theory recap

There are three major components of the Apriori algorithm:

- Support
- Confidence
- Lift

Suppose we have a record of 1000 customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of 1000 transactions, 100 contain ketchup while 150 contain a burger. In 50 out of the 150 burger transactions, customers also bought ketchup. 

Using this data, we want to find the support, confidence, and lift.

<br>
<img src="Images/burgerKetchup.png" width="400">

#### Support

Support refers to the **default popularity of an item** and can be calculated by finding the number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

`Support(B) = (Transactions containing (B))/(Total Transactions)`

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

`Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)
Support(Ketchup) = 100/1000 = 10%`

#### Confidence

Confidence refers to the **likelihood that an item B is also bought if item A is bought.** It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

`Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)` 

Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:

`Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing Burger)
Confidence(Burger→Ketchup) = 50/150 = 33.3%`

#### Lift

`Lift(A -> B)` refers to the **increase in the ratio of sale of B when A is sold.** Lift(A –> B) can be calculated by dividing `Confidence(A -> B)` by `Support(B)`. Mathematically it can be represented as:

`Lift(A→B) = (Confidence (A→B))/(Support (B))`

Coming back to our Burger and Ketchup problem, the `Lift(Burger -> Ketchup)` can be calculated as:

`Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))
Lift(Burger→Ketchup) = 33.3/10 = 3.33`

Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

## Steps involved in Apriori

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item 2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

1. **Set a minimum treshold for support**. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than minimum threshold.
4. Order the rules by descending order of Lift.

Enough with the theory, let's get our hands dirty!

## Import libraries

The first step, as always, is to import the required libraries. 

Luckily, we do not need to write the Apriori script to calculate support, confidence, and lift for all the possible combination of items. We will use an off-the-shelf library where all of the code has already been implemented: [mlxtend](https://github.com/rasbt/mlxtend)

In [None]:
#pip install mlxtend #run this if the package has not been installed yet
#pip install networkx #run this if the package has not been installed yet
#config Completer.use_jedi = False #to fix a weird bug that prevents autocompletion

In [None]:
import pandas as pd #data frames (for storing data)
import numpy as np #scientific computing
import itertools 

import matplotlib.pyplot as plt #plotting

#Apriori library
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

#Visualize association rules
import networkx as nx  
import random

#sklearn for modeling
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier #Decision Tree algorithm
from sklearn.model_selection import train_test_split #Data split function
from sklearn.metrics import accuracy_score

#Decision tree plotting
import pydotplus
from IPython.display import Image 
from IPython.display import HTML, display

## Load data

In [None]:
data = pd.read_csv('shoppingdata.txt')
data.head() #Let's call the head() function to see how the dataset looks

## Exploratory data analysis (EDA)

Our data set has 18 features (dimensions) and 1000 observations:

In [None]:
data.shape

In [None]:
data.describe()

The shopping items (fruitveg, freshmeat, ...) are encoded as "F" or "T", indicated whether a shopping basket includes the item. Calling the `data.info()` function, we observe that these variables are of type object. **Pandas reads them as strings, not as boolean (True, False).**

In [None]:
data.info()

The Apriori algorithm cannot make sense of the strings, so we have to transform them into boolean values.

It is tempting to tell Pandas to replace *all* "T" and "F" values in the whole dataset with "True" and "False". This approach, however, would cause issues as it would also affect the gender column where "female" is encoded as "F". 

Instead, we store the column names of all shopping items in a list called `items` and then tell pandas to replace all values in these colums using the `replace`function:

In [None]:
items = ['fruitveg', 'freshmeat', 'dairy',  'cannedveg', 'cannedmeat', 'frozenmeal', 'beer', 
         'wine', 'softdrink', 'fish', 'confectionery']
data[items] = data[items].replace({'T':True, 'F':False}) #replace only values in the columns specified in "items"
data.info()

The shopping items are now of type bool(en): True, False.

In [None]:
data.head()

## Applying Apriori: "*what* is bought together?"

Now we will use the Apriori algorithm to find out which items are commonly sold together, so that the store owner can take action to place the related items together or advertise them together in order to have increased profit.

First, we select all columns with shopping items (already stored in the `items` variable) and store them in a new dataframe `records`:

In [None]:
records = data[items]
records.head()

1. We select only items with a `min_support` of 10% in order to focus on the truly often bought items. (recall that a min support = 0.1 means that the inderlying item has been bought in at least 10% of all observations). 


2. We filter out itemsets that contain less than 2 items. The Apriori algorithm would otherwise list single items in the itemset (yes, this is annyoing).

In [None]:
frequent_itemsets = apriori(records, min_support=0.1, use_colnames=True) #filter min_support=10%
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) #create length column
frequent_itemsets[(frequent_itemsets['length'] >= 2)] #only list itemsets with at least 2 items 
#we do not remove the single items (length = 1) from the dataset! 

3. With the just identified `frequent_itemsets` (what's bought together?), we call the `association_rules()` function from the Apriori library to find a set of rules. Note that we only include rules with a Lift of 1.6 or higher in order to focus on rules that are likely to increase sales (recall that Lift basically tells us that the likelihood of buying A and B together is x times higher than the likelihood of just buying B.)

In [None]:
#rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.6) #only include rules with Lift >= 1.6
rules

4. This leaves us with 16 rules. Note how many rules are redundant (A -> B, B -> A).

5. The tables above show **three unique association rules** or itemsets:
    - wine and confectionary (chocolate)
    - fish and fruit/vegetables
    - beer, frozen vegetables and pizza
    
6. Plot the rules:

In [None]:
support = rules['support'].values
confidence = rules['confidence'].values

In [None]:
#The code below is arguably quite tricky. You are not expected to reproduce this in any assignment or exam! Either 
#it will be omitted or provided to you. It is included here to help you understand the connection of items and 
#rules by visualizing them in a "spider web".

def draw_graph(rules, rules_to_show):
    G1 = nx.DiGraph()
    
    color_map=[]
    N = 50
    colors = np.random.rand(N)
    
    strs = []
    for i in range(len(rules)):
        strs.append('R{}'.format(i))   
   
    for i in range (rules_to_show):      
        G1.add_nodes_from(["R"+str(i)])
        
        for a in rules.iloc[i]['antecedents']:
            G1.add_nodes_from([a])
            G1.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)
       
        for c in rules.iloc[i]['consequents']:
            G1.add_nodes_from([c])
            G1.add_edge("R"+str(i), c, color=colors[i],  weight=2)
 
    for node in G1:
        found_a_string = False
        for item in strs: 
            if node==item:
                found_a_string = True
        if found_a_string:
            color_map.append('yellow')
        else:
            color_map.append('green')       
   
    edges = G1.edges()
    colors = [G1[u][v]['color'] for u,v in edges]
    weights = [G1[u][v]['weight'] for u,v in edges]
 
    pos = nx.spring_layout(G1, k=16, scale=1)
    plt.figure(1,figsize=(12,12)) 

    nx.draw(G1, pos, node_color = color_map, edge_color=colors, width=weights, font_size=16, with_labels=False)            
   
    for p in pos:  # raise text positions
        pos[p][1] += 0.07
    nx.draw_networkx_labels(G1, pos, font_size=16)
    plt.show()

draw_graph(rules, 16)

## "So, *who* buys it?"

## Data Preprocessing

In [None]:
data.head()

### <font color='red'>Make new variables</font> 

Based on these findings, **we create 3 new variables.** Each of them is True ONLY if all their subset variables are True.

In [None]:
data['wine_and_choco'] = ((records['wine']==True) & (records['confectionery']==True))
data['fish_and_fruit'] = ((records['fish']==True) & (records['fruitveg']==True))
data['beer_beans_pizza'] = ((records['beer']==True) & (records['cannedveg']==True) & (records['frozenmeal']==True))

In [None]:
data.head().style

### Select feature variables

Create a new dataframe "features" that stores all explanatory/independent/feature variables that we will use to explain/predict whether customers will buy our newly found combined products.

In [None]:
features = ['value', 'pmethod', 'sex', 'homeown', 'income', 'age']
X = data[features]
X.head()

### Encode categorical variables

Using the (by now familiar) pandas `get_dummies()` function.

In [None]:
X = pd.get_dummies(X, columns=["pmethod", "sex", "homeown"], prefix=["pmethod", "sex", "homeown"]) #we add a prefix for easier identification

In [None]:
X.head()

## 1. Beer, Beans, Pizza

### Data split


 <font color='red'>In previous tutorials, we had one "global" label/target vector `y` (e.g. fraudulent, reponse, etc.). Now, we have three different label vectors for the three newly-mined association rules. Thus, we have to specify a new target vector for each association rule and perform a split accordingly. Instead of just `y`, we call it `y_BBP` (for beer, beans, pizza):</font> 


In [None]:
y_BBP = data['beer_beans_pizza']
X_train, X_test, y_train, y_test = train_test_split(X, y_BBP, test_size = 0.3, random_state = 12345) #split data 70:30a

### Tree model

Train the Decision Tree model:

In [None]:
#Define and fit the Decision tree classifier with some default parameters
clf = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=2, 
                             min_samples_leaf=3).fit(X_train, y_train)

In [None]:
#Use classifier to predict labels
y_pred = clf.predict(X_test)

In [None]:
print ("Accuracy is: ", (accuracy_score(y_test,y_pred)*100).round(2))

In [None]:
'''
The graphviz library is used to visualize the tree. 
'''

# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=X_train.columns, 
                                class_names=['no buy', 'buy'], filled=True) #or use y_train.unique()

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

# Create PNG 
#graph.write_png("clf.png") #uncomment this line to save the plot as a .png file

Who buys beer, frozen vegetables and pizza? 
- male (propability of sex_M <= 0.5 = False (right branch) means male)
- income below $ 16950 ( mean 20171, max 30000)

Could be (university) students, low-income single men, etc.

<table><tr>
<td> <img src="Images/menpizza.jpg" alt="Drawing" style="width: 450px;"/> </td>
<td> <img src="Images/pizzashirt.jpg" alt="Drawing" style="width: 250px;"/> </td>
    <td> <img src="Images/pizza.jpg" alt="Drawing" style="width: 350px;"/> </td>


</tr></table>

# 2. Wine and chocolate

### Data split

And the same for wine and chocolate:

In [None]:
y_WC = data['wine_and_choco']
X_train, X_test, y_train, y_test = train_test_split(X, y_WC, test_size = 0.3, random_state = 12345) #split data 70:30a

### Tree model

In [None]:
#Define Decision tree classifier with some default parameters
clf = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, 
                             min_samples_leaf=3).fit(X_train, y_train)

In [None]:
#Use classifier to predict labels
y_pred = clf.predict(X_test)

In [None]:
print ("Accuracy is: ", (accuracy_score(y_test,y_pred)*100).round(2))

In [None]:
'''
The graphviz library is used to visualize the tree. 
'''

# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=X_train.columns, 
                                class_names=['no buy', 'buy'], filled=True) #or use y_train.unique()

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

# Create PNG 
#graph.write_png("clf.png") #uncomment this line to save the plot as a .png file

Who buys wine and chocolate?
- female 
- income above 20050 ( mean 20171, max 30000)
- shopping value > $32 

Kinda hard to target beyond "is female". 

<table><tr>
<td> <img src="Images/businesswine.jpg" alt="Drawing" style="width: 450px;"/> </td>
<td> <img src="Images/winechoco.jpg" alt="Drawing" style="width: 350px;"/> </td


</tr></table>

# 3. Fish & Fruit

### Data split

In [None]:
y_FF = data['fish_and_fruit']
X_train, X_test, y_train, y_test = train_test_split(X, y_FF, test_size = 0.3, random_state = 12345) #split data 70:30a

### Tree model

In [None]:
#Define Decision tree classifier with some default parameters
clf = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=2, 
                             min_samples_leaf=3).fit(X_train, y_train)

In [None]:
#Use classifier to predict labels
y_pred = clf.predict(X_test)

In [None]:
print ("Accuracy is: ", (accuracy_score(y_test,y_pred)*100).round(2))

In [None]:
'''
The graphviz library is used to visualize the tree. 
'''

# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=X_train.columns, 
                                class_names=['no buy', 'buy'], filled=True) #or use y_train.unique()

# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

# Create PNG 
#graph.write_png("clf.png") #uncomment this line to save the plot as a .png file

Who buys fish and fruit?
- younger than 24.5 years
- Do NOT own a home (probability of "homeown_No" < 0.5 = False is equivalent to prob. "homeown_No" > 0.5 = True) 

One possible explanation could be education (variable not available). High education might prevent early home ownership as students enter the worklife later and cannot afford a house in their mid-20s. In contrast, people who go for jobs that do not require higher education start earning money earlier and are able to afford their own place earlier. Yet, this young generation is more health conscious and opts for buying fish and fresh vegetables. 

<table><tr>
<td> <img src="Images/uni.jpg" alt="Drawing" style="width: 450px;"/> </td>
<td> <img src="Images/fish.jpg" alt="Drawing" style="width: 350px;"/> </td>
    <td> <img src="Images/yoga.jpg" alt="Drawing" style="width: 450px;"/> </td>


</tr></table>

# Conclusion


In this tutorial we learned:
- how to perform association rule mining in Python with the **Apriori algorithm** to find common shopping patterns,
- how to define new variables based on these patterns, 
- how to use background data to understand what kind of customers are behind these patterns, 
- how to apply a Decision Tree algorithm to connect background data to the newly found ("mined") shopping patterns.

With these skills, **we are able to understand not only "what is frequently bought together", but also WHO buys it.**
