In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0


# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [3]:

uniqueproducts = (df['0'].unique())
uniqueproducts

array(['Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
       'Diaper', 'Milk'], dtype=object)

## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [7]:
#create an itemset based on the products
itemset = df.apply(lambda row: list(row.dropna()), axis=1)

# encoding the feature
encoded_data = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(uniqueproducts) - set(row))
    commons = list(set(uniqueproducts).intersection(row))
    for x in uncommons:
        labels[x] = 0
    for y in commons:
        labels[y] = 1
    encoded_data.append(labels)


In [8]:
  # create new dataframe from the encoded features
encoded_df = pd.DataFrame(encoded_data)
  # show the new dataframe
encoded_df

Unnamed: 0,Bagel,Milk,Eggs,Cheese,Diaper,Meat,Pencil,Wine,Bread
0,0,0,1,1,1,1,1,1,1
1,0,1,0,1,1,1,1,1,1
2,0,1,1,1,0,1,0,1,0
3,0,1,1,1,0,1,0,1,0
4,0,0,0,0,0,1,1,1,0
...,...,...,...,...,...,...,...,...,...
310,0,0,1,1,0,0,0,0,1
311,0,1,0,0,0,1,1,0,0
312,0,0,1,1,1,1,1,1,1
313,0,0,0,1,0,1,0,0,0


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [9]:
encoded_df = encoded_df.dropna(axis=1)

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [10]:
from mlxtend.frequent_patterns import apriori

min_support = 0.2
frequent_product= apriori(encoded_df, min_support=min_support, use_colnames=True)

frequent_product



Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.501587,(Milk)
2,0.438095,(Eggs)
3,0.501587,(Cheese)
4,0.406349,(Diaper)
5,0.47619,(Meat)
6,0.361905,(Pencil)
7,0.438095,(Wine)
8,0.504762,(Bread)
9,0.225397,"(Bagel, Milk)"


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [12]:
from mlxtend.frequent_patterns import association_rules
association_rules(frequent_product, metric = "confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
2,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
4,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
5,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
6,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
7,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
8,"(Meat, Cheese)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845,0.296655
9,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

1. **Antecedent Support:**
   - **Definition:** The support of the antecedent in an association rule.
   - **Formula:** Antecedent Support(A) = (Transactions containing A) / (Total transactions)
   - **Explanation:** It measures how frequently the antecedent (A) of an association rule occurs in the dataset.

2. **Consequent Support:**
   - **Definition:** The support of the consequent in an association rule.
   - **Formula:** Consequent Support(C) = (Transactions containing C) / (Total transactions)
   - **Explanation:** It measures how frequently the consequent (C) of an association rule occurs in the dataset.

3. **Support:**
   - **Definition:** The overall support for an item or itemset, irrespective of it being in the antecedent or consequent.
   - **Formula:** Support(X) = (Transactions containing X) / (Total transactions)
   - **Explanation:** It gauges the frequency of occurrence of a specific item or itemset in the dataset.

4. **Confidence:**
   - **Definition:** The probability of the consequent given the antecedent in an association rule.
   - **Formula:** Confidence(A -> C) = Support(A -> C) / Support(A)
   - **Explanation:** It assesses the reliability or strength of association rules.

5. **Lift:**
   - **Definition:** Measures how much more likely the antecedent and consequent co-occur compared to random chance.
   - **Formula:** Lift(A -> C) = Confidence(A -> C) / Support(C) or Lift(A -> C) = Support(A -> C) / (Support(A) * Support(C))
   - **Explanation:** Lift > 1 indicates a stronger relationship than expected by chance. Lift = 1 implies independence, while Lift < 1 suggests a tendency for items to occur separately.

6. **Leverage:**
   - **Definition:** Quantifies the actual association between item or itemsets compared to what would be expected if they were unrelated.
   - **Formula:** Leverage(A -> C) = Support(A -> C) - Support(A) * Support(C)
   - **Explanation:** Leverage = 0 signifies independence. Positive values indicate association, while negative values suggest a tendency for items to occur separately.

7. **Conviction:**
   - **Definition:** Measures the dependency of the consequent on the antecedent in an association rule.
   - **Formula:** Conviction(A -> C) = (1 - Support(C)) / (1 - Confidence(A -> C))
   - **Explanation:** Higher Conviction values indicate a strong dependence between the antecedent and consequent. If Confidence is 1, Conviction becomes infinite (infinitesimal) as the denominator approaches zero.