## Table of content

1. Introduction
2. Goal
3. Import Datset & libraries
4. Overview
5. Data Pre-processing
6. Statistical Techniques
7. Descriptive Statistical Analyses
8.  Hypothesis Formulation and Testing
9.  Jupyter Notebook Analysis
10. Machine Leaning 
11. Splitting
12. Training and Testing
13. Conclusions
14. References
15. GitHub repo link

### Import Datset & libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn_extra.cluster import KMedoids
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# read the two pages of the excel file
full_path = "data/data.xlsx"
df_1  = pd.read_excel(full_path, 0)
df_2  = pd.read_excel(full_path, 1)
# print the shape of the two dataframes
print(df_1.shape, df_2.shape)

(525461, 8) (541910, 8)


In [None]:
# create the full dataframe
df = pd.concat([df_1, df_2], axis=0)

In [None]:
print("Our Dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))
display(df.describe())
display(df.head())
display(df.dtypes.value_counts())

Our Dataset has 1067371 rows and 8 columns


Unnamed: 0,Quantity,InvoiceDate,Price,Customer ID
count,1067371.0,1067371,1067371.0,824364.0
mean,9.938898,2011-01-02 21:13:55.394028544,4.649388,15324.638504
min,-80995.0,2009-12-01 07:45:00,-53594.36,12346.0
25%,1.0,2010-07-09 09:46:00,1.25,13975.0
50%,3.0,2010-12-07 15:28:00,2.1,15255.0
75%,10.0,2011-07-22 10:23:00,4.15,16797.0
max,80995.0,2011-12-09 12:50:00,38970.0,18287.0
std,172.7058,,123.5531,1697.46445


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


object            4
float64           2
int64             1
datetime64[ns]    1
Name: count, dtype: int64

In [None]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


### Market Basket Analysis

In [None]:
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [None]:
# check for missing values
missing_values = df.isnull().sum()
print(missing_values)

Invoice             0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
Price               0
Customer ID    243007
Country             0
dtype: int64


In [None]:
# we drop the sample for which the description is missing and keep the one with missing customer ID since invoice number is unique given the
prep_df = df.dropna(subset=['Description'])


In [None]:
# check for missing values
missing_values = prep_df.isnull().sum()
print(missing_values)

Invoice             0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
Price               0
Customer ID    238625
Country             0
dtype: int64


In [None]:
# To mine association rules we just need invoice and StockCode, we can then retrieve the description of the product from the original dataset
prep_df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [None]:
# check if for a stockcode we have multiple descriptions
print(prep_df["StockCode"].nunique())
print(prep_df["Description"].nunique())

4950
5698


In [None]:
stockcode_description = prep_df.groupby("StockCode")["Description"].nunique()

In [None]:
stockcode_description[stockcode_description > 1]

StockCode
10080           2
10120           2
10133           2
16008           2
16011           2
               ..
DCGS0068        2
DCGS0069        2
DCGSSBOY        2
DCGSSGIRL       2
gift_0001_20    2
Name: Description, Length: 1232, dtype: int64

For a single stockCode there are multiple description for 1232 times. This imply we have to use description for the transaction encoder input

In [None]:
# we need to map every unique description to a uniquea integer to use the TransactionEncoder, we create a dictionary to do that and use it with map on the dataframe
description_to_int = {desc: idx for idx, desc in enumerate(prep_df['Description'].unique())}
# Replace descriptions with corresponding integers
prep_df['Description'] = prep_df['Description'].map(description_to_int)

In [None]:
prep_df["Description"].nunique()

5698

In [None]:
# Group by 'Invoice' and aggregate 'Description' into lists
transactions = prep_df.groupby('Invoice')['Description'].apply(list)
te = TransactionEncoder()
# Fit TransactionEncoder to the transactions and transform the data into one-hot encoded format
one_hot_encoded = te.fit(transactions).transform(transactions)
# Convert the one-hot encoded format into a DataFrame
basket = pd.DataFrame(one_hot_encoded, columns=te.columns_)

In [None]:
transactions.head()

Invoice
489434                             [0, 1, 2, 3, 4, 5, 6, 7]
489435                                       [8, 9, 10, 11]
489436    [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2...
489437    [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 4...
489438    [53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 6...
Name: Description, dtype: object

In [None]:
prep_df["Invoice"].nunique()

49353

In [None]:
# we double check that we have one line per unique invocienumber and one column for each unique description
print(basket.shape)
print(prep_df["Invoice"].nunique())
print(prep_df["Description"].nunique())


(49353, 5698)
49353
5698



We write a function mine_association_rules takes a dataset basket, minimum support min_support, minimum threshold min_threshold for association rule metrics, and an algorithm choice ('apriori' or 'fpgrowth'). It mines association rules from the dataset using the specified algorithm with the given support and threshold values.

This will help us to do a better analysis using different values for the min_support and min_threshold.

If the algorithm is 'apriori', it uses the Apriori algorithm to generate frequent itemsets from the dataset. If it's 'fpgrowth', it uses the FP-growth algorithm. 

After obtaining frequent itemsets, the function calculates association rules using the confidence metric and the specified minimum threshold. Finally, it returns the generated association rules.

In [None]:

def mine_association_rules(basket, min_support, min_threshold, algorithm='apriori'):
    if algorithm == 'apriori':
        frequent_itemsets = apriori(basket, min_support=min_support, use_colnames=True)
    elif algorithm == 'fpgrowth':
        frequent_itemsets = fpgrowth(basket, min_support=min_support, use_colnames=True)
    else:
        print("Choose apriori or fpgrowth")
    print(f"Generate frequent Itemset using: {algorithm}")
    
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_threshold)
    return rules

In [None]:
# let's write a function to print the rules
def explain_association_rules(rules):
    print("Number of rules found: ", rules.shape[0])
    print("The 5 first rules are: ")
    display(rules.head())  

In [None]:
# create the inverse dictionary to map the integer to the description
int_to_description = {idx: desc for desc, idx in description_to_int.items()}

In [None]:
# we define different thresholds for the support and confidence
min_support = [0.02, 0.05, 0.1]
min_threshold = [0.5, 0.6, 0.7, 0.8, 0.9]

#### Apriori Algorithm

In [None]:
# rules Apriori
for min_s in min_support:
    for min_t in min_threshold:
        print(f"Support: {min_s}, Threshold: {min_t}")
        rules = mine_association_rules(basket, min_s, min_t, algorithm='apriori')
        # map the integer to the description
        # Map back antecedents and consequents to descriptions
        rules['antecedents'] = rules['antecedents'].apply(lambda x: tuple(int_to_description[desc] for desc in x))
        rules['consequents'] = rules['consequents'].apply(lambda x: tuple(int_to_description[desc] for desc in x))
        explain_association_rules(rules)

Support: 0.02, Threshold: 0.5


Generate frequent Itemset using: apriori
Number of rules found:  7
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(232),(4),0.031467,0.050412,0.022025,0.699936,13.884213,0.020439,3.164613,0.958125
1,(130),(91),0.036371,0.113347,0.025409,0.698607,6.163454,0.021286,2.941853,0.869373
2,(194),(694),0.04247,0.041254,0.022775,0.53626,12.999026,0.021023,2.06742,0.964012
3,(694),(194),0.041254,0.04247,0.022775,0.552063,12.999026,0.021023,2.137645,0.96279
4,(3760),(3759),0.040099,0.04405,0.020404,0.508843,11.551482,0.018638,1.946322,0.951589


Support: 0.02, Threshold: 0.6
Generate frequent Itemset using: apriori
Number of rules found:  4
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(232),(4),0.031467,0.050412,0.022025,0.699936,13.884213,0.020439,3.164613,0.958125
1,(130),(91),0.036371,0.113347,0.025409,0.698607,6.163454,0.021286,2.941853,0.869373
2,(4097),(4099),0.028387,0.029846,0.021154,0.745182,24.967392,0.020306,3.807242,0.987994
3,(4099),(4097),0.029846,0.028387,0.021154,0.708758,24.967392,0.020306,3.336097,0.98948


Support: 0.02, Threshold: 0.7
Generate frequent Itemset using: apriori
Number of rules found:  2
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(4097),(4099),0.028387,0.029846,0.021154,0.745182,24.967392,0.020306,3.807242,0.987994
1,(4099),(4097),0.029846,0.028387,0.021154,0.708758,24.967392,0.020306,3.336097,0.98948


Support: 0.02, Threshold: 0.8
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.02, Threshold: 0.9
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.5
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.6
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.7
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.8
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.9
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.5
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.6
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.7
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.8
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.9
Generate frequent Itemset using: apriori
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


#### FP-Growth Algorithm

In [None]:
# rules Apriori
for min_s in min_support:
    for min_t in min_threshold:
        print(f"Support: {min_s}, Threshold: {min_t}")
        rules = mine_association_rules(basket, min_s, min_t, algorithm='fpgrowth')
        # Map back antecedents and consequents to descriptions
        rules['antecedents'] = rules['antecedents'].apply(lambda x: tuple(int_to_description[desc] for desc in x))
        rules['consequents'] = rules['consequents'].apply(lambda x: tuple(description_to_int[desc] for desc in x))
        explain_association_rules(rules)

Support: 0.02, Threshold: 0.5
Generate frequent Itemset using: fpgrowth
Number of rules found:  7
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,"(RED HANGING HEART T-LIGHT HOLDER,)","(WHITE HANGING HEART T-LIGHT HOLDER,)",0.036371,0.113347,0.025409,0.698607,6.163454,0.021286,2.941853,0.869373
1,"(SWEETHEART CERAMIC TRINKET BOX,)","(STRAWBERRY CERAMIC TRINKET BOX,)",0.031467,0.050412,0.022025,0.699936,13.884213,0.020439,3.164613,0.958125
2,"(WOODEN FRAME ANTIQUE WHITE ,)","(WOODEN PICTURE FRAME WHITE FINISH,)",0.04247,0.041254,0.022775,0.53626,12.999026,0.021023,2.06742,0.964012
3,"(WOODEN PICTURE FRAME WHITE FINISH,)","(WOODEN FRAME ANTIQUE WHITE ,)",0.041254,0.04247,0.022775,0.552063,12.999026,0.021023,2.137645,0.96279
4,"(HEART OF WICKER LARGE,)","(HEART OF WICKER SMALL,)",0.040099,0.04405,0.020404,0.508843,11.551482,0.018638,1.946322,0.951589


Support: 0.02, Threshold: 0.6
Generate frequent Itemset using: fpgrowth
Number of rules found:  4
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,"(RED HANGING HEART T-LIGHT HOLDER,)","(WHITE HANGING HEART T-LIGHT HOLDER,)",0.036371,0.113347,0.025409,0.698607,6.163454,0.021286,2.941853,0.869373
1,"(SWEETHEART CERAMIC TRINKET BOX,)","(STRAWBERRY CERAMIC TRINKET BOX,)",0.031467,0.050412,0.022025,0.699936,13.884213,0.020439,3.164613,0.958125
2,"(GREEN REGENCY TEACUP AND SAUCER,)","(ROSES REGENCY TEACUP AND SAUCER ,)",0.028387,0.029846,0.021154,0.745182,24.967392,0.020306,3.807242,0.987994
3,"(ROSES REGENCY TEACUP AND SAUCER ,)","(GREEN REGENCY TEACUP AND SAUCER,)",0.029846,0.028387,0.021154,0.708758,24.967392,0.020306,3.336097,0.98948


Support: 0.02, Threshold: 0.7
Generate frequent Itemset using: fpgrowth
Number of rules found:  2
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,"(GREEN REGENCY TEACUP AND SAUCER,)","(ROSES REGENCY TEACUP AND SAUCER ,)",0.028387,0.029846,0.021154,0.745182,24.967392,0.020306,3.807242,0.987994
1,"(ROSES REGENCY TEACUP AND SAUCER ,)","(GREEN REGENCY TEACUP AND SAUCER,)",0.029846,0.028387,0.021154,0.708758,24.967392,0.020306,3.336097,0.98948


Support: 0.02, Threshold: 0.8
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.02, Threshold: 0.9
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.5
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.6
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.7
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.8
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.05, Threshold: 0.9
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.5
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.6
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.7
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.8
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric


Support: 0.1, Threshold: 0.9
Generate frequent Itemset using: fpgrowth
Number of rules found:  0
The 5 first rules are: 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric



#### Candidate Generation:

Apriori: Generates candidate itemsets by joining frequent itemsets of size 

−
1
k−1. It then prunes infrequent candidate sets based on the Apriori principle.
FP-growth: Does not generate candidate itemsets explicitly. Instead, it constructs an FP-tree to represent the dataset and mines frequent itemsets directly from the tree.
Database Scans:

Apriori: Requires multiple scans of the dataset to count itemset occurrences and determine frequent itemsets. Each scan involves counting the support of candidate itemsets and pruning infrequent ones.
FP-growth: Requires only a single pass over the dataset to construct the FP-tree and mine frequent itemsets. This is because the FP-tree captures the itemset frequencies and relationships efficiently.
Memory Usage:

Apriori: May require a large amount of memory, especially for large datasets or datasets with a high number of unique items, due to the need to store candidate itemsets.
FP-growth: Typically requires less memory compared to Apriori because it does not need to generate candidate itemsets explicitly. Instead, it uses the compact FP-tree data structure.
Performance:

Apriori: While effective, may become inefficient for large datasets or datasets with low minimum support thresholds because it needs to generate a large number of candidate itemsets and perform multiple database scans.
FP-growth: Generally faster and more scalable than Apriori, especially for large datasets or datasets with low minimum support thresholds, due to its single-pass approach and efficient FP-tree data structure.
Algorithm Complexity:

Apriori: The time complexity of Apriori is 
�
(
2
�
)
O(2 
d
 ), where 
�
d is the maximum length of frequent itemsets. It is affected by the number of unique items and the minimum support threshold.
FP-growth: The time complexity of FP-growth is generally lower than Apriori, often closer to 
�
(
�
⋅
�
)
O(n⋅m), where 
�
n is the number of transactions and 
�
m is the number of frequent itemsets.