<a href="https://colab.research.google.com/github/supriyameduri9/CMPE255_Assignments/blob/main/Apriori_FPGrowth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Frequent Pattern Mining Algorithms

 

## Apriori Algorithm

Apriori Algorithm is a Machine Learning algorithm which is used to gain insight into the structured relationships between different items involved. The most prominent practical application of the algorithm is to recommend products based on the products already present in the user’s cart. Walmart especially has made great use of the algorithm in suggesting products to it’s users.

### Import Dependencies

In [1]:
pip install mlxtend --user



In [2]:
%pip install mlxtend --upgrade



In [4]:

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

### Load google drive

In [5]:
from google.colab import drive
drive._mount('/content/drive')

Mounted at /content/drive


### Dataset

I used UCI groceries [dataset](http://archive.ics.uci.edu/ml/datasets/Online+Retail) which contains all the transactions for a UK based online retail store

In [6]:
data = pd.read_excel('/content/drive/MyDrive/Online Retail.xlsx')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [7]:
data.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [8]:
data.Country.unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

### Data Cleaning

In [9]:
data['Description'] = data['Description'].str.strip()
data.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
 
# Dropping all transactions which were done on credit
data = data[~data['InvoiceNo'].str.contains('C')]

Here we divide the data for each country.

In [10]:
# Transactions done in France
basket_France = (data[data['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
 
# Transactions done in the United Kingdom
basket_UK = (data[data['Country'] =="United Kingdom"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
 
# Transactions done in Portugal
basket_Por = (data[data['Country'] =="Portugal"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
 
basket_Sweden = (data[data['Country'] =="Sweden"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

### Hot encoding the data

In [11]:
# Defining the hot encoding function to make the data suitable
# for the concerned libraries
def hot_encode(x):
    if(x<= 0):
        return 0
    if(x>= 1):
        return 1
 
# Encoding the datasets
basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded
 
basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded
 
basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded
 
basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded

### Build Model

For country of France, from the results we can observe that usually plates and cups are bought together 





In [12]:
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames = True)
 
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
45,(JUMBO BAG WOODLAND ANIMALS),(POSTAGE),0.076531,0.765306,0.076531,1.0,1.306667,0.017961,inf
260,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE),0.05102,0.765306,0.05102,1.0,1.306667,0.011974,inf
272,"(RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...",(POSTAGE),0.053571,0.765306,0.053571,1.0,1.306667,0.012573,inf
302,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.127551,0.09949,0.975,7.644,0.086474,34.897959
300,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.137755,0.09949,0.975,7.077778,0.085433,34.489796


For country of Portland, from the results we can observe that usually tiffin sets and candies are bought together

In [15]:
frq_items = apriori(basket_Por, min_support = 0.05, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1170,(SET 12 COLOUR PENCILS SPACEBOY),(SET 12 COLOUR PENCILS DOLLY GIRL),0.051724,0.051724,0.051724,1.0,19.333333,0.049049,inf
1171,(SET 12 COLOUR PENCILS DOLLY GIRL),(SET 12 COLOUR PENCILS SPACEBOY),0.051724,0.051724,0.051724,1.0,19.333333,0.049049,inf
1172,(SET 12 COLOUR PENCILS DOLLY GIRL),(SET OF 4 KNICK KNACK TINS LONDON),0.051724,0.051724,0.051724,1.0,19.333333,0.049049,inf
1173,(SET OF 4 KNICK KNACK TINS LONDON),(SET 12 COLOUR PENCILS DOLLY GIRL),0.051724,0.051724,0.051724,1.0,19.333333,0.049049,inf
1174,(SET 12 COLOUR PENCILS DOLLY GIRL),(SET OF 4 KNICK KNACK TINS POPPIES),0.051724,0.051724,0.051724,1.0,19.333333,0.049049,inf
...,...,...,...,...,...,...,...,...,...
1056,(POSTAGE),(REGENCY CAKESTAND 3 TIER),0.517241,0.086207,0.051724,0.1,1.160000,0.007134,1.015326
1058,(POSTAGE),(RETROSPOT HEART HOT WATER BOTTLE),0.517241,0.086207,0.051724,0.1,1.160000,0.007134,1.015326
1066,(POSTAGE),(SET OF 3 CAKE TINS PANTRY DESIGN),0.517241,0.086207,0.051724,0.1,1.160000,0.007134,1.015326
1070,(POSTAGE),(SET OF 36 TEATIME PAPER DOILIES),0.517241,0.086207,0.051724,0.1,1.160000,0.007134,1.015326


For country of Sweden, from the results we can observe that usually boys’ and girls’ cutlery are paired together.

In [None]:

frq_items = apriori(basket_Sweden, min_support = 0.05, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
print(rules.head())

## FP Growth

In [16]:
from mlxtend.frequent_patterns import fpgrowth


France:

In [17]:
# Building the model
frq_items_france = fpgrowth(basket_France, min_support = 0.05, use_colnames = True)
 
frq_items_france

Unnamed: 0,support,itemsets
0,0.765306,(POSTAGE)
1,0.181122,(RED TOADSTOOL LED NIGHT LIGHT)
2,0.158163,(ROUND SNACK BOXES SET OF4 WOODLAND)
3,0.125000,(SPACEBOY LUNCH BOX)
4,0.104592,(MINI PAINT SET VINTAGE)
...,...,...
190,0.061224,"(POSTAGE, LUNCH BAG APPLE DESIGN, LUNCH BAG RE..."
191,0.063776,"(CHILDRENS CUTLERY DOLLY GIRL, CHILDRENS CUTLE..."
192,0.053571,"(POSTAGE, CHARLOTTE BAG APPLES DESIGN)"
193,0.058673,"(POSTAGE, JUMBO BAG APPLES)"


From the below observation, for the country of France, **Red Toadstool led night light and rabbit night light** are bought together based on the lift value

In [19]:
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items_france, metric ="lift", min_threshold = 1)
rules_france = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(POSTAGE),(RED TOADSTOOL LED NIGHT LIGHT),0.765306,0.181122,0.158163,0.206667,1.141033,0.019549,1.032199
1,(RED TOADSTOOL LED NIGHT LIGHT),(POSTAGE),0.181122,0.765306,0.158163,0.873239,1.141033,0.019549,1.851474
2,(RED TOADSTOOL LED NIGHT LIGHT),(RABBIT NIGHT LIGHT),0.181122,0.188776,0.053571,0.295775,1.566806,0.019380,1.151939
3,(RABBIT NIGHT LIGHT),(RED TOADSTOOL LED NIGHT LIGHT),0.188776,0.181122,0.053571,0.283784,1.566806,0.019380,1.143338
4,(POSTAGE),(ROUND SNACK BOXES SET OF4 WOODLAND),0.765306,0.158163,0.147959,0.193333,1.222366,0.026916,1.043599
...,...,...,...,...,...,...,...,...,...
343,(CHARLOTTE BAG APPLES DESIGN),(POSTAGE),0.068878,0.765306,0.053571,0.777778,1.016296,0.000859,1.056122
344,(POSTAGE),(JUMBO BAG APPLES),0.765306,0.066327,0.058673,0.076667,1.155897,0.007913,1.011199
345,(JUMBO BAG APPLES),(POSTAGE),0.066327,0.765306,0.058673,0.884615,1.155897,0.007913,2.034014
346,(POSTAGE),(RABBIT NIGHT LIGHT),0.765306,0.188776,0.165816,0.216667,1.147748,0.021345,1.035606


Sweden

In [22]:
frq_items_swe = fpgrowth(basket_Sweden, min_support = 0.05, use_colnames = True)
 
frq_items_swe

Unnamed: 0,support,itemsets
0,0.194444,(SET OF 3 CAKE TINS PANTRY DESIGN)
1,0.138889,(WORLD WAR 2 GLIDERS ASSTD DESIGNS)
2,0.111111,(ROUND SNACK BOXES SET OF 4 FRUITS)
3,0.111111,(PACK OF 72 RETROSPOT CAKE CASES)
4,0.111111,(PACK OF 60 SPACEBOY CAKE CASES)
...,...,...
2851,0.055556,"(MINI LIGHTS WOODLAND MUSHROOMS, RABBIT NIGHT ..."
2852,0.055556,"(POSTAGE, MINI LIGHTS WOODLAND MUSHROOMS, RABB..."
2853,0.055556,"(POSTAGE, MINI LIGHTS WOODLAND MUSHROOMS, GUMB..."
2854,0.055556,"(MINI LIGHTS WOODLAND MUSHROOMS, GUMBALL COAT ..."


From the below observation, for the country of Sweden, *PACK OF 60 SPACEBOY CAKE CASES, RETROSPOT TEA SET CERAMIC 11 PC *  are bought together based on the lift value

In [23]:
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items_swe, metric ="lift", min_threshold = 1)
rules_swe = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules_swe

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
121,"(MINI PAINT SET VINTAGE, PACK OF 72 RETROSPOT ...","(PACK OF 60 SPACEBOY CAKE CASES, RETROSPOT TEA...",0.055556,0.055556,0.055556,1.000000,18.000000,0.052469,inf
122,"(PACK OF 72 RETROSPOT CAKE CASES, RETROSPOT TE...","(PACK OF 60 SPACEBOY CAKE CASES, MINI PAINT SE...",0.055556,0.055556,0.055556,1.000000,18.000000,0.052469,inf
123,"(PACK OF 60 SPACEBOY CAKE CASES, MINI PAINT SE...","(PACK OF 72 RETROSPOT CAKE CASES, RETROSPOT TE...",0.055556,0.055556,0.055556,1.000000,18.000000,0.052469,inf
124,"(PACK OF 60 SPACEBOY CAKE CASES, RETROSPOT TEA...","(MINI PAINT SET VINTAGE, PACK OF 72 RETROSPOT ...",0.055556,0.055556,0.055556,1.000000,18.000000,0.052469,inf
149,"(SET OF 3 CAKE TINS PANTRY DESIGN, PACK OF 60 ...","(PACK OF 72 RETROSPOT CAKE CASES, RETROSPOT TE...",0.055556,0.055556,0.055556,1.000000,18.000000,0.052469,inf
...,...,...,...,...,...,...,...,...,...
104604,(POSTAGE),"(MINI LIGHTS WOODLAND MUSHROOMS, GUMBALL COAT ...",0.611111,0.055556,0.055556,0.090909,1.636364,0.021605,1.038889
1342,(POSTAGE),(ROUND SNACK BOXES SET OF4 WOODLAND),0.611111,0.083333,0.055556,0.090909,1.090909,0.004630,1.008333
1347,(POSTAGE),"(ROUND SNACK BOXES SET OF4 WOODLAND, ROUND SNA...",0.611111,0.083333,0.055556,0.090909,1.090909,0.004630,1.008333
81604,(POSTAGE),(CUPCAKE LACE PAPER SET 6),0.611111,0.083333,0.055556,0.090909,1.090909,0.004630,1.008333


Portugal

In [20]:
frq_items_por = fpgrowth(basket_Por, min_support = 0.05, use_colnames = True)
 
frq_items_por

Unnamed: 0,support,itemsets
0,0.517241,(POSTAGE)
1,0.206897,(LUNCH BAG CARS BLUE)
2,0.086207,(LUNCH BAG WOODLAND)
3,0.068966,(VINTAGE PAISLEY STATIONERY SET)
4,0.051724,(LUNCH BAG SUKI DESIGN)
...,...,...
7441,0.086207,"(SET OF 10 LED DOLLY LIGHTS, POSTAGE)"
7442,0.051724,"(VINTAGE DOILY JUMBO BAG RED, JUMBO BAG PAISLE..."
7443,0.051724,"(VINTAGE DOILY JUMBO BAG RED, LUNCH BAG PAISLE..."
7444,0.051724,"(VINTAGE DOILY JUMBO BAG RED, JUMBO BAG PAISLE..."


From the below observation, for the country of Portugal, **JUMBO BAG PAISLEY PARK, LUNCH BAG PAISLEY PARK** are bought together based on the lift value

In [21]:
# Collecting the inferred rules in a dataframe
rules = association_rules(frq_items_por, metric ="lift", min_threshold = 1)
rules_france = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(LUNCH BAG CARS BLUE),(LUNCH BAG RED RETROSPOT),0.206897,0.241379,0.137931,0.666667,2.761905,0.087990,2.275862
1,(LUNCH BAG RED RETROSPOT),(LUNCH BAG CARS BLUE),0.241379,0.206897,0.137931,0.571429,2.761905,0.087990,1.850575
2,(LUNCH BAG CARS BLUE),(RETROSPOT TEA SET CERAMIC 11 PC),0.206897,0.241379,0.086207,0.416667,1.726190,0.036266,1.300493
3,(RETROSPOT TEA SET CERAMIC 11 PC),(LUNCH BAG CARS BLUE),0.241379,0.206897,0.086207,0.357143,1.726190,0.036266,1.233716
4,"(LUNCH BAG CARS BLUE, LUNCH BAG RED RETROSPOT)",(RETROSPOT TEA SET CERAMIC 11 PC),0.137931,0.241379,0.068966,0.500000,2.071429,0.035672,1.517241
...,...,...,...,...,...,...,...,...,...
335159,(VINTAGE DOILY JUMBO BAG RED),"(JUMBO BAG PAISLEY PARK, LUNCH BAG PAISLEY PARK)",0.051724,0.068966,0.051724,1.000000,14.500000,0.048157,inf
335160,(JUMBO BAG PAISLEY PARK),"(VINTAGE DOILY JUMBO BAG RED, LUNCH BAG PAISLE...",0.068966,0.051724,0.051724,0.750000,14.500000,0.048157,3.793103
335161,(LUNCH BAG PAISLEY PARK),"(VINTAGE DOILY JUMBO BAG RED, JUMBO BAG PAISLE...",0.086207,0.051724,0.051724,0.600000,11.600000,0.047265,2.370690
335162,(POSTAGE),(RABBIT NIGHT LIGHT),0.517241,0.051724,0.051724,0.100000,1.933333,0.024970,1.053640


## References:

1.   https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-python/
2. https://www.kaggle.com/sajidcse/market-basket-analysis#FP-GROWTH(Frequent-Pattern-Growth)

