# Association Mining (Market Basket Analysis) Using Books Dataset

Upload/Save CharlesBookClub.csv in your Python folder to access from this program

Columns in the dataset:

'Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
 'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
 'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase',
 'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence'

## Step 1. Import required libraries

In [None]:
# Run the following 'commented' line of code after removing the '#', if mlxtend package is not available by default in your Python environment
#!pip install mlxtend 

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from mlxtend.frequent_patterns import ?    # Replace '?' by 'apriori'; Apriori function to extract frequent itemsets for association rule mining
from mlxtend.frequent_patterns import ?    # Replace '?' by 'association_rules', Function to generate association rules from frequent itemsets

In [None]:
# code to suppress warnings

import warnings
warnings.filterwarnings('ignore')

## Step 2. Load and explore the data

In [None]:
# Read dataset from csv file to a dataframe

books_df = pd.read_csv('CharlesBookClub.csv')

In [None]:
# Check how the data looks, head() by default prints the first 5 rows

books_df.? # Replace '?' by 'head()'

In [None]:
# Can use the attribute 'columns' to check what are the columns in the dataframe

books_df.? # Replace '?' by 'columns'

In [None]:
# Similarly can use the method 'info()' to check details about the columns in the dataframe

books_df.?  # Replace '?' by 'info()'

## Step 3. Preprocess the data

After exploring the data, we need to preprocess it in a particular format for applying Apriori algorithm

First we select the columns of interest, then we generate a binary incidence matrix

In [None]:
# Selecting the columns that are of interest and would be used to generate the binary incidence matrix

books_matrix = books_df[['Seq#', 'ChildBks','YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks','ItalCook', 'ItalAtlas', 'ItalArt', 'Florence']]

In [None]:
# Check if the new dataframe contains the desired columns with their corresponding values

books_matrix.head()

In [None]:
# We set the column 'Seq#'' as the index, of the dataframe created for generating the binary incidence matrix

books_matrix.set_index('Seq#', inplace = True)

In [None]:
# For applying Apriori algorithm, preprocess the data to create the binary incidence matrix

# The following function replace the actual number of books purchased by 0 or 1 depending on whether a specific type of book is purchased or not
def encode_units (x):
    
    if x==0:
        return 0
    elif x>0:
        return 1

# the binary incidence matrix is created which denotes for every transaction what type of books are purchased
books_incidencematrix = books_matrix[['ChildBks','YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks','ItalCook', 'ItalAtlas', 'ItalArt', 'Florence']].applymap(encode_units)

In [None]:
# Check first few rows of the binary incidence matrix to verify it is correctly created

books_incidencematrix.head()

In [None]:
# Further check to verify that the binary incidence matrix is correctly created i.e. there are no null values

books_incidencematrix.isnull().sum() # the binary incidence matrix if correctly created should not contain any null values 

## Step 4. Apply Apriori Algorithm

In [None]:
#create frequent itemsets using - from mlxtend.frequent_patterns import apriori

#Use apriori function, lets set minimum support of 400/4000 = 10% i.e., min_support = 0.1
itemsets = apriori(books_incidencematrix, min_support = ?, use_colnames = True) # Replace '?' by '0.1'

In [None]:
#Check the contents of itemsets

itemsets

## Step 5. Generate the association rules

In [None]:
#To generate the association rules we use association_rules function with metric as 'confidence' and lets set min_threshold '0.5' 

rules_confidence = association_rules(itemsets, metric = 'confidence',min_threshold =?) # Replace '?' by '0.5'

rules_confidence.drop(columns = ['antecedent support','consequent support','conviction','leverage'])
                                 
# Sort the rules generated based on minimum confidence = 0.5 by value of lift
rules_confidence.sort_values(by=['lift'], ascending = False).head()


In [None]:
# Check how many rules have been formed from the itemsets based on the desired metric and threshold i.e., minimum confidence 0.5

rules_confidence.shape # The attribute shape returns the number of rows and columns in rules_confidence, where number of rules = number of rows

In [None]:
#We can also generate the rules using association_rules function with metric as 'lift' with min_threshold = 1

rules_lift = association_rules(itemsets, metric =?, min_threshold = 1) # Replace '?' by 'lift'

# Sort the rules generated based on minimum lift = 1 by value of lift
rules_lift.sort_values(by=['lift'], ascending = False).head()

In [None]:
# Check how many rules have been formed from the itemsets based on the desired metric and threshold i.e., minimum lift = 1

rules_lift.? # Replace '?' by 'shape'

In [None]:
#Top 25 rules (if number of rules more than 25)

print(rules_lift.sort_values(by=['lift'], ascending = False).head(?))  # Replace '?' by '25'

In [None]:
# Rules that satisfy confidence > 0.1 

rules = association_rules(itemsets, metric = 'confidence',min_threshold =0.1)

rules.drop(columns=['antecedent support','consequent support','conviction','leverage'],inplace=True)

In [None]:
# Create a new column containing the length of the antecedent

rules['len']=rules['antecedents'].apply(lambda x: len(x))

In [None]:
# Rules that satisfy 1. atleast 2 antecedents, 2. confidence > 0.5 and, 3. lift > 1

rules[(rules['len']>=2)&(rules['confidence']>0.5)&(rules['lift']>1)]

## References:

rasbt.github.io/mlxtend

https://medium.com/@jihargifari/how-to-perform-market-basket-analysis-in-python-bd00b745b106

https://analyticsindiamag.com/hands-on-guide-to-market-basket-analysis-with-python-codes/