Hello!!

Here We will perform some data mining useful for market basket analysis. The methods we will use are **Apriori** and **Association Rule Mining**. More over we are using the-bread-basket dataset provided. We will perform the data mining task in step-by-step traditional method. Let's do it.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Step 1 : Reading the Data**

Firstly we would fetch the data from the bread basket.csv file and read it in the dataframe df using the read_csv() method.


In [None]:
#setting up coding environment with necessary imports...

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import apriori, association_rules 

#reading the dataset in the dataframe

df = pd.read_csv('../input/the-bread-basket/bread basket.csv')

#checking the top few data  
df.head(10)

# **Step 2 : Understanding the Data**

As we have fetched data, now its time to understand the data.
* We will check the data summary using **df.describe()**
* Also we will check the datatype of each column by **df.info()**

In [None]:
# getting the statistical summary of the data
df.describe()

In [None]:
#understanding the data and its data types
df.info()

# **Step 3 : Checking the Data for Missing Values**

Here we are checking the data for any missing values in any columns. **missing_value_count** list shows the same. The result shows no missing values in any columns.


In [None]:
#get no. of missing data points per column if any

missing_values_count = df.isnull().sum()

#checking the missing values in all columns if any
missing_values_count[0:4]

In [None]:
df.groupby(['Transaction'])
df
#df.head()

# **Step 4 : Cleaning the Data**

Now cleaning the data in columns. First we check the distinct items and cleaning any preceding and trailing whitespaces if any.

In [None]:
df.Item.unique()

In [None]:
#removing trailing and preceding whitespaces in item column   
df['Item']=df['Item'].str.strip()

# **Step 4 : Data Selection & Transformation**

After cleaning the data, we will perform the data selection. Here we will prepare the data as per the need of Frequent Itemset and Association mining process.

Here data is firstly grouped by transaction to collect various item occurances in each transaction.

After that we are creating a pivot table with transaction as index and items as columns to map the items with transcations. During this we are placing 0 for items which are not in perticular transcation.

In [None]:
grouped_df = df.groupby(['Transaction','Item'])['Item'].count().reset_index(name='Count')

basket_df = grouped_df.pivot_table(index='Transaction', columns='Item', values='Count', aggfunc='sum').fillna(0)

As some items may have more than one occurances in a transaction, we are transforming data by applying the hot encoding. After applying this data will be suitable for concerned libraries

In [None]:
# Defining the hot encoding function to make the data suitable for the concerned libraries 
def hot_encode(x): 
    if(x<= 0): 
        return 0
    if(x>= 1): 
        return 1

In [None]:
# Encoding the datasets 
basket_encoded = basket_df.applymap(hot_encode) 
basket_df = basket_encoded 

basket_df

In above table, each column is a unique item. If item present in transaction then value 1 is shown else 0. This transformed data is now useful to apply mining.

# **Step 5 : Data Mining**

Here we are performing data mining by applying the apriori and association_rules algorithm imported from **mlxtend.frequent_patterns** library. 

> We have set some parameters of the algorithm as follows

For Apriori : **Min support = 0.01**

For Association Rules : **Min. confidence = 0.25**



In [None]:
# Building the model with min support = 0.01 (1%)
frq_items = apriori(basket_df, min_support = 0.01, use_colnames = True)
frq_items

Above results shows the frequent Itemsets fulfilling necessary conditions of mining algorithm.
Now the block below shows the Association Rule Mining with necessary parameters set in the model. Output is collected in the variable *rules*

In [None]:
# Collecting the inferred rules in a dataframe 
rules = association_rules(frq_items, metric ="confidence", min_threshold = 0.25) 
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False]) 
rules.reset_index()

# **Step 6 : Model Evaluation Plots**

As we have performed Association Rule Mining for Bread Basket Analysis, we would like to check the relations between various algorithm parameters. Following are the plots showing the same.

In [None]:
# SUPPORT Vs CONFIDENCE

plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

In [None]:
# SUPPORT Vs LIFT


plt.scatter(rules['support'], rules['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()

In [None]:
# LIFT Vs CONFIDENCE


fit = np.polyfit(rules['lift'], rules['confidence'], 1)
fit_fn = np.poly1d(fit)
plt.plot(rules['lift'], rules['confidence'], 'yo', rules['lift'], 
 fit_fn(rules['lift']))

So, Here this conclude the apriori and association rule mining task. To explore more you can try with various values of algorithm parameters and check the difference in the results.

Share your valuable feedbacks.

Thanks in advance.