# Market Basket Analysis

## Introduction

In general whenever we go to buy bread to a store we end up buying milk or butter/cheese as well.

And if you go to buy these groceries from a supermarket chances are you might have encountered this weird thing

    Bread and butter are kept in totally opposite corners of the supermarket. 

*Well this is not by accident.*



This is to make you go through the whole store for your butter/cheese so that you go on by looking at various products on the way and may end up buying something you had not earlier planned for.

And there can be innumerable strategies like someone who went to buy cereals ended up buying milk as well just because it was immediately on the next rack to cereals. Finding Association is the key. Then all that remains is to build startegies to maximize efficiency and revenue generation.

Market Basket Analysis is one of the most important stepping stone to understand how Netflix or Amazon build such world class recommender systems.

We here work on finding creative and out-of-the-box rules. Rules are nothing but these associations between various products some imaginable and some totally out of the blue.

Maybe somewhere you could end up finding that people who buy tomato sauce more also tend to buy comic books more. Although there seems no reasonable explanation right now but after good statistical analysis some relation might be found.

This is what Association Rule Learning wishes to achieve. To give you insights and relations which you otherwise would have never thought off!!

Now of course these rules need the **support** of huge (in hundreds at least) number of transactions to be considered statistically signifcant.

Thus more data you have access to the better!!




### Why should I care about these associations?

Supermarkets these days actually use Data Science to strategically set their store to increase their sales.
The Association of Bread and Butter is pretty obvious and simple but there are 1000s of products in a store and you never know which pair of products may have high association.

For the sake of simplicity assume there are 2 products P and Q.

If we can conclude that P and Q have strong association we can do the following:

*  Put both P and Q on the same shelf so that buyers of one item would be prompted to buy the other.
*  Target advertisments for P to customers who buy Q more often.
*  Combine P and Q somehow as 1 product
*  Apply buy 1 get 1 offers on P and Q for promotions.


### So have you ever wondered how Amazon shows "Customers who bought this also bought"??


![amazon-suggestion.png](amazon-suggestion.png)


### Well this is how..

There is actually an algorithm called **Apriori** to find associations like these. Once you find these associations you can strategize on how to use these associations cleverly to set up your stores.

This is a plus-plus for any business.

Now this Apriori is a basic model. Amazon actually uses a complex set of algorithms along with this to maximize their revenues.


We will try implementing an Apriori model ourselves but before that have a look at what the algorithm is:-

# Basics of Association Rule Learning and Apriori algorithm.

We need to learn a bit of jargon to proceed further..


**Support** is a relative frequency that the rules show. This basically says how popular an item is in the transactions dataset i.e. in the big transactions dataset how frequently the particular item came up.

In many cases in order to make sure there is high relationship we might want to use high support.
However, there may be cases where a low support is beneficial - in case of **“unusual”** relationships.

**Confidence** is a measure of the reliability of the rule.It simply asks how likely item Q is purchased when item P is already purchased. 

One issue with Confidence is that it can misrepresent the importance of some association. You need to be carefull with this parameter. I will explain it when we do parameter selection.

**Lift** is the ratio of the support(**observed**) to that expected if the two rules were independent. In layman terms it tells how likely item Q is purchased when item P is purchased, while controlling how popular item Q is.

The apriori algorithm simply states that if an item does not occur frequently, then all its subsets must also not occur frequently. 

Here subsets are collection of items which ever were together in any transaction.

![formula.png](formula.png)

**The algorithm is:**

1. Set a minimum value for support and confidence

2. Take all subsets in transactions having support higher than minimum support we chose.

3. Now take all rules having higher confidence than confidence we chose

4. Sort the rules by decreasing lift.

#### It will get pretty clear once we start coding!

Let's start with importing whatever we would need to build these associations..

In [2]:
# We will start by importing the basic libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Dataset

The customer cart(basket) data which we need is taken from a course Machine Learning A-Z from Udemy, taught by Kirilenko.

We have to use **"header = None"** while reading the dataset so that the first row doesn't become the column headers.
Let's load the dataset using pandas in the dataframe.

In [3]:
# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

Let's have a look at what kind of data we are dealing with using head for dataframe. 

In [4]:
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


## Description about the dataset

The dataset is a collection of final checkout carts (1 week) of the last 7500 customers from a store. 

- Each row represents the items taken out by a customer. For example 1 represents that 2nd customer taking out **burgers, meatballs and eggs** from the store.

- The colums have no significance as there is no limit to how many items the customer can buy.

- NaN values are just to fill up the empty cells. For e.g. 2nd customer only bought 3 items while 1st customer bought 19.To have the whole dataset in a tabular form NaN values are introduced 

**Basically each row is 1 complete customer transaction at the checkout counter.** 

Everytime any customer comes for checkout and bill payment his/her items are stored in the database row-wise.

Now to run our apriori algorithm successfully we would require a library called **apyori.py** (an implementation of the Association Rule Learning model) from the **Python Software Foundation.**

You can download the file from here

https://github.com/ymoch/apyori/blob/master/apyori.py

Make sure you save **this file** along with **the dataset** and **this script** in the same folder for ease later.

So we need to give this dataset as an input to this apyori model. But this model expects the input as a list of lists.

So we need to do some data pre-processing:(convert dataset into transactions a list of lists.)

In [5]:
#Data Preprocessing
transactionset = []
for i in range(0, 7501):
    transactionset.append([str(dataset.values[i,j]) for j in range(0, 20)])

# Support - Confidence - Lift

Next step is to decide minimum support, minimum confidence and minimum lift for the model. These are parameters which determine association or more specifically assume there are 2 products A and B. If in a population someone likes A what is the chance that the same person will also like B. Using minimum parameters help set required threshold for considering wether a particular association is legit or not.

These parameters vary from problem to problem. They depend on:

1. Nature of the dataset
2. Problem Description
3. Number of entries in a dataset.

So we train the apriori model by giving paramters and transactions as input and obtain a list of rules as an output for our problem.

You can try these rules for a certain period of time and if you don't find an impact on your revenue you may change parameters to test new rules and experiment till you find the strongest rules to maximize your revenue.

In [6]:
# Training Apriori on the dataset
from apyori import apriori
rules = apriori(transactionset, min_support = 0.004, min_confidence = 0.2, min_lift = 3, min_length = 2)

# How to chose the parameters??

If you have gone through the algorithm description above you clearly now know what support,lift,etc. are.

These values for parameters depend on your business goals and if you are not satisfied with your rules you can change them again!

Now for sure in this case we would want to find strong rules about items that are bought at least 3 or 4 times in a day and associating those items together would eventually increase sales.

Now the 7500 transactions are recorded over a week so if we find a product purchased 4 times daily it means 4*7 = 28 times a week.

Hence **min_support** = 28/7500 = 0.004;

Now **min_confidence** is the tricky one. If I keep **min_confidence = 0.9 or 0.8** it means my rules have to be correct **90% of the time which is not a good indicator of association** because it means in a group each product is by itself a frequently purchased one and not associated with other.

So again hit-and-trial and see what works with your dataset. For these 7500 transactions a **20% confidence seemed to work nice.**

80% is default generally. Ideally you should keep dividing by 2 untill you get satisfying rules. **min_lift** is set to 3 to give us relevant rules only.

**min_length** is set to minimum number of associations you want in your rules.We want at least 2 associations.

In [7]:
# Visualising the results
results = list(rules)


The results will now get printed in decreasing order of relevance along with support for that particular association. 

The relevance is decided automatically by the apriori library and is based on support,confidence as well as lift.

We print here 1/4 th of the association results to have a look at the best ones only.

**RULE**  --> means which two products are associated.

**SUPPORT**  --> popularity of that rule

In [8]:
for i in range(0, int(len(results)/4)):
    print('RULE:\t' + str(list(results[i][0])) + '\nSUPPORT:\t' + str(results[i][1]))

RULE:	['light cream', 'chicken']
SUPPORT:	0.004532728969470737
RULE:	['mushroom cream sauce', 'escalope']
SUPPORT:	0.005732568990801226
RULE:	['pasta', 'escalope']
SUPPORT:	0.005865884548726837
RULE:	['herb & pepper', 'ground beef']
SUPPORT:	0.015997866951073192
RULE:	['ground beef', 'tomato sauce']
SUPPORT:	0.005332622317024397
RULE:	['olive oil', 'whole wheat pasta']
SUPPORT:	0.007998933475536596
RULE:	['pasta', 'shrimp']
SUPPORT:	0.005065991201173177
RULE:	['nan', 'light cream', 'chicken']
SUPPORT:	0.004532728969470737
RULE:	['chocolate', 'frozen vegetables', 'shrimp']
SUPPORT:	0.005332622317024397
RULE:	['spaghetti', 'cooking oil', 'ground beef']
SUPPORT:	0.004799360085321957
RULE:	['herb & pepper', 'ground beef', 'eggs']
SUPPORT:	0.0041327822956939075
RULE:	['mushroom cream sauce', 'escalope', 'nan']
SUPPORT:	0.005732568990801226
RULE:	['pasta', 'escalope', 'nan']
SUPPORT:	0.005865884548726837
RULE:	['frozen vegetables', 'spaghetti', 'ground beef']
SUPPORT:	0.008665511265164644


So like the highest association is found between chicken and light curry with a support of 0.0045 which kind of makes sense.

# Summary

In short we have done simply the following:

1. Used a dataset with 7500 customer transaction history of 1 week
2. Using the dataset built a program to find associations between different products
3. Made sure these associations are relevant.
4. Sorted the association rules by relevance.


I hope you understood this basic tutorial on how to find key associations which would generally beat common sense and how you can use these rules that you have found to build strategies and later take decisions for your business and maximize sales and efficiency.

Do try the code yourself and experiment on other datasets as well and figure how can improve your associations.

Some links where you can find similiar datsets:

1. https://www.kaggle.com/c/instacart-market-basket-analysis/data  
2. http://fimi.ua.ac.be/data/ 
3. http://archive.ics.uci.edu/ml/datasets.html




You can connect with me on LinkedIn throught this - https://www.linkedin.com/in/rishabh-baid-8b04a6133/

