## Association Rule Mining Python Hands-on : Market Basket Analysis

<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fmarciaturner%2Ffiles%2F2018%2F01%2FWegmans-Produce-1.jpg" width="800px">

Let's have a detailed explaination about what is <b>Market Basket Analysis</b>.Market basket analysis is generally done by the retailers to check the combination of two or more items that the customers are likely to buy. So let's see what is Market Basket and what does its analysis mean?

### Importing Libraries

In [None]:
#Data manipulation libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

#Visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")
import squarify
import matplotlib

#for market basket analysis (using apriori)
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

#for preprocessing
from mlxtend.preprocessing import TransactionEncoder

### Importing data 

In [None]:
# Load market data here
data = 

In [None]:
#shape
data.shape

In [None]:
#one row is items purchased in a single transaction, for each item recorded in single column
#head of data
data.head()

#tail of data
data.tail()

Here we can find that data needs a lot of preprocessing.So let's preprocess the data.We use the TransactionEncoder() of mlxtend.preprocessing to do this work for us. The TransactionEncoder() is an Encoder class for transaction data in Python list.It finds out what are all the different products in the transactions and will assign each transaction a list which contains a boolean array where each index represnts the corresponding product whether purchased in the transaction or not i.e. True or False.

It needs input as a python list of lists, where the outer list stores the n transactions and the inner list stores the items.

It returns the one-hot encoded boolean array of the input transactions, where the columns represent the unique items found in the input array in alphabetic order. 
For further details you can refer its documentation: <br><a href="http://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/">TransactionEncoder</a>




In [None]:
#Converting into required format of TransactionEncoder()
transactions=[]
for i in range(0,len(data)):
    transactions.append([str(data.values[i,j]) for j in range(0,len(data.columns))])

transactions=np.array(transactions)

transactions

We will be using the MLxtend package for Transaction encoder. <br>

Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. <br>
An itemset = <i>"frequent itemset"</i> if it meets a user-specificed support threshold. <br>
We will need to preprocess the dataframe into a one-hot encoded dataframe.
[refer Example 1](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)

In [None]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder

# Instantiate transaction encoder and identify unique items
encoder = 

# One-hot encode transactions
onehot = 

# Convert one-hot encoded data to DataFrame, remove the nan.
data = pd.DataFrame(onehot, columns = encoder.columns_,dtype=int).drop('nan', axis=1)

# Print the one-hot encoded transaction dataset
data.shape
data.head()

Now Lets do some Data Visualizations...


### Data Visualizations


In [None]:
#Visualise the Top 20 items purchased frequently

r=data.sum(axis=0).sort_values(ascending=False)[:20]



We can find that mineral water is the most purchased item from the store, we may advice that mineral water must be always in the stock.

In [None]:
# create a color palette, mapped to these values
my_values=r.values
cmap = matplotlib.cm.Blues
mini=min(my_values)
maxi=max(my_values)
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in my_values]

#treemap of top 20 frequent items
plt.figure(figsize=(10,10))
squarify.plot(sizes=r.values, label=r.index, alpha=.7,color=colors)
plt.title("Tree map of top 20 items")
plt.axis('off')

Let's now work on apriori to find frequent itemsets.

Refer to [Example 2](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/#example-2-selecting-and-filtering-results) - using Apriori to create frequent itemsets. 

In [None]:
#let us return itemsets with at least 5% support:
freq_items=

In [None]:
freq_items

This tell us that there are 27 frequent itemsets of different lengths , so the first step of our apriori algorithm is finished

In [None]:
#Now let's generate association rules with lift > 1.3
res= 

In [None]:
res

Above we can see the 4 rules generated with lift greater than 1.3

Intuition we can get from confidence is that:<pre>
    32% of transactions containing chocolate also contain mineral water
    22% of transactions containing mineral water also contain chocolate
    34% of transactions containing spaghetti also contain mineral water
    25% of transactions containing mineral water also contain spaghetti
</pre>


There is more chance of the transaction {spaghetti,mineral water} than {chocolate,mineral water} as we can find the interesting nature of rule by comparing lift,leverage and conviction of {spaghetti,mineral water} and {chocolate,mineral water}.

A rule is said to be interesting if it is unexpected(suprising to user) and/or actionable(user can do something with it).It's a subjective measure.

### Selecting and Filtering the Itemsets

In [None]:
# generate frequent itemsets with 5% support.
frequent_itemsets = 
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

In [None]:
# filter the item sets with length = 2 and support more than 10%

frequent_itemsets[ ]

In [None]:
# getting th item sets with length = 1 and support more han 10%

frequent_itemsets[ ]

## FP GROWTH(Frequent Pattern Growth)



Shortcomings Of Apriori Algorithm

-Using Apriori needs a generation of candidate itemsets. These itemsets may be large in number if the itemset in the database is huge.


-Apriori needs multiple scans of the database to check the support of each itemset generated and this leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.

[Explanation of FP Growth](https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/)

In [None]:
#Importing Libraries
from mlxtend.frequent_patterns import fpgrowth

In [None]:
#running the fpgrowth algorithm with at least 5% support:
res=fpgrowth(data,min_support=0.05,use_colnames=True)

In [None]:
res

In [None]:
res=association_rules(res,metric="lift",min_threshold=1.3)

In [None]:
res

We could observe that {spaghetti}->{mineral water} is mostly like to occur as we can observe it from the lift, which is the same as apriori.

## Apriori Vs FP Growth

Since FP-Growth doesn't require creating candidate sets explicitly, it can be magnitudes faster than the alternative Apriori algorithm. FP-Growth is about 5 times faster.Let's look at it.

In [None]:
import time
l=[0.01,0.02,0.03,0.04,0.05]
t=[]
for i in l:
    t1=time.time()
    apriori(data,min_support=i,use_colnames=True)
    t2=time.time()
    t.append((t2-t1)*1000)

In [None]:
l=[0.01,0.02,0.03,0.04,0.05]
f=[]
for i in l:
    t1=time.time()
    fpgrowth(data,min_support=i,use_colnames=True)
    t2=time.time()
    f.append((t2-t1)*1000)

In [None]:
sns.lineplot(x=l,y=f,label="fpgrowth")
sns.lineplot(x=l,y=t,label="apriori")
plt.xlabel("Min_support Threshold")
plt.ylabel("Run Time in ms")

We can gain the required insights from the above graph about the run time comparision between the apriori and fpgrowth.

## ECLAT ALGORITHM


BottleNecks of Apriori:
<ul>
    <li>Candidate generation can result in huge candidate sets</li>
    <li>Multiple Scans of Database--- needs (n+1) scans,n is the longest pattern</li>
</ul>

To solve some of the above problems,Eclat has been introduced.


The ECLAT algorithm stands for **Equivalence Class Clustering and bottom-up Lattice Traversal**. 
* It is one of the popular methods of Association Rule mining. It is a more efficient and scalable version of the Apriori algorithm. 
* While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the ECLAT algorithm works in a **vertical manner** just like the **Depth-First Search** of a graph. This vertical approach of the ECLAT algorithm makes it a faster algorithm than the Apriori algorithm.


How it works:

The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value of a candidate and avoiding the generation of subsets which do not exist in the prefix tree. In the first call of the function, all single items are used along with their tidsets. Then the function is called recursively and in each recursive call, each item-tidset pair is verified and combined with other item-tidset pairs. This process is continued until no candidate item-tidset pairs can be combined


Eg:

t1={a,b,c}
t2={a,b}
t3={a}

now above is horizontal layout where t1,t2,t3 are transactions a,b,c are products.now let's make it into vertical layout....

    k=1,          min_support=0.5
    a={t1,t2,t3}, sup=1
    b={t1,t2},    sup=0.66
    c={t1},       sup=0.33


now we eliminate c as is supp<min_support and them generate itemsets of length k=2

    k=2,           min_support=0.5
    {a,b}={t1,t2}, supp=0.5

and we can't generate anymore sets so we end up with only {a,b}.

<b>This method has an advantage over Apriori as it does not require scanning the database to find the support of k+1 itemsets. This is because the Transaction set will carry the count of occurrence of each item in the transaction (support). The bottleneck comes when there are many transactions taking huge memory and computational time for intersecting the sets.</b>

If you want further reference you can visit : <a href="https://www.geeksforgeeks.org/ml-eclat-algorithm/">Eclat Algo</a>


<b>Advantages over Apriori algorithm:-</b>
<ul>
<li>Memory Requirements: Since the ECLAT algorithm uses a Depth-First Search approach, it uses less memory than Apriori algorithm.</li>
    <li>Speed: The ECLAT algorithm is typically faster than the Apriori algorithm.</li>
<li>Number of Computations: The ECLAT algorithm does not involve the repeated scanning of the data to compute the individual support values.</li>