Hello there!

I going to use Market Basket Analysis using Association Rule-Mining on Groceries Data which is available in [here](https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/groceries.csv). 

Before jumping to coding, a quick question have you ever wonder how you end up with purchasing items/things which were not in your shopping list during shopping at market/Dmart? It happens due to the chances of buying products are related. When you bought a loaf of bread, you saw butter on the next step. These items are not same but the relationship between them tend to increase the chance of buying them. Through association rule mining we are going to determine the strength of relation between two or more different object using Apriori Algorithm. 

**Applications of Association Rule Mining **
It has various applications in machine learning and data mining. Below are some popular applications of association rule learning:

*Market Basket Analysis:* It is one of the popular examples and applications of association rule mining. This technique is commonly used by big retailers to determine the association between items.

*Medical Diagnosis:* With the help of association rules, patients can be cured easily, as it helps in identifying the probability of illness for a particular disease.

*Protein Sequence:* The association rules help in determining the synthesis of artificial Proteins.

It is also used for the *Catalog Design* and *Loss-leader Analysis* and many more other applications.

**Apriori Algorithm**
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for market basket analysis and helps to find those products that can be bought together. It can also be used in the healthcare field to find drug reactions for patients.

The apriori algorithm finds frequent itemsets that occur atleast as frequently as minimum support count to generate association rules which will prune rules which fail minimum support and minimum confidence threshold.

Before we proceed with market basket analysis there are some terms which we need to know.

**Support:** Support is an indication of how frequently the item/itemset appears in a transaction/dataset.

**Confidence:** If we were given two products X and Y such that for rule X(antecedents)=>Y(consequents), confidence shows the percentage of  product Y is bought with product X. It's an indication how often the rule is proven correct.

**Lift:** It's a ratio which indicates whether the given two products X & Y are dependent on each other or not. A lift value greater than 1 means that Y is likely to be bought if  X is bought, while a value less than 1 means that Y is unlikely to be bought if X is bought. Lift value 1 indicates no association between products. 

**Problem Statement**
1. Read the CSV file into a data frame. Get the shape.
2. Find the top 20 “sold items” that occur in the dataset.
3. Find how much of the total sales they account for.
4. Create a function prune_dataset, which will help us reduce the size of our dataset based on our requirements. 
5. Implement Apriori Algorithm and create rules.

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from collections import Counter

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
#1. read csv file into data frame. Get the shape.
data = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv',sep='\n',header=None)
data.head(2)

Unnamed: 0,0
0,"citrus fruit,semi-finished bread,margarine,rea..."
1,"tropical fruit,yogurt,coffee"


In [62]:
data.shape

(9835, 1)

In [None]:
data.values

array([['citrus fruit,semi-finished bread,margarine,ready soups'],
       ['tropical fruit,yogurt,coffee'],
       ['whole milk'],
       ...,
       ['chicken,citrus fruit,other vegetables,butter,yogurt,frozen dessert,domestic eggs,rolls/buns,rum,cling film/bags'],
       ['semi-finished bread,bottled water,soda,bottled beer'],
       ['chicken,tropical fruit,other vegetables,vinegar,shopping bags']],
      dtype=object)

In [None]:
#top 20 "sold items" that occur in the dataset. total sale they account for
transactions = []
for T in data.values:
    transactions.append(T.tolist())

In [None]:
itemset = [itemset[0].split(",") for itemset in transactions]
item_name = [item for T in itemset for item in T]
items = Counter(item_name).keys(); itemcounts = Counter(item_name).values()

In [None]:
df1 = pd.DataFrame(items, columns =['item_name'])
df2 = pd.DataFrame(itemcounts, columns=['item_count'])
goc_df = pd.concat([df1,df2],axis=1)

In [None]:
goc_df['item_perc'] = goc_df.item_count/goc_df.item_count.sum()
goc_df.sort_values(by='item_count', ascending=False, inplace=True)

In [64]:
goc_df.head(20)

Unnamed: 0,item_name,item_count,item_perc,total_perc
7,whole milk,2513,0.057947,0.057947
11,other vegetables,1903,0.043881,0.101829
17,rolls/buns,1809,0.041714,0.143542
31,soda,1715,0.039546,0.183089
5,yogurt,1372,0.031637,0.214725
24,bottled water,1087,0.025065,0.239791
42,root vegetables,1072,0.024719,0.26451
4,tropical fruit,1032,0.023797,0.288307
52,shopping bags,969,0.022344,0.310651
50,sausage,924,0.021307,0.331957


In [None]:
goc_df['total_perc'] = goc_df['item_perc'].cumsum()
top20_trans = goc_df.head(20)
top20_trans.reset_index(drop=True)

Unnamed: 0,item_name,item_count,item_perc,total_perc
0,whole milk,2513,0.057947,0.057947
1,other vegetables,1903,0.043881,0.101829
2,rolls/buns,1809,0.041714,0.143542
3,soda,1715,0.039546,0.183089
4,yogurt,1372,0.031637,0.214725
5,bottled water,1087,0.025065,0.239791
6,root vegetables,1072,0.024719,0.26451
7,tropical fruit,1032,0.023797,0.288307
8,shopping bags,969,0.022344,0.310651
9,sausage,924,0.021307,0.331957


In [None]:
#transforming item_name rows to columns.
TE = TransactionEncoder()
tran_data = TE.fit_transform(itemset)
tran_df = pd.DataFrame(tran_data, columns=TE.columns_)
tran_df

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,berries,beverages,bottled beer,bottled water,brandy,brown bread,butter,butter milk,cake bar,candles,candy,canned beer,canned fish,canned fruit,canned vegetables,cat food,cereals,chewing gum,chicken,chocolate,chocolate marshmallow,citrus fruit,cleaner,cling film/bags,cocoa drinks,coffee,condensed milk,cooking chocolate,cookware,cream,...,salty snack,sauces,sausage,seasonal products,semi-finished bread,shopping bags,skin care,sliced cheese,snack products,soap,soda,soft cheese,softener,sound storage medium,soups,sparkling wine,specialty bar,specialty cheese,specialty chocolate,specialty fat,specialty vegetables,spices,spread cheese,sugar,sweet spreads,syrup,tea,tidbits,toilet cleaner,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9830,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True,False,False,False,True,False,False,False,False,...,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False
9831,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9832,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
9833,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [None]:
#Label Encoding.
columns = tran_df.columns
gle = LabelEncoder()
for i in tran_df.columns:
  tran_df[i] = gle.fit_transform(tran_df[i])
tran_df.head(2)

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,berries,beverages,bottled beer,bottled water,brandy,brown bread,butter,butter milk,cake bar,candles,candy,canned beer,canned fish,canned fruit,canned vegetables,cat food,cereals,chewing gum,chicken,chocolate,chocolate marshmallow,citrus fruit,cleaner,cling film/bags,cocoa drinks,coffee,condensed milk,cooking chocolate,cookware,cream,...,salty snack,sauces,sausage,seasonal products,semi-finished bread,shopping bags,skin care,sliced cheese,snack products,soap,soda,soft cheese,softener,sound storage medium,soups,sparkling wine,specialty bar,specialty cheese,specialty chocolate,specialty fat,specialty vegetables,spices,spread cheese,sugar,sweet spreads,syrup,tea,tidbits,toilet cleaner,tropical fruit,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0


In [None]:
#Data pruning
def prune_dataset(input_df, length_trans, sales_perc):
    transactions = []
    for T in input_df.values:
        transactions.append(T.tolist())
    itemset = [itemset[0].split(",") for itemset in transactions]
    item_name = [item for T in itemset for item in T]
    items = Counter(item_name).keys(); itemcounts = Counter(item_name).values()
    
    data1 = pd.DataFrame(items, columns =['item_name'])
    data2 = pd.DataFrame(itemcounts, columns=['item_count'])
    new_df = pd.concat([data1,data2],axis=1)
    
    new_df['item_perc'] = new_df.item_count/new_df.item_count.sum()
    new_df.sort_values(by='item_count', ascending=False, inplace=True)
    new_df['total_perc'] = new_df['item_perc'].cumsum()
    
    top_items = list(new_df[new_df.total_perc < sales_perc].item_name)

    TE = TransactionEncoder()
    tran_data = TE.fit_transform(itemset)
    tran_df = pd.DataFrame(tran_data, columns=TE.columns_)

    columns = tran_df.columns

    gle = LabelEncoder()
    for i in tran_df.columns:
      tran_df[i] = gle.fit_transform(tran_df[i])
    
    tran_df = tran_df[top_items]

    df = tran_df[tran_df.sum(axis=1) >= length_trans]
    count = new_df[new_df.total_perc < sales_perc].reset_index(drop=True)

    return df, count

In [44]:
pruned_df, item_count = prune_dataset(data,2,0.4)
print(pruned_df)
#print(item_count)


      whole milk  other vegetables  ...  citrus fruit  bottled beer
1              0                 0  ...             0             0
4              1                 1  ...             0             0
5              1                 0  ...             0             0
7              0                 1  ...             0             1
10             0                 1  ...             0             0
...          ...               ...  ...           ...           ...
9829           0                 1  ...             0             0
9830           1                 0  ...             1             0
9832           0                 1  ...             1             0
9833           0                 0  ...             0             1
9834           0                 1  ...             0             0

[4585 rows x 13 columns]


In [53]:
frequent_itemsets = apriori(pruned_df, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.442748,(whole milk)
1,0.352454,(other vegetables)
2,0.311014,(rolls/buns)
3,0.279389,(soda)
4,0.256052,(yogurt)
...,...,...
219,0.010687,"(citrus fruit, other vegetables, tropical frui..."
220,0.010033,"(rolls/buns, root vegetables, yogurt, whole milk)"
221,0.010469,"(rolls/buns, yogurt, tropical fruit, whole milk)"
222,0.012214,"(root vegetables, yogurt, tropical fruit, whol..."


In [54]:
rule_mlx = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.3)
rule_mlx

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(other vegetables),(whole milk),0.352454,0.442748,0.160523,0.455446,1.028679,0.004475,1.023317
1,(whole milk),(other vegetables),0.442748,0.352454,0.160523,0.362562,1.028679,0.004475,1.015857
2,(rolls/buns),(whole milk),0.311014,0.442748,0.121483,0.390603,0.882224,-0.016218,0.914432
3,(soda),(whole milk),0.279389,0.442748,0.085932,0.307572,0.694689,-0.037767,0.804780
4,(yogurt),(whole milk),0.256052,0.442748,0.120174,0.469336,1.060051,0.006808,1.050102
...,...,...,...,...,...,...,...,...,...
190,"(yogurt, tropical fruit, whole milk)",(root vegetables),0.032497,0.207852,0.012214,0.375839,1.808207,0.005459,1.269141
191,"(yogurt, root vegetables, other vegetables)",(tropical fruit),0.027699,0.199346,0.010687,0.385827,1.935466,0.005165,1.303629
192,"(yogurt, root vegetables, tropical fruit)",(other vegetables),0.017448,0.352454,0.010687,0.612500,1.737817,0.004537,1.671087
193,"(root vegetables, other vegetables, tropical f...",(yogurt),0.026390,0.256052,0.010687,0.404959,1.581546,0.003930,1.250245


In [69]:
(rule_mlx[['antecedents','consequents','support','confidence','lift']]
.groupby('consequents')
.max()
.reset_index()
.sort_values(['lift', 'support','confidence'],
ascending=False))

Unnamed: 0,consequents,antecedents,support,confidence,lift
5,(root vegetables),"(yogurt, other vegetables, tropical fruit)",0.049727,0.445312,2.142453
8,"(other vegetables, whole milk)","(citrus fruit, root vegetables)",0.016794,0.333333,2.07654
6,(tropical fruit),"(yogurt, root vegetables, other vegetables)",0.019411,0.391608,1.964469
4,(yogurt),"(root vegetables, other vegetables, tropical f...",0.062814,0.474576,1.853435
1,(other vegetables),"(yogurt, root vegetables, tropical fruit)",0.160523,0.633333,1.796927
7,(sausage),"(rolls/buns, shopping bags)",0.012868,0.307292,1.720308
0,(whole milk),"(yogurt, root vegetables, tropical fruit)",0.160523,0.7,1.581034
3,(soda),"(shopping bags, sausage)",0.062159,0.363636,1.30154
2,(rolls/buns),"(yogurt, tropical fruit, whole milk)",0.065649,0.39749,1.278043


**Conclusion:**

The itemset {yogurt, other vegetables, tropical fruit, root vegetables} is present 4.9% of the transactions and 44% of times customer who bought {yogurt, other vegetables, tropical fruit} also tend to buy root vegetables which could be affirmed by lift value of 2.14 which indicates that there is greater probability of finding root vegetables in transactions which have yogurt, other vegetables, tropical fruit than in other transactions.