<a href="https://colab.research.google.com/github/thalankiabhishek/MBA_AssocRules/blob/master/InstaCart_MBA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **Market Basket Analysis**


---


*Which products will an Instacart consumer purchase again?*

<img src="https://www.analyticsvidhya.com/wp-content/uploads/2015/06/kaggle-logo-transparent-300.png" width="300"> 
<img src="https://www.getpeanutbutter.com/wp-content/uploads/2019/09/instacart-logo_v3.png" width="200">&emsp;

In [None]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz -q
!tar -xzf instacart_online_grocery_shopping_2017_05_01.tar.gz ; rm -rf *.tar.gz

## Libraries and Data Importation 
---

In [None]:
import pandas as pd
import os
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
os.chdir('/content/instacart_2017_05_01/')

### Files and their content

In [None]:
columns=[]
content=[]
for csv in tqdm(os.listdir()):
  if csv.endswith('.csv') and not csv.startswith('sample') and not csv.startswith('.'):
    content.append([i for i in pd.read_csv(csv).columns.tolist()])
    columns.append(csv)
pd.DataFrame(sorted(content), index=sorted(columns)).T

## Pre-processing Data

### Join Aisles, Departments and Products into a single DataDrame for ease of use.

In [None]:
products=pd.read_csv('products.csv').merge(pd.read_csv('aisles.csv'), on='aisle_id').merge(pd.read_csv('departments.csv'), on='department_id')
products.drop(['aisle_id', 'department_id'], 1, inplace=True)
products.head()

### Merge Orders with Product orders for User_IDs

In [None]:
train = pd.read_csv('order_products__train.csv') 
priors = pd.read_csv('order_products__prior.csv')
orders = pd.read_csv('orders.csv')

In [None]:
train = train.merge(orders[orders.eval_set=='train'], on='order_id')
train = train.merge(products, on='product_id')
train = train.drop('order_id product_id add_to_cart_order reordered eval_set order_dow order_hour_of_day days_since_prior_order'.split(" "), 1)
train = train.sort_values(['user_id', 'order_number']).reset_index(drop=True)
train.head()

In [None]:
priors = priors.merge(orders[orders.eval_set=='prior'], on='order_id')
priors = priors.merge(products, on='product_id')
priors = priors.drop('order_id product_id add_to_cart_order reordered eval_set order_dow order_hour_of_day days_since_prior_order'.split(" "), 1)
priors = priors.sort_values(['user_id', 'order_number']).reset_index(drop=True)
priors.head()

### Every User's Shopping List

In [None]:
transactions=[]
for key, group in tqdm(priors.groupby('user_id')):
  transactions.append(sorted(set([item for item in group.product_name])))

### Shopping List to Encoded Dataframe 

In [None]:
!pip install --upgrade mlxtend -q
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax, association_rules

In [None]:
te = TransactionEncoder()
te_array = te.fit_transform(transactions)
transactions = pd.DataFrame(te_array.astype("int"), columns=te.columns_, index = priors.user_id.unique())
transactions.iloc[:5, :5]

### Generating Frequent Itemsets

In [None]:
frequent_itemsets = apriori(transactions, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets.itemsets.apply(lambda x: len(x))
round(frequent_itemsets, 3)

## Generating and Filtering Results for Association Rules
---

### Compute ***Support***
```
> support(A→C) = support(A ∪ C)
```
Measure the abundance or frequency of an item in a database.

In [None]:
association_rules(frequent_itemsets, support_only=True).dropna(axis=1)

### **Confidence** as a threshold metric
```
confidence(A→C) = support(A→C)/support(A)
```
Probability of seeing the consequent (Butter) in a transaction given that it also contains the antecedent (Bread).

In [None]:
association_rules(frequent_itemsets, metric="confidence")

### **Lift** as a threshold metric
```
lift(A→C) = confidence(A→C)/support(C)
```
Measure how much more often the antecedent (Bread) and consequent (Butter) of a rule A->C occur together than we would expect if they were statistically independent. 
If A and C are independent, the Lift score will be exactly 1.

In [None]:
association_rules(frequent_itemsets, metric="lift", min_threshold=3)

### **Leverage** as a threshold metric
```
levarage(A→C) = support(A→C)−support(A)×support(C)
```
Difference between the observed frequency of A (Bread) and C (Butter) appearing together and the frequency that would be expected if A (Bread) and C (Butter) were independent

In [None]:
association_rules(frequent_itemsets, metric="leverage", min_threshold=0.2)

### **Conviction** as a threshold metric
```
conviction(A→C)=1−support(C)/1−confidence(A→C)
```
A high conviction value means that the consequent (Butter) is highly depending on the antecedent (Bread).

In [None]:
association_rules(frequent_itemsets, metric="conviction")