<div style="color:#485053;
           display:fill;
           border-radius:0px;
           background-color:#86C2DE;
           font-size:200%;
           padding-left:40px; 
           font-family:Verdana;
           font-weight:600; 
           letter-spacing:0.5px;
           ">

<p style="padding: 15px;
          color:white;
          text-align: center;">

 DATA WAREHOUSING AND MINING (CSE5021) 

</p>
</div>

## Assignment 2: Association Rule Mining
## Krupa Gajjar: 20MAI0014
## Scope: VIT, Vellore

### The Link of dataset used here: https://www.kaggle.com/c/instacart-market-basket-analysis
***Introduction to Association rule mining:***
This notebook represents the working of apriori algorithm. apriori algorithm is used in many applications today for mining frequent itemsets and devising association rules from a transactional database. The parameters “support” and “confidence” are used. Support refers to items' frequency of occurrence; confidence is a conditional probability. Items in a transaction form an item set.

The main aim is to create association rules to identify the relations between products of dataset. here we show implementation, leveraging the apriori algorithm to generate simple {A} -> {B} association rules. 

![](https://miro.medium.com/max/875/1*-4cnnpMhGMrrQPlZY7DuTA.jpeg)

## Association Rule Mining:
The item sets have been generated using apriori algorithm, we can start mining association rules. Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems like amazon basket, big basket, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:


![](https://www.researchgate.net/publication/337999958/figure/fig1/AS:867641866604545@1583873349676/Formulae-for-support-confidence-and-lift-for-the-association-rule-X-Y.ppm)

# Import Library

In [None]:
from IPython.display import display
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter

In [None]:
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

# Load Dataset

In [None]:
import pandas as pd
orders = pd.read_csv("../input/instacart-market-basket-analysis/order_products__prior.csv")
display(orders.head())

![](https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Fapriori-algorithm-tutorial-2e5eb1d896ab&psig=AOvVaw21hXmXZpHpXgRs2dKvQPck&ust=1614969399245000&source=images&cd=vfe&ved=0CAYQjRxqFwoTCLi4uoOkl-8CFQAAAAAdAAAAABAE)

## Transform into association rule function format

In [None]:
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

* Returns number of unique orders
* generator that yields item pairs, one at a time 
* Returns name associated with item

In [None]:
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
def order_count(order_item):
    return len(set(order_item.index))


def get_item_pairs(order_item):
    order_item = order_item.reset_index().to_numpy()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','confidenceBtoA','supportAB','supportA','supportB','freqA','freqAB','freqB', 
               'confidenceAtoB','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]               

**This function does the following:**
* It Calculates the item frequency and support
* Filter from order_item items below min support
* Recalculate item frequency and support
* Get item pairs generator

In [None]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100

 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))

    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs  = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))
    
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    return item_pairs.sort_values('lift', ascending=False)

In [None]:
%%time
rules = association_rules(orders, 0.01)  

* Replace item ID with item name and display association rules

In [None]:
item_name   = pd.read_csv('../input/instacart-market-basket-analysis/products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
df1=rules_final
df1=df1.drop(columns=['lift','freqA','freqB'])
df1.head(11)