# Market Basket Analysis
-----
### Goal : 
* Given all past orders, which dept that always come together ?
* once known, which products to be specific on each od these dept are always bought together ?


### Why?
* So business can reorganize the store layout & run promotional campaign to bundle these item together. [Reference](https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis)


<br/>

-----

### Proposed Solution :
1. We can answer part1 question using concept called "Association Rules"
    * In short, calculate each pair of items relationships (lift). say we have item A & B.
        * if lift > 1, A & B occur together more often than random
        * if lift = 1, A & B occur together only by chance (random)
        * if lift < 1, A & B occur together less often than random
        
    * calculate lift(A,B) = Pr(A & B bought together) / (Pr(A bought) * Pr(B bought))

<br/>

2. Formal answer : 
    * Pr : Probability of...

```
Step1 : 
calculate support(A,B), support(A), support(B)

sup(A,B) = Pr(A & B bought together) = Pr(A n B)
sup(A)   = Pr(A is bought) = Pr(A)
Note : sup(A,B) =/= sup(B,A)
```


```
Step2 : 
use support(A,B) to get Confidence(A,B) = Pr(B bought, given A alrd bought)

Co(A,B) = Pr(B|A)
        = Pr(B n A) / P(A).  https://images.app.goo.gl/Cb8Z6aQrtpBpeWTC8
        = sup(A,B) / sup(A)
        
Note : Co(A,B) =/= Co(B,A)
```

```
Step3 :
use Confidence(A,B) to get lift(A,B) = Likelihood of A & B bought together)
Li(A,B) = Pr(A & B bought together) / ( Pr(A bought) * Pr(B bought) )
        = Pr(A & B bought together) / Pr(A bought) / Pr(B bought)
        = Pr(A n B) / Pr(A) / Pr(B)
        = Pr(A n B) / Pr(B) / Pr(A)
        = Pr(A|B) / Pr(A)
        = Co(B,A) / sup(A)
```

```
0 < sup(A,B) < 1
0 < Co(A,B) < 1
0 < Li(A,B) < inf
```

<br/>

3. 0.01 min threshold is fine, meaning, out of 100 unique orders, 1 order has that pair 
* see [Apriori Algorithm](https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis#Apriori-Algorithm) small example. we have 5 orders, and we measure by pair, hence the least pair (next to 0) is that "it happen in onc order, out of 5", hence 1/5

<br/>

4. How "Association Rules" works in order to find pair items that happen to be come together more frequent than random ? (simplified terms)
* get each orders (where 1 order has many products inside)
* compare 1 order with another order
* does a particular pair of products always exist in each of the orders ?
    * if yes, record. the more frequent this happens, the higher the 'lift' value 
    * if no, if its very less frequent than min_threshold, we throw that pair entirely
* repeat to all order_id pairs.
    * hence "0 < max number of rows in output result < len(order_id) C 2" . (if len(order_id)=10, -> 10C2)

<br/>

-----

### Useful reads :
* [Association Rules Explained](https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce)

In [None]:
# First timer read here on [loading zip data issue](https://www.kaggle.com/dansbecker/finding-your-files-in-kaggle-kernels) 
# DO NOT RUN THIS CELL TOO OFTEN, WILL TAKE 2 MINS+ TO RELOAD DATA

# basic packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# garbage collector to free up memory
import gc
gc.enable()


# ori, as in this original variable is to be preserved 
# ori_aisles = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/aisles.csv')
ori_dept = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/departments.csv')
ori_order_prod_prior = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/order_products__prior.csv')
# ori_order_prod_train = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/order_products__train.csv')
# ori_orders = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/orders.csv')
ori_products = pd.read_csv('../input/d/psparks/instacart-market-basket-analysis/products.csv')




## Obtain
* load dataset & see their size

In [None]:
# additional packages needed
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display
import pdb

# load data needed
orders_prod_prior = ori_order_prod_prior
products = ori_products
dept = ori_dept

# aisles = ori_aisles
# orders_prod_train = ori_order_prod_train
# orders = ori_orders



In [None]:
def size(obj):
    '''Input a variable object, Output size of the variable in MB '''
    
    return "{0:2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))


# sneak peek of order data
print(f"orders_prod_prior dimensions : {orders_prod_prior.shape};   size : {size(orders_prod_prior)}")
print(orders_prod_prior.head(3))

print("------------------")

# sneak peek of product data
print(f"products dimensions : {products.shape};   size : {size(products)}")
print(products.head(3))

print("------------------")

# sneak peek of product data
print(f"departments dimensions : {dept.shape};   size : {size(dept)}")
print(dept.head(3))

## Scrub
* merge orders & products df first
* Convert dept dataframe to format suitable for association rules (df to series), which is order_id as index & dept_id as value
* Convert products dataframe to format suitable for association rules (df to series), which is order_id as index & product_id as value

In [None]:
# since every order have UNIQUE product_id (since add_to_cart_order represent qty of that product_id being ordered), 
# set every order have UNIQUE dept_id too (with add_to_cart_order as qty of that particular dept_id being ordered)


# merge orders_prod_prior & products df, to get dept_id
orders_products = orders_prod_prior.merge(products, left_on='product_id', right_on='product_id')

print(orders_products.head(3))
print(f"dimensions : {orders_products.shape};   size : {size(orders_products)}")   

In [None]:
# get order_id & dept_id only
orders_products = orders_products[['order_id', 'department_id']]

In [None]:
# removing duplicates idea
print(orders_products[orders_products['order_id'] == 4]) # what we have now
print(orders_products[orders_products['order_id'] == 4].drop_duplicates(subset=['department_id'])) # what we hope to have as final result

In [None]:
# drop dept_id duplicates using numpy vectorization

# # sort df by order_id
# orders_products = orders_products.sort_values(by=['order_id'])

# set order_id as index
orders_products_array = orders_products.set_index('order_id')

# convert the whole dataframe to numpy array pairs
orders_products_array = orders_products_array.reset_index().to_numpy()


In [None]:
# %%timeit   result : 41 s ± 13.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# drop duplicates
orders_products_array_unique = np.unique(orders_products_array, axis=0)

In [None]:
orders_products_array_unique

In [None]:
# convert numpy array back to pandas df
orders_departments_id = pd.DataFrame(orders_products_array_unique, 
             columns=['order_id', 
                      'department_id'])

In [None]:
# set orders_departments_id that is suitable for 'associate rules' analysis
orders_departments_id = orders_departments_id.set_index('order_id')['department_id'].rename('item_id')
print(orders_departments_id.head(5))
print(f"dimensions: {orders_departments_id.shape};  \nsize: {size(orders_departments_id)};   \nunique_orders: {len(orders_departments_id.index.unique())};   \nunique_items: {len(orders_departments_id.value_counts())}; \ntype: {type(orders_departments_id)};")



In [None]:
# convert order df to series. order_id as index & product_id (renamed to item_id) as value
orders_products_id = ori_order_prod_prior.set_index('order_id')['product_id'].rename('item_id')   # take data from orders_products will error in get_item_pairs(). must take from orders.
print(orders_products_id.head(5))
print(f"dimensions: {orders_products_id.shape};  \nsize: {size(orders_products_id)};   \nunique_orders: {len(orders_products_id.index.unique())};   \nunique_items: {len(orders_products_id.value_counts())}; \ntype: {type(orders_products_id)};")



## Explore

* Skipped entirely, as this analysis itself almost eats up 15GB ram
* Will be covered [here](https://www.kaggle.com/dwihdyn/mkt-bskt-prediction/)


## Model
* Functions to calculate Association Rule & its helpers
* Input df that inside is order_id & product_id, Output every pair of product_id that has high chance it buys together (lift), same procedure for dept


In [None]:
# helper function ensure we can calculate lift with no buffer due to dealing large dataset (data mgmt)


def freq(iterable):
    '''Returns frequency counts for items & item pairs'''
    
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")


    
def order_count(order_item):
    '''Return no of unique orders'''
    return len(set(order_item.index))



def get_item_pairs(order_item):
    '''Return generator that yields item pairs, one at a time. (helps to facilitate large dataset)'''
    
    # input
    # order_id     product_id
    # 1             222
    # 1             223
    # 2             222
    # 2             192
    # output : array([1, 222], [1, 223], [2,222], [2,192])
    order_item = order_item.reset_index().to_numpy()
    
    
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]

        for item_pair in combinations(item_list, 2):
            yield item_pair
        

        
def merge_item_stats(item_pairs, item_stats):
    '''Returns frequency & support associated with items'''
    
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True)
           )



def merge_targeted_category(rules, targeted_category):
    '''Returns name associated with item'''
    
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(targeted_category.rename(columns={'targeted_category': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(targeted_category.rename(columns={'targeted_category': 'itemB'}), left_on='item_B', right_on='item_id')
            )
    return rules[columns]               

In [None]:
# Association Rules function, to get lift() of all pairs

def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))

    # 1) Calculate item frequency and support of each item (support(A))
    item_stats             = freq(order_item).to_frame("freq") 
    item_stats['support']  = item_stats['freq'] / order_count(order_item) # * 100  
    



    
    # 2) Remove items from order_item df that items is below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]
     
 
    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    
    
    
    # 3) Since we want to know what PAIRS happen the most from EVERY ORDERS, there is no point including any orders data that has only 1 item inside it.
    # Hence, Remove orders from order_item df that order with less than 2 items (remove any single-orders)
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))



    
    
    # 4) Since all single orders has been removed, Recalculate item frequency and support (repeat of step1, optimisation to remove any useless data & make code run faster)
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) # * 100

    
    
    
    
    # 5) Get all unique items from all orders, pair them up, and count the frequency of each pair happen from all orders
    # run "Counter(item_pair_gen)"
    item_pair_gen          = get_item_pairs(order_item)  


    
    
    
    # 6) Calculate item pair frequency and support of each pair (support(A,B))
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) # * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    
    
    
    # 7) Remove pairs from item_pairs that the pair supportAB  below min support (repeat of step2, optimisation to remove any useless data & make code run faster)
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))
            


        
        
    # 8) Create table of association rules and compute relevant metrics (confidence & lift)
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)
    

In [None]:
# 0.01 min threshold is fine, meaning, out of 100 unique orders, 1 order has that pair 

# associate rules : products
products_rules = association_rules(orders_products_id, 0.01)  


In [None]:
# 1) Which products that's always bought together ?

# Replace item ID with item name and display association rules
targeted_category   = products.rename(columns={'product_id':'item_id', 'product_name':'targeted_category'})
products_rules_final = merge_targeted_category(products_rules, targeted_category).sort_values('lift', ascending=False)

# display result on item pair, ascending order by "lift", only show lift > 1
print(products_rules_final[products_rules_final['lift'] > 1][['itemA', 'itemB', 'lift']].head(10))
print(f"dimensions: {products_rules_final.shape};  \nsize: {size(products_rules_final)};   \nunique_orders: {len(products_rules_final.index.unique())};   \nunique_items: {len(products_rules_final.value_counts())};")

In [None]:
# associate rules : departments
dept_rules = association_rules(orders_departments_id, 0.01)  

In [None]:
# 2) Which dept that's always bought together ?

# Replace item ID with item name and display association rules
targeted_category   = dept.rename(columns={'department_id':'item_id', 'department':'targeted_category'})
dept_rules_final = merge_targeted_category(dept_rules, targeted_category).sort_values('lift', ascending=False)


# display result on item pair, ascending order by "lift"
print(dept_rules_final[dept_rules_final['lift'] > 1][['itemA', 'itemB', 'lift']].head(10))
print(f"dimensions: {dept_rules_final.shape};  \nsize: {size(dept_rules_final)};   \nunique_orders: {len(dept_rules_final.index.unique())};   \nunique_items: {len(dept_rules_final.value_counts())};")

## iNterpret
* Jump to "Call-To-Action" section [here](https://dwihdyn.github.io/journals/6-mba-retail.html)


In [None]:
# # Scrub process 
# # naive way. DO NOT RUN THIS! here to just explain concept on how to tackele this problem the most reckless way

# # empty df to store non-duplicate dept
# unique_dept_tes = pd.DataFrame(columns = ['order_id', 'department_id'])


# # function to remove duplicate
# def remove_duplicate_depts_given_orderid(df, order_id):
#     '''input dataframe & order_id, output that given order_id with UNIQUE dept_id'''
    
#     return df[df['order_id'] == order_id].drop_duplicates(subset=['department_id'])



# # small version for order 2,3,4 only
# i = 2
# # while i <= tes['order_id'].max():
# while i <= 4:    
    
#     # remove duplicate in that selected order_id
#     temp_df = remove_duplicate_depts_given_orderid(tes, i)
    
#     # append temp_df to new df 
#     unique_dept_tes = unique_dept_tes.append(temp_df, ignore_index=True)

    
#     i = i + 1
    
#     if i == 10 or i == 100 or i == 1000 or i == 10000 or i == 100000 or i == 1000000 or i == 3000000 or i == 3421000 or i == 3421080:
#         print(f" order_id up to {i} out of {tes['order_id'].max()} is done, continuing............")