 # Calculating Market Basket Analysis

### 1. What is Market Basket Analysis?
### 2. How is MBA calculated ?
### 3. Doing MBA with Python.
### 4. Building a Recommender and testing it.

# 1. What is Market Basket Analysis (MBA)?

<h3>1.1 So what is MBA?</h3>
<ul>
    <li>It's a group of techniques often used in retail to uncover <b>associations</b> between items.</li>
    <li> If a customer is buying a particular product he is likely to buy some related goods to compliment the first one. </li>
</ul>

<img src="https://cdn-images-1.medium.com/max/800/1*YSDKzjONGi1xB6ub2gidXw.jpeg" alt="Recommender">

<h3>1.2. Basically, MB will tell things like:</h3>

<ul>
    <li>If bananas are purchased then avocados are also purchased</li>
    <li> If both milk and bread are purchased then eggs are purchased 50% of the time.</li>
</ul>

<h3>1.3. What is MBA used for?</h3>

<ul>
    <li>Allowing retailers to make promotional product bundles and send coupons.</li>
    <li>improving store layout optimise flow and cross-sell. Content placement in e-commerce. </li>
    <li>Inventory planning and pricing. </li>
    <li>Creating product segments/groups - Category management.</li>
    <li>Identifying product gaps in baskets.</li>
    <li>Building E-Commerce recommendation engines.</li>
    <li>Fraud detection.</li>
</ul>

# 2. How is MBA calculated ?

<h3>2.1. We will be using the Apriori algorithm to calculate MBA</h3>

<ul>
    <li>Classic Data Mining algorithm.</li>
    <li>Allows us to find frequent itemsets and association rules.</li>
    <li>It's also used for example in healthcare to identify the relationships between drugs and adverse reactions.</li>
</ul>

<h3>2.2. The algorithm uses has three basic steps:</h3>

<ol>
    <li>Calculate support.</li>
    <li>Calculate confidence.</li>
    <li>Calculate lift.</li>
</ol>

<h3>2.2.1. Support:</h3>

<p>Says how many times a product appears in customer transactions.</p>
<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-support-eqn.png?w=248&h=68" alt="Recommender">
<img src="https://annalyzin.files.wordpress.com/2016/04/association-rule-support-table.png?w=503&h=447" alt="Recommender">
<ul>
    <li>Support {apple} 50%</li>
    <li>Support {apple, beer, rice} 2/8 or 25%</li>
</ul>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

<h3>2.2.2. Confidence:</h3>

<p>Says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}.</p>
<p>The confidence of {apple -> beer} is 3 out of 4, or 75%</p>

<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-confidence-eqn.png?w=527&h=77" alt="Recommender">

<p style = "font-size:20px">One drawback of the confidence measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general, there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the confidence measure. <b>To account for the base popularity of both constituent items, we use a third measure called lift.</b></p>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

<h3>2.2.3. Lift:</h3>

<p>This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. </p>
<p>The confidence of {apple -> beer} is 3 out of 4, or 75%</p>

<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-lift-eqn.png?w=566&h=80" alt="Recommender">

<ul>
    <li>lift <b>equal</b> to 1 implies no relationship between A and B.</li>
    (ie: A and B occur together only by chance)
    <li>lift <b>more than</b> 1 implies that there is a positive relationship between A and B. </li>
    (ie:  A and B occur together more often than random)
    <li>lift <b>less than</b> 1 implies that there is a negative relationship between A and B. </li>
    (ie:  A and B occur together less often than random)
</ul>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

# 3. Doing MBA with Python.

In [3]:
# Affinity analysis
# TFIDF
# Complementary
# Substituvie
# Clustering -- PCA?
# Collaborative filtering
# W2vec Vectors
# Latent factor 
\

you might want to show bananas at the begining but not in the product card

SyntaxError: invalid syntax (<ipython-input-3-6f6e67780e6f>, line 11)

In [4]:
import sys
import collections
import itertools
import pandas as pd
from ipywidgets import widgets
from collections import Counter
from IPython.display import display
from IPython.display import clear_output
from itertools import combinations, groupby
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori, association_rules

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

In [5]:
########################
## OBTAINING PRODUCTS ##
########################

products = pd.read_csv("products.csv")
order_products_prior = pd.read_csv("order_products__prior.csv")
departments = pd.read_csv("departments.csv")
#sample_submission = pd.read_csv("../input/sample_submission.csv")
#orders = pd.read_csv("../input/orders.csv")
aisles = pd.read_csv("aisles.csv")
#order_products_train = pd.read_csv("../input/order_products__train.csv")

In [6]:
########################
## OBTAINING SAMPLE   ##
########################

order_products_prior.count()
prods = order_products_prior[:2000000]
orders_prior = pd.merge(products, prods, on="product_id")
#orders_prior_listed = orders_prior.groupby("order_id")["product_name"].apply(list)

In [7]:
#######################################
## MERGE PRODUCTS AND DEP/AISLE INFO ##
#######################################

orders_deps_prior = pd.merge(orders_prior, departments, on="department_id")
orders_total_prior = pd.merge(orders_deps_prior, aisles, on="aisle_id")
orders_total_prior.drop(['department_id', 'aisle_id'], inplace=True, axis=1)

## MOST SOLD PRODUCTS
######################################
count_prods = orders_total_prior.groupby("product_name")['order_id'].count()
count_prods = count_prods.to_frame().sort_values(by="order_id", ascending=False)

## 3.1. Starting point: The data we're going to work with:
* Order information (Order ID, Products ordered, amount, department, aisle)

In [8]:
display(orders_total_prior.sample(n=100).head())

Unnamed: 0,product_id,product_name,order_id,add_to_cart_order,reordered,department,aisle
983659,20765,Steel Wool Soap Pads Lemon Fresh Scent - 10 CT,206382,7,0,household,cleaning products
415741,45840,100% Pure Apple Juice,21869,10,1,beverages,juice nectars
1265890,18883,Honeydew Chunks,195385,1,1,produce,packaged vegetables fruits
1254553,6000,Organic Baby Romaine,197918,3,1,produce,packaged vegetables fruits
1108179,46041,Beef Franks,106887,7,0,meat seafood,hot dogs bacon sausage


In [9]:
#######################################
##  OBTAINING NUMBER OF TRANSACTIONS ##
#######################################

N_Transactions = len(orders_prior['order_id'].unique())
N_products = len(orders_prior['product_name'].unique())
#N_Transactions
display("Total number of orders: {}".format(N_Transactions))
display("Total number of different products: {}".format(N_products))

'Total number of orders: 198086'

'Total number of different products: 40477'

<h3> 3.2. Calculating Support (Individual Products):</h3>

* <b>Says how many times a product appears in customer transactions.</b>
* Support(Products) = Frequency(Products) / Total Number Of Orders.
* --> Frequency column / # of Orders (constant)

In [10]:
support_products = pd.DataFrame(count_prods)
support_products.rename(columns={'order_id': 'Frequency'}, inplace=True)
support_products['Support'] = count_prods * 100 /N_Transactions
support_products.head(n = 20)

Unnamed: 0_level_0,Frequency,Support
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Banana,29255,14.768838
Bag of Organic Bananas,23415,11.820623
Organic Strawberries,16285,8.221177
Organic Baby Spinach,14915,7.529558
Organic Hass Avocado,13107,6.616823
Organic Avocado,10736,5.419868
Large Lemon,9377,4.733802
Strawberries,8817,4.451097
Limes,8663,4.373353
Organic Raspberries,8425,4.253203


In [11]:
mask = orders_total_prior['product_name'].isin(support_products[support_products['Support']>0.1].index)
display(len(orders_total_prior))
orders_total_prior = orders_total_prior[mask]
display(len(orders_total_prior[mask]))

2000000

  after removing the cwd from sys.path.


1276288

In [12]:
"""#######################################
##     APRIORI ALGORYTHM BY DEPT     ##
#######################################

##   ORDERS WITH DEPARTMENT LISTS    ##
#######################################

orders_prior_listed = orders_total_prior.groupby('order_id')['department'].apply(list)
listed_baskets = list(orders_prior_listed)

##        APRIORI ALGORYTHM          ##
#######################################

#### DATA PREP

te = TransactionEncoder()
te_ary = te.fit(listed_baskets).transform(listed_baskets)
df_support = pd.DataFrame(te_ary, columns=te.columns_)

#### APRIORI MODEL EXECUTION

df_apriori = apriori(df_support, min_support=0.001, use_colnames=True)
association_rules(df_apriori, metric="confidence", min_threshold=0.01)"""

'#######################################\n##     APRIORI ALGORYTHM BY DEPT     ##\n#######################################\n\n##   ORDERS WITH DEPARTMENT LISTS    ##\n#######################################\n\norders_prior_listed = orders_total_prior.groupby(\'order_id\')[\'department\'].apply(list)\nlisted_baskets = list(orders_prior_listed)\n\n##        APRIORI ALGORYTHM          ##\n#######################################\n\n#### DATA PREP\n\nte = TransactionEncoder()\nte_ary = te.fit(listed_baskets).transform(listed_baskets)\ndf_support = pd.DataFrame(te_ary, columns=te.columns_)\n\n#### APRIORI MODEL EXECUTION\n\ndf_apriori = apriori(df_support, min_support=0.001, use_colnames=True)\nassociation_rules(df_apriori, metric="confidence", min_threshold=0.01)'

In [13]:
#######################################
##  APRIORI ALGORYTHM BY PRODUCT     ##
#######################################

##DATA PREP
###########

## PREPARING PRODUCTS INTO LISTS

basket_lists = list(orders_total_prior.groupby('order_id')['product_name'].apply(list))
product_counts = collections.defaultdict(int)

##IDENTIFYING PRODUCT PAIRS
###########################

for basket in basket_lists:
    basket.sort()
    for pair in itertools.combinations(basket, 2):
        product_counts[pair] += 1

#for product_pair, product_freq in counts.items():
#    print(product_pair, product_freq)

In [14]:
#SEPARATE PRODUCT TUPLES INTO COLUMNS
df_pairs = pd.DataFrame(list(product_counts.keys()))
df_pairs.columns = ['items1', 'items2']
df_pairs['values'] =  list(product_counts.values())

#SORT BY FREQUENCY AND CALCULATE SUPPORT ITEM1 --> ITEM2
df_pairs = df_pairs.sort_values(by="values", ascending=False)
df_pairs['Support_1_2'] = df_pairs['values']/N_Transactions
df_pairs.rename(columns={'values': 'Frequency'}, inplace=True)

<h3>Calculating Support (Product Pairs):</h3>

* <b>Says how many times a product appears in customer transactions.</b>
* Support(Products) = Frequency(Products) / Total Number Of Orders.
* --> Frequency column / # of Orders (constant)

In [15]:
df_pairs.sort_values(by="Support_1_2", ascending=False).head(n=5)

Unnamed: 0,items1,items2,Frequency,Support_1_2
72,Bag of Organic Bananas,Organic Hass Avocado,3808,0.019224
5302,Bag of Organic Bananas,Organic Strawberries,3799,0.019179
185,Banana,Organic Strawberries,3517,0.017755
180,Banana,Organic Avocado,3237,0.016341
559,Banana,Organic Baby Spinach,3219,0.016251


* Imposing a limitation on Support (Frequency) we get higher relevance. 
* The algorithm is called Apriori because it cuts of some products under a certain level of support. 

In [16]:
# LET'S TEST THIS CONCEPT:

df_pairs[df_pairs['Support_1_2']>0.0010].sample(n=100).head(n=10) 

Unnamed: 0,items1,items2,Frequency,Support_1_2
8491,Red Peppers,Strawberries,253,0.001277
64787,Banana,Organic Russet Potato,330,0.001666
4441,Organic Baby Spinach,Strawberries,826,0.00417
21421,Organic Large Extra Fancy Fuji Apple,Organic Yellow Onion,400,0.002019
8847,Apple Honeycrisp Organic,Seedless Red Grapes,271,0.001368
8086,Organic Baby Carrots,Organic Whole Milk,357,0.001802
1349,Carrots,Organic Raspberries,316,0.001595
20946,Cucumber Kirby,Honeycrisp Apple,438,0.002211
9884,Bag of Organic Bananas,Organic Frozen Peas,330,0.001666
29902,Organic Strawberries,Yellow Bell Pepper,215,0.001085


In [17]:
#product_count.to_frame()
df_pairs = pd.merge(df_pairs, count_prods, right_on='product_name', left_on='items1')
df_pairs = pd.merge(df_pairs, count_prods, right_on='product_name', left_on='items2')

In [18]:
df_pairs

Unnamed: 0,items1,items2,Frequency,Support_1_2,order_id_x,order_id_y
0,Bag of Organic Bananas,Organic Hass Avocado,3808,0.019224,23415,13107
1,Banana,Organic Hass Avocado,2009,0.010142,29255,13107
2,Organic Baby Spinach,Organic Hass Avocado,2088,0.010541,14915,13107
3,Organic Avocado,Organic Hass Avocado,44,0.000222,10736,13107
4,Large Lemon,Organic Hass Avocado,940,0.004745,9377,13107
5,Organic Blueberries,Organic Hass Avocado,721,0.003640,6175,13107
6,Apple Honeycrisp Organic,Organic Hass Avocado,939,0.004740,5277,13107
7,Organic Garlic,Organic Hass Avocado,1102,0.005563,6762,13107
8,Limes,Organic Hass Avocado,1197,0.006043,8663,13107
9,Organic Cucumber,Organic Hass Avocado,1090,0.005503,4996,13107


In [19]:
df_pairs.rename(columns={'order_id_x': 'Freq1', 'order_id_y': 'Freq2'}, inplace=True)
df_pairs['Confidence_1_2'] = df_pairs['Support_1_2'] / (df_pairs['Freq1'] / N_Transactions)

<h3>Calculating Confidence (Product Pairs):</h3>

* <b>Says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}.</b>
* The confidence of {apple -> beer} is 3 out of 4, or 75%
* = (3/5) / (3/5)
* = 1 or 100% 

In [20]:
df_pairs[(df_pairs['Support_1_2']>0.00001)].sort_values(by='Support_1_2', ascending=False).head(10)

Unnamed: 0,items1,items2,Frequency,Support_1_2,Freq1,Freq2,Confidence_1_2
0,Bag of Organic Bananas,Organic Hass Avocado,3808,0.019224,23415,13107,0.162631
931,Bag of Organic Bananas,Organic Strawberries,3799,0.019179,23415,16285,0.162246
932,Banana,Organic Strawberries,3517,0.017755,29255,16285,0.120219
8443,Banana,Organic Avocado,3237,0.016341,29255,10736,0.110648
2094,Banana,Organic Baby Spinach,3219,0.016251,29255,14915,0.110032
2093,Bag of Organic Bananas,Organic Baby Spinach,3071,0.015503,23415,14915,0.131155
14825,Banana,Strawberries,2595,0.0131,29255,8817,0.088703
13330,Banana,Large Lemon,2526,0.012752,29255,9377,0.086344
933,Organic Hass Avocado,Organic Strawberries,2503,0.012636,13107,16285,0.190967
2886,Bag of Organic Bananas,Organic Raspberries,2480,0.01252,23415,8425,0.105915


In [21]:
df_pairs["lift"] = (df_pairs['Support_1_2']) / ((df_pairs['Freq2']/ N_Transactions)*(df_pairs['Freq1']/ N_Transactions)) 

<h3>Calculating Lift (Product Pairs):</h3>

* <b>This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. </b>
* lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})  
* Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. 

In [22]:
df_pairs[df_pairs['Support_1_2']>0.001].sort_values(by='lift', ascending=False).head(20)

Unnamed: 0,items1,items2,Frequency,Support_1_2,Freq1,Freq2,Confidence_1_2,lift
213980,Icelandic Style Skyr Blueberry Non-fat Yogurt,Non Fat Raspberry Yogurt,443,0.002236,1144,969,0.387238,79.16035
332685,Non Fat Raspberry Yogurt,Nonfat Icelandic Style Strawberry Yogurt,237,0.001196,969,630,0.244582,76.90203
386576,Icelandic Style Skyr Blueberry Non-fat Yogurt,Non Fat Acai & Mixed Berries Yogurt,232,0.001171,1144,544,0.202797,73.844277
332669,Icelandic Style Skyr Blueberry Non-fat Yogurt,Nonfat Icelandic Style Strawberry Yogurt,265,0.001338,1144,630,0.231643,72.833819
195858,Nonfat Icelandic Style Strawberry Yogurt,Vanilla Skyr Nonfat Yogurt,231,0.001166,630,1067,0.366667,68.07079
195778,Icelandic Style Skyr Blueberry Non-fat Yogurt,Vanilla Skyr Nonfat Yogurt,396,0.001999,1144,1067,0.346154,64.262634
195817,Non Fat Raspberry Yogurt,Vanilla Skyr Nonfat Yogurt,306,0.001545,969,1067,0.315789,58.625561
189737,Total 2% Lowfat Greek Strained Yogurt With Blu...,Total 2% Lowfat Greek Strained Yogurt with Peach,401,0.002024,1346,1226,0.29792,48.135183
181107,Total 2% Greek Strained Yogurt with Cherry 5.3 oz,Total 2% Lowfat Greek Strained Yogurt With Blu...,344,0.001737,1095,1346,0.314155,46.233103
85991,Total 2% Lowfat Greek Strained Yogurt With Blu...,Total 2% with Strawberry Lowfat Greek Strained...,579,0.002923,1346,1901,0.430163,44.823439


In [23]:
text = widgets.Text(
    value='Organic Raspberries',
    placeholder='Organic Raspberries|',
    description='Search:',
    disabled=False   
)

drop_down = widgets.Dropdown(
    options=list(df_pairs.sort_values(by='Freq1', ascending=False)['items1'].unique()[:1000]),
    value='Banana',
    description='Product:',
    disabled=False,
)

radio = widgets.RadioButtons(
    options=['search', 'dropdown'],
    value='dropdown',
    description='Use:',
    disabled=False
)



button = widgets.Button(
    description='Search',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
)

select = widgets.Select(
    options=['Freq2', 'Confidence_1_2', 'lift'],
    value='lift',
    # rows=10,
    description='Order:',
    disabled=False
)


slider = widgets.IntSlider(
    value=10,
    min=0,
    max=50,
    step=1,
    description='Frequency:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)



def handle_submit(sender):
    clear_output()
    display(text)
    display(drop_down)
    display(radio)
    display(select)
    display(slider)
    display(button)

    
    if radio == 'search':
        print("Search term: {}".format(text.value))
        display(df_pairs[(df_pairs['items1'].str.contains(text.value)) & (df_pairs['Freq2'] > slider.value)]).sort_values(by=select.value, ascending=False)[['items2', 'Support_1_2', 'Freq2', 'Confidence_1_2', 'lift']]
    else: 
        print("Search term: {}".format(drop_down.value))
        recom = df_pairs[(df_pairs['items1'].str.contains(drop_down.value)) & (df_pairs['Freq2'] > slider.value)].sort_values(by=select.value, ascending=False)[['items2', 'Support_1_2', 'Freq2', 'Confidence_1_2', 'lift']]
        display(recom)

<h3>4. Building a Recommender:</h3>

* What's the difference in choosing Frequency, Confidence and Lift?
* What happens when I filter by department or aisle?
* What happens readjusting confidence values?
    

In [24]:
display(text)
display(drop_down)
display(radio)
display(select)
display(slider)
display(button)
button.on_click(handle_submit)

Text(value='Organic Raspberries', description='Search:', placeholder='Organic Raspberries|')

Dropdown(description='Product:', options=('Banana', 'Bag of Organic Bananas', 'Organic Strawberries', 'Organic…

RadioButtons(description='Use:', index=1, options=('search', 'dropdown'), value='dropdown')

Select(description='Order:', index=2, options=('Freq2', 'Confidence_1_2', 'lift'), value='lift')

IntSlider(value=10, continuous_update=False, description='Frequency:', max=50)

Button(description='Search', style=ButtonStyle(), tooltip='Click me')

In [None]:
basket_lists = orders_total_prior.groupby('order_id')['product_name'].apply(list)

In [None]:
norm = [float(i)/max(count_prods.values) for i in count_prods.values]

In [None]:
import numpy as np



idf = 1/(count_prods/N_Transactions)

np.log(idf)*norm