 # Calculating Market Basket Analysis

### 1. What is Market Basket Analysis?
### 2. How is MBA calculated ?
### 3. Doing MBA with Python.

# 1. What is Market Basket Analysis (MBA)?

<h3>So what is MBA?</h3>
<ul>
    <li>It's a group of techniques often used in retail to uncover <b>associations</b> between items.</li>
    <li> If a customer is buying a particular product he is likely to buy some related goods to compliment the first one. </li>
</ul>

<img src="https://cdn-images-1.medium.com/max/800/1*YSDKzjONGi1xB6ub2gidXw.jpeg" alt="Recommender">

<h3>Basically, MB will tell things like:</h3>

<ul>
    <li>If bananas are purchased then avocados are also purchased</li>
    <li> If both milk and bread are purchased then eggs are purchased 50% of the time.</li>
</ul>

<h3>What is MBA used for?</h3>

<ul>
    <li>Allowing retailers to make promotional product bundles and send coupons.</li>
    <li>improving store layout optimise flow and cross-sell. Content placement in e-commerce. </li>
    <li>Inventory planning and pricing. </li>
    <li>Creating product segments/groups - Category management.</li>
    <li>Identifying product gaps in baskets.</li>
    <li>Building E-Commerce recommendation engines.</li>
    <li>Fraud detection.</li>
</ul>

# 2. How is it calculated using Python?

<h3>We will be using the Apriori algorithm to calculate MBA</h3>

<ul>
    <li>Classic Data Mining algorithm.</li>
    <li>Allows us to find frequent itemsets and association rules.</li>
    <li>It's also used for example in healthcare to identify the relationships between drugs and adverse reactions.</li>
</ul>

<h3>The algorithm uses has three basic steps:</h3>

<ol>
    <li>Calculate support.</li>
    <li>Calculate confidence.</li>
    <li>Calculate lift.</li>
</ol>

<h3>Support:</h3>

<p>Says how many times a product appears in customer transactions.</p>
<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-support-eqn.png?w=248&h=68" alt="Recommender">
<img src="https://annalyzin.files.wordpress.com/2016/04/association-rule-support-table.png?w=503&h=447" alt="Recommender">
<ul>
    <li>Support {apple} 50%</li>
    <li>Support {apple, beer, rice} 2/8 or 25%</li>
</ul>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

<h3>Confidence:</h3>

<p>Says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}.</p>
<p>The confidence of {apple -> beer} is 3 out of 4, or 75%</p>

<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-confidence-eqn.png?w=527&h=77" alt="Recommender">

<p style = "font-size:20px">One drawback of the confidence measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general, there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the confidence measure. <b>To account for the base popularity of both constituent items, we use a third measure called lift.</b></p>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

<h3>Lift:</h3>

<p>This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. </p>
<p>The confidence of {apple -> beer} is 3 out of 4, or 75%</p>

<img src="https://annalyzin.files.wordpress.com/2016/03/association-rule-lift-eqn.png?w=566&h=80" alt="Recommender">

<ul>
    <li>lift <b>equal</b> to 1 implies no relationship between A and B.</li>
    (ie: A and B occur together only by chance)
    <li>lift <b>more than</b> 1 implies that there is a positive relationship between A and B. </li>
    (ie:  A and B occur together more often than random)
    <li>lift <b>less than</b> 1 implies that there is a negative relationship between A and B. </li>
    (ie:  A and B occur together less often than random)
</ul>

<a style = "font-size:16px" href=https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html>Source</a>

# 3. Doing MBA with Python.

In [None]:
# Affinity analysis
# TFIDF
# Complementary
# Substituvie
# Clustering -- PCA?
# Collaborative filtering
# W2vec Vectors
# Latent factor 
\

you might want to show bananas at the begining but not in the product card

In [11]:
import sys
import collections
import itertools
import pandas as pd
from ipywidgets import widgets
from collections import Counter
from IPython.display import display
from IPython.display import clear_output
from itertools import combinations, groupby
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori, association_rules

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

In [8]:
########################
## OBTAINING PRODUCTS ##
########################

products = pd.read_csv("products.csv")
order_products_prior = pd.read_csv("order_products__prior.csv")
departments = pd.read_csv("departments.csv")
#sample_submission = pd.read_csv("../input/sample_submission.csv")
#orders = pd.read_csv("../input/orders.csv")
aisles = pd.read_csv("aisles.csv")
#order_products_train = pd.read_csv("../input/order_products__train.csv")

In [9]:
########################
## OBTAINING SAMPLE   ##
########################

order_products_prior.count()
prods = order_products_prior[:100000]
orders_prior = pd.merge(products, prods, on="product_id")
#orders_prior_listed = orders_prior.groupby("order_id")["product_name"].apply(list)

In [13]:
#######################################
## MERGE PRODUCTS AND DEP/AISLE INFO ##
#######################################

orders_deps_prior = pd.merge(orders_prior, departments, on="department_id")
orders_total_prior = pd.merge(orders_deps_prior, aisles, on="aisle_id")
orders_total_prior.drop(['department_id', 'aisle_id'], inplace=True, axis=1)

## MOST SOLD PRODUCTS
######################################
count_prods = orders_total_prior.groupby("product_name")['order_id'].count()
count_prods = count_prods.to_frame().sort_values(by="order_id", ascending=False)

## The dataset we're using:
* Order information (Order ID, Products ordered, amount, department, aisle)

In [16]:
display(orders_total_prior.head())

Unnamed: 0,product_id,product_name,order_id,add_to_cart_order,reordered,department,aisle
0,1,Chocolate Sandwich Cookies,1107,7,0,snacks,cookies cakes
1,1,Chocolate Sandwich Cookies,5319,3,1,snacks,cookies cakes
2,1,Chocolate Sandwich Cookies,7540,4,1,snacks,cookies cakes
3,1,Chocolate Sandwich Cookies,9228,2,0,snacks,cookies cakes
4,1,Chocolate Sandwich Cookies,9273,30,0,snacks,cookies cakes


In [5]:
#######################################
##  OBTAINING NUMBER OF TRANSACTIONS ##
#######################################

N_Transactions = len(orders_prior['order_id'].unique())
#N_Transactions

#######################################
##     APRIORI ALGORYTHM BY DEPT     ##
#######################################

##   ORDERS WITH DEPARTMENT LISTS    ##
#######################################

orders_prior_listed = orders_total_prior.groupby('order_id')['department'].apply(list)
listed_baskets = list(orders_prior_listed)

##        APRIORI ALGORYTHM          ##
#######################################

#### DATA PREP

te = TransactionEncoder()
te_ary = te.fit(listed_baskets).transform(listed_baskets)
df_support = pd.DataFrame(te_ary, columns=te.columns_)

#### APRIORI MODEL EXECUTION

df_apriori = apriori(df_support, min_support=0.001, use_colnames=True)
association_rules(df_apriori, metric="confidence", min_threshold=0.01)

In [6]:
#######################################
##  APRIORI ALGORYTHM BY PRODUCT     ##
#######################################

##DATA PREP
###########

## PREPARING PRODUCTS INTO LISTS

basket_lists = list(orders_total_prior.groupby('order_id')['product_name'].apply(list))
product_counts = collections.defaultdict(int)

##IDENTIFYING PRODUCT PAIRS
###########################

for basket in basket_lists:
    basket.sort()
    for pair in itertools.combinations(basket, 2):
        product_counts[pair] += 1

#for product_pair, product_freq in counts.items():
#    print(product_pair, product_freq)

In [7]:
#SEPARATE PRODUCT TUPLES INTO COLUMNS
df_pairs = pd.DataFrame(list(product_counts.keys()))
df_pairs.columns = ['items1', 'items2']
df_pairs['values'] =  list(product_counts.values())

#SORT BY FREQUENCY AND CALCULATE SUPPORT ITEM1 --> ITEM2
df_pairs = df_pairs.sort_values(by="values", ascending=False)
df_pairs['Support_1_2'] = df_pairs['values']/N_Transactions

In [8]:
#product_count.to_frame()
df_pairs = pd.merge(df_pairs, count_prods, right_on='product_name', left_on='items1')
df_pairs = pd.merge(df_pairs, count_prods, right_on='product_name', left_on='items2')
df_pairs.head()

Unnamed: 0,items1,items2,values,Support_1_2,order_id_x,order_id_y
0,Limes,Michigan Organic Kale,2,2e-05,431,204
1,Large Lemon,Michigan Organic Kale,1,1e-05,475,204
2,100% Mighty Mango Juice Smoothie,Michigan Organic Kale,1,1e-05,12,204
3,Limes,Organic Greek Nonfat Yogurt With Mixed Berries,1,1e-05,431,24
4,Dairy Free Coconut Milk Yogurt Alternative,Organic Greek Nonfat Yogurt With Mixed Berries,1,1e-05,10,24


In [9]:
df_pairs['confidence'] = df_pairs['Support_1_2'] / (df_pairs['order_id_x'] / N_Transactions)
df_pairs["lift"] = (df_pairs['Support_1_2']) / ((df_pairs['order_id_y']/ N_Transactions)) 
df_pairs.rename(columns={'order_id_x': 'Support1', 'order_id_y': 'Support2'}, inplace=True)

In [10]:
df_pairs[(df_pairs['Support_1_2']>0.00001)].sort_values(by='Support_1_2', ascending=False)

Unnamed: 0,items1,items2,values,Support_1_2,Support1,Support2,confidence,lift
0,Limes,Michigan Organic Kale,2,0.00002,431,204,0.004640,0.009804
198,Apple Honeycrisp Organic,Organic Russet Potato,2,0.00002,271,101,0.007380,0.019802
45,Banana,Organic Fuji Apple,2,0.00002,1410,270,0.001418,0.007407
1471,Aluminum Foil,Comice Pear,1,0.00001,32,11,0.031250,0.090909
1473,All Whites 100% Egg Whites,Celery,1,0.00001,19,28,0.052632,0.035714
1474,All Whites 100% Egg Whites,Solid White Albacore Premium Tuna in Water,1,0.00001,19,4,0.052632,0.250000
1475,Tomato Basil Bisque RTS Organic Soup,Whole Grain CountryWild Rice,1,0.00001,2,1,0.500000,1.000000
1476,Coconut Fruit Bars,Original Life Cereal,1,0.00001,10,10,0.100000,0.100000
1477,Glory's Sweet Cherry Tomatoes,Original Life Cereal,1,0.00001,6,10,0.166667,0.100000
1478,Coconut Fruit Bars,Plain Better Than Cream Cheese,1,0.00001,10,10,0.100000,0.100000


In [11]:
text = widgets.Text(
    value='Organic Raspberries',
    placeholder='Organic Raspberries|',
    description='Search:',
    disabled=False   
)

drop_down = widgets.Dropdown(
    options=list(count_prods[:200].index),
    value='Banana',
    description='Product:',
    disabled=False,
)

radio = widgets.RadioButtons(
    options=['search', 'dropdown'],
    value='dropdown',
    description='Use:',
    disabled=False
)



button = widgets.Button(
    description='Search',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
)

select = widgets.Select(
    options=['Support2', 'confidence', 'lift'],
    value='lift',
    # rows=10,
    description='Order:',
    disabled=False
)

def handle_submit(sender):
    clear_output()
    display(text)
    display(drop_down)
    display(radio)
    display(select)
    display(button)

    
    if radio == 'search':
        print("Search term: {}".format(text.value))
        display(df_pairs[df_pairs['items1'].str.contains(text.value)]).sort_values(by=select.value, ascending=False)['items2']
    else: 
        print("Search term: {}".format(drop_down.value))
        recom = df_pairs[df_pairs['items1'] == drop_down.value].sort_values(by=select.value, ascending=False)['items2']
        display(recom)

In [12]:
display(text)
display(drop_down)
display(radio)
display(select)
display(button)
button.on_click(handle_submit)

Text(value='Organic Raspberries', description='Search:', placeholder='Organic Raspberries|')

Dropdown(description='Product:', options=('Banana', 'Bag of Organic Bananas', 'Organic Strawberries', 'Organic…

RadioButtons(description='Use:', index=1, options=('search', 'dropdown'), value='dropdown')

Select(description='Order:', index=2, options=('Support2', 'confidence', 'lift'), value='lift')

Button(description='Search', style=ButtonStyle(), tooltip='Click me')

Search term: Banana


95                 Organic Peach Icelandic Nonfat Yogurt
122    Organic Yummy Tummy Maple & Brown Sugar Instan...
189                            Original Creole Seasoning
171                                       Chopped Ginger
72                                         Jicama Sticks
76                          Kix Crispy Corn Puffs Cereal
100                     Chocolate Fudge High Protein Bar
182                                      Lorraine Quiche
188                                          Onion Rings
172                            Geranium Liquid Dish Soap
148                                Yellow Tortilla Chips
191                                 Dark Chocolate Chips
97                      Feta Crumbled Traditional Cheese
192                                    Thai Coconut Soup
96      Gluten Free Cheddar Macaroni & Cheese Rice Pasta
98     Healthy Grains Oats & Honey Clusters with Toas...
51          Organic Whole Grain Oatmeal Cereal Baby Food
163                            

In [31]:
basket_lists = orders_total_prior.groupby('order_id')['product_name'].apply(list)

In [67]:
norm = [float(i)/max(count_prods.values) for i in count_prods.values]

In [68]:
import numpy as np



idf = 1/(count_prods/N_Transactions)

np.log(idf)*norm

ValueError: Unable to coerce to Series, length must be 1: given 16319