 # Market Basket Analysis  - Association Rule Mining (ARM) - Apriori

### **Motivation**:

I hope that together, we can learn the basics of association rule mining through this implementation of the apriori algorithm. Additionally, I will make this analysis as business oriented as possible so that we can potentially discover business insights hidden within the data.

### **Content**:
1. MBA: Market Basket Analysis
1. Business insight?
1. Data Exploration
1. Input Dataset
1. What does the data look like?
1. Association Rule Mining
1. Apriori
1. Measures of association
1. Data trasnformation 
1. Data Mining
1. Discoveries and Interpretation:
1. References

### Market Basket Analysis (MBA)

In the retail sector, this analysis is used to discover patterns and relationships within transactions and transaction **items-sets** (Items frequently bought together).  From all the sources I have read online, IBM puts it best:

> "Market Basket Analysis is used to increase marketing effectiveness and to improve cross-sell and up-sell opportunities by making the right offer to the right customer. For a retailer, good promotions translate into increased revenue and profits. The objectives of the market basket analysis models are to identify the next product that the customer might be interested to purchase or to browse."  [Learn more][1](https://www.ibm.com/support/knowledgecenter/en/SSCJHT_1.0.0/com.ibm.swg.ba.cognos.pci_oth.1.0.0.doc/c_pci_retail_mba.html)

The reason I like this explanation is that the intent of running the analysis is clear: "to improve cross-sell and up-sell opportunities. With this specific goal in mind, we can ask the right questions, which is what drives insight.


### Business insight?
Right!  Before we implement the algorithm just for the sake of showing off our skills, what is our goal? As discussed previously we are here to determine up-sell opportunities. Let's start with some general questions as a framework: What sort of relationships do we wish to discover? and then, naturally: how would discovering such relationships help the business owner's bottom line? for now, let's keep these in the back of our mind.

Let's ask something little more specific, shall we? 

*Can we get rid of a product 'X' because it is sold infrequently?* If the business owner wishes to get rid of a product in order to save any cost and overhead associated with it but unknowingly is getting rid of a product that is part of an item set 'X' and 'Y' where both X and Y are complements, it might not be as straightforward since it may impact other products.

### Data Exploration : 
Let's do an initial exploration to see what the data looks like and if we can think of some rules that might be intuitive. After all, if we were to analyze McDonald's transactional data, would you be thrilled to find out that burgers are related to fries?

In [None]:
# Including usual required libraries
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

### Input Dataset

In [None]:
# Read data set from CSV file
data = pd.read_csv('../input/BreadBasket_DMS.csv')
# Initial snapshot
display(data.head(10))

### What does the data look like?
What is the frequency of the top selling items?

In [None]:
items = data['Item'].value_counts().index
item_frequency = data['Item'].value_counts().values
plt.figure(figsize=(12,6))
plt.xlabel('Items', fontsize='15')
plt.ylabel('Frequency', fontsize='15')
plt.title('Top 10 Selling Items', fontsize='15')
plt.bar(items[:10],item_frequency[:10], width = 0.7, color="gray",linewidth=0.4)
plt.show()

### Association Rule Mining (ARM)
 An association rule is an implication expression of the form X→Y, where X and Y are disjoint itemsets [1]. 
 Association rules can be created with any number *n* of *K* Items. However, in this particular application due to the size of the transactions and quantity of significant items, we will consider an itemsets of size 2.

 ### Finding our candidate itemsets: Apriori algorithm
 Quick explanation: The algorithm is used to generate candidate that we will evaluate later on. The apriori approach is used to iterate through the items and find all the frequent itemsets that meet your standard of frequent.
 
 Not so quick explanation: This 2 itemset implementation using python generators was done by 'datathèque' and you can find more about it [here](https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis)

### Measures of association:
After our item sets are generated we have to determine which associations are strong and which ones are weak. (Again if you would like to read in more detail I about the measure [Click Here](http://michael.hahsler.net/research/association_rules/measures.html#)[3]:

**Support**: This determines the probability of an item-set by dividing the item-set occurrences over the total number of qualifying orders (orders with 2 or more transactions in them). Therefore, Support is the probability of an item-set occurring in transactions which have more than 2 items and it can be expressed as follows: Support(X and Y) = P( X ∩ Y )

**Confidence**: Confidence goes a step further in and expresses the probability of having an item X given that Y was already purchased: P( Y | X ) and is calculated as follows: Conf( X -> Y) = supp(X - > Y)/supp(X) = P( Y | X ). Because P( Y | X ) does not imply P( X | Y ) then the lift metric needs to be introduced.

**Lift**:  Lift allowes us to quantify the relationship between  P( Y | X ) and P( X | Y ) as follows: 

Lift(X ⇒ Y ) = lift( Y ⇒ X ) = conf( X ⇒ Y )/supp( Y ) = conf( Y ⇒ X )/supp( X ) = P( X ∩ Y )/P( X )*P( Y ) 

If Lift equal or greater than 1,  items are said to be compliments.[3]





In [None]:
#### A. Helper functions to the main association rules function #### 
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]

In [None]:
#### B. Association rules function ####
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item)


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item)


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders)

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

### Data Transformation:
Our program is expecting the data in the type of a series. A series is basically a single column dataframe containing the transaction Item indexed by the transaction number. We can accomplish that as follows:

In [None]:
#Transforming original dataframe into the expected format: Series
orders_series = data.set_index('Transaction')['Item']
#Display series as dataframe for cosmetic purposes
display(pd.DataFrame(orders_series.head(10))) 

### Data Mining:

In [None]:
#Generates rules from the itemsets that meet the minium threshold
rules = association_rules(orders_series, 0.01)

#### Displaying output data: Sorted by lift


In [None]:
#Show output sorted by lift
display(rules.head(5))

#### Sorted by confidenceBtoA:

In [None]:
display(rules.sort_values('confidenceBtoA', ascending=False).head(10))

### Discoveries and Interpretation:
As we would expect there are several itemsets that contain the frequently sold item. However, note that these do not have a lift greater than 1 (except for coffee and toast). We can interpret this as follows:

Itemsets in the form {'Coffee', ' X ' }: Most of these items have a large amount of 'confidenceBtoA' or in other words, given that a customer has bought item X there is a high probability that a customer would buy coffee as well.  But this is not true the other way around. Buying coffee does not imply buying any other item as 'confidanceAtoB' is usually less than 0.1. 

Potential business solution:  There is a simple potential solution to this problem, simply at the point of sale, for anyone buying coffee and not buying anything else the bakery may want to offer an incentive or promotion to get a secondary item, preferably an item that already has relatively high 'confidanceAtoB' which indicates it is relatively more frequent than others, therefore, increasing the effectiveness of the promotion.

Itemsets with lift greater than 1: I personally find these very interesting. {Hot chocolate, Cookies}, {Sandwhich, Sandwhich}, {Hot Chocolate, Cake}, {Coffee, Toast}. Bundling these itemsets together in the form of a combo with a discount could be very useful for the business owner in my opinion. The only variation I would suggest to this strategy would be in the case of {Sandwhich, Sandwhich}. I would suggest pairing these two items together with beverages since it seems reasonable that the customers that buy this item set would also be a little thirsty. 

Let me know how you would implement these discoveries in the comment section.

Camilo,


#### References
1. [1] Market Basket Analysis URL: https://www.ibm.com/support/knowledgecenter/en/SSCJHT_1.0.0/com.ibm.swg.ba.cognos.pci_oth.1.0.0.doc/c_pci_retail_mba.html
1. [2] Association Rule Mining/Market Basket Analysis - URL: https://www.kaggle.com/datatheque/association-rules-mining-market-basket-analysis
1. [3] Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules, 2015, URL: http://michael.hahsler.net/research/association_rules/measures.html