## What Is Market Basket Analysis?



* __<font size="4"> Market basket analysis (also known as association rule mining) is a method of exploring customer buying patterns by revealing associations from stores transaction databases. (Chen, Y. L. et al., 2005)</font>__
* __<font size="4">It is firstly introduced in the "Mining association rules between sets of items in large databases." article by Rakesh Agrawal,  Tomasz Imieliński, and Arun Swami</font>__
* __<font size="4">According to Agrawal et al., this can be stated as follows. Let X and Y be sets of products in a market. Association rule states that if customer buys set of X, he or she is likely to buy set of Y as well.</font>__



* Chen, Y. L., Tang, K., Shen, R. J., & Hu, Y. H. (2005). Market basket analysis in a multiple store environment. Decision support systems, 40(2), 339-354.
* Agrawal, R., Imieliński, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp. 207-216).

__<font size="4"> In order to conduct the analysis, we need three metrics. Those are:</font>__
   * __<font size="4"> Support: Support(X, Y) = Freq(X,Y)/N </font>__
   * __<font size="4"> Confidence: Confidence(X, Y) = Freq(X,Y) / Freq(X)</font>__
   * __<font size="4"> Lift: Lift(X, Y) = Support (X, Y) / ( Support(X) * Support(Y) )</font>__

# Import Libraries

In [None]:
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Data Processing

In [None]:
data = pd.read_csv('../input/supermarket/GroceryStoreDataSet.csv')

In [None]:
data

In [None]:
df = list(data["MILK,BREAD,BISCUIT"].apply(lambda x:x.split(','))) # Dataset converted into a list for further operations
df

* Each row of data indicates a set of product that bought by the customers

In [None]:
te = TransactionEncoder()
te_data = te.fit(df).transform(df)
df = pd.DataFrame(te_data,columns=te.columns_)# Assigned each product in all baskets on the list as column names.
# Items in basket are set true, non-items are set to false
df = df.replace(True,1) # True values are replaced with 1 and false with 0.
df = df.astype(int)
df

* After the transformation process, we can observe that each row represents a basket and if a product is in that basket, it is set as 1. If it is not in the basket, it is set as 0.

# Support, Confidence and Lift

In [None]:
freq_items = apriori(df,min_support=0.10,use_colnames=True) # Support Values


In [None]:
freq_items.sort_values(by = "support" , ascending = False) # Sorting from highest to lowest

* Support metric refers to the frequency(probability) of occurrence of products in baskets singular or in combination.
* Product variety and brand diversity can be increased for products with high frequency of being seen singular or as a combination.
* In this dataset, bread can be given as an example. Bread is observed in 65% of all shopping. In this case, bread variety can be increased (rye, whole wheat, buckwheat, etc.) and more buyers can be targeted by increasing the number of brands. (Row 2)
* By doing so, companies can increase customer satisfaction

In [None]:
df1 = association_rules(freq_items, metric = "confidence", min_threshold = 0.02)

In [None]:
df1.sort_values(by = "lift", ascending = False) # Sorting from highest to lowest with respect to lift

* Lift indicates whether if two bundle of products can be sold together (lift value higher than 1), be substitute of each other (lift value lower than 1) or if there is no relation between the bundles (lift value equals to 1).

In [None]:
df1.sort_values(by = "confidence", ascending = False) # Sorting from highest to lowest with respect to confidence

* Confidence indicates a conditional probability.It indicates a chance of bought of a consequent product if the antecedent product is already bought.
* In this case, probability of jam is bought in case a bread is already bought is 16.66% (Row 21)
* Bundle of products with high confidence level can be stored in close sections in the market

In [None]:
df1[(df1.confidence > 0.8) & (df1.lift > 1)].sort_values(by="lift", ascending=False)

* Those above are the top sellers of the market. In order to increase income, those items can be combined with products with lower support values and special discounts can be given for those combinations. By doing so, the frequency of occurrence in the basket for products that have a lower support levels can be increased.