# What is the Apriori algorithm?
Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function that exploits the downward closure property of support. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.

## Apriori Property –
All subsets of a frequent itemset must be frequent(Apriori propertry).
If an itemset is infrequent, all its supersets will be infrequent.

### **The following are the main steps of the algorithm:**
* Calculate the support of item sets (of size k = 1) in the transactional database (note that support is the frequency of occurrence of an itemset). This is called generating the candidate set.

* Prune the candidate set by eliminating items with a support less than the given threshold.

* Join the frequent itemsets to form sets of size k + 1, and repeat the above sets until no more itemsets can be formed. This will happen when the set(s) formed have a support less than​ the given support.

## Bakery Market Basket Analysis 
Market Basket Analysis is used to increase marketing effectiveness and to improve cross-sell and up-sell opportunities by making the right offer to the right customer. For a retailer, good promotions translate into increased revenue and profits. The objectives of the market basket analysis models are to identify the next product that the customer might be interested to purchase or to browse." 

Right! Before we implement the algorithm just for the sake of showing off our skills, what is our goal? As discussed previously we are here to determine up-sell opportunities. Let's start with some general questions as a framework: What sort of relationships do we wish to discover? and then, naturally: how would discovering such relationships help the business owner's bottom line? for now, let's keep these in the back of our mind.


* Can we get rid of a product 'X' because it is sold infrequently?
If the business owner wishes to get rid of a product in order to save any cost and overhead associated with it but unknowingly is getting rid of a product that is part of an item set 'X' and 'Y' where both X and Y are complements, it might not be as straightforward since it may impact other products.

## If you find this kernel usefull please UPVOTE 

# Import libraries

In [None]:
import pandas as pd
import numpy as np
import io

import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import mlxtend as ml

# Load dataset

In [None]:
df = pd.read_csv('/kaggle/input/transactions-from-a-bakery/BreadBasket_DMS.csv')

In [None]:
# first five row
df.head()

In [None]:
# size of datset
df.shape

In [None]:
# summary about dataset
df.info()

In [None]:
# statistical summary of numerical variables
df.describe()

# Exploratory data analysis

In [None]:
# check for missing values
df.isnull().sum() 

In [None]:
# merge date and time column
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df = df[["Datetime", "Transaction", "Item"]]

df.head()

In [None]:
df.dtypes

In [None]:
# check for unique value in items
df['Item'].value_counts().to_dict()

#### There is 786 'NONE' , we need to remove it

In [None]:
# Remove none
df = df[df['Item'] != 'NONE']

In [None]:
# check NONE value removed or not
df[df['Item'] == 'NONE']

In [None]:
# Extract hour of the day and weekday of the week
# For Datetime: the day of the week are Monday=0, Sunday=6, thereby +1 to become Monday=1, Sunday=7

df['Hour'] = df['Datetime'].dt.hour

df["Weekday"] = df["Datetime"].dt.weekday + 1

df.head()

In [None]:
total_items = len(df)
total_days = len(np.unique(df.Datetime.dt.day))
total_months = len(np.unique(df.Datetime.dt.month))
average_items = int(total_items / total_days)
unique_items = df.Item.unique().size

print("Total unique_items: {} sold by the Bakery".format(unique_items))
print('-----------------------------')
print("Total sales: {} items sold in {} days throughout {} months".format(total_items, total_days, total_months))
print('-----------------------------')
print("Average_items daily sales: {}".format(average_items))

In [None]:
# Rank the top 10 best-selling items
counts = df.Item.value_counts()

percent = df.Item.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

top_10 = pd.DataFrame({'counts': counts, '%': percent})

top_10.head(10)

In [None]:
# Rank by percentage
plt.figure(figsize=(8,5))
df.Item.value_counts(normalize=True)[:10].plot(kind="bar", title="Percentage of Sales by Item").set(xlabel="Item", ylabel="Percentage")
plt.show()

# Rank by value
plt.figure(figsize=(8,5))
df.Item.value_counts()[:10].plot(kind="bar", title="Total Number of Sales by Item").set(xlabel="Item", ylabel="Total Number")
plt.show()

In [None]:
# set datetime as index 
df.set_index('Datetime', inplace=True)

In [None]:
# Number of items sold by day
df["Item"].resample("D").count().plot(figsize=(15,5), title="Total Number of Items Sold by Date").set(xlabel="Date", ylabel="Total Number of Items Sold")
plt.show()

In [None]:
# Number of items sold by month
df["Item"].resample("M").count().plot(figsize=(15,5), grid=True, title="Total Number by Items Sold by Month").set(xlabel="Date", ylabel="Total Number of Items Sold")
plt.show()

In [None]:
# Aggregate item sold by hour
df_groupby_hour = df.groupby("Hour").agg({"Item": lambda item: item.count()/total_days})
print(df_groupby_hour)

# Plot items sold by hour
plt.figure(figsize=(8,5))
sns.countplot(x='Hour',data=df)
plt.title('Items Sales by hour')
plt.show()

In [None]:
# sales groupby weekday
df_groupby_weekday = df.groupby("Weekday").agg({"Item": lambda item: item.count()})
df_groupby_weekday.head()

# Modeling

In [None]:
# Define dataset to machine learning
df_basket = df.groupby(["Transaction","Item"]).size().reset_index(name="Count")

market_basket = (df_basket.groupby(['Transaction', 'Item'])['Count'].sum().unstack().reset_index().fillna(0).set_index('Transaction'))
market_basket.head()

In [None]:
# Convert all of our numbers to either a 1 or a 0 (negative numbers are converted to zero, positive numbers are converted to 1)
def encode_data(datapoint):
  if datapoint <= 0:
    return 0
  else:
    return 1

In [None]:
# Process the transformation into the market_basket dataset
market_basket = market_basket.applymap(encode_data)

# Check the result
market_basket.head()

market_basket.isna().sum()

## Building the Apriori model 

### Support:
* refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions

### Confidence:

* refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought

### Lift:

* refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B)

### Leverage:

* computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent

### Conviction:

* A high conviction value means that the consequent is highly depending on the antecedent

In [None]:
# Apriori method request a min_support: Support is defined as the percentage of time that an itemset appears in the dataset.
# Defined to start seeing data/results with min_support of 2%

itemsets = apriori(market_basket, min_support= 0.02, use_colnames=True)

In [None]:
# Build your association rules using the mxltend association_rules function.
# min_threshold can be thought of as the level of confidence percentage that you want to return
# Defined to use 50% of min_threshold

rules = association_rules(itemsets, metric='lift', min_threshold=0.5)

In [None]:
# Below the list of products sales combinations
# It can use this information to build a cross-sell recommendation system that promotes these products with each other 

rules.sort_values("lift", ascending = False, inplace = True)
rules.head(10)

In [None]:
support = rules.support.to_numpy()
confidence = rules.confidence.to_numpy()

for i in range (len(support)):
    support[i] = support[i]
    confidence[i] = confidence[i]

plt.figure(figsize=(8,6))    
plt.title('Assonciation Rules')
plt.xlabel('support')
plt.ylabel('confidance')
sns.regplot(x=support, y=confidence, fit_reg=False)
plt.show()

In [None]:
# Recommendation of Market Basket
rec_rules = rules[ (rules['lift'] > 1) & (rules['confidence'] >= 0.5) ]

In [None]:
# Recommendation of Market Basket Dataset
cols_keep = {'antecedents':'item_1', 'consequents':'item_2', 'support':'support', 'confidence':'confidence', 'lift':'lift'}
cols_drop = ['antecedent support', 'consequent support', 'leverage', 'conviction']

recommendation_basket = pd.DataFrame(rec_rules).rename(columns= cols_keep).drop(columns=cols_drop).sort_values(by=['lift'], ascending = False)

display(recommendation_basket)

## If you find this kernel usefull please UPVOTE 