## **Project Overview**

The primary objective of this project is to assist the bakery company in increasing sales and customer base by leveraging association rule mining. 

We aim to identify relationships between items within dataset found [here](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket/data), which originates, according to the source, from a bakery in Edinburgh. Our goal is to uncover which products are frequently purchased together, quantify the strength of these co-occurrences, and provide strategic recommendations based on our findings.

This analysis draws inspiration from the article [*"A Conceptual Introduction into Association Rule Mining — Part 2"*](https://medium.com/delvify/a-conceptual-introduction-into-association-rule-mining-part-2-96c73c4ce87b) authored by Annette Catherine Paul, published on August 13, 2021. My goal is to re-create and adapt her analysis, but with my own unique approach. I am driven by a curiosity to explore the outcomes and a desire to verify the accuracy of her conclusions. I respect her methodological process and see this project as a valuable opportunity to enhance my understanding, as she has extensive experience in the field.

*“Imitation is not mechanical ‘parroting’, but complex, goal-oriented behavior which is central to learning” (Purainen-Marsh et al.2012)*

### Main scope of work
* [1. Data understanding & data wrangling](#Step1)
* [2. Transactions encoding](#Step2)
* [3. Tree map](#Step3)
* [4. Apriori algorithm - frequent itemsets generation](#Step4)
* [5. Association rules](#Step5)
* [6. Visualizations for rules generated with support threshold 0.01](#Step6)
* [7. Final conclusions and recommendations](#Step7)

## **Modules**

In [None]:
import pandas as pd
import numpy as np

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

import squarify
import networkx as nx
import matplotlib
import matplotlib.pyplot as plt 
from matplotlib import style 
import seaborn as sns 
sns. set()

from PyARMViz import PyARMViz
from PyARMViz.Rule import generate_rule_from_dict

## **1. Data understanding and data wrangling** <a id="Step1"></a>

In [None]:
data_bakery = pd.read_csv('bread basket.csv')
data_bakery

In [None]:
data_bakery.columns = ['Transaction', 'Item', 'Date_time', 'Period_day', 'Weekday_weekend']

### **Data types & Missing values**

In [None]:
data_bakery.info()

No missing values

In [None]:
data_bakery['Date_time'] = pd.to_datetime(data_bakery['Date_time'], dayfirst=True)

In [None]:
#data_bakery.isnull().sum()

### **Transactions**

In [None]:
data_bakery['Transaction'].nunique()

In [None]:
data_unique_trans = data_bakery.drop_duplicates(subset='Transaction')
data_unique_trans.reset_index(drop=True)

A total of 9465 transactions were carried out.

### **Items**

In [None]:
data_bakery['Item'].nunique()

In [None]:
sorted(data_bakery['Item'].unique())

Insights: 
- There are no spelling mistakes, no typos, 
- The shop provides a well-rounded selection of products

* Bakery items: various breads, cakes, pastries, and cookies
* Confectionery: from chocolates and fudges to sweet snacks like popcorn and crisps
* Savory meals, snacks and conidments: chicken stew, frittata, polenta, focaccia, tacos
* Beverages:  both hot drinks like coffee and tea and cold options like juices and flavored water
* Specialty and seasonal items: the shop offers products tailored to specific events (Christmas common, Valentine’s card) or dietary preferences, such as vegan options.
* Gifts & extras: fift voucher, postcard, nomad bag, t-shirt, afternoon with the baker
* And many more...                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [None]:
data_bakery['Item'].value_counts()

#### Network diagram 
The 30 most frequently purchased products across all transactions 

In [None]:
top_items = data_bakery['Item'].value_counts().nlargest(30)
top_items

In [None]:
df = pd.DataFrame({
    'from': ['Bakary'] * len(top_items),
    'to': top_items.index,
    'value': top_items.values
})

# Build the graph
G = nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.Graph())

# Function to format labels
def format_label(label):
    if ' ' in label:
        # Split label into two lines at the space
        return '\n'.join(label.split(' ', 1))
    else:
        # Return label as is
        return label

# Prepare labels with formatting
formatted_labels = {node: format_label(node) for node in G.nodes()}

# Draw the graph
plt.figure(figsize=(8, 7))

nx.draw(
    G,
    with_labels=True,                           # Show labels
    node_color='#2166ac',                       # Dark red color for nodes
    node_size=1800,                             # Size of the nodes
    edge_color=df['value'],                     # Set edge color
    width=8.0,                                 # Edge width
    labels=formatted_labels,                    # Use formatted labels
    font_color='white',                         # Set the label color to white
    font_size=7,                                # Set the label size
    edge_cmap=plt.cm.Blues                      # Edge color map 
)

plt.title("Network Graph \n Top 30 products at the bread basket bakery", fontsize=15);

Network graph was built based on [this code](https://python-graph-gallery.com/325-map-colour-to-the-edges-of-a-network/). 

The visibility of the line indicates the strength of the connection, meaning the more visible the line, the more frequently the product appears. 

**Among 94 available items, products purchased most frequently (represented by the most visible lines) are <u>coffee, bread, tea, and cake</u>**. 

### **Date_time**

In [None]:
import plotly.express as px
fig = px.line(data_unique_trans, x = 'Date_time', y ='Transaction' )
fig.update_traces(line=dict(color='#B2182B', width=2))  

fig.update_layout(
    plot_bgcolor='white', 
    xaxis=dict(showgrid=True, gridcolor='#dee2e6'),  
    yaxis=dict(showgrid=True, gridcolor='#dee2e6')
)

fig

In [None]:
data_bakery['Date_time'].sort_values(ascending = False)

In [None]:
2000/30

Every month, around 2000 transactions were conducted (it is around 67 transactions per day), indicating that the transactions are evenly distributed across the months.

The dataset encompasses transactions from last days of October 2016 (starting from 30/10/2026) to the first days of April 2017 (up to 09/04/2027).This means **we have data covering approximately 5/6 months (roughly half of the year)**.

### **Period_day** 	

In [None]:
fig = px.line(data_unique_trans, x = 'Date_time', y ='Period_day' )
fig.update_traces(line=dict(color='#67a9cf', width=2))  

fig.update_layout(
    plot_bgcolor='white', 
    xaxis=dict(showgrid=True, gridcolor='#dee2e6'),  
    yaxis=dict(showgrid=True, gridcolor='#dee2e6')
)

fig

In [None]:
round((data_unique_trans['Period_day'].value_counts() / data_unique_trans.shape[0]) * 100,2)

In [None]:
53.77 + 43.35

Around 97% of transactions were conducted in the afternoon or morning, while evening and night are less popular.

The next question is: what exactly does each period of the day mean and what are the hour ranges for each?

In [None]:
#data_unique_trans[data_unique_trans['Period_day'] == 'afternoon']

In [None]:
data_unique_trans[data_unique_trans['Period_day'] == 'morning']['Date_time'].dt.strftime('%H:%M').sort_values()

From the earliest classified transaction to the latest 
* morning: 01:21, next transaction 7:29 - 11:59
* afternoon: 12:00 - 16:59 
* evening: 17:00 - 20:46
* night: 21:42 - 23:38

Insights: 

The observed operating hours appear to be from 7:30 AM to 12:00 PM (There is just one transaction that is beyond the schedule - 01:21AM)
Also some transactions occurred very late, which is quite unusual for a bakery.  

Maybe the shop is open 24/7 or maybe some of the transactions were made online? 

### **Weekday_weekend**

In [None]:
round((data_unique_trans['Weekday_weekend'].value_counts() / data_unique_trans.shape[0]) * 100,2)

65% of transactions occurred on weekdays, while 35% on weekends. 

In [None]:
round((data_unique_trans['Date_time'].dt.day_name().value_counts() / data_unique_trans.shape[0]) * 100,2)

The most transactions were conducted on Saturdays, which makes sense as this is a free day for many people, giving them more time for shopping. The shop is open every day of the week, whether it's a weekend or a weekday. However, it is likely closed on bank holidays, but this would need to be confirmed.

In [None]:
data_unique_trans[data_unique_trans['Date_time'].dt.day_name() == 'Friday']

Friday is treated as a weekday.

In [None]:
transactions_weekend = data_unique_trans[data_unique_trans['Weekday_weekend'] =='weekend'].reset_index(drop = True)
transactions_weekday = data_unique_trans[data_unique_trans['Weekday_weekend'] =='weekday'].reset_index(drop = True)

In [None]:
transactions_weekend['Date_time'].dt.strftime('%H:%M').sort_values()

In [None]:
transactions_weekday['Date_time'].dt.strftime('%H:%M').sort_values()

The opening hours are similar for both weekends and weekdays.

### **Bank holidays**

List of fixed bank holidays in Scotland according to [wikipedia](https://en.wikipedia.org/wiki/Public_and_bank_holidays_in_Scotland) that could be included in our dataset is presented in dictionary bank_holidays.

In [None]:
bank_holidays = [
    '2016-11-30',  # St. Andrew's Day
    '2016-12-25',  # Christmas Day
    '2016-12-26',  # Boxing Day
    '2017-01-01',  # New Year's Day
    '2017-01-02',  # 2nd January
]

#bank_holidays += ['2016-03-25', '2017-04-14']  # Good Friday for 2016 and 2017
bank_holidays = pd.to_datetime(bank_holidays)

In [None]:
bank_holiday_transactions = data_unique_trans[data_unique_trans['Date_time'].dt.date.isin(bank_holidays.date)]
bank_holiday_transactions.reset_index(drop=True)

In [None]:
bank_holiday_transactions[bank_holiday_transactions['Transaction'] == 4090]

Someone bought bread at 1:21AM on New Year's Day. Is it possible? 

In [None]:
bank_holiday_transactions.shape

In [None]:
bank_holiday_transactions['Date_time'].dt.day_name()

There is one transaction that appears to be an outlier. It is the only transaction made on New Year's Day. The timing of the transaction is somewhat explainable, as people often stay up late to celebrate, which might be normal for some types of shops, however, it is unusual for a bakery.

Regarding other bank holidays, we can assume that the shop is closed just on major holidays because for example, on St. Andrew's Day the shop is open, but it is possible that the opening hours are reduced, as the last transaction on that day occurred around 5 PM.

### **To sum up**

The dataset is complete with no errors or missing values, consisting of 9465 transactions and 94 unique products. The most popular products are coffee, bread, tea, and cake. Transactions are evenly distributed with approximately 2000 transactions per month, covering roughly half a year from October 30, 2016, to April 9, 2017.

The majority of transactions (97%) occur in the morning or afternoon, with fewer transactions in the evening and at night. The observed operating hours of the shop appear to be from 7:30 AM to 12:00 PM.

Weekdays account for 65% of transactions, while weekends for 35%. Saturdays are the busiest days, likely due to the increased availability of customers. The shop operates every day of the week, but it is presumed to be closed on major bank holidays.

Additionally, there is one outlier transaction made at 01:21AM on New Year's Day. This late transaction is unusual for a bakery but may be explained by the celebratory nature of the holiday.

## **2. Transactions encoding** <a id="Step2"></a>

In [None]:
basket =  data_bakery[['Transaction','Item']]
basket.head(50)

The items purchased for each transaction.

In [None]:
transactions = [ a[1]['Item'].tolist() for a in list(basket.groupby(['Transaction']))]
transactions

In [None]:
# We have 9465 transactions
len(transactions) 

Convert a list of transactions into a one-hot encoded NumPy array. To encode the data  *TransactionEncoder* class from the *mlxtend library* was used.

In [None]:
te = TransactionEncoder()
te_ary = te.fit_transform(transactions)
te_ary

In [None]:
transactions_df = pd.DataFrame(te_ary,columns = te.columns_)
transactions_df

In [None]:
# convert False to 0, True to 1
transactions_df_int = transactions_df.astype(int)
transactions_df_int

The resulting array where each transaction is represented by a row with one-hot encoding indicating the presence of each unique item.

In [None]:
transactions_df_int.sum(axis=0)

In [None]:
item_count_df = pd.DataFrame(transactions_df_int.sum(axis=0))
item_count_df = item_count_df.reset_index()
item_count_df.columns = ['Item', 'Count']

In [None]:
item_count_df

## **3. Tree map** of the top 30 items based on their count <a id="Step3"></a> 

A tree map is a visual representation of hierarchical data, where each branch is represented by a rectangle and its size is proportional to the data value it represents.

In [None]:
item_count_df_30 = item_count_df.sort_values('Count', ascending = False).head(30)
item_count_df_30

In [None]:
import matplotlib.cm as cm
def split_label(label, max_words=1):
    words = label.split()
    return '\n'.join([' '.join(words[i:i + max_words]) for i in range(0, len(words), max_words)])

# Apply the function to split the labels
item_count_df_30['Item'] = item_count_df_30['Item'].apply(split_label)

# Plotting
fig, ax = plt.subplots(figsize=(13, 8))  # Larger plot

#cmap = matplotlib.cm.coolwarm
cmap = plt.get_cmap("RdBu")
mini = min(item_count_df_30['Count'])
maxi = max(item_count_df_30['Count'])
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in item_count_df_30['Count']]

squarify.plot(sizes=item_count_df_30['Count'], label=item_count_df_30['Item'],alpha = 0.95, color=colors, text_kwargs={'fontsize': 8.5,'color': 'white'})

plt.axis('off')  # Turn off the axes
plt.show()  

As noted earlier, the most popular items, represented by the largest rectangles, are coffee, bread, tea, cake + items typically regarded as side dishes or supplementary offerings rather than main courses.

## **4. Apriori algorithm - frequent itemsets generation** <a id="Step4"></a>

The *apriori* function implements the Apriori algorithm, which finds frequent itemsets in transactional data. 

In the output:

* **support: the proportion of transactions that contain the itemset**

* **itemsets: the itemsets that meet the support threshold**

Support threshold -> beyond it, items are considered significant, that have meaningful impact on profit

#### a) support threshold 0.05 
We choose only itemsets that appear in more than 5% of transactions ( more than 473 times)

In [None]:
len(transactions) * 0.05

In [None]:
frequent_itemsets_0_05 = apriori(transactions_df,min_support = 0.05, use_colnames = True, max_len = 3) # only itemsets with up to 3 items
frequent_itemsets_0_05

In [None]:
frequent_itemsets_0_05['length'] = frequent_itemsets_0_05['itemsets'].apply(lambda x: len(x))

In [None]:
#frequent_itemsets_0_05.sort_values('support', ascending=False)

In [None]:
frequent_itemsets_0_05[frequent_itemsets_0_05['length']== 1].sort_values('support', ascending=False)

In [None]:
frequent_itemsets_0_05[frequent_itemsets_0_05['length']== 2].sort_values('support', ascending=False)

In [None]:
frequent_itemsets_0_05[frequent_itemsets_0_05['length']== 3].sort_values('support', ascending=False)

#### a) support threshold 0.01  
We select only itemsets that appear in more than 1% of transactions ( more than 94 times )

In [None]:
len(transactions) * 0.01

In [None]:
frequent_itemsets_0_01 = apriori(transactions_df,min_support = 0.01, use_colnames = True, max_len = 3)
#frequent_itemsets_0_01

In [None]:
frequent_itemsets_0_01['length'] = frequent_itemsets_0_01['itemsets'].apply(lambda x: len(x))

In [None]:
frequent_itemsets_0_01.sort_values('support', ascending=False)

In [None]:
#frequent_itemsets_0_01[frequent_itemsets_0_01['length']== 1].sort_values('support', ascending=False)

In [None]:
#frequent_itemsets_0_01[frequent_itemsets_0_01['length']== 2].sort_values('support', ascending=False)

In [None]:
frequent_itemsets_0_01[frequent_itemsets_0_01['length']== 3].sort_values('support', ascending=False)

#### Comparison 

In [None]:
frequent_itemsets_0_05.shape

In [None]:
frequent_itemsets_0_01.shape

Frequent itemsets only differ in the number of itemsets. The first one (with support threshold 5%) contains 11 itemsets, another 61 itemsets (50 itemsets more). 

In both: 

- coffee is present in about 48% of the transactions -> nearly half of the transactions includes coffee !!!
- bread appears in about 33% of the transactions
- tea in 14%
- cake in 10%
- coffee and bread in around 9%, while coffee and cake in around 6%

Additionally, in the *frequent_itemsets_0_01* we have also itemsets composed of 3 items. In about 1% of the transactions (Coffee, Bread, Pastry) or (Coffee, Bread, Cake) or (Coffee, Cake, Tea) appear together. 

What is more, coffee is extremely popular. In the *frequent_itemsets_0_01* in itemsets of 2 items, it appears almost everywhere -> coffee is a highly frequent item. 

## **5. Association rules** <a id="Step5"></a>

#### Theory

Association_rules function, from the mlxtend library, generates association rules from the frequent itemsets. It is used to discover interesting relationships between items in large datasets.

Antecedent ⇒ Consequent

      IF  ⇒  THEN 

        X ⇒ Y
        
If a customer buys X, he/she is likely to also buy Y

<u>Key metrics</u>
* **support - how frequently the antecedent and consequent occur together in the dataset**

example: support = 0.02 = 2% -> the combination of itmes in a set appears in 2% of all transactions

* **confidence - how likely you are to buy one item if you have already bought another** 

example: confidence = 0.75 = 75% -> There is a 75% chance that if you bought X, you would buy Y

* **lift - it indicates if one item truly boosts the sale of another**

example lift = 26.8 -> the two items are purchased together almost 27 times more often than expected if they were bought separately

lift > 1 - Y is likely to be bought with X 

lift = 1 - no associations between items 

lift < 1 - Y is unlikely to be bought with X 

High support ensures the rule is relevant, high confidence ensures the rule is reliable, and high lift ensures the rule reveals meaningful and non-random associations.

<u> What we are interested in are high support, confidence, and lift.</u> This indicates frequent, strong, significant associations between items.

#### a) support threshold 0.05 

In [None]:
rules_0_05 = association_rules(frequent_itemsets_0_05, metric='lift', min_threshold = 0.05)
rules_0_05.sort_values(["confidence"], axis = 0, ascending = False)

Interpretation of the first rule: 

- 5.47% of transactions include both Cake and Coffee
- 52.70% of transactions with Cake also include Coffee
- The likelihood of Coffee being purchased with Cake is about 1.10 times higher than if they were independent

The rules with positive lift values (Cake → Coffee and Coffee → Cake) suggest a moderate positive association between these items. 

The rules with negative lift values (Bread → Coffee and Coffee → Bread) suggest a weaker or negative association. In other words, the presence of one item actually decreases the likelihood of finding the other item in the same transaction.

#### a) support threshold 0.01  

In [None]:
rules_0_01 = association_rules(frequent_itemsets_0_01, metric='lift', min_threshold = 0.01)
rules_0_01.sort_values(["confidence"], axis = 0, ascending = False)

The first few rules show positive associations between various antecedents and Coffee. 

When Toast/Spanish Brunch/Medialuna..  is purchased, it’s quite likely that Coffee will be purchased as well. 

## **6. Visualizations** for rules generated with support threshold 0.01 <a id="Step6"></a>

### **Heatmap**

The heatmap will help visualize the strength of associations between antecedents (2 items) and consequents (1 item) based on the lift metric.

lift > 1 -> if item X is bought, item Y too - strong relationship; useful for cross-promotional strategies or bundling products

lift = 1 -> no association; no specific action might be needed

lift < 1 -> if item X is bought, item Y is not - weak relationship; items might not be complementary or could even be less likely to be purchased together, so it is better not to combine them

In [None]:
# https://python-graph-gallery.com/heatmap/

In [None]:
rules_0_01['antecedents length'] = rules_0_01['antecedents'].apply(lambda x: len(x))

In [None]:
rules_0_01['consequents length'] = rules_0_01['consequents'].apply(lambda x: len(x))

In [None]:
#sanity check
#rules_0_01.tail(20)

In [None]:
rules_0_01_filtered = rules_0_01[(rules_0_01['antecedents length'] == 2) & (rules_0_01['consequents length'] == 1) ].reset_index(drop=True)

In [None]:
rules_0_01_filtered

In [None]:
rules_0_01_filtered.shape

In [None]:
heatmap_df = rules_0_01_filtered.pivot(
    index='antecedents',
    columns='consequents',
    values='lift',
)

In [None]:
heatmap_df

In [None]:
heatmap_df.index = heatmap_df.index.map(lambda x: '_'.join(x))
heatmap_df.index.name = None

In [None]:
# for col in heatmap_df.columns:
#     #print(col)
#     print(next(iter(col)) )

In [None]:
heatmap_df.columns = [next(iter(col)) for col in heatmap_df.columns]
heatmap_df.columns 

In [None]:
heatmap_df

### Heatmap plot 

In [None]:
plt.figure(figsize=(11, 6))
ax = sns.heatmap(heatmap_df, annot=True, annot_kws={"size": 7}, cmap='RdBu',linewidths=0.5, linecolor='#dee2e6')

cbar = ax.collections[0].colorbar
cbar.set_label('Lift')

ax.set_frame_on(False)
#ax.set_facecolor('white') 

plt.xlabel('Consequents')
plt.ylabel('Antecedents')
plt.show()


value > 1 -> x1 & x2 -> Y, the combination of x1 and x2 increases the likelihood of purchasing y

value < 1 -> x1 & x2 -> Y, the combination of x1 and x2 decreases the likelihood of purchasing y

Strong positive association: Coffee and Tea → Cake - significantly increases

Weak positive association: Coffee and Bread → Cake, Coffee and Bread → Pastry, Coffee and Cake → Tea - slightly increases 

Other are negative associations

### **Scatterplot**

In [None]:
rules_0_01[rules_0_01['antecedents length']==1].sort_values(by = 'confidence', ascending = False).head(50)

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x="support", y="confidence", size="lift", hue="lift", data=rules_0_01, palette='RdBu', sizes=(20, 200))
plt.title('Metrics')
plt.xlabel('support')
plt.ylabel('confidence')
plt.show()

Overall, the scatter plot shows that **most itemsets have low support and low confidence with varying lift values**. Most bubbles are concentrated in the region with support between 1% and 3%. There are some itemsets with confidence values between 50% and 60%, but these are not significant due to their lift being around 1. This suggests that while these itemsets have decent confidence, they don't provide a stronger association beyond what would be expected by chance.

The majority of itemsets have confidence below 30%, indicating that if I purchase item X, the probability of also purchasing item Y is around 30%. High lift values are observed when confidence is relatively low (around 10-20%) and support is between 1% and 2.5%. This indicates that although these itemsets are infrequent and have lower confidence, the strength of the association is higher when they do occur.


### **Parallel Category Plots**

In [None]:
rules_0_01

In [None]:
#len(transactions)

In [None]:
b = rules_0_01
b['uni'] = np.nan
b['ant'] = np.nan
b['con'] = np.nan
b['tot'] = 9465
transactions = [ a[1]['Item'].tolist() for a in list(basket.groupby(['Transaction']))]

def trans():
    for t in transactions:
        yield t
    
def ant(x):
    cnt = 0
    for t in trans():
        t = set(t)
        if x.intersection(t) == x:
            cnt = cnt + 1 
    return cnt

bb = b.values.tolist()  

rules_dict = []
for bbb in bb:
    bbb[10] = ant(bbb[0])
    bbb[11] = ant(bbb[1])
    bbb[9] = ant(bbb[0].union(bbb[1]))
    diction = {
        'lhs': tuple(bbb[0]), 
        'rhs': tuple(bbb[1]),
        'count_full': bbb[9],
        'count_lhs': bbb[10],
        'count_rhs': bbb[11],
        'num_transactions': len(transactions)
    }
    rules_dict.append(diction)

In [None]:
rules_dict

In [None]:
rules = []
for rd in rules_dict: 
    rules.append(generate_rule_from_dict(rd))
rules

In [None]:
#type(rules[0])
# print(dir(rules[0]))

In [None]:
# Verification 
# bread_rules = [
#     rule for rule in rules
#     if "Bread" in rule.lhs or "Bread" in rule.rhs
# ]

# for rule in bread_rules:
#     print(rule)

In [None]:
# medialuna_rules = [
#     rule for rule in rules
#     if "Medialuna" in rule.lhs or "Medialuna" in rule.rhs
# ]

# for rule in medialuna_rules:
#     print(rule)

### Way 1

In [None]:
PyARMViz.generate_parallel_category_plot(rules)

First plot
1. The plot features two vertical axes. The left axis represents Antecedent 1, and the right axis represents the Consequent.
2. Each axis is segmented into categories. The width of the rectangular marker next to each category's name indicates the number of rules associated with that category. A broader marker signifies a higher count of rules originating from or leading to that category.
3. The most frequent Antecedent 1 is "coffee" (count = 16), followed by "bread" (count = 10), and "cake" and "tea" (both with counts = 4). The Consequent axis mirrors this distribution, indicating that "coffee" is also the most common consequent.
4. Each line represents an individual association rule. The uniform thickness of the lines and the hover information (count = 1) suggest that each association occurs exactly once, with no variation in frequency.
5. Example:
   
**Antecedent 1 (Coffee) → Consequent (Brownie) 
If a customer buys coffee, they are likely to buy a brownie.**

Second plot - multi-step purchasing behavior
1. This plot introduces an additional layer. It has three vertical axes: Antecedent 2 on the left, Antecedent 1 in the center, and the Consequent on the right.
2. Similar to the first plot, categories are represented with rectangular markers, with width indicating rule frequency.
3. The most common Antecedent 2 is "coffee" (count = 6), the most frequent Antecedent 1 is "cake" (count = 3), and the most common Consequent is "coffee" again (count = 3).
4. As with the first plot, each line represents a single rule with a count of 1, indicated by the hover information and consistent line thickness.
5. Example:

**Antecedent 2 (Cake) → Antecedent 1 (Tea) → Consequent (Coffee)
Customers who first select "cake" tend to also choose "tea," and then they are likely to purchase "coffee" as well.**

### Way 2

In [None]:
rules_df = pd.DataFrame(rules_dict)

rules_df = rules_df.sort_values(by='count_full', ascending=False)
import plotly.express as px

# Create the parallel categories plot
fig = px.parallel_categories(
    rules_df,
    dimensions=['lhs', 'rhs'],
    color='count_full',  # You can color by 'support', 'confidence', 'lift', etc.
    color_continuous_scale=px.colors.sequential.RdBu
)

fig.update_layout(
    title='Parallel categories plot of association rules',
    coloraxis_colorbar=dict(
        title="support count"
    )
)

fig.show()

In this plot, blue lines represent rules with high support, indicating they occur most frequently in the dataset. Red lines, on the other hand, represent rules with lower support, meaning they are less common. **The majority of the rules are depicted by red lines, suggesting that most of the associations in this dataset have low support and are not frequent.**

Unlike the previous plot, this one also organizes categories on both axes, from the most frequent to the least frequent. This means that by simply glancing at the plot, we can quickly identify the most common categories in a dataset. Considering our dataset **Coffee, Bread, Cake, and Tea are the most frequent items appearing on both the left-hand side (LHS) and the right-hand side (RHS) of the rules.**

<div class="alert alert-block alert-danger"> Coffee is overly dominant in the dataset, it might overshadow other potentially interesting patterns or associations -> removing it could help reveal less obvious relationships between other items,</div>

### **Network graph**

In [None]:
PyARMViz.generate_rule_graph_plotly(rules)

## **7. Final conclusions and recommendations** <a id="Step7"></a>

#### Rules

In [None]:
rules_0_05.sort_values(by = 'confidence', ascending = False)

In [None]:
rules_0_01[(rules_0_01['antecedents length']==1) & (rules_0_01['lift'] > 1)].sort_values(by = 'confidence', ascending = False).head(50)

In [None]:
rules_0_01[(rules_0_01['antecedents length']==2)& (rules_0_01['lift'] > 1)].sort_values(by = 'confidence', ascending = False).head(50)

Marketing strategies

#### 1. **Toast -> Coffee (70% chance of buying Coffee)**

*“Toast your day with Coffee – your perfect breakfast duo”*

Promotion:
- bundle offer: Toast + Coffee at a special combo price, e.g., “Grab toast and get coffee at 20% off.”
- visual merchandising: Use signage showing steaming coffee and fresh toast together, emphasizing the morning rush appeal.
- digital marketing: Promote on social media with a “Start Your Day Right” theme, featuring happy customers enjoying their morning toast and coffee.

#### 2. **Spanish Brunch -> Coffee (60% chance of buying Coffee)**

*“Savor the flavor: Spanish brunch & a perfect coffee”*

Promotion:
- special brunch days: Promote certain days as “Brunch Days” with special pricing or a small added incentive, like a free refill or mini dessert.
- themed days: Highlight cultural themes, like “Spanish Sunday Brunch,” to attract brunch enthusiasts.

#### 3. **Sweet Pastry (like Medialuna, Alfajores, Cake, Cookies) -> Coffee (more than 50% chance of buying coffee)**

*“Sweet moments deserve coffee”*

Promotion:
- coffee & sweet treat special: Offer a combo price, like “Add any sweet treat to your coffee for a discount.”
- happy hour: Introduce a coffee and sweet combo deal during afternoon hours to capture the post-lunch crowd.
- social media campaign: Run a “Sweet pairing of the day” where you feature a different pastry and coffee combo, encouraging customers to try new things.

#### 4. **Juice / Hot chocolate  -> Coffee ( 53% / 51% chance of buying coffee )**

*“Juice for you, coffee for me – the duo delight!”*

Promotion:
- couple/ friends combo: Special pricing when purchased together.
- engage through special events: “Bring a Friend” days or themed offers that cater to social gatherings, with extra incentives like a free mini treat when bought in pairs.

#### 5. **Coffee + Tea -> Cake (20% Confidence) | Coffee + Cake -> Tea (18% Confidence)**

*“Cake completes the sip: coffee, tea and something sweet”*

Promotion:
- complementary add-ons: Suggest cake when coffee and tea are ordered together. Offer small discounts for adding cake.
- tea time special: Promote a “Tea Time Special” with coffee, tea, and a slice of cake for a combined price, ideal for afternoon breaks.
- tabletop suggestions: Use tabletop displays or digital menu boards to suggest these add-ons with prompts like, “Enjoy with a slice of cake” when ordering coffee and tea.

#### **General marketing tactics across all combos** 
1. Create a <u>loyalty program</u> that rewards customers for frequently purchasing these pairs. For example, after purchasing a combo five times, they get the next coffee or treat free.
2. Feature <u>customer testimonials, photos, or quotes on social media</u> of people enjoying these pairings, highlighting the social aspect of dining at your bakery.
3. Tailor these pairings to <u>seasonal themes</u>, like autumn flavors for pastries or cozy hot chocolate in winter, making the offers feel timely and relevant.
4. <u>Train staff</u> to suggest these pairings based on the initial order, guiding customers toward these combos in a natural, friendly way.

#### **Final thoughts** 

**People love buying coffee**, it is a key driver of complementary sales. Customers frequently pair coffee with other items, such as pastries, cakes, or light meals like toast or a Spanish brunch. While this behavior is not new, it serves as a powerful validation of the concept that coffee often acts as an anchor product, encouraging additional purchases. By tapping into this insight, businesses can strategically boost transaction volume and attract a wider customer base.

Implementing this knowledge through targeted promotions—like discounts on pairings, special deals on certain days, or curated combo offers—can significantly enhance profitability. It is essential to continuously monitor how these strategies impact overall sales and profit margins to refine and optimize your approach further.

This tactic not only appeals to solo visitors but also creates a welcoming environment for various social scenarios—whether it is a casual meet-up with friends, a date, family outings, or even business conversations. It positions your establishment as a versatile space that caters to all.

In our analysis, while many of the association rules identified were similar, the insights were still valuable. Even though supports were relatively low, confidence and lift could be higher, the results reaffirmed expected patterns.