In [7]:
# Checking if package is installed or not
!pip show mlxtend

Name: mlxtend
Version: 0.23.4
Summary: Machine Learning Library Extensions
Home-page: https://github.com/rasbt/mlxtend
Author: 
Author-email: Sebastian Raschka <mail@sebastianraschka.com>
License: BSD 3-Clause
Location: C:\Users\snjvm\anaconda3\Lib\site-packages
Requires: joblib, matplotlib, numpy, pandas, scikit-learn, scipy
Required-by: 


In [9]:
import pandas as pd
from mlxtend.preprocessing import  TransactionEncoder
from mlxtend.frequent_patterns import apriori,association_rules

In [59]:
df=pd.read_excel(r".\Data\Online retail.xlsx",sheet_name=0,header=None)
df.head()

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


In [61]:
dataset=df[0].str.lower().str.split(',').tolist()

print(dataset[:3])

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs'], ['chutney']]


In [63]:
dataset = [sorted(transaction) for transaction in dataset]
print(dataset[:3])

[['almonds', 'antioxydant juice', 'avocado', 'cottage cheese', 'energy drink', 'frozen smoothie', 'green grapes', 'green tea', 'honey', 'low fat yogurt', 'mineral water', 'olive oil', 'salad', 'salmon', 'shrimp', 'spinach', 'tomato juice', 'vegetables mix', 'whole weat flour', 'yams'], ['burgers', 'eggs', 'meatballs'], ['chutney']]


Our data is correctly formatted for using inside TransactionEncoder

### Handling duplicate values in dataset

1. Duplicate Items Inside a Single Transaction
Imagine a customer buys two cartons of milk in one go. Our list might look like: `['Milk', 'Milk', 'Bread']`.

The Issue: The TransactionEncoder only cares if an item is present or not (True/False). It doesn't count "how many."

The Fix: We don't actually have to do anything! The TransactionEncoder handles this automatically. If it sees "Milk" twice, it just marks the "Milk" column as True for that row.

2. Duplicate Rows (Identical Transactions)
Imagine two different customers both bought exactly `['Milk', 'Bread']`.

The Impact: In Association Rule Mining, we actually want to keep these! 
##### Why: Association rules rely on Support (frequency). 
If 100 people buy Milk and Bread together, that rule is much stronger than if only 1 person does. If we delete duplicate rows, we are telling the model that every unique combination of items only happened once, which destroys Model's ability to find patterns.

In [114]:
te=TransactionEncoder()
# Fit and Transform
# 'fit' learns all the unique labels (Milk, Bread, etc.)
# 'transform' creates the True/False matrix
te_ary = te.fit(dataset).transform(dataset)

# Convert back to a DataFrame
# te.columns_ gives us the names of the items for the column headers
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

# Peek at the result
df_encoded.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [116]:
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True) # appearing in at least 1% of transactions

In [112]:
# Generate rules with a minimum lift threshold
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Sort rules by Lift to find the strongest relationships
rules.sort_values('lift', ascending=False)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
214,(herb & pepper),(ground beef),0.049460,0.098254,0.015998,0.323450,3.291994,1.0,0.011138,1.332860,0.732460,0.121457,0.249734,0.243136
215,(ground beef),(herb & pepper),0.098254,0.049460,0.015998,0.162822,3.291994,1.0,0.011138,1.135410,0.772094,0.121457,0.119261,0.243136
386,(ground beef),"(mineral water, spaghetti)",0.098254,0.059725,0.017064,0.173677,2.907928,1.0,0.011196,1.137902,0.727602,0.121097,0.121190,0.229696
383,"(mineral water, spaghetti)",(ground beef),0.059725,0.098254,0.017064,0.285714,2.907928,1.0,0.011196,1.262445,0.697788,0.121097,0.207886,0.229696
395,"(mineral water, spaghetti)",(olive oil),0.059725,0.065858,0.010265,0.171875,2.609786,1.0,0.006332,1.128021,0.656007,0.089017,0.113491,0.163873
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,(french fries),(low fat yogurt),0.170911,0.076523,0.013332,0.078003,1.019340,1.0,0.000253,1.001605,0.022885,0.056948,0.001603,0.126110
131,(eggs),(olive oil),0.179709,0.065858,0.011998,0.066766,1.013783,1.0,0.000163,1.000973,0.016574,0.051370,0.000972,0.124476
130,(olive oil),(eggs),0.065858,0.179709,0.011998,0.182186,1.013783,1.0,0.000163,1.003029,0.014554,0.051370,0.003019,0.124476
144,(escalope),(spaghetti),0.079323,0.174110,0.013998,0.176471,1.013557,1.0,0.000187,1.002866,0.014528,0.058463,0.002858,0.128434


In Association Rule Mining, min_threshold is a parameter used during the rule generation phase (after frequent itemsets have already been found). It acts as a "quality filter" to ensure that the rules we extract meet a minimum level of reliability or strength.

While min_support filters out rare items, min_threshold filters out weak relationships.

Support Analysis: Items like "mineral water" and "spaghetti" often have high support because they are staples5555.
The Power of Lift: Rules with a Lift > 1 indicate that the items are not just common, but actually depend on each other. 
For example, if $\{Herb \ \& \ Pepper\} \to \{Ground \ Beef\}$ has a high lift, it suggests a strong cooking-related association

#### Business Insight: High-confidence rules can be used for store layout optimization (placing associated items together) or bundle deals.

#### What is Lift and why is it important?
Lift measures how much more likely the consequent is to be bought given the antecedent, compared to its overall popularity. It is important because it filters out "coincidences" where an item is only frequently appearing because it's a popular staple (like water).

##### What are Support and Confidence? How do you calculate them?Support: The % of total transactions containing the itemset. ($\text{Count}(A,B) / \text{Total Transactions}$).Confidence: The probability that item B is purchased when item A is purchased. ($\text{Support}(A,B) / \text{Support}(A)$).What are some limitations of Association Rule Mining?Answer: It can be computationally expensive on massive datasets (though Apriori prunes the search space). It may also generate thousands of redundant rules if thresholds are set too low.