In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
import networkx as nx
import seaborn as sns

# Load the dataset
file_path = "/mnt/data/Online retail.xlsx"
df = pd.read_excel('Online retail.xlsx')

# Data Preprocessing
df.dropna(subset=["InvoiceNo", "StockCode", "Description"], inplace=True)
df = df[df['Quantity'] > 0]
df = df[df['InvoiceNo'].astype(str).str.startswith('C') == False]

# Convert data into a suitable format for association rule mining
basket = df.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().fillna(0)
basket = basket.applymap(lambda x: 1 if x > 0 else 0)

# Apply Apriori Algorithm
frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Optimize threshold selection
rules = rules[(rules['support'] > 0.01) & (rules['confidence'] > 0.1)]

# Visualizing Top Association Rules
plt.figure(figsize=(8, 6))
sns.heatmap(rules[['support', 'confidence', 'lift']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation between Support, Confidence, and Lift")
plt.show()

# Network Graph Visualization
G = nx.Graph()
for i, row in rules.iterrows():
    G.add_edge(row['antecedents'], row['consequents'], weight=row['lift'])

plt.figure(figsize=(10, 6))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue', edge_color='gray', font_size=8)
plt.title("Association Rules Network Graph")
plt.show()

# Display rules
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift', 'conviction']])


KeyError: ['InvoiceNo', 'StockCode', 'Description']

## Interview Questions:
What is lift and why is it important in Association rules?

Lift is a measure of how much more likely the consequent is to occur with the antecedent than without it. It is important because it indicates the strength of the relationship between two items. A lift greater than 1 suggests that the items are positively correlated, while a lift less than 1 suggests a negative correlation.
What is support and confidence? How do you calculate them?

Support is the proportion of transactions that contain both the antecedent and consequent items. It is calculated as:
Support
=
Number of transactions containing both items
Total number of transactions
Support=
Total number of transactions
Number of transactions containing both items
​

Confidence is the probability that the consequent item is purchased when the antecedent item is purchased. It is calculated as:
Confidence
=
Support of antecedent and consequent
Support of antecedent
Confidence=
Support of antecedent
Support of antecedent and consequent
​

What are some limitations or challenges of Association rules mining?

Scalability: The algorithm can be computationally expensive, especially for large datasets with many items.
Quality of Rules: Association rule mining can generate a large number of rules, many of which might not be meaningful or actionable.
Threshold Selection: The choice of thresholds for support, confidence, and lift can greatly impact the results. Setting them too high might miss important relationships, while setting them too low might generate too many irrelevant rules.
Interpretability: Some rules might be difficult to interpret or may not provide clear actionable insights.