**ASSOCIATION RULES**

In [81]:
import pandas as pd

## Dataset:

Use the Online retail dataset to apply the association rules.


In [82]:
df = pd.read_csv('/content/Online_Retail.csv', encoding='ISO-8859-1')


## Data Preprocessing:

Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  


In [83]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6.0,12/1/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6.0,12/1/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8.0,12/1/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6.0,12/1/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6.0,12/1/10 8:26,3.39,17850.0,United Kingdom


In [84]:
df.shape

(64886, 8)

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64886 entries, 0 to 64885
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    64886 non-null  object 
 1   StockCode    64886 non-null  object 
 2   Description  64719 non-null  object 
 3   Quantity     64885 non-null  float64
 4   InvoiceDate  64885 non-null  object 
 5   UnitPrice    64885 non-null  float64
 6   CustomerID   40012 non-null  float64
 7   Country      64885 non-null  object 
dtypes: float64(3), object(5)
memory usage: 4.0+ MB


In [87]:
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [78]:
# Check for missing values
print(df.isnull().sum())

InvoiceNo          0
StockCode          0
Description      167
Quantity           1
InvoiceDate        1
UnitPrice          1
CustomerID     24874
Country            1
dtype: int64


In [79]:
# Drop rows with missing values
df.dropna(inplace=True)


In [80]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [88]:
# Convert the InvoiceNo column to string for consistency
df['InvoiceNo'] = df['InvoiceNo'].astype(str)

In [92]:
# Convert InvoiceDate to datetime object for easier analysis
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

#  create a column for total price
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

In [89]:
# Focus on United Kingdom transactions (Optional, depending on the dataset specifics)
df = df[df['Country'] == 'United Kingdom']

In [90]:
# Filter for positive quantities
df = df[df['Quantity'] > 0]

In [93]:
transactions = df.groupby(['InvoiceNo'])['Description'].apply(list).values.tolist()

In [95]:
# Check the structure of the data after preprocessing
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPrice
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6.0,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8.0,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


## Association Rule Mining:

•	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

•	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

•	Set appropriate threshold for support, confidence and lift to extract meaning full rules.


In [96]:

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [97]:
# Convert the DataFrame to a one-hot encoded format for Apriori
basket = (df.groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [99]:
# Convert values to boolean (presence/absence of item in transaction)
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1
basket_sets = basket.applymap(encode_units)

# Apply Apriori algorithm
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Print the generated rules
print(rules)

               antecedents              consequents  antecedent support  \
0  (HEART OF WICKER LARGE)  (HEART OF WICKER SMALL)            0.082721   
1  (HEART OF WICKER SMALL)  (HEART OF WICKER LARGE)            0.105237   

   consequent support   support  confidence      lift  leverage  conviction  \
0            0.105237  0.051395    0.621302  5.903812   0.04269    2.362733   
1            0.082721  0.051395    0.488372  5.903812   0.04269    1.792863   

   zhangs_metric  
0       0.905524  
1       0.928311  


In [104]:
min_confidence = 0.2  # Minimum confidence
# Filter rules based on confidence
rules = rules[rules['confidence'] >= min_confidence]
print(rules)

               antecedents              consequents  antecedent support  \
0  (HEART OF WICKER LARGE)  (HEART OF WICKER SMALL)            0.082721   
1  (HEART OF WICKER SMALL)  (HEART OF WICKER LARGE)            0.105237   

   consequent support   support  confidence      lift  leverage  conviction  \
0            0.105237  0.051395    0.621302  5.903812   0.04269    2.362733   
1            0.082721  0.051395    0.488372  5.903812   0.04269    1.792863   

   zhangs_metric  
0       0.905524  
1       0.928311  


In [106]:
min_lift = 3  # Minimum lift
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=min_lift)
print(rules)

               antecedents              consequents  antecedent support  \
0  (HEART OF WICKER LARGE)  (HEART OF WICKER SMALL)            0.082721   
1  (HEART OF WICKER SMALL)  (HEART OF WICKER LARGE)            0.105237   

   consequent support   support  confidence      lift  leverage  conviction  \
0            0.105237  0.051395    0.621302  5.903812   0.04269    2.362733   
1            0.082721  0.051395    0.488372  5.903812   0.04269    1.792863   

   zhangs_metric  
0       0.905524  
1       0.928311  


## Analysis and Interpretation:

•	Analyse the generated rules to identify interesting patterns and relationships between the products.

•	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.


In [108]:
# Sort the rules by lift
rules = rules.sort_values(by='lift', ascending=False)

In [109]:
for index, row in rules.iterrows():
    antecedents = list(row['antecedents'])
    consequents = list(row['consequents'])
    print(f"Rule: {antecedents} -> {consequents}")
    print(f"Confidence: {row['confidence']}, Lift: {row['lift']}")

Rule: ['HEART OF WICKER SMALL'] -> ['HEART OF WICKER LARGE']
Confidence: 0.4883720930232558, Lift: 5.903811751754507
Rule: ['HEART OF WICKER LARGE'] -> ['HEART OF WICKER SMALL']
Confidence: 0.621301775147929, Lift: 5.903811751754506


In [110]:
print(rules.head(10))

               antecedents              consequents  antecedent support  \
1  (HEART OF WICKER SMALL)  (HEART OF WICKER LARGE)            0.105237   
0  (HEART OF WICKER LARGE)  (HEART OF WICKER SMALL)            0.082721   

   consequent support   support  confidence      lift  leverage  conviction  \
1            0.082721  0.051395    0.488372  5.903812   0.04269    1.792863   
0            0.105237  0.051395    0.621302  5.903812   0.04269    2.362733   

   zhangs_metric  
1       0.928311  
0       0.905524  


## Interview Questions:
1.	What is lift and why is it important in Association rules?

Lift is a metric used in association rule mining to measure the strength of an association between items in a dataset. It is calculated as the ratio of the observed support of the itemset to the expected support if the items were independent.


2.	What is support and Confidence. How do you calculate them?

Support and Confidence are fundamental metrics used in association rule mining to evaluate the significance and reliability of rules.

3.	What are some limitations or challenges of Association rules mining?

Association rule mining is a powerful tool, but it comes with certain limitations and challenges
