# Association Rules.

### Objective :
The Objective of this assignment is to introduce and understanding to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

### Task-1 Data Preprocessing :
Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.

In [3]:
# import the required libraries.
import pandas as pd
import warnings

In [7]:
# Load and read the given dataset
warnings.simplefilter('ignore')
data=pd.read_excel('Online retail.xlsx')

In [9]:
data.head()

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [205]:
data

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
...,...
7495,"butter,light mayo,fresh bread"
7496,"burgers,frozen vegetables,eggs,french fries,ma..."
7497,chicken
7498,"escalope,green tea"


In [206]:
# Handling Missing Values.
data.dropna(inplace=True)

In [207]:
data.head(10)

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
5,"whole wheat pasta,french fries"
6,"soup,light cream,shallot"
7,"frozen vegetables,spaghetti,green tea"
8,french fries
9,"eggs,pet food"


In [208]:
# Removing Duplicate Rows
data.drop_duplicates(inplace=True)

In [209]:
data.head(20)

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
5,"whole wheat pasta,french fries"
6,"soup,light cream,shallot"
7,"frozen vegetables,spaghetti,green tea"
8,french fries
9,"eggs,pet food"


#### Converting the Data into Appropriate format.

In [211]:
data.columns=['Items']

In [212]:
# Splitting the Items into Lists.
trans=data['Items'].str.split(',').apply(lambda x:[item.strip() for item in x if item])

In [213]:
# Converting to One-Hot Encoding format.
from mlxtend.preprocessing import TransactionEncoder
Te=TransactionEncoder()
data_Te=Te.fit(trans).transform(trans)
df=pd.DataFrame(data_Te,columns=Te.columns_)

In [214]:
data.head(20)

Unnamed: 0,Items
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
5,"whole wheat pasta,french fries"
6,"soup,light cream,shallot"
7,"frozen vegetables,spaghetti,green tea"
8,french fries
9,"eggs,pet food"


In [215]:
df.head(10)

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Task-2 Association Rule Mining :
1. Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.
2. Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
3. Set appropriate threshold for support, confidence and lift to extract meaning full rules.

#### 1.Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

In [218]:
# Import required libraries
from mlxtend.frequent_patterns import apriori,association_rules

In [219]:
# Applying Apriori algorith to find frequent itemsets.
freq_itemsets=apriori(df,min_support=0.01,use_colnames=True)

In [220]:
freq_itemsets.head(10)

Unnamed: 0,support,itemsets
0,0.029179,(almonds)
1,0.011014,(antioxydant juice)
2,0.045797,(avocado)
3,0.01256,(bacon)
4,0.015459,(barbecue sauce)
5,0.020483,(black tea)
6,0.01314,(blueberries)
7,0.016232,(body spray)
8,0.045024,(brownies)
9,0.012367,(bug spray)


#### 2.Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
#### 3.Set appropriate threshold for support, confidence and lift to extract meaning full rules.

In [222]:
# Generation of Association Rules.
rules=association_rules(freq_itemsets,metric="confidence",min_threshold=0.2)

In [223]:
# Filtering rules Based on Metrics.
rules=rules[rules['lift']>1.0]

In [224]:
rules[['antecedents','consequents','support','confidence','lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(almonds),(mineral water),0.010821,0.370861,1.237399
1,(avocado),(chocolate),0.010242,0.223629,1.089716
2,(avocado),(french fries),0.011594,0.253165,1.314069
3,(avocado),(milk),0.010821,0.236287,1.389528
4,(avocado),(mineral water),0.015845,0.345992,1.154421
...,...,...,...,...,...
350,"(shrimp, spaghetti)",(mineral water),0.012367,0.407643,1.360125
351,"(soup, mineral water)",(spaghetti),0.010821,0.323699,1.410054
352,"(soup, spaghetti)",(mineral water),0.010821,0.523364,1.746235
353,"(tomatoes, mineral water)",(spaghetti),0.013527,0.391061,1.703487


In [225]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(almonds),(mineral water),0.029179,0.299710,0.010821,0.370861,1.237399,0.002076,1.113092,0.197619
1,(avocado),(chocolate),0.045797,0.205217,0.010242,0.223629,1.089716,0.000843,1.023715,0.086281
2,(avocado),(french fries),0.045797,0.192657,0.011594,0.253165,1.314069,0.002771,1.081019,0.250476
3,(avocado),(milk),0.045797,0.170048,0.010821,0.236287,1.389528,0.003034,1.086732,0.293786
4,(avocado),(mineral water),0.045797,0.299710,0.015845,0.345992,1.154421,0.002120,1.070766,0.140185
...,...,...,...,...,...,...,...,...,...,...
350,"(shrimp, spaghetti)",(mineral water),0.030338,0.299710,0.012367,0.407643,1.360125,0.003274,1.182210,0.273058
351,"(soup, mineral water)",(spaghetti),0.033430,0.229565,0.010821,0.323699,1.410054,0.003147,1.139190,0.300865
352,"(soup, spaghetti)",(mineral water),0.020676,0.299710,0.010821,0.523364,1.746235,0.004624,1.469236,0.436362
353,"(tomatoes, mineral water)",(spaghetti),0.034589,0.229565,0.013527,0.391061,1.703487,0.005586,1.265209,0.427765


In [226]:
rules_sort=rules.sort_values(by='lift',ascending=False)

In [227]:
rules_sort[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head()

Unnamed: 0,antecedents,consequents,support,confidence,lift
176,(whole wheat pasta),(olive oil),0.011014,0.271429,3.100757
126,(herb & pepper),(ground beef),0.022802,0.343023,2.5251
312,"(shrimp, mineral water)",(frozen vegetables),0.010435,0.312139,2.403747
305,"(frozen vegetables, spaghetti)",(ground beef),0.01256,0.321782,2.368738
343,"(milk, spaghetti)",(olive oil),0.010242,0.204633,2.337697


In [228]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(almonds),(mineral water),0.029179,0.299710,0.010821,0.370861,1.237399,0.002076,1.113092,0.197619
1,(avocado),(chocolate),0.045797,0.205217,0.010242,0.223629,1.089716,0.000843,1.023715,0.086281
2,(avocado),(french fries),0.045797,0.192657,0.011594,0.253165,1.314069,0.002771,1.081019,0.250476
3,(avocado),(milk),0.045797,0.170048,0.010821,0.236287,1.389528,0.003034,1.086732,0.293786
4,(avocado),(mineral water),0.045797,0.299710,0.015845,0.345992,1.154421,0.002120,1.070766,0.140185
...,...,...,...,...,...,...,...,...,...,...
350,"(shrimp, spaghetti)",(mineral water),0.030338,0.299710,0.012367,0.407643,1.360125,0.003274,1.182210,0.273058
351,"(soup, mineral water)",(spaghetti),0.033430,0.229565,0.010821,0.323699,1.410054,0.003147,1.139190,0.300865
352,"(soup, spaghetti)",(mineral water),0.020676,0.299710,0.010821,0.523364,1.746235,0.004624,1.469236,0.436362
353,"(tomatoes, mineral water)",(spaghetti),0.034589,0.229565,0.013527,0.391061,1.703487,0.005586,1.265209,0.427765


### Task-3 Analysis and Interpretation :
1. Analyse the generated rules to identify interesting patterns and relationships between the products.
2. Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules


In [230]:
print(rules_sort.head(10))

                            antecedents          consequents  \
176                 (whole wheat pasta)          (olive oil)   
126                     (herb & pepper)        (ground beef)   
312             (shrimp, mineral water)  (frozen vegetables)   
305      (frozen vegetables, spaghetti)        (ground beef)   
343                   (milk, spaghetti)          (olive oil)   
303    (ground beef, frozen vegetables)          (spaghetti)   
337               (soup, mineral water)               (milk)   
188                (eggs, french fries)            (burgers)   
330          (mineral water, spaghetti)        (ground beef)   
313  (frozen vegetables, mineral water)             (shrimp)   

     antecedent support  consequent support   support  confidence      lift  \
176            0.040580            0.087536  0.011014    0.271429  3.100757   
126            0.066473            0.135845  0.022802    0.343023  2.525100   
312            0.033430            0.129855  0.010435    0

### Interview Questions:
1.	What is lift and why is it important in Association rules?
2.	What is support and Confidence. How do you calculate them?
3.	What are some limitations or challenges of Association rules mining?

#### 1.What is lift and why is it important in Association rules?
Lift is a measure that helps you understand how much more likely two items are to appear together compared to if they were independent. It tells you if there's a meaningful relationship between two items or events beyond random chance.

Lift indicates how much more likely one item is found with another than by pure chance. A lift greater than 1 shows a positive association (the items often appear together), while a lift less than 1 suggests a negative association (they rarely appear together). A lift of exactly 1 implies no relationship.

#### 2.What is support and Confidence. How do you calculate them?
Support measures how frequently an item or a combination of items appears in the dataset. It‚Äôs a measure of the general frequency of occurrence, which helps in identifying common itemsets.

Formula: Support(ùê¥‚Üíùêµ)=Transactions containing ùê¥ and ùêµ/Total transactions‚Äã

Confidence is a measure of how often items in an association rule are found together. Specifically, it measures the likelihood of finding item B in a transaction if item A is already present.

Formula: Confidence(ùê¥‚Üíùêµ)=Transactions containing ùê¥ and ùêµ/Transactions containing ùê¥

#### 3.What are some limitations or challenges of Association rules mining?
Scalability: With large datasets, the number of possible item combinations can be enormous, leading to high computational costs.

Threshold Selection: Setting support and confidence thresholds is tricky. Too low, and you get an overwhelming number of rules; too high, and you might miss potentially useful associations.

Relevancy: Not all discovered rules are useful or interesting; filtering out trivial or irrelevant rules can be challenging.