# **KDD- Experiment 6**

*   **SIA VASHIST**
*    PRN: 20190802107

---


#  Dataset used: Market Basket Optimisation
---

# Libraries used: 
> Pandas | mlxtend - TransactionEncoder , apriori

# Apriori algorithm using libraries:

In [1]:
import pandas as pd

# Load the dataset
Market_data = pd.read_csv(r'C:\sia\Market_Basket_Optimisation.csv', header=None)
print("The Dataset is as follows:")
print(Market_data.dropna(), '\n')

The Dataset is as follows:
       0        1        2               3             4                 5   \
0  shrimp  almonds  avocado  vegetables mix  green grapes  whole weat flour   

     6               7             8             9               10  \
0  yams  cottage cheese  energy drink  tomato juice  low fat yogurt   

          11     12     13             14      15                 16  \
0  green tea  honey  salad  mineral water  salmon  antioxydant juice   

                17       18         19  
0  frozen smoothie  spinach  olive oil   



In [2]:
Market_data.shape

(7501, 20)

In [3]:
# Convert the dataframe into a list of lists
transactions = []
for i in range(len(Market_data)):
    transaction = [str(Market_data.values[i,j]) for j in range(len(Market_data.columns))]
    transactions.append(transaction)


In [4]:
from mlxtend.preprocessing import TransactionEncoder

# Convert the list of transactions into a pandas DataFrame with binary values
te = TransactionEncoder()
te_ary = te.fit_transform(transactions)
transactions_df = pd.DataFrame(te_ary, columns=te.columns_)
print(transactions_df)


       asparagus  almonds  antioxydant juice  asparagus  avocado  babies food  \
0          False     True               True      False     True        False   
1          False    False              False      False    False        False   
2          False    False              False      False    False        False   
3          False    False              False      False     True        False   
4          False    False              False      False    False        False   
...          ...      ...                ...        ...      ...          ...   
7496       False    False              False      False    False        False   
7497       False    False              False      False    False        False   
7498       False    False              False      False    False        False   
7499       False    False              False      False    False        False   
7500       False    False              False      False    False        False   

      bacon  barbecue sauce

In [11]:
from mlxtend.frequent_patterns import apriori, association_rules

# Find frequent itemsets
frequent_itemsets = apriori(transactions_df, min_support=0.05, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.087188,(burgers)
1,0.081056,(cake)
2,0.059992,(chicken)
3,0.163845,(chocolate)
4,0.080389,(cookies)
5,0.05106,(cooking oil)
6,0.179709,(eggs)
7,0.079323,(escalope)
8,0.170911,(french fries)
9,0.063325,(frozen smoothie)


>we set the 'min_support' parameter to '0.05'. This means that we're only interested in itemsets that appear in at least 5% of the transactions.

In [13]:
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(burgers),(nan),0.087188,0.999867,0.087188,1.0,1.000133,1.2e-05,inf
1,(nan),(burgers),0.999867,0.087188,0.087188,0.0872,1.000133,1.2e-05,1.000013
2,(cake),(nan),0.081056,0.999867,0.081056,1.0,1.000133,1.1e-05,inf
3,(nan),(cake),0.999867,0.081056,0.081056,0.081067,1.000133,1.1e-05,1.000012
4,(chicken),(nan),0.059992,0.999867,0.059992,1.0,1.000133,8e-06,inf
5,(nan),(chicken),0.999867,0.059992,0.059992,0.06,1.000133,8e-06,1.000009
6,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357
7,(mineral water),(chocolate),0.238368,0.163845,0.05266,0.220917,1.348332,0.013604,1.073256
8,(chocolate),(nan),0.163845,0.999867,0.163845,1.0,1.000133,2.2e-05,inf
9,(nan),(chocolate),0.999867,0.163845,0.163845,0.163867,1.000133,2.2e-05,1.000026


> Here, we set the metric parameter to "lift". This means that we're interested in association rules that have a lift score of at least '1'. The 'min_threshold' parameter is set to '1', which means that we're only interested in rules with a lift score of at least 1.

# Observations: 
> 1. The most frequent items purchased together are "mineral water" and "eggs". Other popular combinations include "mineral water" and "spaghetti", "chocolate" and "mineral water", "ground beef" and "mineral water", and "milk" and "mineral water".

> 2. The support values for the frequent itemsets are relatively low, indicating that most items are not frequently purchased together. However, the lift values for the frequent itemsets are high, indicating that there is a strong association between the items in the frequent itemsets.

> 3. The association rules generated from the frequent itemsets also have high confidence values, indicating that the consequent is likely to be purchased given the antecedent. For example, the rule {mineral water} -> {eggs} has a confidence of 0.23, meaning that when a customer purchases mineral water, there is a 23% chance that they will also purchase eggs.

> 4. Some items have low support values, which means that they are not frequently purchased. This may be due to various reasons such as the items being seasonal or not being popular among customers.

> 5. It may be beneficial for the store to group certain items together or offer discounts on items that are frequently purchased together to increase sales and customer satisfaction.

---
# Apriori algorithm <b>without</b> using libraries:

In [7]:
import pandas as pd

# Load the market basket optimization dataset:
df = pd.read_csv(r'C:\sia\Market_Basket_Optimisation.csv', header=None)

In [8]:
# Define a function to get the support of an itemset
def get_support(itemset, transactions):
    count = 0
    for transaction in transactions:
        if all(item in transaction for item in itemset):
            count += 1
    return count / len(transactions)

# Define a function to generate candidate itemsets of length k+1 from itemsets of length k
def generate_candidates(itemsets):
    candidates = []
    for i in range(len(itemsets)):
        for j in range(i+1, len(itemsets)):
            if itemsets[i][:-1] == itemsets[j][:-1]:
                candidate = itemsets[i] + (itemsets[j][-1],)
                candidates.append(candidate)
    return candidates

In [9]:
# Define the Apriori algorithm
def apriori(transactions, min_support):
    itemsets = []
    for item in set([item for transaction in transactions for item in transaction]):
        itemsets.append((item,))
    frequent_itemsets = []
    k = 1
    while itemsets:
        candidates = generate_candidates(itemsets)
        itemsets = []
        for candidate in candidates:
            support = get_support(candidate, transactions)
            if support >= min_support:
                itemsets.append(candidate)
                frequent_itemsets.append((candidate, support))
        k += 1
    return frequent_itemsets

# Run the Apriori algorithm on the dataset
transactions = []
for i in range(len(df)):
    transactions.append([str(df.values[i,j]) for j in range(len(df.columns))])
frequent_itemsets = apriori(transactions, 0.01)

In [10]:
# Print the frequent itemsets and their supports
for itemset, support in frequent_itemsets:
    print(itemset, support)

('black tea', 'nan') 0.014264764698040262
('yams', 'nan') 0.011331822423676844
('butter', 'nan') 0.030129316091187842
('ham', 'nan') 0.026529796027196375
('melons', 'nan') 0.011998400213304892
('olive oil', 'nan') 0.0657245700573257
('olive oil', 'mineral water') 0.027596320490601255
('olive oil', 'milk') 0.017064391414478068
('olive oil', 'spaghetti') 0.022930275963204905
('olive oil', 'frozen vegetables') 0.011331822423676844
('olive oil', 'eggs') 0.011998400213304892
('olive oil', 'chocolate') 0.01639781362485002
('olive oil', 'ground beef') 0.014131449140114652
('olive oil', 'pancakes') 0.010798560191974404
('nan', 'vegetables mix') 0.025596587121717106
('nan', 'tomato sauce') 0.014131449140114652
('nan', 'burgers') 0.0871883748833489
('nan', 'french wine') 0.022530329289428077
('nan', 'french fries') 0.1709105452606319
('nan', 'hot dogs') 0.03239568057592321
('nan', 'pepper') 0.026529796027196375
('nan', 'tomatoes') 0.06839088121583789
('nan', 'frozen smoothie') 0.0631915744567391

> # Observations:
>The Apriori algorithm identifies several frequent itemsets with high support, which indicates that they are commonly purchased together. For example, the itemset ('mineral water', 'spaghetti') has a support of 0.059, which means that it appears in 5.9% of the transactions. Other frequent itemsets include ('eggs', 'mineral water') with a support of 0.050 and ('french fries', 'mineral water') with a support of 0.046.

>These frequent itemsets can be used to generate recommendations for customers. For example, if a customer purchases spaghetti, the algorithm could recommend mineral water since the two items often appear together

---
# Conclusion:
In conclusion, implementing the Apriori algorithm with libraries like mlxtend can simplify the process and save time, while implementing it without libraries provides a better understanding of the algorithm and more customization. Regardless of the implementation method, the algorithm is useful for identifying frequent itemsets in transaction data, which can be used to generate recommendations and optimize operations. Adjusting the minimum support threshold can change the number and significance of frequent itemsets identified. Overall, the Apriori algorithm is a valuable tool for gaining insights into customer behavior and optimizing operations.