# Association Rule Introduction

## Definition
Rule-based learning method for discovering interesting relations between variables in large databeses.

## Application Areas
1. Market basket analysis in retail
2. Web usage mining
3. Intrusion detection
4. Continuous production
5. Bioinformatics

## Terminologies
1. Support: indication of how frequently itemset appears in the dataset
2. Confidence: indication of how often the rule has been found to be true
3. Lift: ratio of the observed support to that expected if X and Y were independent
![formula.png](attachment:formula.png)

## Process (2-steps approach)
1. Frequent Itemset Generation: Find all frequent item-sets with support >= pre-determined min_support count
2. Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.

## Algorithms
1. Apriori algorithm
2. Eclat algorithm
3. FP-growth algorithm

# Apriori Algorithm

In [1]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

In [12]:
df= pd.read_csv('C:\\Users\\ooi.weixin\\Documents\\Learning Material\\9.2 Association Rule\\store_data.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [16]:
# Data preprocessing to convert pandas dataframe into a list of lists
records = []
for i in range(0, len(df)):
    records.append([str(df.values[i,j]) for j in range(0, 20)])
#print(records)

Parameters to apply apriori:
1. list of list that would like to extract the rules from
2. min_support: to select items with support values greater than the value specified by the parameter
3. min_confidence: filter those rulesthat have confidence greater than the confidence threshold specified by the parameter
4. min_lift: spcifies minimum left value for the short listed rules
5. min_length: specifies the minimum number of items that you want in your rules

Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. <br>
The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. <br>
<i>*those value are arbitrary</i>

In [19]:
# Apply apriori
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)

In [30]:
# Interpret result

# Check the total number of rules mined
print("Total number of rules mined = " ,len(association_results))

# Have a quick view on the first rule
print("\nFirst rule: \n", association_results[0])

Total number of rules mined =  48

First rule: 
 RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])


In [27]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: chicken -> light cream
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
Rule: ground beef -> tomato sauce
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
Rule: olive oil -> whole wheat pasta
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
Rule: shrimp -> pasta
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
Rule: chicken -> nan
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Rule: shrimp -> frozen vegetables
Support: 0.005332622317024397
Confidence: 0.

Explanation of the first rule:<br>
- support value of 0.0045 is calculated by dividing the number of transactions containing light cream divided by total number of transactions 
- confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken
- lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken.

# FP(Frequent Pattern)-Growth Algorithm

FP-Growth takes lesser execution time compared to apriori by building a compact-tree structure and uses the tree for frequent itemset mining and generating rules.

In [31]:
# Import libraries
import pandas as pd
import numpy as np
import pyfpgrowth

In [45]:
df= pd.read_csv('C:\\Users\\ooi.weixin\\Documents\\Learning Material\\9.2 Association Rule\\store_data.csv', header=None)

# Data preprocessing to convert pandas dataframe into a list of lists
records = []
for i in range(0, len(df)):
    records.append([str(df.values[i,j]) for j in range(0, 20)])

In [48]:
df=df.T.apply(lambda x: x.dropna().to_dict()).to_list()

In [51]:
# find_frequent_patterns: to find patterns in baskets that occur over the support threshold
# 1. list of item set bought at each transaction
# 2. minimum threshold set for the support count
patterns = pyfpgrowth.find_frequent_patterns(df, 2)
patterns

{(0,): 7501,
 (1,): 5747,
 (2,): 4389,
 (3,): 3345,
 (4,): 2529,
 (5,): 1864,
 (6,): 1369,
 (7,): 981,
 (8,): 654,
 (9,): 395,
 (10,): 256,
 (11,): 154,
 (12,): 87,
 (13,): 47,
 (14,): 25,
 (15,): 8,
 (16,): 4,
 (17,): 4,
 (18,): 3,
 (0, 1): 5747,
 (0, 2): 4389,
 (0, 3): 3345,
 (0, 4): 2529,
 (0, 5): 1864,
 (0, 6): 1369,
 (0, 7): 981,
 (0, 8): 654,
 (0, 9): 395,
 (0, 10): 256,
 (0, 11): 154,
 (0, 12): 87,
 (0, 13): 47,
 (0, 14): 25,
 (0, 15): 8,
 (0, 16): 4,
 (0, 17): 4,
 (0, 18): 3,
 (1, 2): 4389,
 (1, 3): 3345,
 (1, 4): 2529,
 (1, 5): 1864,
 (1, 6): 1369,
 (1, 7): 981,
 (1, 8): 654,
 (1, 9): 395,
 (1, 10): 256,
 (1, 11): 154,
 (1, 12): 87,
 (1, 13): 47,
 (1, 14): 25,
 (1, 15): 8,
 (1, 16): 4,
 (1, 17): 4,
 (1, 18): 3,
 (2, 3): 3345,
 (2, 4): 2529,
 (2, 5): 1864,
 (2, 6): 1369,
 (2, 7): 981,
 (2, 8): 654,
 (2, 9): 395,
 (2, 10): 256,
 (2, 11): 154,
 (2, 12): 87,
 (2, 13): 47,
 (2, 14): 25,
 (2, 15): 8,
 (2, 16): 4,
 (2, 17): 4,
 (2, 18): 3,
 (3, 4): 2529,
 (3, 5): 1864,
 (3, 6): 1369,

In [None]:
# generate_association_rules: to find patterns that are associated with another with a certain minimum probability
# 1. patterns
# 2. minimum threshold set for confidence
rules = pyfpgrowth.generate_association_rules(patterns, 0.7)
rules

In [None]:
countries= [nan, 'USA', 'UK', 'France']

In [None]:
countries[0] == countries[0]

In [None]:
countries[1] == countries[1]