## Association Rule Mining

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations between items.It allows retailers to identify relationships between the items that people buy together frequently.

Given a set of transactions, we can find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

- **Support(s)** –
The number of transactions that include items in the {X} and {Y} parts of the rule as a percentage of the total number of transaction.It is a measure of how frequently the collection of items occur together as a percentage of all transactions.

    ```Support = \sigma(X+Y) \div total –```

It is interpreted as fraction of transactions that contain both X and Y.

- **Confidence(c)** –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions that includes all items in {A} to the no of transactions that includes all items in {A}.

    ```Conf(X=>Y) = Supp(X\cupY) \div Supp(X) –```

It measures how often each item in Y appears in transactions that contains items in X also.

- **Lift(l)** –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming that the itemsets X and Y are independent of each other.The expected confidence is the confidence divided by the frequency of {Y}.

    ```Lift(X=>Y) = Conf(X=>Y) \div Supp(Y) –```

Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1 means they appear together more than expected and less than 1 means they appear less than expected.Greater lift values indicate stronger association.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib as plt

In [None]:
df = pd.read_csv("../input/market-basket-optimisationcsv/Market_Basket_Optimisation.csv", header = None)
df

In [None]:
records = []
for i in range(0, len(df)):
    records.append([str(df.values[i,j]) for j in range(0, 20)])
    records[i] = [x for x in records[i] if x != 'nan']

In [None]:
dict_data = {'items': records}
df = pd.DataFrame.from_dict(dict_data)

df.head()

In [None]:
items = []
for i in range(len(df)):
    for j in range(len(df['items'][i])):
        items.append(df['items'][i][j])

items[0:10]

### Creating First Candidate (C1)

In [None]:
#Get unique element from list/array
unique_item = set(items)
list_unique_item = list(unique_item)
list_unique_item[0:5]

In [None]:
count_unique = []
for value in (list_unique_item):
    count_unique.append((value, items.count(value)))
count_unique[0:5]

In [None]:
candidate1 = pd.DataFrame(count_unique, columns=["itemset", "sup"])
candidate1.head()
len(candidate1)

### Creating first Frequent Itemset (L1)

In [None]:
def filter_sup(candidate):
    minimum_sup = 50
    filtering = candidate['sup'] > minimum_sup
    freq = candidate[filtering]
    return freq

In [None]:
freq_itemset1 = filter_sup(candidate1)

In [None]:
freq_itemset1

In [None]:
freq_itemset1['itemset'].iloc[0]

### Choose second candidate

In [None]:
def self_join(prev_freq_itemset):
    self_join_candidate = []
    for i in range(len(prev_freq_itemset['itemset'])):
        for j in range((i+1), len(prev_freq_itemset['itemset'])):
            itemset_i = prev_freq_itemset['itemset'].iloc[i]
            itemset_j = prev_freq_itemset['itemset'].iloc[j]
            if(type(itemset_i) == np.int64 and type(itemset_j) == np.int64):
                itemset_i = {itemset_i}
                itemset_j = {itemset_j}
            union_candidate = itemset_i + ", " + itemset_j

            if union_candidate not in self_join_candidate:
                self_join_candidate.append(union_candidate)
    return self_join_candidate

In [None]:
candidate2_list = self_join(freq_itemset1)
# candidate2_list

In [None]:
count_candidate2 = []

#Set the Initial value of Second Count Candidate (C2)
for i in range(len(candidate2_list)):
    count_candidate2.append((candidate2_list[i], 0))

initial_df_candidate = pd.DataFrame(count_candidate2, columns=['itemset', 'sup'])
len(initial_df_candidate)

In [None]:
def count_support(database_dataframe, prev_candidate_list):
    initial_df_candidate['sup'] = 0 #set All value into 0 only for initial value for consistency value when running this cell everytime.
    count_prev_candidate = []

    #Set the Initial value of Previous Candidate
    for i in range(1000): #len(prev_candidate_list)
        count_prev_candidate.append((prev_candidate_list[i], 0))
    
    df_candidate = pd.DataFrame(count_prev_candidate, columns=['itemset', 'sup'])
    print('Database D dataframe\n', database_dataframe)
    print('(Initial) Dataframe from Candidate with All zeros sup\n', df_candidate)
    
    for i in range(len(database_dataframe)):
        for j in range(1000): #len(count_prev_candidate)
            #using issubset() function to check whether every itemset is a subset of Database or not
            if set(list(df_candidate['itemset'][j].split(", "))).issubset(set(list(database_dataframe['items'][i]))): 
                df_candidate.loc[j, 'sup'] += 1
            
    return df_candidate

In [None]:
count_candidate2_df = count_support(df, candidate2_list)

In [None]:
count_candidate2_df

### Creating Second Frequent Itemset (L2)

In [None]:
freq_itemset2 = filter_sup(count_candidate2_df)

freq_itemset2 = freq_itemset2.reset_index(drop=True)
freq_itemset2

### Creating the Third Candidate (C3) - Using the Candidate Forming Technique

In [None]:
print(freq_itemset2[0:10])
self_join_result = self_join(freq_itemset2)

In [None]:
print(len(self_join_result))
self_join_result[0:10]

In [None]:
candidate3_list = self_join_result

In [None]:
count_candidate3_df = count_support(df, candidate3_list)

In [None]:
max(count_candidate3_df['sup'])

In [None]:
def filter_sup2(candidate):
    minimum_sup = 20
    filtering = candidate['sup'] > minimum_sup
    freq = candidate[filtering]
    return freq

### Creating Third frequent Itemset

In [None]:
freq_itemset3 = filter_sup2(count_candidate3_df)
for i in range(len(freq_itemset3)):
    print("Items: ", set(list(freq_itemset3['itemset'].iloc[i].split(", "))), "\t Support: ", freq_itemset3['sup'].iloc[i])

In [None]:
frequent_itemset = pd.concat([freq_itemset1, freq_itemset2, freq_itemset3], axis=0)
frequent_itemset_final = frequent_itemset.reset_index(drop=True)

### Combining all the Frequent Itemsets

In [None]:
for i in range(len(frequent_itemset_final)):
    print(frequent_itemset_final['itemset'].iloc[i], "\t", frequent_itemset_final['sup'].iloc[i])