### Associative mining rule

Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
People who buy one of the products can be targeted through an advertisement campaign to buy the other.
Collective discounts can be offered on these products if the customer buys both of them.
Both A and B can be packaged together.
The process of identifying an associations between products is called association rule mining. <br><br>
  taken from "stackabuse.com"

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from itertools import combinations

In [2]:
data = pd.read_csv("data/items.csv", header = 'infer')

# Resolvig the column names into the numeric ones
for index,col in enumerate(list(data.columns)):
    data.rename(columns = {col:index}, inplace = True)

data.fillna("NA", inplace = True)
column_list = list(data.columns)
data

Unnamed: 0,0,1,2,3
0,apple,beer,rice,chicken
1,apple,beer,rice,
2,apple,beer,,
3,apple,mango,,
4,milk,beer,rice,chicken
5,milk,beer,rice,
6,milk,beer,,
7,milk,mango,,


#### Finding all the possible unique pairs which appeared in respective baskets:
Assumption - duplicate items will not be repeated in the same row

In [3]:
pairwise_info = {}
item_info = {}

for rownum,row in data.iterrows():
    temp = []
    for col in column_list:
        if row[col] != 'NA':
            temp.append(row[col])
    pairs_per_row =  [sorted(item) for item in combinations(temp, 2)]
    for pair_list in pairs_per_row:
        pair_name = ""
        for index,pair in enumerate(pair_list):
            if pair not in item_info:
                item_info[pair] = [rownum]
            else:
                if rownum not in item_info[pair]:
                    item_info[pair].append(rownum)
            if index == 0:
                pair_name = pair
            else:
                pair_name = pair_name + "+" + pair
                
        if pair_name not in pairwise_info:
            pairwise_info[pair_name] = [rownum]
        else:
            pairwise_info[pair_name].append(rownum)

### Support <br>
This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total and {apple,egg} occurs in 3 of them, so:

             support{apple,egg} = 3/5 or 60%

The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.

#### Calculating support for all the pairs

In [4]:
item_list = []
support_list = []
for item in pairwise_info.items():
    item_list.append(item[0])
    support_list.append(len(pairwise_info[item[0]]) / len(data))
support = pd.DataFrame(support_list,item_list, columns = ['Support Value'])
support

Unnamed: 0,Support Value
apple+beer,0.375
apple+rice,0.25
apple+chicken,0.125
beer+rice,0.5
beer+chicken,0.25
chicken+rice,0.25
apple+mango,0.125
beer+milk,0.375
milk+rice,0.25
chicken+milk,0.125


***Normalization is pending

### Confidence
Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:

             confidence{A->B} = support{A,B} / support{A}   

Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:

             confidence{B->A} = support{A,B} / support{B}    

In our example, the percentage of times that egg is purchased, given that apple was purchased is:

             confidence{apple->egg} = support{apple,egg} / support{apple}
                                    = (3/5) / (4/5)
                                    = 0.75 or 75%

A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg. Now, we look at the confidence measure in the opposite direction (ie: egg->apple):

             confidence{egg->apple} = support{apple,egg} / support{egg}
                                    = (3/5) / (3/5)
                                    = 1 or 100%  

Here we see that all of the orders that contain egg also contain apple. But, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.


We will utilize the support dataset to calculate the pairwise confidences

In [37]:
individual_support = {}
item_combinations = []
confidence_values = []

for index,item in enumerate(support.index):
    split_data = item.split("+")
    a = split_data[0]
    b = split_data[1]
    support_a = 0
    support_b = 0
    if a not in individual_support:
        for item_to_check_for_presence in support.index:
            if a in item_to_check_for_presence:
                support_a = support_a + 1
        individual_support[a] = support_a
    if b not in individual_support:
        for item_to_check_for_presence in support.index:
            if b in item_to_check_for_presence:
                support_b = support_b + 1
        individual_support[b] = support_b
    
    item_combinations.append(a+"+"+b)
    confidence_values.append(float(support.loc[a+"+"+b])/individual_support[b])
    item_combinations.append(b+"+"+a)
    confidence_values.append(float(support.loc[a+"+"+b])/individual_support[a])
        
individual_support

{'apple': 4, 'beer': 4, 'rice': 4, 'chicken': 4, 'mango': 2, 'milk': 4}

In [50]:
pd.DataFrame(list(zip(item_combinations,confidence_values)),
             index = [i for i in range(0,len(item_combinations))],
            columns = ['Item','Confidence'])

Unnamed: 0,Item,Confidence
0,apple+beer,0.09375
1,beer+apple,0.09375
2,apple+rice,0.0625
3,rice+apple,0.0625
4,apple+chicken,0.03125
5,chicken+apple,0.03125
6,beer+rice,0.125
7,rice+beer,0.125
8,beer+chicken,0.0625
9,chicken+beer,0.0625
