In [None]:
# !pip install emoji --upgrade

In [None]:
# import emoji

# Association Rule Learning with Mlxtend


## Introduction
Definition: Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large datasets. <br>
I won't confuse you and hence I will take a very simple example. The rule <br> <center>{onion,potatoes}=>{burger}</center><br> found in the sales data of supermarket would indicate that if a customer buys onion and potatoes together, they are likely to also buy hamburger meat. <br>
Now you must be thinking what is the use of such rule?<br>
Well this information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.<br>
Apart from this market basket analysis, association rules are employed today in many application areas including Web usage mining, intrusion detection, bioinformatics etc.<br>
You are familiar with data science world, first thing that must have crossed your mind would be that there must be a ready made algorithm for this in scikit-learn. However, scikit-learn doesn't support this but don't be disheartened we have something for rescue. Tada-- Mlxtend.
<p>
There is one good news for you unlike all the other complex ways where we analyze the data, association analysis on the other hand is relatively light on the math concepts and easy to explain to non-technical people(Although, I know you are a technical person ;)) <br>
In addition, it is an unsupervised learning tool.
</p>


## Association Rules 101
Let me provide another example.
##### Table 1.1
| Tid | Items   |
|------|------|
|   1  | { Bread,Milk }|
|   2  | { Bread, Diapers, Beer, Eggs } |
|   3  | { Milk, Diapers, Beer, Cola } |
|   4  | { Bread, Milk, Diapers, Beer } |
|   5  | { Bread, Milk, Diapers, Cola } |

In the above table the following rule can be extracted from the data set: <br>
<center> {Diapers} => {Beer} </center>
The rule suggest a relationship between sale of diapers and beers because many customers who buy diapers also buy beers. And this can help retailers a lot in expanding their business. Also, the {Diapers} is also known as antecedent and the {Beer} is the consequent. <br>
There are two key issues that needs to be addressed in market basket analysis. First, discovering patterns from the large transaction dataset which can be computationally expensive. Second, some of the discovered pattern may be spurious. 
<br><br>

##### Definition by the people who introduced this is defined as follows:
<br>
Let I = {i<sub>1</sub>, i<sub>2</sub>, i<sub>3</sub> ...., i<sub>n</sub>} be a set of n binary attributes called items. <br>

Let D = {t<sub>1</sub>, t<sub>2</sub>, t<sub>3</sub> ...., i<sub>m</sub>} be a set of transactions called database.

Each Transaction D has a unique transaction ID and contains a subset of the items in I.
A rule is defined as an implication of form: 
X => Y , where <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/0f5aabe40ab2b554c2f23eda221556db5868959f" />

<br>
<br>
Every rule is composed by two different sets of items, also known as <I>itemsets</I>, X and Y, where X is called antecedent or left-hand-side and Y is called consequent or right-hand-side.


##### Table 1.2
| Tid | Bread   | Milk | Diapers | Beer | Eggs | Cola |
|------|------||------|------||------|------|
|   1  | 1|1|0|0|0|0|
|   2  | 1|0|1|1|1|0|
|   3  | 0|1|1|1|0|1|
|   4  | 1|1|1|1|0|0|
|   5  | 1|1|1|0|0|1|
<br><br>

Now if you refer to the table 1.1 , table 1.2 is derived from it and is actually one hot encoded. You must be clear from this that, first step what needs to be done is to convert all the data in 0 and 1 form, where in each entry the value 1 means the presence of the item in the corresponding transaction, and the value 0 represents the absence of an item in that transaction. 






Let's start with some priliminary process. <br>
<br>

And let's take the same sample example and load this is pandas.






In [1]:
import pandas as pd
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
from mlxtend.preprocessing import TransactionEncoder

In [2]:
data1 = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
data =[['Bread','Milk'],
      ['Bread', 'Diapers', 'Beer', 'Eggs' ],
       ['Milk', 'Diapers', 'Beer', 'Cola'],
       ['Bread', 'Milk', 'Diapers', 'Beer'],
       ['Bread', 'Milk', 'Diapers', 'Cola']
      ]
df = pd.DataFrame(data,columns=['Item1','Item2','Item3','Item4'])

df

Unnamed: 0,Item1,Item2,Item3,Item4
0,Bread,Milk,,
1,Bread,Diapers,Beer,Eggs
2,Milk,Diapers,Beer,Cola
3,Bread,Milk,Diapers,Beer
4,Bread,Milk,Diapers,Cola


In [3]:
pd.get_dummies(df)

Unnamed: 0,Item1_Bread,Item1_Milk,Item2_Diapers,Item2_Milk,Item3_Beer,Item3_Diapers,Item4_Beer,Item4_Cola,Item4_Eggs
0,1,0,0,1,0,0,0,0,0
1,1,0,1,0,1,0,0,0,1
2,0,1,1,0,1,0,0,1,0
3,1,0,0,1,0,1,1,0,0
4,1,0,0,1,0,1,0,1,0


Now, you must be thinking is there anyway that we can have single column for indivdual items. And the answer to this is Mlxtend.
Mlxtend is so easy and so handy that by the end of this tutorial you will start loving it and yes, you can thank me later.
Let's start this by installing Mlxtend module first.
Just do !pip or !conda install mlxtend.

In [None]:
# print(emoji.emojize(':smirk:'))

In [None]:
# !pip install mlxtend

In [None]:
# from mlxtend.preprocessing import TransactionEncoder
# import mlxtend


In [4]:
te = TransactionEncoder()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk
0,False,True,False,False,False,True
1,True,True,False,True,True,False
2,True,False,True,True,False,True
3,True,True,False,True,False,True
4,False,True,True,True,False,True


In [5]:
from mlxtend.preprocessing import one_hot
import numpy as np

In [6]:
# one_hot(np.array(df))
arr = te.fit_transform(data1)
df1 = pd.DataFrame(arr,columns=te.columns_)
df1

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


Let me explain you what this code is doing: <br>
First, we are using one of the method TransactionEncoder to convert our dataset into one hot encoding. And TransactionEncoder is encoder class for transaction data in Python lists. Second,we just used fit to know the unique columns and transform to convert into one-hot encoded array. That's pretty easy, right?<br>
I will list some of TransactionEncoder method which will be useful in the association rule throughout.<br>
1. fit(X) : Learn Unique columns, where X is lists of lists.<br>
2. fit_transform(X, sparse=False) : Fit a TransactionEncoder encoder and transform a dataset.<i>And Many More.</i><br>




In [7]:
# from mlxtend.frequent_patterns import apriori
df_ex = te.inverse_transform(te_ary) 
df_ex
# apriori(df, min_support=0.00)

[['Corn', 'Kidney Beans'],
 ['Apple', 'Corn', 'Eggs', 'Ice cream'],
 ['Apple', 'Dill', 'Eggs', 'Kidney Beans'],
 ['Apple', 'Corn', 'Eggs', 'Kidney Beans'],
 ['Corn', 'Dill', 'Eggs', 'Kidney Beans']]

In [None]:
# apriori(df,  min_support=0.00,use_colnames=True)

## Preliminaries 
 
I will walk you through some of the terms in the association rules which are very important and then we will see some of the algorithms associated with association rules.<br>
What is itemset? <br>
Well, collection of zero or more items is referred to as itemset. If an itemset contains k-items then they are called as k-itemsets. <br>
An important property of itemset is Support Count. It is the number of transactions that contain a particular itemset. Mathematically a support count σ(X) is written as follows:<br>
<center>σ(X) =  {t<sub>i </sub>|  X ⊆ t<sub>i </sub>, t<sub>i </sub> ∈ T} ,</center>

This is the number of times an element is occuring in the set.  Where T is the transactions and X is the itemset .
For example: Here, Beers,Diapers are occuring 3 times therefore σ(X)=3, for these itemset.<br>
Here the property which interests us is the support which is nothing but the support count of any itemset divided by the total number of transactions.<br>

Second most important property in itemset is Confidence, as the name suggests it tells us how often the rule is found to be true. 

The confidence value of a rule, X⇒Y  with respect to a set of transactions T is the proportion of the transactions that contains X which also contains Y.

The formal definition of this metrics are : <br>
<center>Support, $$ s(X→Y) = \frac {σ(X∪Y)}{N} $$ <br> </center>
<center>Confidence,  $$ c(X → Y ) = \frac {σ(X ∪ Y )}{ σ(X)} $$ </center>

Here N is the total number of transaction.

Let's see what will be two metric of the following rule :
{Milk, Diapers} → {Beer} <br>

Support count will be 2 transactions where Milk,Diapers and beer come together, hence numerator will be 2 and total number of transaction is 5 therefore denominator will be 5 making the support to be 2/5 i.e. 0.4 or 40% of the transaction contain these three together or 40% of the transaction has the following itemset.
Similarly Confidence will have 2 as numerator and the support count for X will be 3 subsituting denominator to be 3 therefore confidence will be 2/3 i.e 0.67. 

There is one more metric known as Lift, which I will discuss now. 
Lift is defined as follows :<br>
<center> $$ lift(X→Y) = \frac {s(X∪Y)}{s(X) * s(Y)}$$</center>
<br>
For example, the rule {Milk, Diapers} → {Beer} <br>
has a lift of 0.4/(0.6 x 0.6) = 1.11.
If the rule has a lift of 1, it would indicate that the probability of occurrence of the antecedent and that of the consequent are independent of each other.<br>

if lift >1 , that means that the two occurrences are dependent on each other.

Let's try to code these three metric for the sample data. 




In [8]:
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk
0,False,True,False,False,False,True
1,True,True,False,True,True,False
2,True,False,True,True,False,True
3,True,True,False,True,False,True
4,False,True,True,True,False,True


In [9]:
#Calculating for {milk and diapers} -> {beer}
# df[['Beer', 'Bread','Cola','Diapers','Eggs','Milk']] = (df[['Beer', 'Bread','Cola','Diapers','Eggs','Milk']] == 'True').astype(int)
# df.dtypes
df.Beer = df.Beer.astype(int)
df.Bread = df.Bread.astype(int)
df.Cola = df.Cola.astype(int)
df.Diapers = df.Diapers.astype(int)
df.Eggs = df.Eggs.astype(int)
df.Milk = df.Milk.astype(int)
df


Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk
0,0,1,0,0,0,1
1,1,1,0,1,1,0
2,1,0,1,1,0,1
3,1,1,0,1,0,1
4,0,1,1,1,0,1


In [10]:
df['Milk_Diapers'] = df.Milk + df.Diapers 
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk,Milk_Diapers
0,0,1,0,0,0,1,1
1,1,1,0,1,1,0,1
2,1,0,1,1,0,1,2
3,1,1,0,1,0,1,2
4,0,1,1,1,0,1,2


In [11]:
df.Milk_Diapers = df.Milk_Diapers.replace(1, np.nan)
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk,Milk_Diapers
0,0,1,0,0,0,1,
1,1,1,0,1,1,0,
2,1,0,1,1,0,1,2.0
3,1,1,0,1,0,1,2.0
4,0,1,1,1,0,1,2.0


In [12]:
supp_Milk_num = df.Milk_Diapers.count()

In [13]:
supp_Milk_num

3

In [14]:
total = len(df.index)

In [15]:
total

5

In [16]:
df['Milk_Diapers_Beer'] = df.Milk + df.Diapers + df.Beer
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk,Milk_Diapers,Milk_Diapers_Beer
0,0,1,0,0,0,1,,1
1,1,1,0,1,1,0,,2
2,1,0,1,1,0,1,2.0,3
3,1,1,0,1,0,1,2.0,3
4,0,1,1,1,0,1,2.0,2


In [17]:
df.Milk_Diapers_Beer = df.Milk_Diapers_Beer.replace(1, np.nan)
df.Milk_Diapers_Beer = df.Milk_Diapers_Beer.replace(2, np.nan)

In [18]:
df

Unnamed: 0,Beer,Bread,Cola,Diapers,Eggs,Milk,Milk_Diapers,Milk_Diapers_Beer
0,0,1,0,0,0,1,,
1,1,1,0,1,1,0,,
2,1,0,1,1,0,1,2.0,3.0
3,1,1,0,1,0,1,2.0,3.0
4,0,1,1,1,0,1,2.0,


In [19]:
supp_count_num = df.Milk_Diapers_Beer.count()
support = supp_count_num/total
support

0.40000000000000002

In [20]:
support_count_beer = df.Beer.sum()
confidence = supp_count_num/support_count_beer
confidence

0.66666666666666663

In [21]:
supp_X = supp_Milk_num/total
supp_Y = support_count_beer/total
lift = support/(supp_X*supp_Y)
lift

1.1111111111111112

So you see that we have calculated the three metric for the example rule, but what if there is a large number of data? Would you calculate for each and every column? Won't that be very big and time consuming task?<br>
There is only one answer to this question and that is Mlxtend(Which I will explain you later in this book).<br>
Before that you must be wondering what are these metric for and why are we using it? <br>
Let me give you a brief explanation.
If there is low support it simpy implies that the rule has occured just because of sheer luck. And if we think it from a business perspective a low support rule is unlikely to interest because it might not profit the organization to promote items that customers usually don't buy frequently. <br>
Confidence on the other hand measures reliability. As told before if confidence is near 1, it means that the rule is somewhat applicable and there is higher chances of Y to happen in the presence of X. <br>
Lift, it's just a measure to measure the importance of rule. 
<br>
<br>
Now, you must be thinking that to know all the metric we must make combinations of all the columns and accordingly a list will be generated and the work will be done. But this brute force approach is very expensive for large set of data and for every possible rule.
To know all the combinations(rules) assuming neither the LHS nor the RHS is an empty set there is a formula :
                $$ 3^{d} - 2^{d+1} + 1 $$

Even in our small example there will be 729-128+1 = 602 rules, which will again cost a lot. Below is the solution to all the questions arised in this section.

## Apriori Principle
Association Rule Discovery: Given a set of transaction T, find all the rules having support $\ge$ minsup and confidence $\ge$ minconf, where minsup and minconf are corresponding support and confidence threshold.This is known as mining association rule. <br>
With, this rule, if minsup is taken as 20% and minconf is taken as 50% then 80% of the rules are discarded. Thus all the computations will be wasted. To avoid performing needless computations it would be useful to prune out the rules early without having to compute them.
First step towards this approach would be to decouple support and confidence.
A common strategy followed by many algorithm is to decompose the problem into two subtasks. 
1. Frequent Itemset generation, whose objective is to find all the itemsets that satisfy the minsup threshold.
2. Rule generation, whose objective is to extract all the high confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules. 

### Frequent Itemset generation
A lattice structure can be used to enumerate the list of all possible itemsets. Suppose there are 5 itemset in T = {A, B, C, D, E} following is the itemset lattice for all the item:
<img src="http://slideplayer.com/slide/4777788/15/images/11/Frequent+Itemset+Indentification:+the+Itemset+Lattice.jpg" />
Clearly, here the generation is $2^{k}-1$ frequent itemset, where k is the total number of items.
The complexity is extremely high for frequent itemset generation too and to reduce this complexity there are three main approaches,
1. Reduce the number of candidate itemset(The Apriori Principle).
2. Reduce the number of comparisons. 
3. Reduce the number of transactions.

We will discuss the first approach i.e. reducing the number of candidate itemset.

#### The <i>Apriori</i> Principle
This states that if an itemset is frequent therefore its subset must also be frequent. 
To illustrate this let's assume that {C, D, E} itemset is frequent, this implies that all its subset i.e. {C, D}, {C, E}, {D, E}, {C}, {D}, {E} are also frequent. This also implies that the vice versa is also true. If an itemset is infrequent that means that the subset of those itemset will also be infrequent. 
The psuedocode for this principle is: <br>
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/8eed75c18217fe2f9b15f266c40b369ce038164d" />

Explanation:
Assume that the minsup value is given as 60%, which in our example gives minimum support count equals to 3. 
Intially, every itemset is consideted as 1-itemset. After counting their support, the items which don't have minimum 3 item support is discarded. In our example it would be {Cola} and {Eggs}. Now, only the items which have more than 3 item support count is taken into account and 2 candidate itemset is generated with the help of only those items. Because there are only 4 itemset left as a candidate for 2 itemset generation there will be only <sup>4</sup>C<sub>2</sub> = 6 2-itemset generation. Similarly, for 3 itemset generation there will be only <sup>6</sup>C<sub>3</sub> = 20 3-itemset generation. Clearly, the computation is reduced for frequent item generation with the help of apriori principle. 


Curious mind can refer to <a href="https://en.wikipedia.org/wiki/Apriori_algorithm">Apriori Algorithm</a>

Let's see apriori in action. 

In [22]:
from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)

  support_dict = {1: support[support >= min_support]}
  itemset_dict = {1: ary_col_idx[support >= min_support].reshape(-1, 1)}


Unnamed: 0,support,itemsets
0,0.6,[0]
1,0.8,[1]
2,0.8,[3]
3,0.8,[5]
4,0.6,"[0, 3]"
5,0.6,"[1, 3]"
6,0.6,"[1, 5]"
7,0.6,"[3, 5]"


In [23]:
df_Temp = apriori(df, min_support=0.5,use_colnames=True)
df_big = apriori(df1, min_support=0.6,use_colnames=True)
# print(df_Temp)

  support_dict = {1: support[support >= min_support]}
  itemset_dict = {1: ary_col_idx[support >= min_support].reshape(-1, 1)}


In [24]:
dic={}
for dat in data:
    for d in dat:
        if d in dic.keys():
            dic[d]+=1
        else:
            dic[d]=1
dictionary={}
for i,k in enumerate(dic.keys()):
    dictionary[k]=i
# print(dictionary)
min_support=0.5
support_c={}
for key,value in dic.items():
    support_c[key]=value/len(df.index)
list_s=[]
# univ_dic ={}
for k,v in support_c.items():
    if(v > min_support):
        list_s.append(k)

import itertools as it
itemset_two = list(it.combinations(list_s, 2))
# print(itemset_two)
item_two_dic={}


tmp={}
bol = False
for tup in itemset_two:
    for dat in data:
        if tup[0] in dat:
            if tup[1] in dat:
                keys=" ".join(tup)
                if keys in item_two_dic:
                    item_two_dic[keys]+=1
                else:
                    item_two_dic[keys]=1
                    
for key in item_two_dic.keys():
    item_two_dic[key]=item_two_dic[key]/5

lis_three=[]
for k,v in item_two_dic.items():
    if(v > min_support):
        lis_three.append(k)

comb_three=[]
for l in lis_three:
    comb_three.extend(l.split())

itemset_three = list(it.combinations(set(comb_three), 3))
item_three_dic={}
for tup in itemset_three:
    for dat in data:
        if tup[0] in dat:
            if tup[1] in dat:
                if tup[2] in dat:
                    keys=" ".join(tup)
                    if keys in item_three_dic:
                        item_three_dic[keys]+=1
                    else:
                        item_three_dic[keys]=1

for key in item_three_dic.keys():
    item_three_dic[key]=item_three_dic[key]/5

lis_four = []
for k,v in item_three_dic.items():
    if(v > min_support):
        lis_four.append(k)


final_dic={}
for l in list_s:
    if l in support_c.keys():
        final_dic[l]=support_c[l]

for l in lis_three:
    if l in item_two_dic.keys():
        final_dic[l] = item_two_dic[l]
print(final_dic)            
            

{'Bread': 0.8, 'Milk': 0.8, 'Diapers': 0.8, 'Beer': 0.6, 'Bread Milk': 0.6, 'Bread Diapers': 0.6, 'Milk Diapers': 0.6, 'Diapers Beer': 0.6}


This is a rookie implementation of apriori, and right now it is restricted to this example only. But the main idea can be implemented with the help of this in a more generalized solution. 
Moving forward to the next point i.e. the Rule Generation which is one of the important algorithm.

## Rule Generation
This states that association rules can be generated from a frequent itemset. For each frequent k-itemset, Y can produce upto $2^k-2$ association rules, ignoring rules that have empty set on L.H.S. and R.H.S..
This can be achieved by partitioning the itemset Y into two non empty subsets, X and Y −X, such that X → Y −X satisfies the confidence threshold.

### Rule generation in Apriori
The Apriori Algorithm uses a level-wise approach for generating association rules.  Initially, all the high confidence rules that have only one item in the rule consequent are extracted. These rules are then used to generate new candidate rules. For example, if {acd} → {b} and {abd} → {c} are high confidence rules, then the candidate rule {ad} → {bc} is generated by merging the consequents of both rules.

 
Now let's use the library function association rules and see the result. 



In [25]:
from mlxtend.frequent_patterns import association_rules

association_rules(df_Temp, metric="confidence", min_threshold=0.7)

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Diapers),(Beer),0.8,0.6,0.6,0.75,1.25,0.12,1.6
1,(Beer),(Diapers),0.6,0.8,0.6,1.0,1.25,0.12,inf
2,(Bread),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8
3,(Diapers),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8
4,(Bread),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8
5,(Milk),(Bread),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8
6,(Diapers),(Milk),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8
7,(Milk),(Diapers),0.8,0.8,0.6,0.75,0.9375,-0.04,0.8


In [26]:
association_rules(df_big, metric="confidence", min_threshold=0.7)

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf
1,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0
2,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf
4,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
5,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
6,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
7,"(Eggs, Onion)",(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf
8,"(Eggs, Kidney Beans)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6
9,"(Onion, Kidney Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf


With just one function you will know every metric of association rules and with the help of this you can now determine the association between products. 
Now let us work on real world example.

In [27]:
df_dataset = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df_dataset.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [None]:
#cleaning the dataset. 
df_dataset.dropna(subset=['InvoiceNo'],inplace=True)
df_dataset['InvoiceNo'] = df_dataset['InvoiceNo'].astype('str')
df_dataset = df_dataset[~df_dataset['InvoiceNo'].str.contains('C')]
df_dataset['Description'] = df_dataset['Description'].str.strip()

df_dataset.head()


In [28]:

#we are selecting only france dataset
basket = (df_dataset[df_dataset['Country'] =="France"])
basket=basket[['InvoiceNo','Description']]

basket = basket.set_index('InvoiceNo')

basket.head()


Unnamed: 0_level_0,Description
InvoiceNo,Unnamed: 1_level_1
536370,ALARM CLOCK BAKELIKE PINK
536370,ALARM CLOCK BAKELIKE RED
536370,ALARM CLOCK BAKELIKE GREEN
536370,PANDA AND BUNNIES STICKER SHEET
536370,STARS GIFT TAPE


In [32]:
te = TransactionEncoder()
basket_group = basket.groupby(['InvoiceNo'])
lis=[]
#grouping by invoice number. 
for g in basket_group:
    
    lis.append(list(g[1].Description))
te_ary = te.fit(lis).transform(lis)
df_encode = pd.DataFrame(te_ary, columns=te.columns_)
df_encode.drop('POSTAGE', inplace=True, axis=1)


In [None]:
 # df_encode.loc[df_encode['ALARM CLOCK BAKELIKE PINK'] == True]

In [38]:
df_calc = apriori(df_encode, min_support=0.07,use_colnames=True)
df_calc

Unnamed: 0,support,itemsets
0,0.084599,[ALARM CLOCK BAKELIKE GREEN]
1,0.086768,[ALARM CLOCK BAKELIKE PINK]
2,0.08026,[ALARM CLOCK BAKELIKE RED ]
3,0.086768,[DOLLY GIRL LUNCH BOX]
4,0.08243,[JUMBO BAG RED RETROSPOT]
5,0.10846,[LUNCH BAG APPLE DESIGN]
6,0.073753,[LUNCH BAG DOLLY GIRL DESIGN]
7,0.132321,[LUNCH BAG RED RETROSPOT]
8,0.104121,[LUNCH BAG SPACEBOY DESIGN ]
9,0.101952,[LUNCH BAG WOODLAND]


In [39]:
association_rules(df_calc, metric="confidence", min_threshold=0.01)
#With this association rules we can determine that what the users buy together and hence this can 
#help the store owner to think about the products placements in store. 

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(PLASTERS IN TIN CIRCUS PARADE ),(PLASTERS IN TIN SPACEBOY),0.147505,0.119306,0.078091,0.529412,4.437433,0.060493,1.871475
1,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN CIRCUS PARADE ),0.119306,0.147505,0.078091,0.654545,4.437433,0.060493,2.467747
2,(PLASTERS IN TIN CIRCUS PARADE ),(PLASTERS IN TIN WOODLAND ANIMALS),0.147505,0.145336,0.086768,0.588235,4.04741,0.06533,2.075612
3,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE ),0.145336,0.147505,0.086768,0.597015,4.04741,0.06533,2.11545
4,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.119306,0.145336,0.088937,0.745455,5.129172,0.071598,3.357608
5,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN SPACEBOY),0.145336,0.119306,0.088937,0.61194,5.129172,0.071598,2.269481
6,(SET/6 RED SPOTTY PAPER CUPS),(SET/20 RED RETROSPOT PAPER NAPKINS ),0.117137,0.112798,0.086768,0.740741,6.566952,0.073555,3.422064
7,(SET/20 RED RETROSPOT PAPER NAPKINS ),(SET/6 RED SPOTTY PAPER CUPS),0.112798,0.117137,0.086768,0.769231,6.566952,0.073555,3.825741
8,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS ),0.10846,0.112798,0.086768,0.8,7.092308,0.074534,4.436009
9,(SET/20 RED RETROSPOT PAPER NAPKINS ),(SET/6 RED SPOTTY PAPER PLATES),0.112798,0.10846,0.086768,0.769231,7.092308,0.074534,3.863341


In [None]:
# df_groceries = pd.read_csv('http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv',header=None)
# df_groceries.head()

## References
1. Association Rule Learning https://en.wikipedia.org/wiki/Association_rule_learning
2. Apriori Algorithm https://en.wikipedia.org/wiki/Apriori_algorithm
3. R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A Tree Projection Algorithm for Generation of Frequent Itemsets. Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining)
