# An Introduction to Association Rules in Python

##### Association rules is a rule-based learning method used to draw frequent patterns and correlations from datasets such as transactional and relational data.

##### In essence it computes the co-occurence statistics between items, in the form of an implication expression (X → Y).

##### For instance, in customer basket analysis, {diaper} → {beer} means if diaper is bought, then beer is put into basket.

#### 4 fundamental concepts in association rules:

* *(Not a Rule)* Support: number of times X occurs over all instances.

* Support(X→Y) is the probability of co-occurence of both items within all data.

* Confidence(X→Y) is the probability of Y occurs given that X is present.

* Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.


# Example 1

### Before getting into the formulas and terminology, let's begin by a simple example.

Mlxtend is a rich and useful library for machine learning. It provides methods in association rules with a major algorithm *apriori*.

You can install mlxtend via pip or conda.

In [1]:
!pip install mlxtend



In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori,association_rules
import warnings
warnings.filterwarnings('ignore')

# apriori - function that generates all possible combinnation of support values 

To use association rules, first we neeed some data in one-hot encoded format.

Imagine in a grocery database, there are order id with some products...

In [5]:
data = {'ID':[1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}

# 1- bought
# 0 - not bought

In [6]:
data

{'ID': [1, 2, 3, 4, 5, 6],
 'Onion': [1, 0, 0, 1, 1, 1],
 'Potato': [1, 1, 0, 1, 1, 1],
 'Burger': [1, 1, 0, 0, 1, 1],
 'Milk': [0, 1, 1, 1, 0, 1],
 'Beer': [0, 0, 1, 0, 1, 0]}

In [7]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1


In [8]:
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


### Then, we can generate frequent itemsets based on *support*.

Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time.

`apriori(df, min_support=0.5, use_colnames=False, max_len=None)`

In [9]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], min_support=0.5, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Potato, Onion)"
5,0.5,"(Onion, Burger)"
6,0.666667,"(Potato, Burger)"
7,0.5,"(Potato, Milk)"
8,0.5,"(Potato, Onion, Burger)"


Itemsets with 1, 2 or 3 items are returned, with support > 0.5

The only itemset with 3 products is [Onion, Potato, Burger].

### Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):

```association_rules(df, metric='confidence', min_threshold=0.8)```

* Here, df means the frequent_itemsets dataframe;

* metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

* min_threshold is the mininum value for the specified metrics.

In [10]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

In [11]:
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


In [12]:
rules
# support in below is support of both antecedents and consequents

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
2,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
3,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
4,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
6,"(Potato, Onion)",(Burger),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
7,"(Potato, Burger)",(Onion),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
8,"(Onion, Burger)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333
9,(Potato),"(Onion, Burger)",0.833333,0.5,0.5,0.6,1.2,0.083333,1.25,1.0


link: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/


'support':
support(A→C)=support(A∪C)                           ,range: [0,1]

confidence':
confidence(A→C)=support(A→C)/support(A)              ,range: [0,1]

'lift':
lift(A→C)=confidence(A→C)/support(C)                  ,range: [0,∞]

leverage':
levarage(A→C)=support(A→C)−support(A)×support(C)      ,range: [−1,1]

conviction':
conviction(A→C)=(1−support(C)) / (1−confidence(A→C)),range: [0,∞]

'zhangs_metric':
zhangs metric(A→C)=(confidence(A→C)−confidence(A′→C)) / (Max[confidence(A→C),confidence(A′→C)] )  
   ,range: [−1,1]

### Intrepreting the result:

We can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations.

Several are high in confidence as well. But domain knowledge will be useful in explaining the phenomenon.

In [13]:
rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.7)  ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
4,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,0.111111,1.666667,1.0
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,0.111111,inf,0.5
8,"(Onion, Burger)",(Potato),0.5,0.833333,0.5,1.0,1.2,0.083333,inf,0.333333


Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

* **If Onion or Burger is in a users' basket, it is highly likely that the user will buy Potato as well.**
* **If Burger and Onion is in a users' basket, it is highly likely that the user will also buy Potato.**

### Some notes on Lift, Conviction & Leverage:


1.  Lift(X→Y) : the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.
    > When Lift=1,  X makes no impact on Y  
    > When Lift>1, there is a relationship between X & Y


# Example 2

In [14]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

In [23]:
retail = pd.DataFrame(retail_shopping_basket)
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


In [24]:
retail = retail[['ID', 'Basket']]

In [25]:
pd.options.display.max_colwidth=100

Suppose we have a list of customer ids to a list of basket items:

In [26]:
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


First one-hot encode the basket, but how?

In [27]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
retail = pd.DataFrame(mlb.fit_transform(retail.Basket), columns=mlb.classes_)
retail
# here single index value is of form [a,b,c,d]. so MultiLabelBinarizer used 
# in traditional one hot encoding , only single value in one cell, so pd.get_dummies(x, drop_first = True)
# new_df = mlb.fit_transform(retail.Basket)
# mlb.inverse_transform(new_df) --> to reverse the one hot encoding

Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [28]:
# Another method to do one hot enoding : TransactionEncoder
df_shop = pd.DataFrame(retail_shopping_basket)
transactions = df_shop['Basket'].tolist()  #should be passed in list

from mlxtend.preprocessing import TransactionEncoder
te= TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df_enoded = pd.DataFrame(te_array,columns=te.columns_)
df_encoded = df_enoded.replace({True:1,False:0})
df_encoded

# TransactionEncoder does not implement fit_transform(), so you must call .fit() and .transform() separately.abs


Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [31]:
# another method: str.get_dummies(sep = '') # just for knowledge
    
ret = pd.DataFrame(retail_shopping_basket)
ret['Basket_str'] = ret['Basket'].apply(lambda x: ','.join(x)) # apply done on data frame othewise error
dummies = ret['Basket_str'].str.get_dummies(sep = ',')
ret_encoded = ret.drop(columns=['ID','Basket_str', 'Basket']).join(dummies)
ret_encoded

Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [18]:
retail

Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


Making use of `Series.str.get_dummies`, we can easily encode lists of items in a dataframe's column!

In [19]:
frequent_itemsets_2 = apriori(retail, min_support=0.5,use_colnames=True)
frequent_itemsets_2



Unnamed: 0,support,itemsets
0,0.666667,(Beer)
1,0.666667,(Chips)
2,0.5,(Diaper)
3,0.666667,(Milk)
4,0.5,"(Chips, Beer)"
5,0.5,"(Diaper, Beer)"


Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [20]:
association_rules(frequent_itemsets_2, metric='lift',min_threshold=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Chips),(Beer),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
1,(Beer),(Chips),0.666667,0.666667,0.5,0.75,1.125,0.055556,1.333333,0.333333
2,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,0.166667,inf,0.666667
3,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,0.166667,2.0,1.0



Clearly, {Diaper, Beer} is the most associated itemset in this data!

# Example 3 - Movie Genre Associations


Recommendation system using market basket analyssis

It seems a bit boring playing only with basket analysis and imaginary datasets.

In this example, let's play with an open dataset [MovieLens (small)](https://grouplens.org/datasets/movielens/).

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

We might want to take a look at the data and look at the stat first:

In [21]:
movies = pd.read_csv(r'D:\intellipaat ds ai ml\DS@AI\5 ml\06-01 Association rules mining/movies.csv')


In [22]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [23]:

movies_ohe = movies.drop('genres',axis=1).join(movies.genres.str.get_dummies(sep='|'))
#  encoding done 

# try with MultiLabelBinarizer also

In [24]:
movies_ohe.head()

Unnamed: 0,movieId,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
movies_ohe.shape

(9125, 22)

### Let's get back to analysing the genre associations:

In [26]:
movies_ohe.set_index(['movieId','title'],inplace=True)
movies_ohe

Unnamed: 0_level_0,Unnamed: 1_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
163949,The Beatles: Eight Days a Week - The Touring Years (2016),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [27]:
movies_ohe.shape

(9125, 20)

In [28]:
frequent_itemsets_movies = apriori(movies_ohe,use_colnames=True,min_support=0.01)
frequent_itemsets_movies



Unnamed: 0,support,itemsets
0,0.169315,(Action)
1,0.122411,(Adventure)
2,0.048986,(Animation)
3,0.063890,(Children)
4,0.363288,(Comedy)
...,...,...
87,0.010301,"(Mystery, Crime, Drama)"
88,0.032000,"(Thriller, Crime, Drama)"
89,0.012493,"(Thriller, Crime, Mystery)"
90,0.010411,"(Thriller, Horror, Drama)"


In [29]:
rules_movies =  association_rules(frequent_itemsets_movies, metric='lift', min_threshold=1)

In [30]:
recs=rules_movies.sort_values('lift',ascending=False)
recs.head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
119,(Animation),"(Children, Adventure)",0.048986,0.02926,0.014685,0.299776,10.245163,0.013252,1.386328,0.948875
118,"(Children, Adventure)",(Animation),0.02926,0.048986,0.014685,0.501873,10.245163,0.013252,1.909178,0.929593
117,"(Adventure, Animation)",(Children),0.023342,0.06389,0.014685,0.629108,9.846673,0.013194,2.523941,0.919916
120,(Children),"(Adventure, Animation)",0.06389,0.023342,0.014685,0.229846,9.846673,0.013194,1.268132,0.959762
144,(Children),"(Comedy, Animation)",0.06389,0.02137,0.01337,0.209262,9.792409,0.012005,1.237617,0.959161
141,"(Comedy, Animation)",(Children),0.02137,0.06389,0.01337,0.625641,9.792409,0.012005,2.500567,0.917487
25,(Animation),(Children),0.048986,0.06389,0.027068,0.552573,8.648758,0.023939,2.092205,0.92993
24,(Children),(Animation),0.06389,0.048986,0.027068,0.423671,8.648758,0.023939,1.650122,0.944736
145,(Animation),"(Comedy, Children)",0.048986,0.032877,0.01337,0.272931,8.301641,0.011759,1.330166,0.924847
140,"(Comedy, Children)",(Animation),0.032877,0.048986,0.01337,0.406667,8.301641,0.011759,1.602832,0.909441


***As we can see in this dataset, the support and hence confidence values are fairly small. This makes it difficult interpreting the result based on these two values. Whereas, the lift and conviction remains to very intuitive and representative. That is why we should understand the meaning of all of the 5 metrics to accurately interpret the result!***

In [31]:
rules_movies[(rules_movies.lift>2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Adventure),(Action),0.122411,0.169315,0.058301,0.476276,2.812955,0.037575,1.586111,0.734401
1,(Action),(Adventure),0.169315,0.122411,0.058301,0.344337,2.812955,0.037575,1.338475,0.775868
8,(IMAX),(Action),0.016767,0.169315,0.010082,0.601307,3.551410,0.007243,2.083521,0.730673
9,(Action),(IMAX),0.169315,0.016767,0.010082,0.059547,3.551410,0.007243,1.045489,0.864855
10,(Sci-Fi),(Action),0.086795,0.169315,0.040986,0.472222,2.789015,0.026291,1.573929,0.702416
...,...,...,...,...,...,...,...,...,...,...
168,(Thriller),"(Horror, Drama)",0.189479,0.017644,0.010411,0.054945,3.114122,0.007068,1.039470,0.837588
171,"(Thriller, Drama)",(Mystery),0.087123,0.059507,0.018301,0.210063,3.530062,0.013117,1.190592,0.785121
172,"(Mystery, Drama)",(Thriller),0.031671,0.189479,0.018301,0.577855,3.049696,0.012300,1.920004,0.694081
173,(Thriller),"(Mystery, Drama)",0.189479,0.031671,0.018301,0.096588,3.049696,0.012300,1.071857,0.829218


* As we are expecting the {Romance, Drama} pair, it is not as correlated as other groups such as {Animation, Childres} which has a much higher lift & conviction levels.

In [32]:
rules_movies[(rules_movies.lift>2)].sort_values(by=['lift','confidence'], ascending=[False,False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
118,"(Children, Adventure)",(Animation),0.029260,0.048986,0.014685,0.501873,10.245163,0.013252,1.909178,0.929593
119,(Animation),"(Children, Adventure)",0.048986,0.029260,0.014685,0.299776,10.245163,0.013252,1.386328,0.948875
117,"(Adventure, Animation)",(Children),0.023342,0.063890,0.014685,0.629108,9.846673,0.013194,2.523941,0.919916
120,(Children),"(Adventure, Animation)",0.063890,0.023342,0.014685,0.229846,9.846673,0.013194,1.268132,0.959762
141,"(Comedy, Animation)",(Children),0.021370,0.063890,0.013370,0.625641,9.792409,0.012005,2.500567,0.917487
...,...,...,...,...,...,...,...,...,...,...
90,(Adventure),"(Thriller, Action)",0.122411,0.062904,0.016000,0.130707,2.077881,0.008300,1.077998,0.591097
108,(Thriller),"(Drama, Action)",0.189479,0.051178,0.020055,0.105842,2.068103,0.010358,1.061134,0.637202
107,"(Drama, Action)",(Thriller),0.051178,0.189479,0.020055,0.391863,2.068103,0.010358,1.332793,0.544322
112,"(Sci-Fi, Action)",(Thriller),0.040986,0.189479,0.015781,0.385027,2.032024,0.008015,1.317977,0.529586


By making a subset with ordering with lift & conviction:

* The highest correlation: {Animation, Childres} correlates in both directions! Recall those Pixar & Disney films that we love watching
* {Children, Adventure} ...
* {Fantasy, Adventure} ... How to interpret these two pairs?

The best way is to go back to your movies table and check it out!

In [33]:
# pd.options.display.max_rows=50

In [34]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


So we want Adventure & Children but NOT Animation...

In [35]:
movies[ (movies.genres.str.contains('Romance')) & (~movies.genres.str.contains('Comedy'))]

# ~  not operator

Unnamed: 0,movieId,title,genres
14,15,Cutthroat Island (1995),Action|Adventure|Romance
16,17,Sense and Sensibility (1995),Drama|Romance
24,25,Leaving Las Vegas (1995),Drama|Romance
27,28,Persuasion (1995),Drama|Romance
33,35,Carrington (1995),Drama|Romance
...,...,...,...
9057,149606,Bajirao Mastani (2015),Romance|War
9065,152017,Me Before You (2016),Drama|Romance
9091,158783,The Handmaiden (2016),Drama|Romance|Thriller
9119,162542,Rustom (2016),Romance|Thriller


So, well, what are these movies? I rarely know any of them... (proves again the notion that domain knowledge is of utmost importance in data science!)

Viola, I know ***Tomorrowland (2015)***! We all know this movie, so we sort of understand why {Children, Adventure} is an associated pair. Given that this is not an animation, but its interesting and fantasy storyline in discovering the secrets of a mystic place kind of succeeded in targeting little boys and girls.

There are more to discover. Try finding an interesting pair on your own!

# Summary

To recap, a straightforward 4-steps approach to association rule:

1. One-hot encode the basket in dataframe.
2. Generate frequent itemsets using `apriori`.
3. Generate rule with `association_rules`.
4. Interpret & evalute the result with metrics.

### References:
1. [Introduction to Market Basket Analysis in Python](http://pbpython.com/market-basket-analysis.html)
2. [Movie genre associations](https://mathematicaforprediction.wordpress.com/2013/10/06/movie-genre-associations/)
3. [Mining Association Rules](https://paginas.fe.up.pt/~ec/files_0506/slides/04_AssociationRules.pdf)
4. [Association Rules Generation from Frequent Itemsets](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
5. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

# onehot encode using multilabel binarizer

In [4]:
movies = pd.read_csv(r'D:\intellipaat ds ai ml\DS@AI\5 ml\06-01 Association rules mining/movies.csv')
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [None]:
# movies= movies.drop('genres', axis =1).join(movies.genres.str.split( '|'))

In [5]:
movies['genres_split']  = movies['genres'].str.split('|')  # removing | and adding the genres in another column

In [6]:
movies.head(3)

Unnamed: 0,movieId,title,genres,genres_split
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]"


In [17]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_ohe = mlb.fit_transform(movies['genres_split'])
genre_df = pd.DataFrame(genre_ohe, columns=mlb.classes_, index=movies.index)
movies_encoded = pd.concat([movies.drop(columns=['genres', 'genres_split']), genre_df], axis=1)
movies_encoded=movies_encoded.set_index(['movieId','title'])
movies_encoded

Unnamed: 0_level_0,Unnamed: 1_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
163949,The Beatles: Eight Days a Week - The Touring Years (2016),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

In [12]:
genre_ohe = mlb.fit_transform(movies['genres_split'])
genre_df = pd.DataFrame(genre_ohe, columns=mlb.classes_, index=movies.index)
genre_df

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9120,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
9121,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
9122,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
9123,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
movies_encoded = pd.concat([movies.drop(columns=['genres', 'genres_split']), genre_df], axis=1)

In [15]:
movies_encoded

Unnamed: 0,movieId,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9120,162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9121,163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9122,163949,The Beatles: Eight Days a Week - The Touring Y...,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9123,164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
movies_encoded=movies_encoded.set_index(['movieId','title'])
movies_encoded

Unnamed: 0_level_0,Unnamed: 1_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
163949,The Beatles: Eight Days a Week - The Touring Years (2016),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
# Another method to do one hot enoding : TransactionEncoder
movies = pd.read_csv(r'D:\intellipaat ds ai ml\DS@AI\5 ml\06-01 Association rules mining/movies.csv')

movies['genres_split']  = movies['genres'].str.split('|')  # removing | and adding the genres in another column
transactions = movies['genres_split'].tolist()  #should be passed in list

from mlxtend.preprocessing import TransactionEncoder
te= TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
movies_encoded = pd.DataFrame(te_array,columns=te.columns_)
movies_encoded = df_encoded.replace({True:1,False:0})
movies_oh = pd.concat([movies.drop(columns=['genres', 'genres_split']), movies_encoded], axis=1)

movies_oh=movies_oh.set_index(['movieId','title'])
movies_oh

# TransactionEncoder does not implement fit_transform(), so you must call .fit() and .transform() separately.abs


Unnamed: 0_level_0,Unnamed: 1_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162672,Mohenjo Daro (2016),0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
163056,Shin Godzilla (2016),0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
163949,The Beatles: Eight Days a Week - The Touring Years (2016),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
164977,The Gay Desperado (1936),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
movies['genres_split'].tolist()

[['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'],
 ['Adventure', 'Children', 'Fantasy'],
 ['Comedy', 'Romance'],
 ['Comedy', 'Drama', 'Romance'],
 ['Comedy'],
 ['Action', 'Crime', 'Thriller'],
 ['Comedy', 'Romance'],
 ['Adventure', 'Children'],
 ['Action'],
 ['Action', 'Adventure', 'Thriller'],
 ['Comedy', 'Drama', 'Romance'],
 ['Comedy', 'Horror'],
 ['Adventure', 'Animation', 'Children'],
 ['Drama'],
 ['Action', 'Adventure', 'Romance'],
 ['Crime', 'Drama'],
 ['Drama', 'Romance'],
 ['Comedy'],
 ['Comedy'],
 ['Action', 'Comedy', 'Crime', 'Drama', 'Thriller'],
 ['Comedy', 'Crime', 'Thriller'],
 ['Crime', 'Drama', 'Horror', 'Mystery', 'Thriller'],
 ['Action', 'Crime', 'Thriller'],
 ['Drama', 'Sci-Fi'],
 ['Drama', 'Romance'],
 ['Drama'],
 ['Children', 'Drama'],
 ['Drama', 'Romance'],
 ['Adventure', 'Drama', 'Fantasy', 'Mystery', 'Sci-Fi'],
 ['Crime', 'Drama'],
 ['Drama'],
 ['Mystery', 'Sci-Fi', 'Thriller'],
 ['Children', 'Drama'],
 ['Drama', 'Romance'],
 ['Crime', 'Drama'],
 ['