**fundamental concepts in association rules**

(Not a Rule) Support: number of times X occurs over all instances.

Support(X→Y) is the probability of co-occurence of both items within all data.

Confidence(X→Y) is the probability of Y occurs given that X is present.

Lift(X→Y) is the probability of Y being bought given that X is present, taking into account the popularity of Y as well.

Conviction(X→Y) is the measure of implication. A value > 1 indicates that Y is highly depending on X.

So basically it is probability/statistics. A simple but useful decision making tool for a wide range of usages such as market basket analysis, customer relationship management, recommender system, marketing activities, network traffic analysis, intrusion detection (fraud & malware detection) and bioinformatics.

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
data = {'ID':[1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}

In [None]:
df = pd.DataFrame(data)
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]
df

Then, we can generate frequent itemsets based on support.
Here we need to set the minimum support value between [0,1]. Using min_supp = 50% means we only want itemsets that co-occur more than half of the time.

apriori(df, min_support=0.5, use_colnames=False, max_len=None)

In [None]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], 
                            min_support=0.50, use_colnames=True)

In [None]:
frequent_itemsets

**Final Step: generate the rules with their corresponding support, confidence and lift, (and leverage & conviction):
association_rules(df, metric='confidence', min_threshold=0.8)**

Here, df means the frequent_itemsets dataframe;

metrics is the parameters to consider if there is association. You can set it to one of the five metrics.

min_threshold is the mininum value for the specified metrics.

In [None]:
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

In [None]:
rules

In [None]:
rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.8)  ]


Subsetting the lift and confidence values return you with the itemsets that are relatively highly correlated in this data.

We can see that:

If Onion or Burger is in a users' basket, it is highly likely that the user will buy Potato as well.

If Burger and Onion is in a users' basket, it is highly likely that the user will also buy Potato.

**Some notes on Lift, Conviction & Leverage:**

**Lift(X→Y) :** the likelihood of Y being bought when X is present, taking into account the popularity of Y as well.

When Lift=1, X makes no impact on Y

When Lift>1, there is a relationship between X & Y

**Conviction(X→Y):** Conviction is a measure of the implication and has value 1 if items are unrelated.
A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

**Leverage(X→Y):** the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent. An leverage value of 0 indicates independence.

In [None]:
retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }

In [None]:
retail = pd.DataFrame(retail_shopping_basket)
retail = retail[['ID', 'Basket']]
pd.options.display.max_colwidth=100

In [None]:
retail

**First one-hot encode the basket, but how?**

In [None]:
retail = retail.drop('Basket' ,1).join(retail.Basket.str.join(',').str.get_dummies(','))

In [None]:
retail

In [None]:
frequent_itemsets_2 = apriori(retail.drop('ID',1), use_colnames=True)

In [None]:
frequent_itemsets_2

Just by calculating the support(X>Y), [Beer, Chips] & [Beer, Diaper] are the two frequent basket of intereseted.

But which one is more correlated than the other?

In [None]:
association_rules(frequent_itemsets_2, metric='lift')

In [None]:
association_rules(frequent_itemsets_2)

**What can you discover from the two rules? **

Clearly, {Diaper, Beer} is the most associated itemset in this data!

In [None]:
movies = pd.read_csv('../input/movies.csv')

In [None]:
movies.head(10)

In [None]:
movies_ohe = movies.drop('genres',1).join(movies.genres.str.get_dummies())
pd.options.display.max_columns=100

In [None]:
movies_ohe.head()

In [None]:
stat1 = movies_ohe.drop(['title', 'movieId'],1).apply(pd.value_counts)
stat1.head()

In [None]:
stat1 = stat1.transpose().drop(0,1).sort_values(by=1, 
                                                ascending=False).rename(columns={1:'No. of movies'})

In [None]:
stat1.head()

In [None]:
stat2 = movies.join(movies.genres.str.split('|').reset_index().genres.str.len(), rsuffix='r').rename(
    columns={'genresr':'genre_count'})

In [None]:
stat2.head(10)

In [None]:
stat2 = stat2[stat2['genre_count']==1].drop('movieId',1).groupby('genres').sum().sort_values(
    by='genre_count', ascending=False)

In [None]:
stat2.head(10)

In [None]:
stat2.shape

In [None]:
stat = stat1.merge(stat2, how='left', left_index=True, right_index=True).fillna(0)

In [None]:
stat.genre_count=stat.genre_count.astype(int)
stat.rename(columns={'genre_count': 'No. of movies with only 1 genre'},inplace=True)

In [None]:
stat.head()

In [None]:
movies_ohe.set_index(['movieId','title'],inplace=True)

In [None]:
movies_ohe.head()

In [None]:
frequent_itemsets_movies = apriori(movies_ohe,use_colnames=True, min_support=0.025)

In [None]:
frequent_itemsets_movies

In [None]:
rules_movies =  association_rules(frequent_itemsets_movies, metric='lift', min_threshold=1.25)

In [None]:
rules_movies