## Association Rule Mining



We will define some numbers that can make it easier to decide whether a rule is valid or just coincidence.


1. {bread, yoghurt}
2. {milk, bread, carrots}
3. {bread, carrots}
4. {bread, milk}
5. {milk, dates, carrots}
6. {milk, dates, yoghurt, bread}

### 1. Confidence

This can be defined as the number of transactions in which the itemset is divided by the total number of itemset the first item is pesent. In other words, the confidence tell us the proportion of transactions where the presence of item X results in the presence of item Y.

$$c(X \to Y) = \frac{T(X\ AND Y)}{T(X)} $$

$$c(\text{bread} \to \text{milk}) = \frac{3}{5} = 60\% $$

According to this analysis, you are 60% confident that you will find milk when you see bread in the basket.

### 2. Support

The support of an itemset or rule measures how frequently it occurs in the data.

$$S(X \to Y) = \frac{T(X\ AND Y)}{T()} $$


### 3. Lift

$$L(X \to Y) = \frac{c(X \to Y)}{S(Y)} $$

If the lift is greater than 1, this implies that the two items are found together more often than one would expect by chance.

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules

ModuleNotFoundError: No module named 'mlxtend'

In [None]:
online_retail=pd.read_excel('Online Retail.xlsx')
online_retail.info()
online_retail.head()

In [None]:
online_retail.Country.value_counts()

## Data Cleaning

* Select UK transactions

In [None]:
online_retail[online_retail.Country.isin(['United Kingdom', 'Canada', 'RSA'])]

In [None]:
UK_retail=online_retail[online_retail.Country == 'United Kingdom']
UK_retail=UK_retail.reset_index(drop=True)

In [None]:
UK_retail.info()

* Convert Invoice number to string so that we can drop the cancellations

In [None]:
UK_retail['InvoiceNo'] = UK_retail['InvoiceNo'].astype('str')

* Drop cancellations

In [None]:
UK_retail=UK_retail[~UK_retail.InvoiceNo.str.contains('C')].reset_index(drop=True)

In [None]:
UK_retail.isnull().sum()

* Drop missing descriptions

In [None]:
UK_retail=UK_retail.dropna(subset=['Description']).reset_index(drop=True)
UK_retail.isnull().sum()

In [None]:
UK_retail.Description=UK_retail.Description.str.strip()

* Take absolute values of the quantity

In [None]:
UK_retail.Quantity= UK_retail['Quantity'].abs()

In [2]:
UK_retail

NameError: name 'UK_retail' is not defined

In [None]:
transactions=UK_retail.groupby(['InvoiceNo', 'Description'])['Quantity']\
.sum().unstack().fillna(0)

In [None]:
transactions

In [None]:
def binary(x):
    if x > 0:
        return 1
    else:
        return 0

In [None]:
transactions.sum().sort_values(ascending=False).iloc[:10].plot(kind='barh')
plt.title('Top 10 items')

In [None]:
transaction_final=transactions.applymap(binary)

In [None]:
transaction_final.sum().sort_values(ascending=False).iloc[:10].plot(kind='barh')
plt.title('Top 10 most occuring items')

In [None]:
mba = apriori(transaction_final, min_support=0.02, use_colnames=True)

In [None]:
rules=association_rules(mba, metric='lift', min_threshold=1)

In [None]:
rules.sort_values(['confidence', 'lift'], ascending=[False, False]).reset_index(drop=True).iloc[:10]