# Association Rules Mining with MLxtend_library

Association analysis using the apriori algorithm and the MLxtend library. 

- [MLxtend documentation example](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) for generation of frequent item sets with Apriori algorithm
- [MLxtend documentation example](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) for Association Rules


**Data Sources:**

- `data/raw/sales_total.csv`: real transaction dataset for a B2B retailer.

**Changes**

- 2019-07-07: Start notebook


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries,-load-data" data-toc-modified-id="Import-libraries,-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries, load data</a></span></li><li><span><a href="#Prepare-basic-data-structure" data-toc-modified-id="Prepare-basic-data-structure-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare basic data structure</a></span></li><li><span><a href="#Get-frequent-itemsets-with-Apriori-algorithm" data-toc-modified-id="Get-frequent-itemsets-with-Apriori-algorithm-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get frequent itemsets with Apriori algorithm</a></span></li></ul></div>

---

## Import libraries, load data

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

# Specials
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

# My functions
import EDA_functions as EDA
import cleaning_functions as cleaning

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns 
# sns.set_style('whitegrid')
color = 'rebeccapurple'
%matplotlib inline

# Display settings
from IPython.display import display
pd.options.display.max_columns = 100

In [2]:
# Load data
transactions_raw = pd.read_csv('data/raw/sales_total.csv', parse_dates=['Fakturadatum'])

## Prepare basic data structure

**IMPORTANT:** Using MLxtnd this analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded.

In [3]:
transactions_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2835054 entries, 0 to 2835053
Data columns (total 12 columns):
Kunde           int64
Fakturadatum    datetime64[ns]
Faktura         int64
Pos             int64
Artikel         object
Unit Price      float64
pro             int64
ME              object
Menge           float64
ME.1            object
Nettowert       float64
Währg           object
dtypes: datetime64[ns](1), float64(3), int64(4), object(4)
memory usage: 259.6+ MB


In [4]:
transactions_raw.sample(2)

Unnamed: 0,Kunde,Fakturadatum,Faktura,Pos,Artikel,Unit Price,pro,ME,Menge,ME.1,Nettowert,Währg
257849,8052723,2017-03-15,91589089,50,5315816,11.97,1,ST,1.0,ST,11.97,CHF
638342,8671117,2017-06-22,91695054,9,7199113,1.22,1,ST,10.0,ST,12.2,CHF


In [5]:
"""Clean data"""

# Look at 2018 data only
transactions_18_full = transactions_raw.loc[transactions_raw['Fakturadatum'].dt.year == 2018]
transactions_18_full = transactions_18_full[['Kunde', 'Fakturadatum', 'Artikel']]
 # Kick out all artikel that contain str values in their code
print("Unique artikel before cleaning:", transactions_18_full['Artikel'].nunique())
transactions_18_full['Artikel'] = pd.to_numeric(transactions_18_full['Artikel'], errors='coerce')
transactions_18 = transactions_18_full.dropna(how='any')
print("Unique artikel after cleaning:", transactions_18['Artikel'].nunique())
 # Kick-out special customers
transactions_18 = transactions_18.loc[transactions_18['Kunde'] > 700000]

transactions_18_grouped = pd.DataFrame(transactions_18.groupby(
        ['Kunde', 'Fakturadatum'])['Artikel'].unique())
transactions = transactions_18_grouped.reset_index(drop=True)

Unique artikel before cleaning: 74126
Unique artikel after cleaning: 49709


**Note:** By keeping numeric artikel IDs only I loose about a third of artikel! But let's do it anyway.

In [6]:
# Check results
print("Number of transactions:", len(transactions))
transactions.head()

Number of transactions: 204330


Unnamed: 0,Artikel
0,"[9900179.0, 4100130.0, 5308748.0, 5074021.0, 5..."
1,"[2960232.0, 5308591.0, 6436018.0, 66933.0, 629..."
2,"[4921011.0, 2947803.0, 2550805.0, 2458003.0, 6..."
3,"[7947701.0, 6634351.0, 7906101.0, 8621604.0, 6..."
4,"[2119635.0, 2119634.0, 2310306.0, 6103861.0, 8..."


In [8]:
"""OHE to sparse format with MLxtnd TransactionEncoder"""

te = TransactionEncoder()
products_array = np.array(transactions['Artikel'])
products_array_ohe = te.fit(products_array).transform(products_array, sparse=True)
transactions_sparse = pd.SparseDataFrame(products_array_ohe, 
                                        columns=te.columns_, 
                                        default_fill_value=False)

assert transactions_sparse.iloc[1,].sum() == len(transactions.iloc[1,0])

In [None]:
# # Check results
# transactions_sparse.head()

## Get frequent itemsets with Apriori algorithm

In [None]:
frequent_itemsets = apriori(transactions_sparse, min_support=0.05, use_colnames=True, verbose=1)