# Module 9: Repetition

In an effort to modernize our sales process, management wants to target the advertisment for customers better. Market research has concluded that this kind of advertisment is best used when the customer is in the process of ordering new products.

Our task is to build system that predicts the time when a customer orders the next batch.

For this we need to:
- Analyze the historic data to identify features that can be used for prediction
- Preprocess the data for the following steps
- Build our system or model for prediction
- Evaluate the feasability of the idea 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from pandas.core.algorithms import value_counts

try:
    # Try to use the BI style sheet for plots
    plt.style.use('matplotlibrc')
    plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[(136/256, 76/256, 255/256), (60/256, 170/256, 207/256), (12/256, 229/256, 177/256)]) 
    
    colors = [(0.53125, 0.296875, 0.99609375), (0.453125, 0.3984375, 0.9453125), (0.375, 0.4921875, 0.89453125), (0.3046875, 0.578125, 0.8515625), (0.234375, 0.6640625, 0.80859375), (0.16015625, 0.75390625, 0.76171875), (0.09375, 0.8359375, 0.72265625), (0.046875, 0.89453125, 0.69140625), (0.0, 0.875, 0.6640625)]
    bicmap = LinearSegmentedColormap.from_list(name='BIcmp', 
                                                colors=colors,
                                                N=len(colors))
    cm_bright = ListedColormap([(0.53125, 0.296875, 0.99609375), (12/256, 229/256, 177/256)])
except:
    bicmap = plt.cm.BuGn 
    colors = ['r', 'g', 'b']

## **Exercise 9.1: Exploratory Data Analysis**

We have the following two datasets available:
- `Customers`: Contains the data about the contact person that makes the orders.
- `Order History`: Contains the orders placed by each person for the past 1000 days.

**Make yourself familiar with the data**

In [None]:
# Load the dataset
customers = pd.read_csv(
    'customers.csv',
    index_col='customer_id',
    parse_dates=['birth_date', 'customer_since'])
customers.head()

In [None]:
order_history = pd.read_csv(
    'order_history.csv',
    index_col=[0, 1],
    parse_dates=True)
order_history.index.names = ['target', 'customer_id']
order_history

In [None]:
# Define which variables you want to use as features
features = ['country', 'customer_since', 'client_group']

In [None]:
# The target variable will be the number of days between each order from each customer
target = order_history.reset_index(level=0).groupby(level=0).diff().dropna()['target'].dt.days
target

In [None]:
# Our features will be
x = customers.loc[target.index, features]
x

## **Exercise 9.2: Preprocessing**

In [None]:
# Preprocess the data
# Use sklearn's pipeline to working with train/test splits easier
# You can use make_column_transformer and make_column_selector for this
from sklearn.compose import make_column_selector, make_column_transformer
# You can use the Standard scaler for numeric data (dtype_include=np.number)
# You can use the OneHotEncoder for categorical data (dtype_include=object)
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(categories=[customers[s].unique() for s in ['country', 'client_group']]), make_column_selector(dtype_include=object))
)

In [None]:
# Test the pipeline
ct.fit_transform(x).toarray()

## **Exercise 9.3: Modeling**

**First Benchmark**

In [None]:
# We want to measure our performance against a really simple benchmark
# Always predict the average time between orders
from sklearn.dummy import DummyRegressor

bench1 = DummyRegressor(strategy='constant', constant=target.mean())
bench1.fit(x, target)

**Second Benchmark**

In [None]:
# This more advanced benchmark takes the average time between orders of each customer
bench2 = target.groupby(level=0).mean().loc[customers.loc[target.index].index]

### **Exercise 9.3.1: Simple solution**

In [None]:
# We can start with a simple model first
from sklearn.linear_model import LinearRegression

In [None]:
# Create a pipeline with your preprocessing pipeline and the LinearRegression() model
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ct, LinearRegression())

In [None]:
# Evalaute the pipeline against the benchmarks
from sklearn.metrics import mean_absolute_percentage_error, mean_absolute_error, make_scorer, mean_squared_error
from sklearn.model_selection import cross_val_score

print(f'[ML model]: {np.mean(cross_val_score(pipe, x, target, scoring=make_scorer(mean_squared_error)))}')

print(f'[Bench 1]: {mean_squared_error(target, bench1.predict(x))}')

print(f'[Bench 2]: {mean_squared_error(target, bench2)}')

### **9.3.2: Optimization**

**Use some form of optimization to find a model that performs better than benchmark 1**  
Options:
- Manual search
- Grid search
- Random search
- Bayesian optimization

Tip: You can treat the model as a hyperparameter and search for the best model and the best hyperparameter combination at the same time.

Tip: It might suffice to just use a complex instead of a linear model


In [None]:
# TODO

In [None]:
# Evalaute the pipeline against the benchmarks

print(f'[ML model]: {np.mean(cross_val_score(pipe, x, target, scoring=make_scorer(mean_squared_error)))}')

print(f'[Bench 1]: {mean_squared_error(target, bench1.predict(x))}')

print(f'[Bench 2]: {mean_squared_error(target, bench2)}')

### **Exercise 9.3.3: Outlier detection and Segmentation**

In [None]:
# We can aggregate the order_history data to visualize the behavior of each of our customers
avg_order_freq = target.groupby(level=0).mean()
avg_order_size = order_history.groupby(level=-1).mean()

agg_customers = avg_order_freq.to_frame().join(avg_order_size)
agg_customers

In [None]:
agg_customers.plot.scatter(x='target', y='volume')

In this image we can see two thinks:
- There seem to be clusters of similar behaving customers (segement the data and build one model per segment)
- There are a lot of outliers that can reduce our prediction accuracy (remove outliers to improved the learned function)

Depending on the algorithm you choose you can segment the data and do outlier removal at the same time or do it separately.

Then train one model per segment. Use some form of Hyperparameter optimization to automatically find the best one for each segment.

In [None]:
# TODO

## **Exercise 9.4: Evaluation**

Based on the work above, evaluate your final model on the test dataset.

You can use the complete training dataset for training now.

**What is the final evaluation performance you achieve on the test set?**

In [None]:
# Load the test data
test_customers = pd.read_csv(
    'customers.csv',
    index_col='customer_id',
    parse_dates=['birth_date', 'customer_since'])
test_order_history = pd.read_csv(
    'order_history.csv',
    index_col=[0, 1],
    parse_dates=True)
test_order_history.index.names = ['target', 'customer_id']

In [None]:
# TODO