# Predictions for Customer Lifetime Value 

## Problem overview:

This scenario focuses on predictions related to CLV (Customer Lifetime Value) where the goal is to predict the future value a customer brings to the business. For any marketing department, the CLV is a key metric that helps understand the revenue attached to the future relationship with the customer. In general, the CLV is used to optimize marketing spends while maximizing returns. While there are many ways to model the CLV, in this scenario we will use a metric related to future customer spend in a given time range (e.g. a quarter). Predicting the future spend relies on the known transaction history of customers. One of the most widely used models to capture customer behavior is RFM (Recency, Frequency, Monetary). RFM features are derived from the transactional history of the customers and then used as inputs for the machine learning models that predict future spend.

This scenario details the development of a machine learning future customer spend prediction model. The model is trained on a public dataset containing transactions from an online retailer. The dataset contains the history of purchased items over 12 calendar months.

## Solution overview:

We will use the [Automated Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train) (autoML) capabilities of the [Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml) to quickly train a model that can predict future customer spend. We will model our problem as a **Regression** problem where the goal of the trained model is to predict a numerical value (in our case, the future customer spend). The automML capabilities enable us to evaluate different algorithms and hyperparameters to get the best trained model for the problem with minimum effort. The approach used in this example cand be extended to various use cases that revolve around the need to predict numerical values related to customer behavior (and not only).


This notebook is organized into the following sections:

1. Basic setup

2. Data prep

3. Model training

4. Explore the results and evaluate the best model

5. Model Explainability: which features matter for the predictions?

## Section 1. Basic setup

Before starting this step, you need to create an Azure Machine Learning service workspace ([instructions](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace)).

Let's get started by creating an experiment in your Azure Machine Learning workspace. An experiment is a named object in a workspace, which is used to do model training.

In [None]:
import azureml.core
import pandas as pd
import numpy as np
import logging
import warnings
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None


from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from matplotlib import pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
subscription_id = "<subscription id goes here>"
resource_group = "<resource group goes here>"
workspace_region = "<workspace region goes here>"
workspace_name = "<workspace name goes here>"

In [None]:
ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

In [None]:
# choose a name for the run history container in the workspace
experiment_name = 'CLVFutureSpend'

# project folder
project_folder = './sample_projects/automl-clvfuturespendprediction'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Section 2. Data prep

OnlineRetail.csv contains the customer purchase history between December 1st, 2010 and December 9th, 2011.

In [None]:
data = pd.read_csv("OnlineRetail.csv", parse_dates=['timeStamp'])

## 2.1 Inspect data

Display the first few rows of the data and view some plots to help you understand the dynamics within the dataset.

In [None]:
data.head(20)

In [None]:
# Plot to be added here

In [None]:
# Plot to be added here

## 2.2 Engineer new features - Recency, Frequency, and Monetary

In [None]:
# RFM calculation to be added here

## 2.3 Split the data into train and test sets

In [None]:
# Train/Test split to be added here

## Section 3. Model training

In this section you will configure the [automated ML](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml) feature of the Azure Machine Learning service. Using the configuration you will then run an automated ML experiment which explores various algorithms and hyperparameter values to generate machine learning models. Finally, the best model will be selected. The training jobs run on local compute resources (provided and managed by Azure Notebooks).

The significant advantage of automated ML is the acceleration of a data scientist's work (as it does a significant portion of the exploration work). Besides that, automated ML exposes rich data resulting from experiment runs which enables control, transparency, and, most importantly, visibility on what is happening behind the scenes.

The configuration data required by automated ML contains information about the experiment itself as well as the training data used to train the models. Below is an example of the most important components of the configuration data:

Property | Description
--- | ---
task | regression
primary_metric | Metric that you want to optimize.<br>Forecasting supports the following primary metrics:<br>spearman_correlation<br>normalized_root_mean_squared_error<br>r2_score<br>normalized_mean_absolute_error
iterations | Number of iterations. In each individual iteration, automated ML trains one pipeline (algorithm and hyperparameters) on the given data.
iteration_timeout_minutes | The maximum number of minutes for each individual iteration
X | Training data in the form [n_samples, n_features]
y | Target values in the form [n_samples]
n_cross_validations | Number of [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) splits.
path | Relative path to the project folder. Automated ML stores configuration files for the experiment under this folder. You can specify a new empty folder.


### 3.1 Automated ML configuration

Automated ML provides several options to configure the experiment runs, giving you flexibility and control. For example, the primary_metric setting specifies which metrich should automated ML use to optimize the machine learning model being built. There are multiple primary metrics available, and in our problem we will use NRMSE (Normalized Root Mean Squared Error).

Notice that we've set the task to regression and we are also specifying the training data set (X_train and y_train). We need to do this because training is performed locally. When training in performed remotely (e.g. on AML compute resources) you will need to provide a script that contains code to get the data instead of the data itself.

In [None]:
automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_clvfuturespend_errors.log',
                             primary_metric= 'normalized_root_mean_squared_error',
                             iterations = 5,
                             iteration_timeout_minutes = 5,
                             X = X_train,
                             y = y_train,
                             n_cross_validations = 3,
                             path=project_folder,
                             verbosity = logging.INFO)

### 3.2 Train your models on local compute

When you call the submit method on the experiment object and pass the AutoMLConfig object, automated ML will general a number of machine learning models equal to **iterations** (5 in our case). Depending on the input data and the number of iterations the time required to complete the traning can range from a few minutes to hours (or even mode). Once execution starts, you will see status messages being print out to the console.

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

### 3.3 Monitor training

When you want to start the run and continue to execute your code you need to specify ```show_output=False``` when calling experiment.submit. This will also enable you to use a widget to view the status of all iterations (see the cell below).

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

## Section 4. Explore the results and test the best model

## 4.1 Retrieve all child runs

Each individual model is trained in the context of a child run having **local_run** as its parent. You can get the list of all child runs and their logged metrics.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

## 4.2 Retrieve the best fitted model

The **get_output** method enables you to retrieve the best child run and the associated fitted model.

In [None]:
best_run, fitted_model = local_run.get_output()
fitted_model.steps

## 4.3 Test the best fitted model

Description of the process to be added.

In [2]:
# Test model code to be added

## Section 5. Model Explainability: Which features matter for the predictions?

Description of the process to be added

In [None]:
# Model explainability code to be added