# Use your own dataset with `neural-lifetimes`

Here we will demonstrate how to use your dataset with this library. For this purpose we use a dataset from Kaggle

[https://www.kaggle.com/shailaja4247/customer-lifetime-value-prediction/data](https://www.kaggle.com/shailaja4247/customer-lifetime-value-prediction/data)

and use components from the neural lifetimes library to predict customer value.

Please download the dataset and place it in a `data` subdirectory.

## Import dependencies
If one of the dependencies is not installed run `!pip install <package_name>` in a Jupyter cell to add them

In [None]:
%load_ext tensorboard

from pathlib import Path
import datetime
import numpy as np
import pandas as pd

import plotly.express as px

## Explore Dataset

We load the datset and set the columns' datatypes correctly. Further, we clean up the dataset as described below.

In [None]:
dtypes = {
    'InvoiceNo': str,
    'StockCode': str,
    'Description': str,
    'Quantity': int,
    'UnitPrice': float,
    'CustomerID': str,
    'Country': str}
data = pd.read_csv('data/customer_segmentation.csv', encoding='cp1252', dtype=dtypes, parse_dates=['InvoiceDate'])

# remove rows with no customerID
data = data.dropna(subset=['CustomerID'])

# filter out customer returns (e.g. transactions with negative quantities)
data = data[data.Quantity > 0]
# filter out orders only shipping free items
data = data[data.UnitPrice > 0]

# log transform quantities and prices for better stability in modelling 
data['LogUnitPrice'] = np.log(data['UnitPrice'])
data['LogQuantity'] = np.log(data['Quantity'])


Now, let us check out the first few rows of the dataset:

In [None]:
data.head()

We can see that multiple rows refer different products and thus multiple rows can refer to one invoice. We will regard each invoice as one transaction rather than indiviual items. This implies that we need to aggregate the data from each row belonging to the same invoice and thus we do some feature engineering. We will do this very rudimentary and just record the sum, mean and standard deviation for the quantities per product, the prices and the number of products per invoice.

In [None]:
def get_dataset_per_invoice(df):
    pass
    invoices = data.InvoiceNo.unique()

    data_per_invoice = data[['InvoiceNo', 'InvoiceDate', 'Country', 'CustomerID']].drop_duplicates().set_index('InvoiceNo')

    data_per_invoice['MeanLogUnitPrice'] = data.groupby('InvoiceNo')['LogUnitPrice'].mean()
    data_per_invoice['StdLogUnitPrice'] = data.groupby('InvoiceNo')['LogUnitPrice'].std().fillna(0)
    data_per_invoice['SumLogUnitPrice'] = data.groupby('InvoiceNo')['LogUnitPrice'].sum()
    data_per_invoice['MeanLogQuantity'] = data.groupby('InvoiceNo')['LogQuantity'].mean()
    data_per_invoice['StdLogQuantity'] = data.groupby('InvoiceNo')['LogQuantity'].std().fillna(0)
    data_per_invoice['SumLogQuantity'] = data.groupby('InvoiceNo')['LogQuantity'].sum()
    data_per_invoice['NumProducts'] = data.groupby('InvoiceNo')['LogQuantity'].count()

    return data_per_invoice

data = get_dataset_per_invoice(data)


Now let us look at the number of transactions per customers.

In [None]:
num_transactions_per_customer = data.groupby(['CustomerID']).size()
summary_t_per_c = num_transactions_per_customer.describe()
pd.options.display.float_format = '{:,.2f}'.format
pd.DataFrame({'Transactions per Customer': summary_t_per_c})

This seems fine. We have customers with between 1 and 210 transactions. However, most have <= 5 transactions. We visualise this using a boxplot on the log scale.

In [None]:
fig = px.box(np.log(num_transactions_per_customer))
fig.show()

The Boxplot marks `exp(4.007)` as maximum number of transactions (using 1.5xIQR rule) and any customers with more transactions as outliers.

Next, let us look at the invoice dates:

In [None]:
print("First transaction:", data['InvoiceDate'].min(), "Last transaction:", data['InvoiceDate'].max())
fig = px.histogram(data, x='InvoiceDate')
fig.show()

We can observe that it is fairly well spread with a slight increasing tendancy and peaks before Christmas. We may also observe Christmas breaks.

Next, let us look at the distribution of countries associated with each transaction.

In [None]:
print(f"Number of Countries in Dataset : {data['Country'].nunique()}")
px.histogram(data, x='Country', histnorm='probability density')

We may observe a distribution with a very long tail with almost 90% of all transactions being done in the UK. We may also observe that `Unspecified` is included as one country.

Finally, let us print the mean and standard deviation for all continuous columns.

In [None]:
data.describe()

## Use neural-lifetimes

We will now use the `neural-lifetimes` library to fit a neural network model predicting sequences of transactions from which one can derive all kinds of quantities, such as customer lifetime values.

Our model will take our features and convert them from their original format (e.g. strings) to numberic values and finds an embedding space for them. We feed these embeddings into a time sequence model (here `GRU`) to generate multivariate time series predictions for events using an variational encoder-decoder model. For each feature we apply a small decoder head at the end to predict them for the next time series model.

First some imports and configurations:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl

import neural_lifetimes

from neural_lifetimes.data.datasets import PandasSequenceDataset
from neural_lifetimes.data.dataloaders import SequenceLoader
from neural_lifetimes.models.nets import CombinedEmbedder
from neural_lifetimes.models import TargetCreator
from neural_lifetimes.data.datamodules import SequenceDataModule
from neural_lifetimes.models.modules import VariationalEventModel

LOG_DIR = str(Path.cwd() / "logs")
print(f"Logging to: {LOG_DIR}")

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    device_type = "gpu"
else:
    device = torch.device("cpu")
    device_type = "cpu"

We will use the `Sequence` pattern implemted in the package: There exist multiple classes in the package named `*Sequence*` that work together to process multivariate sequential data. For example, `SequenceLoader` (a pytorch Dataloader) `SequenceDataset` (a pytorch dataset) or `SequenceDataModule` (a pytorch lightning DataModule).

In particular, here we will use `neural_lifetimes.data.datasets.PandasSequenceDataset` to create a sequence dataset. It takes a Pandas Dataframe as input and we only need to specify relevant column names. Read #TODO for full documentation. The `continuous_feature_names` and `category_dict` specify the features our model will use.

In [None]:
# specify the datset
dataset = PandasSequenceDataset(df=data, uid_col='CustomerID', time_col='InvoiceDate', asof_time=datetime.datetime(2012, 1, 1), min_items_per_uid=1)

# specify which columns to use
continuous_feature_names = ['MeanLogUnitPrice', 'StdLogUnitPrice', 'SumLogUnitPrice', 'MeanLogQuantity', 'StdLogQuantity', 'SumLogQuantity', 'NumProducts']
categorical_feature_names = ['Country']
category_dict = {}
for col in categorical_feature_names:
    category_dict[col] = data[col].unique()
print(category_dict)

Next we define the
- Tokens (we have variable length time sequenes, so we need to communicate with the model when a new series starts and an old one ends).
- Logging directory to save model checkpoints and logs
- Set device to GPU if available

In [None]:
START_TOKEN_DISCR = 'StartToken'
START_TOKEN_CONT = 0

We now need to decide on how to embed our data. We will use the following:

- `Country` will be parsed through the discrete embedder
- The features derived from `UnitPrice` and `Quantity`, as well as `NumProduct` will get a continuous embedding

We are ignoring the remaining columns for this demonstration, but it would be possible to add more complicated embeddings and more columns. For example, we might add a `BERT` embedding for the description of product.

Using our lists of features, we will create two key parts of the training pipeline:

1. The `CombinedEmbedder` processes all features individually and finds neural embeddings before parsing them on to the time series modelling.
2. The `TargetCreator` implements the decoding for each feature and sets up the appropriate loss for them.

#TODO Link to docs

In [None]:
emb = CombinedEmbedder(
    continuous_features=continuous_feature_names,
    category_dict=category_dict,
    embed_dim=128,
    drop_rate=0.1,
    pre_encoded=False,
)

target_transform = TargetCreator(
    cols = continuous_feature_names + categorical_feature_names,
    emb=emb,
    max_item_len=100,
    start_token_discr=START_TOKEN_DISCR,
    start_token_cont=START_TOKEN_CONT,
)


We can now create the `SequenceDataModule` #TODO LINK TO DOCS automating a few tasks, such as creating the `DataLoaders` for us. 

In [None]:
datamodule = SequenceDataModule(
    dataset=dataset,
    target_transform=target_transform,
    test_size=0.2,
    batch_points=1024,
    device=device,
    min_points=1,
)

Finally, we will implement our model. This model contains the embedder, encoder and decoder.

In [None]:
net = VariationalEventModel(
    emb,
    rnn_dim=128,
    drop_rate=0.2,
    bottleneck_dim=32,
    lr=0.001,
    target_cols=target_transform.cols,
    vae_sample_z=True,
    vae_sampling_scaler=1.0,
    vae_KL_weight=0.01,
)

In order to track the training progress let us launch a tensorboard. You can also launch the tensorboard in another tab.

In [None]:
%tensorboard --logdir logs/

Let's run the model :) 
The `run_model` function sets up the training for you, as well, as the logging and checkpointing. The use of the `run_model` interface is optional, but will provide less experienced users a one-line interface to run their model.

ATTENTION: The default number of epochs below is set to `2`. We recommend you set this to `100+` for training.

In [None]:
neural_lifetimes.run_model(
    datamodule,
    net,
    log_dir=LOG_DIR,
    device_type=device_type,
    num_epochs=2,
    val_check_interval=18,
)

That is it. You can see how simple it is to fit a complicated neural network model on custom data using the `neural-lifetimes` library. If you have comments please check out 

https://github.com/transferwise/neural-lifetimes/blob/pandas-and-custom-dataset/examples/use_own_dataset.ipynb