<a href="https://colab.research.google.com/github/sundrop03/TL/blob/main/Inductive_TL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inductive transfer learning

## Introduction

This notebook demonstrates the applications of inductive transfer learning discussed in the paper ["Transfer Learning in the Actuarial Domain"]().

We can learn a claims frequency prediction model using a large data set with fire and robbery claims as our labels. Then the learnings from this model can be transferred to be used for injury claims frequency prediction for a smaller data set.

## Data Preparation

For transfer learning, we need a source dataset to learn from and a target dataset to transfer the learnings to.

Therefore, two datasets are used for the application.

For the source dataset, we use the Brazilian automobile claims data set that can be accessed [online](http://www2.susep.gov.br/menuestatistica/Autoseg/) and is also available from the [CASdatasets](https://github.com/dutangc/CASdatasets). The target dataset is the Singapore automobile claims data that is also available in the CASdatasets.

We start with loading the necessary packages that include various metrics and loss functions used in constructing the neural network.

In [None]:
#load packages
import pandas as pd
import xlsxwriter
import io
import requests
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,mean_poisson_deviance
from pickle import dump
from pickle import load
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.backend import exp
from tensorflow.keras.models import Sequential, Model, save_model
from tensorflow.keras.layers import Input, Reshape, Dense, Activation, Flatten, Concatenate, Embedding, BatchNormalization, Dropout, Add
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam, Nadam, SGD, Adamax, Adagrad
from tensorflow.keras.utils import plot_model

The source data is preprocessed before it is read into the python environment.

Instances with blank values for 'Gender' and 'VehModel' are removed. The blank values for corporate 'DrivAge' are filled proportionally based on the age distribution of the whole data. A possible alternative can be using a nearest neighbor algorithm to fill the empty values. Instances with values larger than 1 for the exposure feature are removed.

Features that are not available in the target data are removed from the source data.

The original source data has 23 features available of which we use features on gender, exposure, vehicle age, vehicle type, driver age, and the number of claims.

These features are pre-processed to match the features of our target data. Gender consists of male, female, and corporate. Driver age is used to categorize each instance into 'AgeCat' which has 5 age groups where 0 is the youngest and 4 is the oldest age group. Based on the Vehicle year feature, we can categorize each observation into 'AutoAge0', 'AutoAge1', 'AutoAge2', 'AutoAge','VAgecat', and 'VAgeCat1'. Information on the vehicle model and vehicle group is used to determine the 'VehicleType' which has 4 categories which are A(auto), M(motorcycle), O(others), and T(truck).
ClaimNbRob and ClaimNbFire represent the number of claims due to robbery and fire. The sum of these two features is renamed as Clm_Count.

Now we have 11 final features in our source data which are Gender, VehicleType, Clm_Count, Exp_weights, AgeCat, AutoAge0, AutoAge1, AutoAge2, AutoAge, VAgeCat, and VAgecat1.

Let's load our source data and check:

In [None]:
!git clone https://github.com/sundrop03/TL.git

In [None]:
mysource1= pd.read_excel("TL/Source_dataBS1.xlsx")
mysource2= pd.read_excel("TL/Source_dataBS2.xlsx")
mysource3= pd.read_excel("TL/Source_dataBS3.xlsx")
mysource4= pd.read_excel("TL/Source_dataBS4.xlsx")
mysource5= pd.read_excel("TL/Source_dataBS5.xlsx")

mysource = pd.concat([mysource1, mysource2, mysource3, mysource4, mysource5])
print(mysource)
mysource.describe()

All features except Exp_weights are categorical variables that need to be encoded to be used for machine learning methods.

Vehicle type and Gender are categorical variables, but without ordinal characteristics, so we apply one-hot encoding.

In [None]:
#process source data for use

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
mysource1 = pd.DataFrame(encoder.fit_transform(mysource[['Gender']]).toarray())
mysource2 =  mysource.join(mysource1)
mysource2.drop('Gender', axis=1, inplace=True)
mysource2.columns = ['VehicleType', 'Clm_Count', 'Exp_weights', 'AgeCat','AutoAge0',
                     'AutoAge1','AutoAge2','AutoAge','VAgeCat', 'VAgecat1', 'Corporate', 'Female', 'Male']

encoder1 = OneHotEncoder(handle_unknown='ignore')
mysource3 = pd.DataFrame(encoder1.fit_transform(mysource2[['VehicleType']]).toarray())
final_source = mysource2.join(mysource3)
final_source.drop('VehicleType', axis=1, inplace=True)
final_source.columns = ['Clm_Count', 'Exp_weights', 'AgeCat','AutoAge0','AutoAge1',
                        'AutoAge2','AutoAge','VAgeCat', 'VAgecat1', 'Corporate', 'Female', 'Male', 'Auto', 'Motorcycle', 'Other', 'Truck']

Now let's load our target data and apply the same encoding.

In [None]:
mytarget= pd.read_excel("TL/Target_dataB.xlsx")
print(mytarget)
mytarget.describe()

In [None]:
#process target data for use
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
mytarget1 = pd.DataFrame(encoder.fit_transform(mytarget[['Gender']]).toarray())
mytarget2 =  mytarget.join(mytarget1)
mytarget2.drop('Gender', axis=1, inplace=True)
mytarget2.columns = ['VehicleType', 'Clm_Count', 'Exp_weights', 'AgeCat','AutoAge0',
                     'AutoAge1','AutoAge2','AutoAge','VAgeCat', 'VAgecat1', 'Corporate', 'Female', 'Male']

encoder1 = OneHotEncoder(handle_unknown='ignore')
mytarget3 = pd.DataFrame(encoder1.fit_transform(mytarget2[['VehicleType']]).toarray())
final_target = mytarget2.join(mytarget3)
final_target.drop('VehicleType', axis=1, inplace=True)
final_target.columns = ['Clm_Count', 'Exp_weights', 'AgeCat','AutoAge0','AutoAge1',
                        'AutoAge2','AutoAge','VAgeCat', 'VAgecat1', 'Corporate', 'Female', 'Male', 'Auto', 'Motorcycle', 'Other', 'Truck']

## Setting seeds

We can set seeds for reproducibility, but exact reproduction of results is hard to achieve due to randomness in the sampling, cross-validation, dropout layers, and methods. This doesn't change the overall findings from the results.    

In [None]:
seed_val = 10579
import os
os.environ['PYTHONHASHSEED']=str(seed_val)
import random
random.seed(seed_val)
np.random.seed(seed_val)
tf.random.set_seed(seed_val)

## Defining the neural network

We define the structure of our neural network that will be used throughout the applications.

In [None]:
#define NN model

def NN_model(input_shape=(14,)):
    model = Sequential()
    model.add(Dense(28,input_shape=input_shape,kernel_initializer = 'GlorotNormal'))
    model.add(BatchNormalization()) #BatchNormalization
    model.add(Dropout(0.1))
    model.add(Activation('tanh'))
    model.add(Dense(21,kernel_initializer = 'GlorotNormal'))
    model.add(BatchNormalization()) #BatchNormalization
    model.add(Dropout(0.1))
    model.add(Activation('tanh'))
    model.add(Dense(14,kernel_initializer = 'GlorotNormal'))
    model.add(BatchNormalization()) #BatchNormalization
    model.add(Dropout(0.1))
    model.add(Activation('tanh'))
    model.add(Dense(1, activation='linear'))
    model.add(Dense(1, activation='exponential',trainable = False))
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    return model

We define a sequential model which can also be defined as a functional model.

The network has one input layer with 14 neurons that take the 14 features as input. These are passed through three Dense(hidden) layers with tanh activation functions that have 28, 21, and 14 neurons respectively. Then the output is combined and passed through a dense layer with 1 neuron and a linear activation function. Finally, the results are combined at the output layer with an exponential activation function and the loss function is set to 'poisson' since our outcome is the predicted claim frequency.

Unlike the transductive model, we initialize the weights for each hidden layer by a Glorot normal initializer.

The choice of optimizer was determined by experiments. We have considered a sgd (Stochastic Gradient Descent) with learning rate of 0.1 and momentum of 0.9 as a possible alternative. With dropouts applied to multiple layers, increasing the learning rate and momentum to a higher level provided reasonable results. But this also resulted in very large weights that showed poor performance in terms of predictions. The final choice is Nadam which is an Adam optimizer with Nesterov momentum. The learning rate is set to 0.004 which depends on the batch size used for the training.

To control for overfitting, we apply regularization such as batch normalization and dropout to the first three dense layers. There are different opinions regarding the order of applying batch normalization and dropout suggesting dropout coming first. Also, it is quite common to use higher probabilities (e.g., 80\%) used in dropout for input layers and relatively lower probabilities (e.g., 50\%) for other layers. But these differ by application and the proposed order and hyperparameters are determined through experiments.

Having a large dropout probability works relatively well when training a neural network on a single dataset. But for the application of transfer learning, small dropout probablility worked better. Therefore, we set it to a constant 0.1 to retain most of the output from all layers.



## Baseline model

We are interested in the performance difference between models with and without transfer learning. Therefore, we run a baseline model that only learns from the target data.

The target data is split into 60\% train, 25\% validation, and 15\% test data.

The outcome of interest 'Clm_Count' should be considered with its corresponding 'Exp_weights'. In the model, we take this into account by creating bootstrap samples distributed according to the the 'Exp_weights' feature.

In [None]:
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=30,verbose=0,
                                           mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)

    #split training and testing data within target

    scalerIBA = MinMaxScaler()
    final_target[['AgeCat','VAgeCat', 'VAgecat1']] = scalerIBA.fit_transform(final_target[['AgeCat','VAgeCat', 'VAgecat1']])
    Xmy = final_target.drop(columns = ['Exp_weights','Clm_Count']).values
    ymy = final_target['Clm_Count'].values
    vmy= final_target['Exp_weights'].values

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.6), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    Xtr = Xmy_bs[idx1]
    ytr = ymy_bs[idx1]
    vtr = vmy_bs[idx1]

    Xva_my = Xmy_bs[idx2]
    yva_my = ymy_bs[idx2]
    vva_my = vmy_bs[idx2]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xva_my)), round(len(Xva_my)*0.625), replace=False)
    idx4 = np.delete(np.arange(len(Xva_my)),idx3)

    Xva = Xva_my[idx3]
    yva = yva_my[idx3]
    vva = vva_my[idx3]

    Xte = Xva_my[idx4]
    yte = yva_my[idx4]
    vte = vva_my[idx4]


    model1 = NN_model()
    model1.fit(Xtr,ytr,batch_size=512, callbacks=[callback], verbose=2,epochs=400, validation_data=(Xva,yva))

    Plain_preds = model1.predict(Xte)
    deviance = mean_poisson_deviance(yte, Plain_preds)
    error = mean_squared_error(yte, Plain_preds)

    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error


end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

We set the model to have 1,000 iterations to produce 1,000 sets of results. For each iteration, we calculate the MPD(Mean Poisson Deviance) and MSE(Mean Squared Error) to determine the performance of the model. If the predictions are closer to the true outcome, then the deviation decreases, which indicates an improvement in the predictions. Our problem is a Poisson regression problem, so using MPD with a traditional regression metric MSE will be a reasonable choice.

The batch size is determined through a balance between runtime and gain in performance. Usually a small batch size of 32 or 64 with a learning rate of 0.001 provides a good prediction, but it greatly increases the runtime. We increased the batch size to 512 and the learning rate of the optimizer was increased to 0.004, which is by a factor of $\sqrt{16}$. Increasing the learning rate by a factor of the increase in batch size resulted in less runtime with similar predictive performance.  

The baseline model has a MPD of 0.4478(0.0530) with a MSE of 0.0984(0.0233). Values in the parentheses are standard deviations.

Now we have the baseline model! Let's move on to apply inductive transfer learning.

## Parameter-based methods

For inductive transfer learning, the source data and target data have the same domain but related tasks. In our case, the source task is to predict auto claims due to robbery and fire, while the target task is to predict auto injury claims.

We experiment with a widely used parameter-based approach. The source data is split into 80\% training data, 12.5\% validation data, and 7.5\% test data.

### Implementation

Transfer learning is applied in two steps.

First, we train the baseline model on the large source data and save the model which includes the structure and the weights learned from the training. Out of the 1,000 iterations, we choose the best model to transfer based on the metrics.

Second, we load the saved model, remove the output layer, and set the remaining layers to 'non-trainable' also known as
'freezing' the layers. Then a new output layer is added to the model to adapt the learning from the source task to the target task.

If the source task and target task are very different, then we need to freeze fewer layers from the source model and update them during the target training.

### Step 1: Training source model

For each of the 1,000 iterations, we train the source model, record the metrics, and save the model itself.

In [None]:
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=30,verbose=0,
                                           mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)

    #split training and testing data within source

    scalerIBA1 = MinMaxScaler()
    final_source[['AgeCat','VAgeCat', 'VAgecat1']] = scalerIBA1.fit_transform(final_source[['AgeCat','VAgeCat', 'VAgecat1']])
    Xmy = final_source.drop(columns = ['Exp_weights','Clm_Count']).values
    ymy = final_source['Clm_Count'].values
    vmy= final_source['Exp_weights'].values

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.8), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    Xtr = Xmy_bs[idx1]
    ytr = ymy_bs[idx1]
    vtr = vmy_bs[idx1]

    Xva_my = Xmy_bs[idx2]
    yva_my = ymy_bs[idx2]
    vva_my = vmy_bs[idx2]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xva_my)), round(len(Xva_my)*0.625), replace=False)
    idx4 = np.delete(np.arange(len(Xva_my)),idx3)

    Xva = Xva_my[idx3]
    yva = yva_my[idx3]
    vva = vva_my[idx3]

    Xte = Xva_my[idx4]
    yte = yva_my[idx4]
    vte = vva_my[idx4]

    model_TL = NN_model()
    model_TL.fit(Xtr,ytr,batch_size=512, callbacks=[callback], verbose=2,epochs=400, validation_data=(Xva,yva))

    Plain_preds = model_TL.predict(Xte)
    deviance = mean_poisson_deviance(yte, Plain_preds)
    error = mean_squared_error(yte, Plain_preds)

    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error


end = time.time()
print((end - start)/60.0, "min elapsed.")

save_model(model_TL, "model1.h5")

writer = pd.ExcelWriter('InSource.xlsx', engine='xlsxwriter')
BS.to_excel(writer, sheet_name='welcome', index=False)
writer.save()

After the run, we have 1,000 models and a spreadsheet containing the deviance and error values of those models.

Based on the deviance and error value, we choose 'model1' as the model to be transffered to the target.

### Step 2: Transferred model with one unfrozen dense layer (Transfer 1)

We load the chosen 'model1', remove the output layer, freeze the remaining layers. Then we add a new output layer that is trainable. In other words, we are using all the parameters from the source model except the last layer. This last layer is used to adapt the model from the source task to the target task.

In [None]:
model = load_model('model1.h5')

In [None]:
model1 = Sequential()

for layer in model.layers[:-1]:
    model1.add(layer)
for layer in model1.layers:
    layer.trainable=False

model1.add(Dense(1, activation='exponential',name='dense_4'))
model1.compile(optimizer=Nadam(0.004), loss='poisson')

This new model is trained on the target data.

In [None]:
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=30,verbose=0,
                                           mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)

    #split training and testing data within target
    scalerITA1 = MinMaxScaler()
    final_target[['AgeCat','VAgeCat', 'VAgecat1']] = scalerITA1.fit_transform(final_target[['AgeCat','VAgeCat', 'VAgecat1']])
    Xmy = final_target.drop(columns = ['Exp_weights','Clm_Count']).values
    ymy = final_target['Clm_Count'].values
    vmy= final_target['Exp_weights'].values

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.6), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    Xtr = Xmy_bs[idx1]
    ytr = ymy_bs[idx1]
    vtr = vmy_bs[idx1]

    Xva_my = Xmy_bs[idx2]
    yva_my = ymy_bs[idx2]
    vva_my = vmy_bs[idx2]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xva_my)), round(len(Xva_my)*0.625), replace=False)
    idx4 = np.delete(np.arange(len(Xva_my)),idx3)

    Xva = Xva_my[idx3]
    yva = yva_my[idx3]
    vva = vva_my[idx3]

    Xte = Xva_my[idx4]
    yte = yva_my[idx4]
    vte = vva_my[idx4]

    model1.fit(Xtr,ytr,batch_size=512, callbacks=[callback], verbose=2, epochs=400, validation_data=(Xva,yva))

    Plain_preds = model1.predict(Xte)
    deviance = mean_poisson_deviance(yte, Plain_preds)
    error = mean_squared_error(yte, Plain_preds)

    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error


end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

The first transfer model has a MPD of 0.4554(0.0299) with a MSE of 0.0991(0.0120) which doesn't improve the performance compared to the baseline.

We move on to freezing less layers from the transferred source model. This will allow us to use more layers to adapt the model to the target data.

### Transferred model with two unfrozen layers (Transfer 2)

To check the effect of freezing fewer layers from the transferred model, we run additional experiments that uses more layers for training on the target data.

We load the chosen 'model1', remove 2 dense layers including the output layer, and freeze the remaining layers. Then we add 2 new dense layers which also includes a new output layer that is trainable.

In [None]:
model2 = Sequential()

for layer in model.layers[:-2]:
    model2.add(layer)
for layer in model2.layers:
    layer.trainable=False

model2.add(Dense(1, activation='linear',name='dense_3'))
model2.add(Dense(1, activation='exponential',name='dense_4'))
model2.compile(optimizer=Nadam(0.004), loss='poisson')

We train the new model on the target data and record the metrics.

The second transfer model has a MPD of 0.4523(0.0295) with a MSE of 0.0988(0.0120) which shows an improvement compared to the first transfer model. But this is still worse than the baseline model without any transfer learning.

We experiment once more with freezing less number of dense layers from the transferred source model.

### Transferred model with three unfrozen dense layers (Transfer 3)

We load the chosen 'model1' and remove 3 dense layers including the dropout, batchnormalization,activation, and output layer. Then the remaining layers are frozen. Finally, 3 new dense layers, dropout, activation, and a new output layer is added.

In [None]:
model3 = Sequential()

for layer in model.layers[:-6]:
    model3.add(layer)
for layer in model3.layers:
    layer.trainable=False

model3.add(Dense(14,kernel_initializer = 'GlorotNormal',name='dense_2'))
model3.add(BatchNormalization(name='batch_normalization_2'))
model3.add(Dropout(0.1,name='dropout_2'))
model3.add(Activation('tanh',name='activation_2'))
model3.add(Dense(1, activation='linear',name='dense_3'))
model3.add(Dense(1, activation='exponential',name='dense_4'))
model3.compile(optimizer=Nadam(0.004), loss='poisson')

We train the new model on the target data and record the metrics.

The third transfer model has a MPD of 0.4414(0.0315) with a MSE of 0.0975(0.0128) which shows an improvement compared to the second transfer model and is also better than the baseline model.

In summary, parameter-based transfer learning in the inductive setting does show improvement in the prediction performance.
One possible reason for the poor performance of the first and second transfer models is that the source task (i.e., predicting auto claims due to robbery and fire risks) is not that close to the target task (i.e., auto claims prediction due to injury). Therefore, it would require unfreezing more layers to adapt the source task to the target task.