# Transductive transfer learning

## Introduction

This notebook demonstrates the applications of transductive transfer learning discussed in the paper ["Transfer Learning in the Actuarial Domain"]().

The python package [ADAPT](https://github.com/adapt-python/adapt) is used for the implementation.

## Data Preparation

For transfer learning, we need a source dataset to learn from and a target dataset to transfer the learnings to.

Therefore, two datasets are used for the application.

For the source dataset, we use the Australian automobile claims data that can be accessed from the [CASdatasets](https://github.com/dutangc/CASdatasets). And the target dataset is a Singapore automobile claims data that is also available in the CASdatasets.

We start with loading the necessary packages that include various metrics and loss functions used in constructing the neural network.

In [None]:
#load packages
import pandas as pd
import xlsxwriter
import io
import requests
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,mean_poisson_deviance
from pickle import dump
from pickle import load
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.backend import exp
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Reshape, Dense, Activation, Flatten, Concatenate, Embedding, BatchNormalization, Dropout, Add
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam, Nadam, SGD, Adamax, Adagrad
from tensorflow.keras.utils import plot_model

The original source data has 11 features available which are veh_value, exposure, clm, numclaims, claimcst0, veh_body, veh_age, gender, area, agecat, X_OBSTAT_.

We use 6 features which are gender, exposure, veh_body, veh_age, agecat, and numclaims. These features are pre-processed to match the features of our target data. Gender is changed to Female as a binary data of 0 and 1. No change is made for the Exposure feature. Veh_body denotes the vehicle body categorized as Bus, Convertible, Coupe, Hatchback, Hardtop, Minibus, Motorized caravan, Panel van, Roadster, Sedan, Station wagon, Truck, and Utility. Based on the veh_body, we create a feature Veh_type and categorize all vehicles into A(auto), O(others), and T(truck). The Veh_age is grouped from 1 to 4 with 1 being the youngest and 4 representing the oldest. Age_Cat is based on the agecat and categorizes the policyholder into 5 age groups where 1 is the youngest and 5 is the oldest age group. Numclaims represents the number of claims and is renamed to N_Claims.

Now we have 6 final features in our source data which are Female, Exposure, Veh_type, Veh_age, Age_Cat, N_Claims.

Let's load our source data and check:

In [None]:
mysource= pd.read_excel("Source_datav1.xlsx")
print(mysource)
mysource.describe()

All features except Exposure are categorical variables that need to be encoded to be used for machine learning methods.

Vehicle type is a categorical variable, but without ordinal characteristics, so we apply one-hot encoding. 

In [None]:
#process source data for use

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_mysource = pd.DataFrame(encoder.fit_transform(mysource[['Veh_type']]).toarray())
mysource1 = mysource.join(encoder_mysource)
mysource1.drop('Veh_type', axis=1, inplace=True)
mysource1.columns = ['Female', 'Exposure', 'Veh_age', 'Age_Cat', 'N_Claims', 'Auto', 'Other', 'Truck']
print(mysource1)

The original target data has 15 features available from which we use information on gender, exposure, vehicle age, vehicle type, driver age, and number of claims. The 6 final features in our target data are the same as the source data.

Now let's load our target data and check:

In [None]:
mytarget= pd.read_excel("Target_datav1.xlsx")
print(mytarget)
mytarget.describe()

The same encoding is applied to the target data.

In [None]:
#Preprocessing for target data

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_mytarget = pd.DataFrame(encoder.fit_transform(mytarget[['Veh_type']]).toarray())
final_target = mytarget.join(encoder_mytarget)
final_target.drop('Veh_type', axis=1, inplace=True)
final_target.columns = ['Female', 'Exposure', 'Veh_age', 'Age_Cat', 'N_Claims', 'Auto', 'Other', 'Truck']

## Setting seeds

We can set seeds for reproducibility, but exact reproduction of results is hard to achieve due to randomness in the sampling, cross-validation, dropout layers, and methods. This doesn't change the overall findings from the results.    

In [None]:
seed_val = 10579
import os
os.environ['PYTHONHASHSEED']=str(seed_val)
import random
random.seed(seed_val)
np.random.seed(seed_val)
tf.random.set_seed(seed_val)

## Defining the neural network

We define the structure of our neural network that will be used throughout the applications.

In [None]:
#define NN model

def NN_model():
    input_init = Input(shape=(6,))
    
    hidden1 = Dense(12)(input_init) #Hidden 1
    batch1 = BatchNormalization()(hidden1) #BatchNormalization1
    drop1 = Dropout(0.1)(batch1) #Dropout1
    act1 = Activation('tanh')(drop1)

    hidden2 = Dense(9)(act1) #Hidden 2
    batch2 = BatchNormalization()(hidden2) #BatchNormalization2
    drop2 = Dropout(0.1)(batch2) #Dropout2
    act2 = Activation('tanh')(drop2)

    hidden3 = Dense(6)(act2) #Hidden 3
    batch3 = BatchNormalization()(hidden3) #BatchNormalization3
    drop3 = Dropout(0.1)(batch3) #Dropout3
    act3 = Activation('tanh')(drop3)
    lin = Dense(1, activation='linear')(act3)
    
    output = Dense(1, activation='exponential',trainable = False)(lin) #Output
    model = Model(inputs=input_init,outputs=output)
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    return model

We define the model using a functional api to allow for multiple inputs to be fed at any stage of the network. The same model can also be defined as a sequential model.

The network has one input layer with 6 neurons that take the 6 features as input. These are passed through three Dense(hidden) layers with tanh activation functions that have 12, 9, and 6 neurons respectively. Then the output is combined and passed through a dense layer with 1 neuron and a linear activation function. Finally, the results are combined at the output layer with an exponential activation function and the loss function is set to 'poisson' since our outcome is the predicted claim frequency. 

The choice of optimizer was determined by experiments. We have considered a sgd (Stochastic Gradient Descent) with learning rate of 0.1 and momentum of 0.9 as a possible alternative. With dropouts applied to multiple layers, increasing the learning rate and momentum to a higher level provided reasonable results. But this also resulted in very large weights that showed poor performance in terms of predictions. The final choice is Nadam which is an Adam optimizer with Nesterov momentum. The learning rate is set to 0.004 which depends on the batch size used for the training.

To control for overfitting, we apply regularization such as batch normalization and dropout to the first three dense layers. There are different opinions regarding the order of applying batch normalization and dropout suggesting dropout coming first. Also, it is quite common to use higher probabilities (e.g., 80\%) used in dropout for input layers and relatively lower probabilities (e.g., 50\%) for other layers. But these differ by application and the proposed order and hyperparameters are determined through experiments.

Having a large dropout probability works relatively well when training a neural network on a single dataset. But for the application of transfer learning, small dropout probablility worked better. Therefore, we set it to a constant 0.1 to retain most of the output from all layers.



## Baseline model

We are interested in the performance difference between models with and without transfer learning. Therefore, we run a baseline model that only learns from the target data.

The target data is split into 60\% train, 25\% validation, and 15\% test data. The vehicle age group and policyholder age group are categorical variables that are ordinal at the same time. These two features are label encoded and assigned a numerical value to preserve the order of the groups.

The outcome of interest "N_Claims" should be considered with its corresponding "Exposure". In the model, we take this into account by creating bootstrap samples distributed according to the the "Exposure" feature. 

In [None]:
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=20,verbose=0,mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)
    
    #scaling target variables
    scalerBA = MinMaxScaler()
    final_target[['Age_Cat','Veh_age']] = scalerBA.fit_transform(final_target[['Age_Cat','Veh_age']])
    Xmy = final_target.drop(columns = ['N_Claims','Exposure']).values
    ymy = final_target['N_Claims'].values
    vmy= final_target['Exposure'].values
    
    
    #Creating bootstrap samples taking into account the sample weights based on exposure
    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.6), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    #training set
    Xtr = Xmy_bs[idx1]
    ytr = ymy_bs[idx1]
    vtr = vmy_bs[idx1]
    
    Xva_my = Xmy_bs[idx2]
    yva_my = ymy_bs[idx2]
    vva_my = vmy_bs[idx2]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xva_my)), round(len(Xva_my)*0.625), replace=False)
    idx4 = np.delete(np.arange(len(Xva_my)),idx3)
    
    #validation set
    Xva = Xva_my[idx3]
    yva = yva_my[idx3]
    vva = vva_my[idx3]
    
    #test set
    Xte = Xva_my[idx4]
    yte = yva_my[idx4]
    vte = vva_my[idx4]

    
    y_va = yva.reshape((len(yva),1))
    v_te = vte.reshape((len(vte),1))
    y_te = yte.reshape((len(yte),1))
    y_sam = ytr.reshape((len(ytr),1))
    v_sam = vtr.reshape((len(vtr),1))
    
    
    model_TL = NN_model()
    model_TL.fit(Xtr,y_sam,batch_size=512, callbacks=[callback], verbose=2, epochs=300, validation_data=(Xva,y_va))
    
    TL_preds = model_TL.predict(Xte)
    
    deviance = mean_poisson_deviance(y_te, TL_preds)
    error = mean_squared_error(y_te, TL_preds)
    
    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error

end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())


We set the model to have 1,000 iterations to produce 1,000 sets of results. For each iteration, we calculate the MPD(Mean Poisson Deviance) and MSE(Mean Squared Error) to determine the performance of the model. If the predictions are closer to the true outcome, then the deviation decreases, which indicates an improvement in the predictions. Our problem is a Poisson regression problem, so using MPD with a traditional regression metric MSE will be a reasonable choice.

The batch size is determined through a balance between runtime and gain in performance. Usually a small batch size of 32 or 64 with a learning rate of 0.001 provides a good prediction, but it greatly increases the runtime. We increased the batch size to 512 and the learning rate of the optimizer was increased to 0.004, which is by a factor of $\sqrt{16}$. Increasing the learning rate by a factor of the increase in batch size resulted in less runtime with similar predictive performance.  

The baseline model has a MPD of 0.4610(0.0731) with a MSE of 0.1027(0.0316). Values in the parentheses are standard deviations.

Now we have the baseline model! Let's move on to apply transfer learning.

## Transfer learning methods with no target labels

We experiment with instance-based and feature-based methods that don't require target labels for the learning.

### Instance-based methods

Let's start with instance-based methods. The instance-based approach is aiming to minimize the marginal distribution differences between the source data and the target data by reweighting the instances. We experiment with KMM and KLIEP.

### KMM(Kernel Mean Matching)

KMM minimizes the difference in means between the input of source and target instances in a reproducing kernel Hilbert space (RKHS) by reweighting the source instances. Matching the mean is equivalent to minimizing the discrepancy of the marginal distributions between the source and target. 

In [None]:
import adapt
from adapt.instance_based import KMM
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=20,verbose=0,mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)

    #scaling source data
    scalerKMMA = MinMaxScaler()
    mysource1[['Age_Cat','Veh_age']] = scalerKMMA.fit_transform(mysource1[['Age_Cat','Veh_age']])
    Xmy = mysource1.drop(columns = ['N_Claims','Exposure']).values
    ymy = mysource1['N_Claims'].values
    vmy= mysource1['Exposure'].values
    
    #Creating bootstrap samples taking into account the sample weights based on exposure
    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.8), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)
    
    #training set (source)
    Xs = Xmy_bs[idx1]
    ys = ymy_bs[idx1]
    vs = vmy_bs[idx1]
    
    #validation set (source)
    Xva = Xmy_bs[idx2]
    yva = ymy_bs[idx2]
    vva = vmy_bs[idx2]        
        
    #split training and testing data within target
    train_target = final_target.sample(frac=0.6, random_state=None)
    test_target = final_target.drop(train_target.index)

    #scale target data
    scalerKMM1 = MinMaxScaler()
    train_target[['Age_Cat','Veh_age']] = scalerKMM1.fit_transform(train_target[['Age_Cat','Veh_age']])
    Xtr = train_target.drop(columns = ['N_Claims','Exposure']).values
    ytr = train_target['N_Claims'].values
    vtr = train_target['Exposure'].values
    dump(scalerKMM1, open('scalerKMM1.pkl', 'wb'))

    scalerKMM1 = load(open('scalerKMM1.pkl', 'rb'))
    test_target[['Age_Cat','Veh_age']] = scalerKMM1.transform(test_target[['Age_Cat','Veh_age']])
    Xte = test_target.drop(columns = ['N_Claims','Exposure']).values
    yte = test_target['N_Claims'].values
    vte = test_target['Exposure'].values
    
    #training set (target)
    idx = np.random.choice(np.arange(len(Xtr)), len(Xtr), replace=True, p=vtr/vtr.sum())
    x_sample = Xtr[idx]
    y_sample = ytr[idx]
    v_sample = vtr[idx]

    y_s = ys.reshape((len(ys),1))
    v_s = vs.reshape((len(vs),1))
    y_va = yva.reshape((len(yva),1))
    v_te = vte.reshape((len(vte),1))
    y_te = yte.reshape((len(yte),1))
    y_sam = y_sample.reshape((len(y_sample),1))
    v_sam = v_sample.reshape((len(v_sample),1))
    
    
    model_TL = KMM(NN_model(), Xt=x_sample, kernel="rbf", gamma=3)
    model_TL.fit(X=Xs, y=y_s, Xt=x_sample, callbacks=[callback], validation_data=(Xva,y_va), epochs=300, batch_size=512, verbose=2)
    
    TL_preds = model_TL.predict(Xte)
    
    deviance = mean_poisson_deviance(y_te, TL_preds)
    error = mean_squared_error(y_te, TL_preds)
    
    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error
    

end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

writer = pd.ExcelWriter('KMM.xlsx', engine='xlsxwriter')
BS.to_excel(writer, sheet_name='welcome', index=False)
writer.save()

The value for parameter 'gamma' is determined by cross-validation.
The KMM model has a MPD of 0.4113(0.0237) with a MSE of 0.0816(0.0078) which is an improvement compared to the baseline model.

### KLIEP(Kullback–Leibler Importance Estimation Procedure)

KLIEP minimizes the Kullback-Leibler(KL) discrepancy between the source and target marginal distribution. This is achieved by reweighting the source instances to minimize the KL divergence.

In [None]:
import adapt
from adapt.instance_based import KLIEP

model_TL = KLIEP(NN_model(), Xt=x_sample, kernel="rbf", gamma=1)
model_TL.fit(Xs, y_s, Xt=x_sample, callbacks=[callback], validation_data=(Xva,y_va), epochs=300, batch_size=512, verbose=2)

The KLIEP model follows the same source training, source validation, target training, and target test data split used for the KMM model. These two methods require the source input, source label, and the target input. 

The value of parameter 'gamma' is determined by cross-validation. Providing a list of potential 'gamma' values in the fit function would automatically call a 5 fold cross validation to determine the best 'gamma'. The KLIEP model has a MPD of 0.4152(0.0209) with a MSE of 0.0802(0.0059) which improves the performance compared to the baseline.

### Feature-based methods

Let's consider feature-based methods. The idea of feature-based approaches is to minimize the differences between the source and target by transforming the features. We experiment with CORAL and Deep CORAL.

### CORAL(Correlation Alignment)

CORAL transforms the source features to minimize the discrepancy between the correlation matrices of the transformed source
and the non-transformed target. A linear transformation is applied to the covariance matrix of the source features which minimizes the distance between the transformed source covariance and the non-transformed target covariance.

In [None]:
import adapt
from adapt.feature_based import CORAL

model_TL = CORAL(NN_model(), Xt=x_sample)
model_TL.fit(X=Xs, y=y_s, Xt=x_sample, callbacks=[callback], validation_data=(Xva,y_va), epochs=300, batch_size=512, verbose=2)

CORAL requires a parameter 'lambda_' which is the regularization parameter that affects the level of adaptation. In the
experiment, we used the default lambda of 0.00001 to allow a higher level of adaptation between the covariances of the source and target inputs.

The CORAL model has a MPD of 0.4032(0.0427) with a MSE of 0.0789(0.0213) which is better than the instance-based methods KMM and KLIEP.

### Deep CORAL

Deep CORAL is an extension of CORAL where the learning is done in two parts. First, an encoder network is trained to learn new feature representations such that the correlation matrices of the source and target data are close. Second, these new feature representations are used for the source data to enable a task network to learn a model that predicts the frequency of claims.

Now the total loss to minize can be thought in two parts. A CORAL loss that comes from encoder network, which is the difference in covariance matrices between the source and target input. And a task loss to minimize for our prediction problem using the task network.

We set the encoder model as a shallow network with 6 neurons and tanh activation. The task model is the same as the baseline neural network model. 

In [None]:
#define NN encoder model

def NN_model0(input_shape=(6,)):
    model = Sequential()
    model.add(Dense(6, activation='tanh')) #Hidden 3
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    return model

In [None]:
#define NN task model

def NN_model1():
    input_init = Input(shape=(6,))
    
    hidden1 = Dense(12)(input_init) #Hidden 1
    batch1 = BatchNormalization()(hidden1) #BatchNormalization1
    drop1 = Dropout(0.1)(batch1) #Dropout1
    act1 = Activation('tanh')(drop1)

    hidden2 = Dense(9)(act1) #Hidden 2
    batch2 = BatchNormalization()(hidden2) #BatchNormalization2
    drop2 = Dropout(0.1)(batch2) #Dropout2
    act2 = Activation('tanh')(drop2)

    hidden3 = Dense(6)(act2) #Hidden 3
    batch3 = BatchNormalization()(hidden3) #BatchNormalization3
    drop3 = Dropout(0.1)(batch3) #Dropout3
    act3 = Activation('tanh')(drop3)
    lin = Dense(1, activation='linear')(act3)
    
    output = Dense(1, activation='exponential',trainable = False)(lin) #Output
    model = Model(inputs=input_init,outputs=output)
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    return model

In [None]:
import adapt
from adapt.feature_based import DeepCORAL
modelEn = NN_model0()
modelTk = NN_model1()

model_TL = DeepCORAL(encoder=modelEn,task=modelTk, Xt=x_sample, lambda_= 0.8)
model_TL.fit(X=Xs, y=y_s, Xt=x_sample,callbacks=[callback], validation_data=(Xva,y_va), epochs=300, batch_size=512)

$\mathcal{L}_{total}=\mathcal{L}_{task}+\lambda\mathcal{L}_{coral}$

The parameter 'lambda_' plays a role to balance the CORAL loss and the task loss, which enables the learned model using source data with new feature representations to make good predictions on the target data. A higher value of 'lambda_' heavily penalizes the difference in covariance matrices between the source and target. We set the value of 'lambda_' to 0.8 which was determined by cross validation.

The Deep CORAL model has a MPD of 0.4006(0.0135) with a MSE of 0.0778(0.0050) which is better than the CORAL model.   

## Transfer learning models with target labels

We experiment with instance-based, feature-based, and parameter-based methods that require target labels for the learning.

### Instance-based method

For the instance-based approach, we examine the Transfer AdaBoost for regression (“TrAdaBoostR2”) which is a boosting-based approach that uses reverse boosting to update the weights. 

### TrAdaBoostR2(Transfer AdaBoost for Regression)

In [None]:
import adapt
from adapt.instance_based import TrAdaBoostR2

model_TL = TrAdaBoostR2(NN_model(), n_estimators=10, lr=0.00001, Xt=x_sample, yt=y_sam)
model_TL.fit(Xs, y_s, Xt=x_sample,yt=y_sam, callbacks=[callback], validation_data=(Xva,y_va), epochs=300, batch_size=512, verbose=2)

TrAdaBoostR2 is a boosting-based method that needs to have a defined the number of boosting iterations. This is the parameter 'n_estimators' which is by default 10. The learning rate 'lr' is set to 0.00001, which determines how fast the weights are updated. In the fit function of the code, we can see that the target labels 'y_sam' is used at the training stage.

The TrAdaBoostR2 model has a MPD of 0.3950(0.0121) with a MSE of 0.0776(0.0049) which is better than all the models that didn't have access to target labels.

### Feature-based method

For the feature-based approach, we implement the Feature Augmentation (“FA”) method.

### FA(Feature Augmentation)

Feature Augmentation is done by adding null vectors to both the source and target vectors which results in all the features having three components.
The source feature Xs becomes (Xs, 0, Xs) and the target feature becomes (0, Xt, Xt).
The transformed source and target data are combined and used as the training data to learn a model which is then used to predict the outcome of the test data. 

We have 6 features for the source and target, so the transformed source and target will have 18 features. Therefore, we modify the neural network to take into account the change in input size.

In [None]:
#define NN model

def NN_model():
    input_init = Input(shape=(18,))
    
    hidden1 = Dense(36)(input_init) #Hidden 1
    batch1 = BatchNormalization()(hidden1) #BatchNormalization1
    drop1 = Dropout(0.1)(batch1) #Dropout1
    act1 = Activation('tanh')(drop1)

    hidden2 = Dense(27)(act1) #Hidden 2
    batch2 = BatchNormalization()(hidden2) #BatchNormalization2
    drop2 = Dropout(0.1)(batch2) #Dropout2
    act2 = Activation('tanh')(drop2)

    hidden3 = Dense(18)(act2) #Hidden 3
    batch3 = BatchNormalization()(hidden3) #BatchNormalization3
    drop3 = Dropout(0.1)(batch3) #Dropout3
    act3 = Activation('tanh')(drop3)
    lin = Dense(1, activation='linear')(act3)
    
    output = Dense(1, activation='exponential',trainable = False)(lin) #Output
    model = Model(inputs=input_init,outputs=output)
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    return model

In [None]:
import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)

    #scaling source variables
    
    scalerFA = MinMaxScaler()
    mysource1[['Age_Cat','Veh_age']] = scalerFA.fit_transform(mysource1[['Age_Cat','Veh_age']])
    Xmy = mysource1.drop(columns = ['N_Claims','Exposure']).values
    ymy = mysource1['N_Claims'].values
    vmy= mysource1['Exposure'].values

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.8), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    Xs = Xmy_bs[idx1]
    ys = ymy_bs[idx1]
    vs = vmy_bs[idx1]

    Xva = Xmy_bs[idx2]
    yva = ymy_bs[idx2]
    vva = vmy_bs[idx2]
        
    #split training and testing data within target
    train_target = final_target.sample(frac=0.6, random_state=None)
    test_target = final_target.drop(train_target.index)

    #scale target data separately
    scalerFA1 = MinMaxScaler()
    train_target[['Age_Cat','Veh_age']] = scalerFA1.fit_transform(train_target[['Age_Cat','Veh_age']])
    Xtr = train_target.drop(columns = ['N_Claims','Exposure']).values
    ytr = train_target['N_Claims'].values
    vtr = train_target['Exposure'].values
    dump(scalerFA1, open('scalerFA1.pkl', 'wb'))

    scalerFA1 = load(open('scalerFA1.pkl', 'rb'))
    test_target[['Age_Cat','Veh_age']] = scalerFA1.transform(test_target[['Age_Cat','Veh_age']])
    Xte = test_target.drop(columns = ['N_Claims','Exposure']).values
    yte = test_target['N_Claims'].values
    vte = test_target['Exposure'].values
        
    idx = np.random.choice(np.arange(len(Xtr)), len(Xtr), replace=True, p=vtr/vtr.sum())
    x_sample = Xtr[idx]
    y_sample = ytr[idx]
    v_sample = vtr[idx]
    
    y_s = ys.reshape((len(ys),1))
    v_s = vs.reshape((len(vs),1))
    y_va = yva.reshape((len(yva),1))
    v_te = vte.reshape((len(vte),1))
    y_te = yte.reshape((len(yte),1))
    y_sam = y_sample.reshape((len(y_sample),1))
    v_sam = v_sample.reshape((len(v_sample),1))
    
    #transform source and target data
    Xs_emb = np.concatenate((Xs,np.zeros((len(Xs), Xs.shape[-1])),Xs),axis=-1)
    Xtr_emb = np.concatenate((np.zeros((len(x_sample), x_sample.shape[-1])),x_sample,x_sample),axis=-1)
    Xva_emb = np.concatenate((Xva,np.zeros((len(Xva), Xva.shape[-1])),Xva),axis=-1)
    Xte_emb = np.concatenate((np.zeros((len(Xte), Xte.shape[-1])),Xte,Xte),axis=-1)
    
    #Augment the transformed source and target data
    X_aug = np.concatenate((Xs_emb, Xtr_emb))
    y_aug = np.concatenate((y_s, y_sam))
    v_aug = np.concatenate((v_s, v_sam))
    
    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(X_aug)), round(len(X_aug)*0.8), replace=False)
    idx4 = np.delete(np.arange(len(X_aug)),idx3)

    final_x = X_aug[idx3]
    final_y = y_aug[idx3]
    final_v = v_aug[idx3]

    valid_x = X_aug[idx4]
    valid_y = y_aug[idx4]
    
    cf = sum(y_sam)/sum(v_sam) # claim frequency
    Int_ClaimNB = cf*v_te

    model_TL = NN_model()
    model_TL.fit(final_x, final_y, callbacks=[callback], validation_data=(valid_x,valid_y), epochs=300, batch_size=512, verbose=2)
    
    TL_preds = model_TL.predict(Xte_emb)
    
    deviance = mean_poisson_deviance(y_te, TL_preds)
    error = mean_squared_error(y_te, TL_preds)
    
    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error


end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

The FA model has a MPD of 0.3915(0.0196) with a MSE of 0.0775(0.0065) which is the best of all models in the transductive learning application. FA is a simple feature transformation that can be applied to the source and target data along with applying other transfer learning methods.

### Paramter-based methods

The parameter-based approach is assuming that a good estimator for the target data can be learned by utilizing the parameters of the source estimator. Specifically, transfer learning is done through shared parameters. In the experiment, we consider two approaches.

### Regularized Transfer Neural Networks(TR)

A neural network model is trained on the source data then its parameters are obtained. Then we use the target data to
train a new model using the parameters transferred from the source model. Here the final parameters for the new model are obtained by regularizing the distance between the transferred source parameters and the parameters learned using the target data.

In [None]:
import adapt
from adapt.parameter_based import RegularTransferNN
callback =tf.keras.callbacks.EarlyStopping(monitor='val_loss',min_delta=0,patience=20,verbose=0,mode='auto',baseline=None,restore_best_weights=True)

import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error','Actual','Predicted','PD_Int'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)
        
    #scale source data separately
    scalerP1A = MinMaxScaler()
    mysource1[['Age_Cat','Veh_age']] = scalerP1A.fit_transform(mysource1[['Age_Cat','Veh_age']])
    Xmy = mysource1.drop(columns = ['N_Claims','Exposure']).values
    ymy = mysource1['N_Claims'].values
    vmy= mysource1['Exposure'].values

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy[idx_boot]
    ymy_bs = ymy[idx_boot]
    vmy_bs = vmy[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.8), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    Xst = Xmy_bs[idx1]
    yst = ymy_bs[idx1]
    vst = vmy_bs[idx1]

    Xsv = Xmy_bs[idx2]
    ysv = ymy_bs[idx2]
    vsv = vmy_bs[idx2]
       
    #split training and testing data within target
    
    scalerP1 = MinMaxScaler()
    final_target[['Age_Cat','Veh_age']] = scalerP1.fit_transform(final_target[['Age_Cat','Veh_age']])
    Xty = final_target.drop(columns = ['N_Claims','Exposure']).values
    yty = final_target['N_Claims'].values
    vty= final_target['Exposure'].values

    idx_boot = np.random.choice(np.arange(len(Xty)), len(Xty), replace=True, p=vty/vty.sum())
    Xty_bs = Xty[idx_boot]
    yty_bs = yty[idx_boot]
    vty_bs = vty[idx_boot]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xty_bs)), round(len(Xty_bs)*0.6), replace=False)
    idx4 = np.delete(np.arange(len(Xty_bs)),idx3)

    x_sample = Xty_bs[idx3]
    y_sample = yty_bs[idx3]
    v_sample = vty_bs[idx3]
    
    Xva_my = Xty_bs[idx4]
    yva_my = yty_bs[idx4]
    vva_my = vty_bs[idx4]

    rng2 = np.random.default_rng()
    idx5 = rng2.choice(np.arange(len(Xva_my)), round(len(Xva_my)*0.625), replace=False)
    idx6 = np.delete(np.arange(len(Xva_my)),idx5)
    
    Xva = Xva_my[idx5]
    yva = yva_my[idx5]
    vva = vva_my[idx5]
    
    Xte = Xva_my[idx6]
    yte = yva_my[idx6]
    vte = vva_my[idx6]

    #train source model
    src_model = RegularTransferNN(task=NN_model(),loss="poisson",lambdas=0)
    src_model.fit(Xst, yst, callbacks=[callback], validation_data=(Xsv,ysv), epochs=300, batch_size=512)
    
    #train regularized model
    modelTNN = RegularTransferNN(src_model.task_,loss="poisson", lambdas=0.5)
    modelTNN.fit(x_sample, y_sample, callbacks=[callback], validation_data=(Xva,yva), epochs=300, batch_size=512, verbose=2)
    
    TL_preds = modelTNN.predict(Xte)
    deviance = mean_poisson_deviance(yte, TL_preds)
    
    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error

end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

The parameter 'lambdas' for the source model is set to 0 since we are training the model only on the source data. For the regularized model, we set 'lambdas' to 0.5, since we take into account the difference between the source network layer parameter and the target network layer parameters.

The Regularized Transfer model has a MPD of 0.4663(0.0445) with a MSE of 0.1018(0.0204) which is worse than the baseline model.

### Transfer Embeddings(TR_emb)

Next we consider transferring certain layers from the trained source model to the target model and train the other layers with the target data. In the experiment, we look into transferring the "embedding" layers from the source to the target model since our features consist of categorical features only. The code here is from [Entity Embedding Neural Net](https://www.kaggle.com/code/aquatic/entity-embedding-neural-net) and is modified for this application. 

The source and target data should be processed differently than other methods.

In [None]:
#import source data
mysource= pd.read_excel("/home/y/ykim775/PUBLIC_web/Data/Source_datav1.xlsx")

#process source data for use
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
mysource[['Veh_type']] = encoder.fit_transform(mysource[['Veh_type']])
print(mysource)

#import target data
mytarget= pd.read_excel("/home/y/ykim775/PUBLIC_web/Data/Target_datav1.xlsx")

#process target data for use
encoder = OrdinalEncoder()
mytarget[['Veh_type']] = encoder.fit_transform(mytarget[['Veh_type']])
print(mytarget)

We first define the embedding network that takes in the categorical features, creates the embeddings, and trains a model using these embeddings.

In [None]:
#define embedding network
def build_embedding_network():
    
    inputs = []
    embeddings = []

    Female_input = Input(shape=(1,), name='Female')
    Fem_emb = Embedding(2, 1, input_length=1, name='Fem_emb')(Female_input)
    Flat1 = Reshape(target_shape=(1,))(Fem_emb)
    inputs.append(Female_input)
    embeddings.append(Flat1)

    Veh_type_input = Input(shape=(1,), name='Veh_type')
    VehType_emb = Embedding(3, 2, input_length=1, name='VehType_emb')(Veh_type_input)
    Flat2 = Reshape(target_shape=(2,))(VehType_emb)
    inputs.append(Veh_type_input)
    embeddings.append(Flat2)

    Veh_age_input = Input(shape=(1,), name='Veh_age')
    VehAge_emb = Embedding(4, 2, input_length=1, name='VehAge_emb')(Veh_age_input)
    Flat3 = Reshape(target_shape=(2,))(VehAge_emb)
    inputs.append(Veh_age_input)
    embeddings.append(Flat3)

    Age_Cat_input = Input(shape=(1,), name='Age_Cat')
    AgeCat_emb = Embedding(5, 2, input_length=1, name='AgeCat_emb')(Age_Cat_input)
    Flat4 = Reshape(target_shape=(2,))(AgeCat_emb)
    inputs.append(Age_Cat_input)
    embeddings.append(Flat4)

    x = Concatenate()(embeddings)
    x = Dense(14)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(10)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(6)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(1, activation='linear')(x)
    
    output = Dense(1, activation='exponential', trainable= False)(x)
    
    model = Model(inputs, output)
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    
    return model

Then we define the model that takes the embeddings from the embedding network and uses it in learning the final model.

In [None]:
#define embedding model
def build_embedding_model():
    
    inputs = []
    embeddings = []
    
    pickle.load(open(str('EW1.dict'), "rb"))
    pickle.load(open(str('EW2.dict'), "rb"))
    pickle.load(open(str('EW3.dict'), "rb"))
    pickle.load(open(str('EW4.dict'), "rb"))

    Female_input = Input(shape=(1,), name='Female')
    Fem_emb = Embedding(2, 1, input_length=1, name='Fem_emb', weights = [EW1])(Female_input)
    Flat1 = Reshape(target_shape=(1,))(Fem_emb)
    inputs.append(Female_input)
    embeddings.append(Flat1)

    Veh_type_input = Input(shape=(1,), name='Veh_type')
    VehType_emb = Embedding(3, 2, input_length=1, name='VehType_emb', weights = [EW2])(Veh_type_input)
    Flat2 = Reshape(target_shape=(2,))(VehType_emb)
    inputs.append(Veh_type_input)
    embeddings.append(Flat2)

    Veh_age_input = Input(shape=(1,), name='Veh_age')
    VehAge_emb = Embedding(4, 2, input_length=1, name='VehAge_emb', weights = [EW3])(Veh_age_input)
    Flat3 = Reshape(target_shape=(2,))(VehAge_emb)
    inputs.append(Veh_age_input)
    embeddings.append(Flat3)

    Age_Cat_input = Input(shape=(1,), name='Age_Cat')
    AgeCat_emb = Embedding(5, 2, input_length=1, name='AgeCat_emb', weights = [EW4])(Age_Cat_input)
    Flat4 = Reshape(target_shape=(2,))(AgeCat_emb)
    inputs.append(Age_Cat_input)
    embeddings.append(Flat4)

    x = Concatenate()(embeddings)
    x = Dense(14)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(10)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(6)(x)
    x = BatchNormalization()(x)
    x = Dropout(0.1)(x)
    x = Activation('tanh')(x)
    x = Dense(1, activation='linear')(x)
    
    output = Dense(1, activation='exponential', trainable= False)(x)
    
    model = Model(inputs, output)
    model.compile(optimizer=Nadam(0.004), loss='poisson')
    
    return model

We also need to define the function that converts data into a list format to match the embedding network structure and the embedding model structure.

In [None]:
#convert data to list format to match the embedding model structure
def preproc(Xtr, Xva, Xte):

    input_list_train = []
    input_list_val = []
    input_list_test = []
    
    #the cols to be embedded: rescaling to range [0, # values)
    for c in embed_cols:
        raw_vals = np.unique(Xtr[c])
        val_map = {}
        for i in range(len(raw_vals)):
            val_map[raw_vals[i]] = i       
        input_list_train.append(Xtr[c].map(val_map).values)
        input_list_val.append(Xva[c].map(val_map).fillna(0).values)
        input_list_test.append(Xte[c].map(val_map).fillna(0).values)
    
    return input_list_train, input_list_val, input_list_test    


#convert data to list format to match the embedding network
def proproc(Xtr, Xva):

    input_list_train = []
    input_list_val = []
    
    #the cols to be embedded: rescaling to range [0, # values)
    for c in embed_cols:
        raw_vals = np.unique(Xtr[c])
        val_map = {}
        for i in range(len(raw_vals)):
            val_map[raw_vals[i]] = i       
        input_list_train.append(Xtr[c].map(val_map).values)
        input_list_val.append(Xva[c].map(val_map).fillna(0).values)
    
    return input_list_train, input_list_val   

We start with learning the embeddings using the embedding network. The embedding are saved and called back into the embedding model for training.

In [None]:
import pickle
import time
start = time.time()
N = 1000
BS = pd.DataFrame(columns=['PD','Error'])
for i in range(N):
    if (i % 200 == 0):
        print("Iteration", i)
    #split training and testing data within source
    
    Xmy = mysource.drop(columns = ['N_Claims','Exposure'])
    ymy = mysource['N_Claims']
    vmy= mysource['Exposure']

    idx_boot = np.random.choice(np.arange(len(Xmy)), len(Xmy), replace=True, p=vmy/vmy.sum())
    Xmy_bs = Xmy.iloc[idx_boot]
    ymy_bs = ymy.iloc[idx_boot]
    vmy_bs = vmy.iloc[idx_boot]

    rng = np.random.default_rng()
    idx1 = rng.choice(np.arange(len(Xmy_bs)), round(len(Xmy_bs)*0.8), replace=False)
    idx2 = np.delete(np.arange(len(Xmy_bs)),idx1)

    xs_sample = Xmy_bs.iloc[idx1]
    ys_sample = ymy_bs.iloc[idx1]
    vs_sample = vmy_bs.iloc[idx1]

    Xvs = Xmy_bs.iloc[idx2]
    yvs = ymy_bs.iloc[idx2]
    vvs = vmy_bs.iloc[idx2]

    cols_use = [c for c in xs_sample.columns if (not c.startswith('cat_'))]
    xs_sample = xs_sample[cols_use]
    col_vals_dict = {c: list(xs_sample[c].unique()) for c in xs_sample.columns}
    embed_cols = []
    for c in col_vals_dict:
        if len(col_vals_dict[c])>1:
            embed_cols.append(c)
    
    poc_Xts, poc_Xvs = proproc(xs_sample, Xvs)
    
    #learn embedding using source data
    NNS = build_embedding_network()
    NNS.fit(poc_Xts, ys_sample.values, epochs=300, callbacks=[callback], validation_data=(poc_Xvs,yvs.values), batch_size=512)
    
    #save the embeddings
    EW1 = NNS.get_layer('Fem_emb').get_weights()[0]
    pickle.dump(EW1, open(str('EW1.dict'), "wb"))
    EW2 = NNS.get_layer('VehType_emb').get_weights()[0]
    pickle.dump(EW2, open(str('EW2.dict'), "wb"))
    EW3 = NNS.get_layer('VehAge_emb').get_weights()[0]
    pickle.dump(EW3, open(str('EW3.dict'), "wb"))
    EW4 = NNS.get_layer('AgeCat_emb').get_weights()[0]
    pickle.dump(EW4, open(str('EW4.dict'), "wb"))
    
        
    #split training and testing data within target
    
    Xty = mytarget.drop(columns = ['N_Claims','Exposure'])
    yty = mytarget['N_Claims']
    vty= mytarget['Exposure']

    idx_boot1 = np.random.choice(np.arange(len(Xty)), len(Xty), replace=True, p=vty/vty.sum())
    Xty_bs = Xty.iloc[idx_boot1]
    yty_bs = yty.iloc[idx_boot1]
    vty_bs = vty.iloc[idx_boot1]

    rng1 = np.random.default_rng()
    idx3 = rng1.choice(np.arange(len(Xty_bs)), round(len(Xty_bs)*0.6), replace=False)
    idx4 = np.delete(np.arange(len(Xty_bs)),idx3)

    xt_sample = Xty_bs.iloc[idx3]
    yt_sample = yty_bs.iloc[idx3]
    vt_sample = vty_bs.iloc[idx3]
    
    Xva_ty = Xty_bs.iloc[idx4]
    yva_ty = yty_bs.iloc[idx4]
    vva_ty = vty_bs.iloc[idx4]

    rng2 = np.random.default_rng()
    idx5 = rng2.choice(np.arange(len(Xva_ty)), round(len(Xva_ty)*0.625), replace=False)
    idx6 = np.delete(np.arange(len(Xva_ty)),idx5)
    
    Xva = Xva_ty.iloc[idx5]
    yva = yva_ty.iloc[idx5]
    vva = vva_ty.iloc[idx5]
    
    Xte = Xva_ty.iloc[idx6]
    yte = yva_ty.iloc[idx6]
    vte = vva_ty.iloc[idx6]

    cols_use = [c for c in xt_sample.columns if (not c.startswith('cat_'))]
    xt_sample = xt_sample[cols_use]
    col_vals_dict = {c: list(xt_sample[c].unique()) for c in xt_sample.columns}
    embed_cols = []
    for c in col_vals_dict:
        if len(col_vals_dict[c])>1:
            embed_cols.append(c)
    
    proc_Xtr, proc_Xva, proc_Xte = preproc(xt_sample, Xva, Xte)

    #train on target data with embeddings learned from source data
    NN = build_embedding_model()
    NN.fit(proc_Xtr, yt_sample.values, epochs=300, callbacks=[callback], validation_data=(proc_Xva,yva.values), batch_size=512)
    
    TL_preds = NN.predict(proc_Xte)
    deviance = mean_poisson_deviance(yte.values, TL_preds)
    
    BS.loc[i, ['PD']] = deviance
    BS.loc[i, ['Error']] = error

end = time.time()
print((end - start)/60.0, "min elapsed.")

print(BS.mean())
print(BS.std())

In other words, embeddings learned from the source data is transferred to the embedding model and is trained on the target data.

The TR_emb model has a MPD of 0.4747(0.1630) with a MSE of 0.1112(0.0892) which is even worse than the Regularized Transfer model.

A possible explanation for the bad performance of the Regularized Transfer model and the Transfer Embeddings model can be 'negative transfer', where transferring the learnings between dissimilar domains leads to worse performance than no transfer learning.

In summary, instance-based and feature-based methods that have access to the target labels show better results than methods with no access to the target labels. In addition, adapting the distributional differences between the source and target by transforming features can provide relatively better predictions compared to transferring parameters from a source model and updating a target model.
