# A deep neural network for insurance classification written in tensorflow

## Original goal, and a what I learned

I had originally set out to write this post with the intent of gaining practice in the design of neural networks using tensorflow, but along the way I learned a valuable lesson about class imbalance which I will share here in additon to the model I have designed. This process has shown me that you can design a really nifty model, but if it is given data it cannot effectively learn from, your predictions will be just as garbage as your inputs.

Executable script versions of the different nn variants are avaliable at: https://github.com/CNuge/kaggle_code/tree/master/insurance_classification

## What this post covers
1. Designing the neural network
2. My first training attempt (a.k.a. how to do it wrong)
3. Why the training was not working
4. Solution A: Downsampling the 0s
5. Solution B: Upsampling the 1s
6. Discussion of the results from the two methods of addressing class imbalance

So If all you want is the best form of the working model, then you can look at just parts 1. and 5. If you're interested in learning from my mistakes in dealing with class imbalance then read on!

### Housekeeping: imports


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


## 1. Designing the network

The first three functions below are not part of the network. The two gini functions are used to assess the normalized gini index score for the model, and I will be calling them once per training epoch so that we can check in on the model's accuracy.

### 1.a gini assessment function

In [None]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
	assert( len(actual) == len(pred) )
	all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
	all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
	totalLosses = all[:,0].sum()
	giniSum = all[:,0].cumsum().sum() / totalLosses

	giniSum -= (len(actual) + 1) / 2.
	return giniSum / len(actual)
 
def gini_normalized(a, p):
	return gini(a, p) / gini(a, a)

### 1.b reset function for the tensorflow graph
Since we will be running multiple models in this notebook, we need to reset the tensorflow graph between runs so that the various parts aren't erroneously linked together. 

In [None]:

#for stability
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


reset_graph()

### 1.c Load the data, split categoricals, clean the data and standardize scale

To keep things chronological, this is how I first went about importing the data. Note that I do not even look at the number of 0s and the number of 1s in the training data! This mistake would come to bite me in the butt and cause the training to fail.
What happens here is that the data is loaded, the y is split off from the training dataframe and then the train and test are merged and one-hot encoded (categoricals) and scaled (numericals) as a unit. 

In [None]:
# Load the data

test_dat = pd.read_csv('../input/test.csv')
train_dat = pd.read_csv('../input/train.csv')
submission = pd.read_csv('../input/sample_submission.csv')

train_y = train_dat['target'].as_matrix()
train_x = train_dat.drop(['target', 'id'], axis = 1)
test_dat = test_dat.drop(['id'], axis = 1)

In [None]:
#clean the data

merged_dat = pd.concat([train_x, test_dat],axis=0)

cat_features = [col for col in merged_dat.columns if col.endswith('cat')]
for column in cat_features:
	temp=pd.get_dummies(pd.Series(merged_dat[column]))
	merged_dat=pd.concat([merged_dat,temp],axis=1)
	merged_dat=merged_dat.drop([column],axis=1)

numeric_features = [col for col in merged_dat.columns if '_calc_' in  str(col)]
numeric_features = [col for col in numeric_features if '_bin' not in str(col)]

scaler = StandardScaler()
scaled_numerics = scaler.fit_transform(merged_dat[numeric_features])
scaled_num_df = pd.DataFrame(scaled_numerics, columns =numeric_features )


merged_dat = merged_dat.drop(numeric_features, axis=1)


merged_dat = np.concatenate((merged_dat.values,scaled_num_df), axis = 1)


train_x = merged_dat[:train_x.shape[0]]
test_dat = merged_dat[train_x.shape[0]:]


train_x = train_x.astype(np.float32)
test_dat = test_dat.astype(np.float32)

### 1.d Designing the deep neural network

To make the network I've used the tf.layers API which lets you define the makeup in a given layer of the neural network very easily. Here, the input for any layer is always the variable name given to the previous layer. The second thing passed in is the number of neurons, followed by some hyperparamater arguments (discussed below).

Notes on the components I've used:
- This neural network is designed using 4 fully connected hidden layers, a batch normalization layer and an output layer with a sigmoid activation function (so that useful probability predictions are generated). 
- The batch normalization applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1 (effectively centering the outputs of the previous layer). 
- Throughout the network the rectified linear unit (ReLU) activation function(https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) is used, along with an He Kernel initializer (https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf). 


In [None]:
num_inputs = train_x.shape[1]
learning_rate = 0.1
num_classes = 2
n_hidden1 = 100
n_hidden2 = 400
n_hidden3 = 200
n_hidden4 = 100
dropout = 0.3


In [None]:
X = tf.placeholder(tf.float32, shape=(None, num_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")


with tf.variable_scope('ClassNet'):

	he_init = tf.contrib.layers.variance_scaling_initializer()

	training = tf.placeholder_with_default(False, shape=(), name='training')

	hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
							  kernel_initializer=he_init, name="hidden1")

	bn1 = tf.layers.batch_normalization(hidden1, training = training, momentum = 0.9)

	hidden2 = tf.layers.dense(bn1, n_hidden2, activation=tf.nn.relu,
							  kernel_initializer=he_init, name="hidden2")

	hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu,
							  kernel_initializer=he_init, name="hidden3")

	hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
							  kernel_initializer=he_init, name="hidden4")

	fc1 = tf.layers.dropout(hidden4, rate=dropout)

	logits = tf.layers.dense(fc1, num_classes, activation=tf.nn.sigmoid)
	

With the neural network graph defined, the next this we need to do is define the methods used to calculate loss, to train the model, and to evaluate the model. These are all shown below. With all of the scopes defined we can initialize the network and supporting tf functions.

In [None]:
with tf.name_scope("loss"):
	xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
	loss = tf.reduce_mean(xentropy, name="loss")


In [None]:
with tf.name_scope("train"):
	optimizer = tf.train.GradientDescentOptimizer(learning_rate)
	training_op = optimizer.minimize(loss)


In [None]:
with tf.name_scope("eval"):
	correct = tf.nn.in_top_k(logits, y, 1)
	accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))


In [None]:
init = tf.global_variables_initializer()
#saver = tf.train.Saver()



# 2. The first attempt at training the network


I've defined n_epochs = 20 for brevity's sake, as on further iterations the flatline in the GINI NORM score persists.
As you can see I here just pass all the training data into the model and leave it to run, note the scores and how they are changing from epoch to epoch, we aren't getting significant improvements to the final score!

### Train the model

In [None]:
n_epochs = 20

with tf.Session() as sess:
	init.run()
	for epoch in range(n_epochs):
		sess.run(training_op, feed_dict={X: X_train, y: y_train})	
		acc_train = accuracy.eval(feed_dict={X: X_train, y: y_train})
		acc_test = accuracy.eval(feed_dict={X: X_val,
											y: y_val})

		###below is the new GINI test.
		prob_test = logits.eval(feed_dict={X: X_val,
								y: y_val})
		#switched from outputs to logits
		gini_n = gini_normalized(y_val, prob_test[:,1])

		print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test, 
			"\nGINI NORM:", gini_n)
	
	save_path = saver.save(sess, "./cams_model_final.ckpt")



# 3. Why is this not working ?!?


Without even evaluating on the test data we can see that this is not working! The GINI NORM scores flounder around 0.03 and considering that the leader board has scores in the 0.289 range recorded this tells us that it is way way off base! So why is the accuracy improving but the gini score flatlined? Well I asked this same question (https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/43282) and was politely informed in the discussion section that the issue is that I have completely forgotten to consider the distribution of the two classes in the data! Which is a bonehead move on my part.  96%+ of the training data is 0s.... so by just predicting a 0 each time I would be correct 96% of the time. This is no good because we cannot get an informative probability, and the order of the data relative to one another (which is what matters in the gini score) is all jumbled. For this reason accuracy isn't what we should be focusing on, this is more of a ranking task than a classification one. 

Rodrigo Bicalho provided the a comment with a suggestion that I focused on as the best way to resolve the issue: 'There is a couple things you can try: i) Oversampling or Undersampling your sample If you do this, in your training set you might get 50%/50% split between Y=1 and Y=0. That way, your model will be less driven to predict all zeros.'


So by giving a higher proportion of 1s relative to the number of 0s, then the model can begin to learn the features that are informative in distinguishing 'claim' from 'no claim'. We need to balance the class distribution in the dataset and can do this through either of the two methods. Undersampling is dropping instances from the class in the higher proportion in order to restore an even ratio. Oversampling is the opposite, where we duplicate the observations of the class in the lower proportion in order to bring the two classes into an even ratio.

But which one is better, over or under sampling? There is an obvious speed advantage to undersampling as the size of the training set is decreased significantly. But this comes at a cost of throwing out lots of really useful data, so maybe it is better to increase the number of 1s and work with an expanded training set with a 50:50 ratio of 1s and 0s. I decided to try both and compare the results!

# 4. Solution A: Downsampling the 0s

This method scores: ~0.246 with epoch change below.

In order to balance the 0s and 1s through downsampling, I split the 0s and 1s into separate dataframes, then I take a random sample of the 0s dataframe that is equal to the size of the the 1s dataframe. This provides a 50:50 split of the 0s and 1s to pass in to the neural network. After training, the model scores ~0.246 on the public leaderboard which is much higher then the dud of a GINI score that we were getting from the unaltered training dataset.

In [None]:
test_dat = pd.read_csv('../input/test.csv')
train_dat = pd.read_csv('../input/train.csv')
submission = pd.read_csv('../input/sample_submission.csv')

train_dat_1s = train_dat[train_dat['target'] == 1]

train_dat_0s = train_dat[train_dat['target'] == 0]
keep_0s = train_dat_0s.sample(frac=train_dat_1s.shape[0]/train_dat_0s.shape[0])


train_dat = pd.concat([keep_0s,train_dat_1s],axis=0)




### Cleaning steps - unchanged

In [None]:

train_y = train_dat['target'].as_matrix()
train_x = train_dat.drop(['target', 'id'], axis = 1)
test_dat = test_dat.drop(['id'], axis = 1)

merged_dat = pd.concat([train_x, test_dat],axis=0)

cat_features = [col for col in merged_dat.columns if col.endswith('cat')]
for column in cat_features:
	temp=pd.get_dummies(pd.Series(merged_dat[column]))
	merged_dat=pd.concat([merged_dat,temp],axis=1)
	merged_dat=merged_dat.drop([column],axis=1)

numeric_features = [col for col in merged_dat.columns if '_calc_' in  str(col)]
numeric_features = [col for col in numeric_features if '_bin' not in str(col)]

scaler = StandardScaler()
scaled_numerics = scaler.fit_transform(merged_dat[numeric_features])
scaled_num_df = pd.DataFrame(scaled_numerics, columns =numeric_features )


merged_dat = merged_dat.drop(numeric_features, axis=1)


merged_dat = np.concatenate((merged_dat.values,scaled_num_df), axis = 1)


train_x = merged_dat[:train_x.shape[0]]
test_dat = merged_dat[train_x.shape[0]:]


train_x = train_x.astype(np.float32)
test_dat = test_dat.astype(np.float32)


In [None]:
reset_graph()


In [None]:
n_epochs = 1500

with tf.Session() as sess:
	init.run()
	for epoch in range(n_epochs):
		sess.run(training_op, feed_dict={X: X_train, y: y_train})	
		acc_train = accuracy.eval(feed_dict={X: X_train, y: y_train})
		acc_test = accuracy.eval(feed_dict={X: X_val,
											y: y_val})

		###below is the new GINI test.
		prob_test = logits.eval(feed_dict={X: X_val,
								y: y_val})
		#switched from outputs to logits
		gini_n = gini_normalized(y_val, prob_test[:,1])

		print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test, 
			"\nGINI NORM:", gini_n)
	
#	save_path = saver.save(sess, "./cams_model_final.ckpt")


In [None]:
#make external predictions on the test_dat
with tf.Session() as sess:
#    saver.restore(sess, "./cams_model_final.ckpt") # or better, use save_path
    Z = logits.eval(feed_dict={X: test_dat}) #switched from outputs to logits
    y_pred = Z[:,1]


In [None]:


dnn_output = submission
dnn_output['target'] = y_pred

dnn_output.to_csv('tf_dnn_downsample.csv', index=False, float_format='%.10f')



# 5. Solution B: Upsampling the 1s

This method scores: ~0.248 with epoch change below.

Here we do the inverse of the downsampling, instead of subsetting the 0s, we instead up the number of 1s. Below I use a list comprehension to duplicate (and then merge) a set of 26 copies of the 1s dataframe. This brings the instances of 0s and 1s up to a 50:50 ratio for the training of the neural network

In [None]:
test_dat = pd.read_csv('../input/test.csv')
train_dat = pd.read_csv('../input/train.csv')
submission = pd.read_csv('../input/sample_submission.csv')


train_dat_0s = train_dat[train_dat['target'] == 0]


train_dat_1s = train_dat[train_dat['target'] == 1]
rep_1 =[train_dat_1s for x in range(train_dat_0s.shape[0]//train_dat_1s.shape[0] )]
keep_1s = pd.concat(rep_1, axis=0)

train_dat = pd.concat([keep_1s,train_dat_0s],axis=0)

In [None]:

train_y = train_dat['target'].as_matrix()
train_x = train_dat.drop(['target', 'id'], axis = 1)
test_dat = test_dat.drop(['id'], axis = 1)

merged_dat = pd.concat([train_x, test_dat],axis=0)

cat_features = [col for col in merged_dat.columns if col.endswith('cat')]
for column in cat_features:
	temp=pd.get_dummies(pd.Series(merged_dat[column]))
	merged_dat=pd.concat([merged_dat,temp],axis=1)
	merged_dat=merged_dat.drop([column],axis=1)

numeric_features = [col for col in merged_dat.columns if '_calc_' in  str(col)]
numeric_features = [col for col in numeric_features if '_bin' not in str(col)]

scaler = StandardScaler()
scaled_numerics = scaler.fit_transform(merged_dat[numeric_features])
scaled_num_df = pd.DataFrame(scaled_numerics, columns =numeric_features )


merged_dat = merged_dat.drop(numeric_features, axis=1)


merged_dat = np.concatenate((merged_dat.values,scaled_num_df), axis = 1)


train_x = merged_dat[:train_x.shape[0]]
test_dat = merged_dat[train_x.shape[0]:]


train_x = train_x.astype(np.float32)
test_dat = test_dat.astype(np.float32)

In [None]:
reset_graph()

In [None]:
n_epochs = 1500

with tf.Session() as sess:
	init.run()
	for epoch in range(n_epochs):
		sess.run(training_op, feed_dict={X: X_train, y: y_train})	
		acc_train = accuracy.eval(feed_dict={X: X_train, y: y_train})
		acc_test = accuracy.eval(feed_dict={X: X_val,
											y: y_val})

		###below is the new GINI test.
		prob_test = logits.eval(feed_dict={X: X_val,
								y: y_val})
		#switched from outputs to logits
		gini_n = gini_normalized(y_val, prob_test[:,1])

		print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test, 
			"\nGINI NORM:", gini_n)
	
#	save_path = saver.save(sess, "./cams_model_final.ckpt")


In [None]:
#make external predictions on the test_dat
with tf.Session() as sess:
#    saver.restore(sess, "./cams_model_final.ckpt") # or better, use save_path
    Z = logits.eval(feed_dict={X: test_dat}) #switched from outputs to logits
    y_pred = Z[:,1]


In [None]:
dnn_output = submission
dnn_output['target'] = y_pred

dnn_output.to_csv('tf_dnn_predictions_upsample.csv', index=False, float_format='%.10f')

## Conclusion

So before any hyperparamater tuning, by simply creating balanced training data from a very imbalanced training set we are able to produce a model that is fairly effective at predicting the relative likelihoods of people filing insurance claims. This has taught me the importance of assessing the data I am using, and thinking of instances beyond the obvious things such as missing data and the need to one hot encode categoricals. 

The difference here between a model that is completely useless and one that is a good predictor is very small, just 3 lines of pandas dataframe manipulation in the preprocessing, but it proved to make all the difference. Over sampling and undersampling proved to be roughly equivalent, with a slightly higher LB score observed for the oversampled data. I have learned my lesson and will always check the class distribution in future classification problems that I undertake!

