# Construction of a deep neural network for predicting insurance claim probabilities

In this script we will build an deep neural network (3 layers) to predict the probabilities of customers filing insurance claims. I am doing the preprocessing in pandas and scaling the data using scikit learn's StandardScaler function.

Imports for the functions we use:

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

Next we read in the data.

In [None]:
test_dat = pd.read_csv('../input/test.csv')
train_dat = pd.read_csv('../input/train.csv')
submission = pd.read_csv('../input/sample_submission.csv')

## Cleaning the data

We split the y values off to a separate object, and drop the ids and targets from the train and test dataframes. Since we will be manipulating the data, we need to merge the train and test dataframes so that we are making the same changes to each dataset (otherwise predictions would be impossible).

In [None]:
train_y = train_dat['target']
train_x = train_dat.drop(['target', 'id'], axis = 1)
test_dat = test_dat.drop(['id'], axis = 1)

merged_dat = pd.concat([train_x, test_dat],axis=0)

Below we make several changes to the data prior to training the neural network. First the inputs are changed to float32 (as float64 is not compatabile with all of tensorflow's models). Second we one-hot encode the categorical variables. Thirdly, we standardize the scale of the numerical variables.

In [None]:
#change data to float32
for c, dtype in zip(merged_dat.columns, merged_dat.dtypes): 
    if dtype == np.float64:     
        merged_dat[c] = merged_dat[c].astype(np.float32)

#one hot encode the categoricals
cat_features = [col for col in merged_dat.columns if col.endswith('cat')]
for column in cat_features:
    temp=pd.get_dummies(pd.Series(merged_dat[column]))
    merged_dat=pd.concat([merged_dat,temp],axis=1)
    merged_dat=merged_dat.drop([column],axis=1)

#standardize the scale of the numericals
numeric_features = [col for col in merged_dat.columns if '_calc_' in  str(col)]
numeric_features = [col for col in numeric_features if '_bin' not in str(col)]

scaler = StandardScaler()
scaled_numerics = scaler.fit_transform(merged_dat[numeric_features])
scaled_num_df = pd.DataFrame(scaled_numerics, columns =numeric_features )

With the data munged, we can now split it back into train and test variables

In [None]:
merged_dat = merged_dat.drop(numeric_features, axis=1)

merged_dat = np.concatenate((merged_dat.values,scaled_num_df), axis = 1)

train_x = merged_dat[:train_x.shape[0]]
test_dat = merged_dat[train_x.shape[0]:]

## Training the neural network

Below we ste up the neural network in tensorflow, and fit the training data.

In [None]:
config = tf.contrib.learn.RunConfig(tf_random_seed=42)

feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(train_x)

We then create the DNN classifier. DNN == deep neural network

hidden units means the number of units per layer, of the neural network abd all layers fully connected.
'[64,32]' would be 64 nodes in first layer and 32 in second.

In [None]:
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[150,150,150], n_classes=2,
                                         feature_columns=feature_cols, config=config)

Fitting the model is just like with SciKit learn, the steps variable indicates the number of training iterations and the batch_size is the number of samples used to train the network on each step.

In [None]:
dnn_clf.fit(train_x, train_y, batch_size=50, steps=40000)

## Predicting probabilities with the neural network
We then predict the probabilities using predict proba. A generator object is produced, so I wrap it in a list function so that we can move to list form (and easily put the data into a pandas df). 

In [None]:
dnn_y_pred = dnn_clf.predict_proba(test_dat)

dnn_out = list(dnn_y_pred)

Note the predicted probabilities are provided in pairs of (P[0], P[1]). We want the probability of a claim (P[1]) so I use a list comprehension below to grab only the second member of the array when I add the data to the output dataframe.

In [None]:
dnn_output = submission
dnn_output['target'] = [x[1] for x in dnn_out]

dnn_output.to_csv('dnn_predictions.csv', index=False, float_format='%.4f')