[](http://)<div class="list-group" id="list-tab" role="tablist">
<h2>Quick navigation </h2>

[1. Introduction](#1)
    
[2. Data processing and exploration](#2) 
    
&nbsp;&nbsp;&nbsp;&nbsp;[2.1. Setup and examine the train dataset](#2_1)    
&nbsp;&nbsp;&nbsp;&nbsp;[2.2. Training data and response correlation](#2_2)   
&nbsp;&nbsp;&nbsp;&nbsp;[2.3. Compare training and testing sets](#2_3)   
    
[3. Modeling (In progress)](#3)
    
&nbsp;&nbsp;&nbsp;&nbsp;[3.1. Define the model and the metrics](#3_1)    
&nbsp;&nbsp;&nbsp;&nbsp;[3.2. Set the correct initial bias](#3_2)   
&nbsp;&nbsp;&nbsp;&nbsp;[3.3. Class weight](#3_3)  
&nbsp;&nbsp;&nbsp;&nbsp;[3.4. Over sampling](#3_4)      
    
[4. Output](#4)    

<IMG align="center" src="https://www.netclipart.com/pp/m/84-843864_traffic-clipart-car-collision-car-accident-cartoon-insurance.png" alt="car accident img">


<a id="1"></a>
<h2> Introduction<h2>



# Context

****
Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (cust
omers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# Task Details
****
Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# Evaluation Metric
****
The evaluation metric for this hackathon is ROC_AUC score.

![](http://)<a id="2"></a>
<h2>Data processing and exploration</center><h2>

<a id="2_1"></a>
<h1 style='border:0; color:black'>2.1. Setup and examine the train dataset<h1>

# Setup
 
We  will first import the libraries needed.

*RandomOverSampler and SMOTE(Synthetic Minority Oversampling Technique) are used to treat imbalanced datasets (which is the case here as we can see in the first figure).*

*RandomOverSampler duplicates the minority class data until minority class data reaches specified proportion of majority class data.*

*SMOTE generates synthetic data of minority classes and ensures that the data doesn't overfit*

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import csv
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
import tempfile
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

from imblearn.over_sampling import RandomOverSampler,SMOTE
from imblearn.under_sampling  import RandomUnderSampler

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

In [None]:
mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

[](http://)![](http://)We download the train and test datasets :

In [None]:
df_train=pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')
df_test=pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/test.csv')
df_train.head()  

We now check if theres is some data missing :

In [None]:
df_train.isnull().sum()

**There's no missing Data.**

# Examine the class label imbalance

In [None]:
neg, pos = np.bincount(df_train['Response'])
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_train[df_train['Response']==1]),
            len(df_train[df_train['Response']==0])
        ], 
        name='Train Response'
    ),
]


for i in range(len(traces)):
    fig.append_trace(traces[i], (i // 2) + 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train Response distribution',
    height=400,
    width=400
)
fig.show()



the datasets is quiet imbalanced.

<a id="2_2"></a>
<h1 style='border:0; color:black'>2.2. Training data and response correlation<h1>

First, we need to convert our text data 

In [None]:

df_train.loc[df_train['Gender'] == 'Male', 'Gender'] = 1
df_train.loc[df_train['Gender'] == 'Female', 'Gender'] = 2
df_train['Gender'] = df_train['Gender'].astype(int)
df_test.loc[df_test['Gender'] == 'Male', 'Gender'] = 1
df_test.loc[df_test['Gender'] == 'Female', 'Gender'] = 2
df_test['Gender'] = df_test['Gender'].astype(int)


df_train.loc[df_train['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
df_train.loc[df_train['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
df_train.loc[df_train['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
df_train['Vehicle_Age'] = df_train['Vehicle_Age'].astype(int)
df_test.loc[df_test['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
df_test.loc[df_test['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
df_test.loc[df_test['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
df_test['Vehicle_Age'] = df_test['Vehicle_Age'].astype(int)


df_train.loc[df_train['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
df_train.loc[df_train['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0
df_train['Vehicle_Damage'] = df_train['Vehicle_Damage'].astype(int)
df_test.loc[df_test['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
df_test.loc[df_test['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0
df_test['Vehicle_Damage'] = df_test['Vehicle_Damage'].astype(int)


In [None]:
df_train.head()

The figure below shows the correlation between different labels :

In [None]:
f = plt.figure(figsize=(11, 13))
plt.matshow(df_train.corr(), fignum=f.number)
plt.xticks(range(df_train.shape[1]), df_train.columns, fontsize=14, rotation=75)
plt.yticks(range(df_train.shape[1]), df_train.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

<a id="2_3"></a>
<h1 style='border:0; color:black'>2.3. Compare training and testing sets<h1>

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Male', 'Female'], 
        y=[
            len(df_train[df_train['Gender']==1]),
            len(df_train[df_train['Gender']==2])
        ], 
        name='Train Gender',
        text = [
            str(round(100 * len(df_train[df_train['Gender']==1]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Gender']==2]) / len(df_train), 2)) + '%'
        ],
        textposition='auto'
    ),
    go.Bar(
        x=['Male', 'Female'], 
        y=[
            len(df_test[df_test['Gender']==1]),
            len(df_test[df_test['Gender']==2])
        ], 
        name='Test Gender',
        text=[
            str(round(100 * len(df_test[df_test['Gender']==1]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Gender']==2]) / len(df_test), 2)) + '%'
        ],
        textposition='auto'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], (i // 2) + 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test gender column',
    height=400,
    width=700
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_train[df_train['Driving_License']==1]),
            len(df_train[df_train['Driving_License']==0])
        ], 
        name='Train Driving_License',
        text = [
            str(round(100 * len(df_train[df_train['Driving_License']==1]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Driving_License']==0]) / len(df_train), 2)) + '%'
        ],
        textposition='auto'
    ),
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_test[df_test['Driving_License']==1]),
            len(df_test[df_test['Driving_License']==0])
        ], 
        name='Test Driving_License',
        text=[
            str(round(100 * len(df_test[df_test['Driving_License']==1]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Driving_License']==0]) / len(df_test), 2)) + '%'
        ],
        textposition='auto'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], (i // 2) + 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test Driving_License column',
    title_x=0.5,
    height=400,
    width=700
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_train[df_train['Previously_Insured']==1]),
            len(df_train[df_train['Previously_Insured']==0])
        ], 
        name='Train Previously_Insured',
        text = [
            str(round(100 * len(df_train[df_train['Previously_Insured']==1]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Previously_Insured']==0]) / len(df_train), 2)) + '%'
        ],
        textposition='auto'
    ),
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_test[df_test['Previously_Insured']==1]),
            len(df_test[df_test['Previously_Insured']==0])
        ], 
        name='Test Previously_Insured',
        text = [
            str(round(100 * len(df_test[df_test['Previously_Insured']==1]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Previously_Insured']==0]) / len(df_test), 2)) + '%'
        ],
        textposition='auto'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test Previously_Insured column',
    title_x=0.5,
    height=400,
    width=700
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_train[df_train['Vehicle_Damage']==1]),
            len(df_train[df_train['Vehicle_Damage']==0])
        ], 
        name='Train Vehicle_Damage',
        text = [
            str(round(100 * len(df_train[df_train['Vehicle_Damage']==1]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Vehicle_Damage']==0]) / len(df_train), 2)) + '%'
        ],
        textposition='auto'
    ),
    go.Bar(
        x=['Yes', 'No'], 
        y=[
            len(df_test[df_test['Vehicle_Damage']==1]),
            len(df_test[df_test['Vehicle_Damage']==0])
        ], 
        name='Test Vehicle_Damage',
        text = [
            str(round(100 * len(df_test[df_test['Vehicle_Damage']==1]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Vehicle_Damage']==0]) / len(df_test), 2)) + '%'
        ],
        textposition='auto'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test Vehicle_Damage column',
    title_x=0.5,
    height=400,
    width=700
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Bar(
        x=['> 2 Years', '1-2 Year', '< 1 Year'], 
        y=[
            len(df_train[df_train['Vehicle_Age']==2]),
            len(df_train[df_train['Vehicle_Age']==1]),
            len(df_train[df_train['Vehicle_Age']==0])
        ], 
        name='Train Vehicle_Age',
        text = [
            str(round(100 * len(df_train[df_train['Vehicle_Age']==2]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Vehicle_Age']==1]) / len(df_train), 2)) + '%',
            str(round(100 * len(df_train[df_train['Vehicle_Age']==0]) / len(df_train), 2)) + '%'
        ],
        textposition='auto'
    ),
    go.Bar(
        x=['> 2 Years', '1-2 Year', '< 1 Year'], 
        y=[
            len(df_test[df_test['Vehicle_Age']==2]),
            len(df_test[df_test['Vehicle_Age']==1]),
            len(df_test[df_test['Vehicle_Age']==0])
        ], 
        name='Test Vehicle_Age',
        text = [
            str(round(100 * len(df_test[df_test['Vehicle_Age']==2]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Vehicle_Age']==1]) / len(df_test), 2)) + '%',
            str(round(100 * len(df_test[df_test['Vehicle_Age']==0]) / len(df_test), 2)) + '%'
        ],
        textposition='auto'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test Vehicle_Age column',
    title_x=0.5,
    height=400,
    width=700
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2)

traces = [
    go.Histogram(
        x=df_train['Age'], 
        name='Train Age'
    ),
    go.Histogram(
        x=df_test['Age'], 
        name='Test Age'
    ),

]

for i in range(len(traces)):
    fig.append_trace(traces[i], (i // 2) + 1, (i % 2)  +1)

fig.update_layout(
    title_text='Train/test Age column distribution',
    title_x=0.5,
    height=500,
    width=900
)
fig.show()

<a id="3"></a>
<h2>3. Modeling<h2>

In [None]:
train_arr=df_train.values.tolist()
data=[x[:-1] for x in train_arr]
response=[x[-1] for x in train_arr]
data = np.array(data, dtype='float')
response = np.array(response, dtype='float')


In [None]:
# split into 40% for training and 60% for testing
data_training, data_testing, response_training, response_testing = train_test_split(data, response, test_size=0.4, random_state=42)
bool_response_training = response_training != 0


Normalize the input features using the sklearn StandardScaler. This will set the mean to 0 and standard deviation to 1.

Note: The StandardScaler is only fit using the train_features to be sure the model is not peeking at the validation or test sets.

In [None]:
scaler = StandardScaler()
data_training = scaler.fit_transform(data_training)
data_testing = scaler.transform(data_testing)

data_training = np.clip(data_training, -5, 5)
data_testing = np.clip(data_testing, -5, 5)


print('Training labels shape:', response_training.shape)
print('Test labels shape:', response_testing.shape)

print('Training features shape:', data_training.shape)
print('Test features shape:', data_testing.shape)

<a id="3_1"></a>

# Define the model and the metrics

### Understanding useful metrics

Notice that there are a few metrics defined above that can be computed by the model that will be helpful when evaluating the performance.



*   **False** negatives and **false** positives are samples that were **incorrectly** classified
*   **True** negatives and **true** positives are samples that were **correctly** classified
*   **Accuracy** is the percentage of examples correctly classified
>   $\frac{\text{true samples}}{\text{total samples}}$
*   **Precision** is the percentage of **predicted** positives that were correctly classified
>   $\frac{\text{true positives}}{\text{true positives + false positives}}$
*   **Recall** is the percentage of **actual** positives that were correctly classified
>   $\frac{\text{true positives}}{\text{true positives + false negatives}}$
*   **AUC** refers to the Area Under the Curve of a Receiver Operating Characteristic curve (ROC-AUC). This metric is equal to the probability that a classifier will rank a random positive sample higher than a random negative sample.

Note: Accuracy is not a helpful metric for this task. You can 99.8%+ accuracy on this task by predicting False all the time.  

Read more:
*  [True vs. False and Positive vs. Negative](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)
*  [Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
*   [Precision and Recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)
*   [ROC-AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

In [None]:
METRICS = [
      tf.keras.metrics.TruePositives(name='tp'),
      tf.keras.metrics.FalsePositives(name='fp'),
      tf.keras.metrics.TrueNegatives(name='tn'),
      tf.keras.metrics.FalseNegatives(name='fn'), 
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
      tf.keras.metrics.AUC(name='auc'),
]

def make_model(metrics = METRICS, output_bias=None):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)
  model = tf.keras.Sequential([
      tf.keras.layers.Dense(16, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid',
                         bias_initializer=output_bias),
  ])

  model.compile(
      optimizer=tf.keras.optimizers.Adam(lr=1e-3),
      loss=tf.keras.losses.BinaryCrossentropy(),
      metrics=metrics)

  return model


In [None]:
EPOCHS = 100
BATCH_SIZE = 2000

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_auc', 
    verbose=1,
    patience=10,
    mode='max',
    restore_best_weights=True)

<a id="3_2"></a>

# Set the correct initial bias

The dataset is imbalanced. Set the output layer's bias to reflect that (See: [A Recipe for Training Neural Networks: "init well"](http://karpathy.github.io/2019/04/25/recipe/#2-set-up-the-end-to-end-trainingevaluation-skeleton--get-dumb-baselines)). This can help with initial convergence.

The correct bias to set can be derived from:

$$ p_0 = pos/(pos + neg) = 1/(1+e^{-b_0}) $$
$$ b_0 = -log_e(1/p_0 - 1) $$
$$ b_0 = log_e(pos/neg)$$

In [None]:
model = make_model(output_bias  = np.log([pos/neg]))

In [None]:
model.predict(data_training[:10])

Checkpoint the initial weights

To make the various training runs more comparable, keep this initial model's weights in a checkpoint file, and load them into each model before training.

In [None]:
initial_weights = os.path.join(tempfile.mkdtemp(),'initial_weights')
model.save_weights(initial_weights)

In [None]:
history = model.fit(
    data_training,
    response_training,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks = [early_stopping],
    validation_data=(data_testing, response_testing))

Check training history :

In [None]:
def plot_metrics(history):
  metrics =  ['loss', 'auc', 'precision', 'recall']
  for n, metric in enumerate(metrics):
    name = metric.replace("_"," ").capitalize()
    plt.subplot(2,2,n+1)
    plt.plot(history.epoch,  history.history[metric], color=colors[0], label='Train')
    plt.plot(history.epoch, history.history['val_'+metric],
             color=colors[0], linestyle="--", label='Val')
    plt.xlabel('Epoch')
    plt.ylabel(name)
    if metric == 'loss':
      plt.ylim([0, plt.ylim()[1]])
    elif metric == 'auc':
      plt.ylim([0.8,1])
    else:
      plt.ylim([0,1])

    plt.legend()


In [None]:
plot_metrics(history)

In [None]:
predict_train = model.predict_classes(data_training)
predict_test = model.predict_classes(data_testing)

In [None]:
cm = confusion_matrix(response_testing, predict_test)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g')

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')

unique, counts = np.unique(response_testing, return_counts=True)
print(dict(zip(unique, counts)))

unique, counts = np.unique(predict_test, return_counts=True)
print(dict(zip(unique, counts)))


If the model had predicted everything perfectly, this would be a diagonal matrix where values off the main diagonal, indicating incorrect predictions, would be zero. In this case, we can accept few false positives (89 cases in the matrix) meaning that we might ask a customer to subscribe to auto insurance even if he doesn't accept. 
However, we need to decrease as much as possible the false negative responses (18 783 cases) because they refer to the customers that want an auto subscription while the model flagged them as customers that don't want the auto insurance.

# Plot the ROC

Now plot the ROC. This plot is useful because it shows, at a glance, the range of performance the model can reach just by tuning the output threshold.

In [None]:
def plot_roc(name, labels, predictions, **kwargs):
  fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)

  plt.plot(100*fp, 100*tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.grid(True)
  ax = plt.gca()
  ax.set_aspect('equal')

In [None]:
plot_roc("Train Baseline", response_training, predict_train, color=colors[0])
plot_roc("Test Baseline", response_testing, predict_test, color=colors[0], linestyle='--')
plt.legend(loc='lower right')

<a id="3_3"></a>
# Class weights

Calculate class weights

The goal is to identify the customers interested in auto insurance, but you don't have very many of those positive samples to work with, so you would want to have the classifier heavily weight the few examples that are available. You can do this by passing Keras weights for each class through a parameter. These will cause the model to "pay more attention" to examples from an under-represented class.

In [None]:
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.

weight_for_0 = (1 / neg)*(neg+pos)/2.0 
weight_for_1 = (1 / pos)*(neg+pos)/2.0

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

In [None]:
weighted_model = make_model()
weighted_model.load_weights(initial_weights)

weighted_history = weighted_model.fit(
    data_training,
    response_training,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks = [early_stopping],
    validation_data=(data_testing, response_testing),
    # The class weights go here
    class_weight=class_weight) 

In [None]:
def plot_metrics(weighted_history):
  metrics =  ['loss', 'auc', 'precision', 'recall']
  for n, metric in enumerate(metrics):
    name = metric.replace("_"," ").capitalize()
    plt.subplot(2,2,n+1)
    plt.plot(history.epoch,  history.history[metric], color=colors[0], label='Train')
    plt.plot(history.epoch, history.history['val_'+metric],
             color=colors[0], linestyle="--", label='Val')
    plt.xlabel('Epoch')
    plt.ylabel(name)
    if metric == 'loss':
      plt.ylim([0, plt.ylim()[1]])
    elif metric == 'auc':
      plt.ylim([0.8,1])
    else:
      plt.ylim([0,1])

    plt.legend()

In [None]:
plot_metrics(weighted_history)

In [None]:
predict_weighted_train = weighted_model.predict_classes(data_training)
predict_weighted_test = weighted_model.predict_classes(data_testing)

In [None]:
cm = confusion_matrix(response_testing, predict_weighted_test)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g')

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')

unique, counts = np.unique(response_testing, return_counts=True)
print(dict(zip(unique, counts)))

unique, counts = np.unique(predict_weighted_test, return_counts=True)
print(dict(zip(unique, counts)))

In [None]:
plot_roc("Train Baseline", response_training, predict_train, color=colors[0])
plot_roc("Test Baseline", response_testing, predict_test, color=colors[0], linestyle='--')

plot_roc("Train Weighted", response_training, predict_weighted_train, color=colors[1])
plot_roc("Test Weighted", response_testing, predict_weighted_test, color=colors[1], linestyle='--')


plt.legend(loc='lower right')

<a id="3_4"></a>
# Oversampling

Oversample the minority class
A related approach would be to resample the dataset by oversampling the minority class.

In [None]:
pos_features = data_training[bool_response_training]
neg_features = data_training[~bool_response_training]

pos_labels = response_training[bool_response_training]
neg_labels = response_training[~bool_response_training]

Using NumPy
You can balance the dataset manually by choosing the right number of random indices from the positive examples:

In [None]:
ids = np.arange(len(pos_features))
choices = np.random.choice(ids, len(neg_features))

res_pos_features = pos_features[choices]
res_pos_labels = pos_labels[choices]

print(res_pos_features.shape)
print(pos_features.shape)

In [None]:
resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)

order = np.arange(len(resampled_labels))
np.random.shuffle(order)
resampled_features = resampled_features[order]
resampled_labels = resampled_labels[order]

resampled_features.shape

Using tf.data
If you're using tf.data the easiest way to produce balanced examples is to start with a positive and a negative dataset, and merge them. See the tf.data guide for more examples.

In [None]:
BUFFER_SIZE = 100000

def make_ds(features, labels):
  ds = tf.data.Dataset.from_tensor_slices((features, labels))#.cache()
  ds = ds.shuffle(BUFFER_SIZE).repeat()
  return ds

pos_ds = make_ds(pos_features, pos_labels)
neg_ds = make_ds(neg_features, neg_labels)

In [None]:
for features, label in pos_ds.take(1):
  print("Features:\n", features.numpy())
  print()
  print("Label: ", label.numpy())

Merge the two together using experimental.sample_from_datasets:

In [None]:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
resampled_ds = resampled_ds.batch(BATCH_SIZE).prefetch(2)

In [None]:
for features, label in resampled_ds.take(1):
  print(label.numpy().mean())

To use this dataset, you'll need the number of steps per epoch.

The definition of "epoch" in this case is less clear. Say it's the number of batches required to see each negative example once:

In [None]:
resampled_steps_per_epoch = np.ceil(2.0*neg/BATCH_SIZE)
resampled_steps_per_epoch

Train on the oversampled data
Now try training the model with the resampled data set instead of using class weights to see how these methods compare.

In [None]:
resampled_model = make_model()
resampled_model.load_weights(initial_weights)
# Reset the bias to zero, since this dataset is balanced.
resampled_model = make_model(output_bias  = 0)

output_layer = resampled_model.layers[-1] 

val_ds = tf.data.Dataset.from_tensor_slices((data_testing, response_testing)).cache()
val_ds = val_ds.batch(BATCH_SIZE).prefetch(2) 

resampled_history = resampled_model.fit(
    resampled_ds,
    epochs=EPOCHS,
    steps_per_epoch=resampled_steps_per_epoch,
    callbacks = [early_stopping],
    validation_data=val_ds)

In [None]:
plot_metrics(resampled_history )

In [None]:
train_predictions_resampled = resampled_model.predict(data_training, batch_size=BATCH_SIZE)
test_predictions_resampled = resampled_model.predict(data_testing, batch_size=BATCH_SIZE)

In [None]:
plot_roc("Train Baseline", response_training, predict_train, color=colors[0])
plot_roc("Test Baseline", response_testing, predict_test, color=colors[0], linestyle='--')

plot_roc("Train Weighted", response_training, predict_weighted_train, color=colors[1])
plot_roc("Test Weighted", response_testing, predict_weighted_test, color=colors[1], linestyle='--')

plot_roc("Train Resampled", response_training, train_predictions_resampled,  color=colors[2])
plot_roc("Test Resampled", response_testing, test_predictions_resampled,  color=colors[2], linestyle='--')
plt.legend(loc='lower right')

<a id="4"></a>
# Output

In [None]:
data_test=df_test.values.tolist()
data_test = np.array(data_test, dtype='float')

In [None]:
prediction = resampled_model.predict_classes(data_test)
id=[]
for i in data_test:
    id.append(i[0])

id=np.array(id, dtype='int')
result = prediction[:, 0]
combined=np.vstack((id, result)).T

In [None]:
pd.DataFrame(combined).to_csv('Submission .csv')