# Vehicle Insurance Predictive Analysis


A health insurance company wants to offer its old customers a new vehicle insurance. They need our service to create a model that predicts whether a customer would be interested in this new insurance.

The company has provided us a dataset with former costumers and their responses to the offer. Likewise, they ask us to analyze and return the possible response of another dataset with new costumers based on the created model.

The datasets are divided into the following variables:

| Variable | Definition |
| --- | --- |
| id | Identifier |
| Gender | Gender (M/F) |
| Age | Age |
| Driving_License | They costumer has driving license (1/0) |
| Region_Code | costumer region code |
| Previously_Insured | costumer already has vehicle insurance (1/0) |
| Vehicle_Age | Vehicle age |
| Vehicle_Damage | Previous damage in the costumer's vehicle (1/0) |
| Annual_Premium | Amount to pay for the new car insurance |
| Policy_Sales_Channel | Communication channel (e-mail, phone, person, etc) |
| Vintage | costumer number of days with the company |
| Response | Dependient variable: yes, no (1/0) |

### The problem will be solved by a classification machine learning model

## Dataset import and split

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Deleting index form dataset so we don't compromise the predictions
train = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
x = train.iloc[:, 1:-1] # Training Dataset without dependient variable and index (pandas index = dataset index - 1)
y = train.iloc[:, -1] # Training Dependient variable

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0) # Train and test set split

test_final = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv').iloc[:, 1:] # Test Dataset variable and index (pandas index = dataset index - 1)

## Dataset analysis

Dataset structure analysis

Complete dataset:

In [None]:
display(train)

Checking NaN values:

In [None]:
train.isnull().sum()

Checking data types

In [None]:
dtypes_train = train.dtypes
display(dtypes_train)

## Dataset visualization

In this section we are going to visualize the dataset in order to be able to analyze in a simpler way how costumers behave under the different variable conditions. This section is divided into:

- Categorical variables analysis
- Continuous variables analysis
- Variables comparison

### Categorical variables analysis

In [None]:
fig, ax = plt.subplots(2, 3, figsize = (30, 20))

# Gender plot
g_gender = sns.countplot(
    data = train,
    x = 'Gender',
    palette = sns.color_palette('Set2'),
    ax = ax[0, 0]
)

# Driving License plot
g_driving_license = sns.countplot(
    data = train,
    x = 'Driving_License',
    palette = sns.color_palette('Set2'),
    ax = ax[1, 0]
)

# Previously insured plot
g_previously_insured = sns.countplot(
    data = train,
    x = 'Previously_Insured',
    palette = sns.color_palette('Set2'),
    ax = ax[0, 1]
)

# Vehicle damage plot
g_damage = sns.countplot(
    data = train,
    x = 'Vehicle_Damage',
    palette = sns.color_palette('Set2'),
    ax = ax[1, 1]
)

# Vehicle Age plot
g_vehicle_age = sns.countplot(
    data = train,
    x = 'Vehicle_Age',
    palette = sns.color_palette('Set2'),
    ax = ax[0, 2]
)

# Response plot
g_response = sns.countplot(
    data = train,
    x = 'Response',
    palette = sns.color_palette('Set2'),
    ax = ax[1, 2]
)

# Titles
ax[0, 0].set_title('Gender', fontsize=20)
ax[1, 0].set_title('Driving License', fontsize=20)
ax[0, 1].set_title('Previously Insured', fontsize=20)
ax[1, 1].set_title('Vehicle Damage', fontsize=20)
ax[0, 2].set_title('Vehicle Age', fontsize=20)
ax[1, 2].set_title('Response', fontsize=20)

# Delete x and y labels
for ax in ax.reshape(-1): 
  ax.set_xlabel(None)
  ax.set_ylabel(None)

# Super title
fig.suptitle('Categorical variables', size = '40', y = 1.0)

plt.tight_layout() # Plots fit the fig area
plt.show()
plt.close()

- Gender: we observe almost an uniform distribution.
- Previously insured: we observe almost an uniform distribution.
- Vehicle years: uneven distribution. There is not too many data about customers with vehicles older than two years.
- Driving license: there is almost no data on customers without a license. It would be interesting in order to offer insurance to future drivers.
- Damage to the vehicle: we observe an uniform distribution.
- Response: we observe many more negative responses than affirmative ones.

### Continuous variables analysis

In [None]:
fig, ax = plt.subplots(5, 1, figsize = (30, 20))

# Age plot
g_edad = sns.histplot(
    data = train,
    x = 'Age',
    kde=True,
    ax = ax[0]
)

# Region Code plot
g_region = sns.histplot(
    data = train,
    x = 'Region_Code',
    kde=True,
    ax = ax[1]
)

# Premium annual plot
g_premium_anual = sns.histplot(
    data = train,
    x = 'Annual_Premium',
    kde=True,
    ax = ax[2]
)

# Policy sales channel plot
g_policy_sales_channel = sns.histplot(
    data = train,
    x = 'Policy_Sales_Channel',
    kde=True,
    ax = ax[3]
)

# Vintage plot
g_vintage = sns.histplot(
    data = train,
    x = 'Vintage',
    kde=True,
    ax = ax[4]
)

# Titles
ax[0].set_title('Age', fontsize=20)
ax[1].set_title('Region Code', fontsize=20)
ax[2].set_title('Annual Premium', fontsize=20)
ax[3].set_title('Policy Sales Channel', fontsize=20)
ax[4].set_title('Vintage', fontsize=20)

# Set x ticks limit to premium annual due to its nonuniform distribution
ax[2].set_xlim(0, 90000)

# Delete x and y label
for ax in ax.reshape(-1): 
  ax.set_xlabel(None)
  ax.set_ylabel(None)

# Super title
fig.suptitle('Categorical variables', size = '40', y = 1.0)

plt.tight_layout()  # Plots fit the fig area
plt.show()
plt.close()

- Age: there is much more data on young customers.
- Region: uneven distribution. Possibly due to the marketing target.
- Annual Premium: very uneven distribution. It would be convenient to diversify the price to obtain a more valuable response from customers when analyzing future forecasts.
- Communication channels: majority use of channels (25-30), (120-130), (150, 165)
- Vintage: we observe an uniform distribution.

### Variables comparison

We can consider having previously received damage and the vehicle's age two of the most decisives variables. We are also going to divide these two variables according to costumer's gender.
#### We compare the following variables:

- Vehicle's age
- Previous damage to the vehicle
- Gender
- Response

In [None]:
# Vehicle age vs Gender vs Vehicle Damage vs Response plot
g_vage_gender_vdama = sns.catplot(
    data = train,
    y = 'Vehicle_Age', hue = 'Response', col = 'Gender', row = 'Vehicle_Damage',    
    kind = 'count', height=6, aspect=13/6,    
    legend = True,  
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    margin_titles = True,
    legend_out = False)

# Title axis & super title  
g_vage_gender_vdama.set_titles(size = '30')
g_vage_gender_vdama.set_axis_labels("Total", "Vehicle Age", size = '30')
g_vage_gender_vdama.fig.suptitle('Vehicle Age vs Gender vs Vehicle Damage vs Response', size = '40', y = 1.1)
plt.setp(g_vage_gender_vdama._legend.get_texts(), fontsize=25)
plt.setp(g_vage_gender_vdama._legend.get_title(), fontsize=25)

plt.tight_layout() # Plots fit the fig area
plt.show()
plt.close()

We can see how customers who have previously received damage to the vehicle are more likely to give a positive response to the insurer's offer.

We can also see how the affirmative response from customers with newer vehicles is somewhat lower than in the other two cases.

Let's see how customers who already have insurance and do not have a driving license behave:

#### We compare the following variables:

- Vehicle's age
- Driving license
- Previously insured
- Response

In [None]:
# Previously Insured vs Vehicle Age vs Driving License Plot
g_pins_vdama = sns.catplot(
    data = train,
    hue = 'Response', x = 'Vehicle_Age', col = 'Previously_Insured', row = 'Driving_License',    
    kind = 'count', height=6, aspect=13/6,
    legend = True,  
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    margin_titles = True,
    legend_out = False
)

# Title axis & super title  
g_pins_vdama.set_titles(size = '30')
g_pins_vdama.set_axis_labels("Vehicle Age", "Total", size = '30')
g_pins_vdama.fig.suptitle('Previously Insured vs Vehicle Age vs Driving License vs Response', size = '40', y = 1.1)
plt.setp(g_pins_vdama._legend.get_texts(), fontsize=25)
plt.setp(g_pins_vdama._legend.get_title(), fontsize=25)

plt.tight_layout()  # Plots fit the fig area
plt.show()
plt.close()

We can see how having the vehicle previously insured makes customers all respond negatively to insurer's offer.

We don't have enough data on unlicensed customers to make the data meaningful.

We compare some continuous variables to see how the responses are distributed.

#### We compare the following variables:

- Age
- Region code
- Polici Sales Channel
- Response

In [None]:
fig, ax = plt.subplots(4, 1, figsize = (20, 20))

# Age vs Response Plot
g_age_response = sns.histplot(
    data = train,
    x = 'Age', hue = 'Response',
    multiple = 'fill',
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    linewidth = .5,
    ax = ax[0]
)
# x ticks
ax[0].set_xticks(np.arange(19, 85, 2))

# Region vs Age vs Response Plot
g_region_ = sns.violinplot(
    data = train,     
    x = 'Region_Code', y = 'Age', hue = 'Response', 
    split = True, inner = 'quart', linewidth = 1,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    ax = ax[1]
)
#x ticks
ax[1].set_xticks(np.arange(0, 55, 2))

# Region vs Policy Sales Channel vs Response Plot
g_region_policy_response = sns.violinplot(
    data = train,     
    x = 'Region_Code', y = 'Policy_Sales_Channel', hue = 'Response', 
    split = True, inner = 'quartile', linewidth = 1,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    ax = ax[2]
)
#x ticks
ax[2].set_xticks(np.arange(0, 55, 2))

# Region vs Response
g_region_response = sns.histplot(
    data = train,
    x = 'Region_Code', hue = 'Response',
    multiple = 'fill',
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    edgecolor = '.6',
    linewidth = .5,
    ax = ax[3]
)
#x ticks
ax[3].set_xticks(np.arange(0, 55, 2))

# Titles
ax[0].set_title('Age', fontsize=20)
ax[1].set_title('Region Code - Age', fontsize=20)
ax[2].set_title('Region Code - Policy Channel', fontsize=20)
ax[3].set_title('Region Code', fontsize=20)

fig.tight_layout() # Plots fit the fig area
plt.show()
plt.close()

In these plots we observe that:

- The age group between 33-57 years are more willing to response yes to our offer
- Although we have positive answers between 19 and 27 years old, the proportion of affirmative answers is lower than other ages
- Policy sales channels close to 150 seems to receive a very bad proportion of affirmative responses
- We did not find any relevant pattern regarding the region of the disguise

#### Vintage and Anual_Premium Analysis

In this last section we analyze the influence of two important values: the costumer's seniority and the price to pay for the insurance.

We also analyze these variables on the policy sales channels, the vehicle's age, the costumer's age and costumer's region

In [None]:
# Vintage vs Response Plot
g_vintage = sns.displot(
    data = train,
    x = 'Vintage', hue = 'Response',
    kind=  'kde', height=6, aspect=3,
    multiple = 'fill',
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
)
g_vintage.fig.suptitle('Vintage', size = '30', y = 1.1)

# Annual Premium vs Response
g_annual_premium = sns.displot(
    data=train,
    x = 'Annual_Premium', hue = 'Response',
    kind = 'kde', height=6, aspect=3,
    multiple = 'fill',
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
)
# Title, xticks, xlimit (nonuniform distribution)
g_annual_premium.fig.suptitle('Annual Premium', size = '30', y = 1.1)
g_annual_premium.set(xticks=np.arange(0, 100000, 10000))
g_annual_premium.set(xlim=(0,90000))

# Scatter plots: Annual Premium vs (Vintage or Region or Policy Sales Channel or Age)
fig, ax = plt.subplots(5, 1, figsize=(30, 40))

g_vint_premium = sns.scatterplot(
    data = train, x = 'Vintage', y = 'Annual_Premium', hue = 'Response',
    ci = None,
    linewidth=2.5,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    ax=ax[0])
ax[0].set_title('Vintage - Annual Premium', fontsize = 30)

g_region_premium = sns.scatterplot(
    data = train, x = 'Region_Code', y = 'Annual_Premium', hue = 'Response',
    ci = None,
    linewidth=2.5,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    ax=ax[1])
ax[1].set_title('Region Code - Annual Premium', fontsize = 30)

g_pol_premium = sns.scatterplot(
    data = train, x = 'Policy_Sales_Channel', y = 'Annual_Premium', hue = 'Response',
    ci = None,
    linewidth=2.5,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    ax=ax[2])
ax[2].set_title('Policy Sales Channel - Annual Premium', fontsize = 30)

g_age_premium = sns.scatterplot(
    data = train, x = 'Age', y = 'Annual_Premium', hue = 'Response',
    ci = None,
    linewidth=2.5,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    ax=ax[3])
ax[3].set_title('Age - Annual Premium', fontsize = 30)

g_vehicle_age_premium = sns.histplot(
    data = train, x = 'Vehicle_Age', y = 'Annual_Premium', hue = 'Response',
    linewidth=2.5,
    palette = sns.color_palette(['#E74C3C', '#58D68D']),
    ax=ax[4])
ax[4].set_title('Vehicle Age - Annual Premium', fontsize = 30)

fig.tight_layout() # Plots fit the fig area
plt.show()
plt.close()


We receive affirmative responses from both the newest and the oldest costumers.

The distribution of **annual premium** we could see that it was not uniform. Despite this, we can see a clear affirmative answer around 9000 points. More data would be necessary to give this graph an important value.

We note that **the insurer does not offer improvements in annual premium with respect to seniority, policy sales channel, region or the most senior costumers**. It would be interesting to offer some kind of offer, at least on older costumers.

It is interesting to see in the last graph where we find the limit where the costumer decide to pay the annual premium according their vehicle's age. We found a somewhat more permissive decisions on 1-2 year old vehicles. We could **lower the price to customers with newer cars** as their likelihood of needing assistance would be lower.

## Dataset cleaning and processing

### Categorical values

Categorical values to bool values

In [None]:
# Generate boolean values for categorical columns (train set & test set)
le = LabelEncoder()
x_train = pd.get_dummies(x_train, columns=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'], prefix=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'])
x_train['Vehicle_Damage'] = le.fit_transform(x_train['Vehicle_Damage']) # Yes -> 1 | No -> 0

x_test = pd.get_dummies(x_test, columns=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'], prefix=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'])
x_test['Vehicle_Damage'] = le.fit_transform(x_test['Vehicle_Damage']) # Yes -> 1 | No -> 0

test_final = pd.get_dummies(test_final, columns=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'], prefix=['Gender', 'Vehicle_Age', 'Policy_Sales_Channel', 'Region_Code'])
test_final['Vehicle_Damage'] = le.fit_transform(test_final['Vehicle_Damage']) # Yes -> 1 | No -> 0

# Columns not ordered
display(x_train)
display(x_test)

# Reordering columns so categorical data are the last columns
cols = list(x_train.columns) # list of columns names
cols = cols[0:1] + cols[4:6] + cols[1:4] + cols[4:] # Reordering
x_train = x_train[cols] # Copy

# Get missing columns in the training test
missing_cols_test = set(x_train.columns) - set(x_test.columns)
missing_cols_finaltest = set(x_train.columns) - set(test_final.columns)

# Add a missing column in test set with default value equal to 0
for c in missing_cols_test:
    x_test[c] = 0

for c in missing_cols_finaltest:
    test_final[c] = 0

# Ensure test-set columns follows same order than in train-set and copy
x_test = x_test[x_train.columns]
test_final = test_final[x_train.columns]

# Columns ordered
display(x_train)
display(x_test)
display(test_final)

### Standardization

We standardize continuous values ​​so not confuse future analysis. Very large values ​​of certain variables can carry much more weight than other variables with smaller values. This may be the case for *Annual_premium* over *Vintage*. Although they may have the same weight on the final decision, in a logistic regression model, the *Annual_premium* values ​​will have greater weight in the final decision if we do not standardize.

In [None]:
# Continous variables standarization
sc = StandardScaler()

# Continous variables only
x_train.iloc[:, :3] = sc.fit_transform(x_train.iloc[:, :3])
x_test.iloc[:, :3] = sc.fit_transform(x_test.iloc[:, :3])
test_final.iloc[:, :3] = sc.fit_transform(test_final.iloc[:, :3])

display(x_train)
display(x_test)
display(test_final)

## Predictive analytics: model creation

We proceed to obtain the best model to predict the response of the costumers.

We are going to pay special attention to what our costumer asks of us. The insurer wants to contact the costumers that our model predicts as interested costumer. Therefore, apart from measuring the general model's accuracy, **we must measure the accuracy of actual affirmative response well predicted**. This last measure will mean the cost that the company will have to make in contacting potential costumers. We bear in mind that the insurer will not proceed to contact the costumers who appear as *"not interested"* in our model.

The vector machine model is suppressed due to excess computing time.

#### We'll save models accuracy in these arrays

In [None]:
accuracy_1_scores = np.empty(5)
accuracy_total = np.empty(5)

### Logistic regresion analysis

#### Model configuration

In [None]:
logmodel = LogisticRegression(solver='lbfgs', max_iter=200) # Logistic Regression
logmodel.fit(x_train, y_train) # Training...
predictions_logmodel = logmodel.predict(x_test) # Predictions over test set

#### Result

In [None]:
# Reports: confussion matrix & accuracy score
cm = confusion_matrix(y_test, predictions_logmodel)
log_model_acc = accuracy_score(predictions_logmodel, y_test)

print('Accuracy: ', log_model_acc)
print('\nREPORT:\n', classification_report(y_test, predictions_logmodel))

# Confussion matrix table
fig, ax =plt.subplots(1, figsize=(10,4))

labels_pred =['Pred 0', 'Pred 1']
labels_result =['Actual 0', 'Actual 1']
ax.axis('tight')
ax.axis('off')
colors = [['g', 'r'],[ 'r', 'g']]

conf_matrix_table = ax.table(   cellText = cm, 
                                colLabels = labels_pred,
                                rowLabels = labels_result,    
                                loc = 'center',   
                                cellColours = colors, 
                                cellLoc = 'center')

ax.set_title('Matriz de confusión', fontsize = 20)
conf_matrix_table.auto_set_font_size(False)
conf_matrix_table.set_fontsize(8)
conf_matrix_table.set_fontsize(14)
conf_matrix_table.scale(1.5, 1.5)

# Pred 1 rate
pred_1_rate = cm[1, 1] / (cm[0, 1] + cm[1, 1])

accuracy_1_scores[0] = pred_1_rate
accuracy_total[0] = log_model_acc

print('Actual 1 prediction rate: ', pred_1_rate)


- We get a very good precision result.
- Most failures are found in false negatives.
- The drawback of this model is that it offers very few absolute actual 1 well predicted (future costumers). Although the ratio is good (less cost).

### kNN analysis

#### Model configuration

In [None]:
knn_model = KNeighborsClassifier(metric='minkowski', p=2, n_jobs=-1) # kNN model
knn_model.fit(x_train, y_train) # Training...
predictions_knn_model = knn_model.predict(x_test) # Predictions over test set

#### Result

In [None]:
# Reports: confussion matrix & accuracy score
cm = confusion_matrix(y_test, predictions_knn_model)
knn_model_acc = accuracy_score(predictions_knn_model, y_test)

print('Accuracy: ', knn_model_acc)
print('\nREPORT:\n', classification_report(y_test, predictions_knn_model))

# Confussion matrix table
fig, ax =plt.subplots(1, figsize=(10,4))

labels_pred =['Pred 0', 'Pred 1']
labels_result =['Actual 0', 'Actual 1']
ax.axis('tight')
ax.axis('off')
colors = [['g', 'r'],[ 'r', 'g']]

conf_matrix_table = ax.table(   cellText = cm, 
                                colLabels = labels_pred,
                                rowLabels = labels_result,    
                                loc = 'center',   
                                cellColours = colors, 
                                cellLoc = 'center')


ax.set_title('Matriz de confusión', fontsize = 20)
conf_matrix_table.auto_set_font_size(False)
conf_matrix_table.set_fontsize(8)
conf_matrix_table.set_fontsize(14)
conf_matrix_table.scale(1.5, 1.5)

# Pred 1 rate
pred_1_rate = cm[1, 1] / (cm[0, 1] + cm[1, 1])

accuracy_1_scores[1] = pred_1_rate
accuracy_total[1] = knn_model_acc

print('Actual 1 prediction rate: ', pred_1_rate)

- Slight decrease in precision.
- This model, although it has less precision (more effort) offers more absolute actual 1 well predicted (future costumers).

### Naive Bayes analysis

#### Model configuration

In [None]:
nb_model = GaussianNB() # Naive Bayes model
nb_model.fit(x_train, y_train) # Training...
predictions_nb_model = nb_model.predict(x_test) # Predictions over test set

#### Result

In [None]:
# Reports: confussion matrix & accuracy score
cm = confusion_matrix(y_test, predictions_nb_model)
nb_model_acc = accuracy_score(predictions_nb_model, y_test)

print('Accuracy: ', nb_model_acc)
print('\nREPORT:\n', classification_report(y_test, predictions_nb_model))

# Confussion matrix table
fig, ax =plt.subplots(1, figsize=(10,4))

labels_pred =['Pred 0', 'Pred 1']
labels_result =['Actual 0', 'Actual 1']
ax.axis('tight')
ax.axis('off')
colors = [['g', 'r'],[ 'r', 'g']]

conf_matrix_table = ax.table(   cellText = cm, 
                                    colLabels = labels_pred,
                                    rowLabels = labels_result,
                                    loc = 'center',   
                                    cellColours = colors, 
                                    cellLoc = 'center')


ax.set_title('Matriz de confusión', fontsize = 20)
conf_matrix_table.auto_set_font_size(False)
conf_matrix_table.set_fontsize(8)
conf_matrix_table.set_fontsize(14)
conf_matrix_table.scale(1.5, 1.5)

# Pred 1 rate
pred_1_rate = cm[1, 1] / (cm[0, 1] + cm[1, 1])

accuracy_1_scores[2] = pred_1_rate
accuracy_total[2] = nb_model_acc

print('Actual 1 prediction rate: ', pred_1_rate)

- Faster model but with less precision.
- It offers many more absolute actual 1 correctly predicted and good ratio between real and predicted positives.

### Decision Tree analysis

#### Model configuration

In [None]:
dtc_model = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) # Decision Tree Model
dtc_model.fit(x_train, y_train) # Training...
predictions_dtc_model = dtc_model.predict(x_test) # Predictions over test set

#### Result

In [None]:
# Reports: confussion matrix & accuracy score
cm = confusion_matrix(y_test, predictions_dtc_model)
dtc_model_acc = accuracy_score(predictions_dtc_model, y_test)

print('Accuracy: ', dtc_model_acc)
print('\nREPORT:\n', classification_report(y_test, predictions_dtc_model))

# Confussion matrix table
fig, ax =plt.subplots(1, figsize=(10,4))

labels_pred =['Pred 0', 'Pred 1']
labels_result =['Actual 0', 'Actual 1']
ax.axis('tight')
ax.axis('off')
colors = [['g', 'r'],[ 'r', 'g']]

conf_matrix_table = ax.table(   cellText = cm, 
                                colLabels = labels_pred,
                                rowLabels = labels_result,    
                                loc = 'center',   
                                cellColours = colors, 
                                cellLoc = 'center')


ax.set_title('Confussion Matrix', fontsize = 20)
conf_matrix_table.auto_set_font_size(False)
conf_matrix_table.set_fontsize(8)
conf_matrix_table.set_fontsize(14)
conf_matrix_table.scale(1.5, 1.5)

# Pred 1 rate
pred_1_rate = cm[1, 1] / (cm[0, 1] + cm[1, 1])

accuracy_1_scores[3] = pred_1_rate
accuracy_total[3] = dtc_model_acc

print('Actual 1 prediction rate: ', pred_1_rate)

- Less precision than kNN.
- Much faster than kNN.
- Higher ratio of pred 1 and therefore less effort on the part of the insurer in contacts with costumers.

### Random Forest analysis

#### Model configuration

In [None]:
rfc_model = RandomForestClassifier(n_estimators = 10, criterion='entropy', random_state=0) # Random Forest Model
rfc_model.fit(x_train, y_train) # Training...
predictions_rfc_model = rfc_model.predict(x_test) # Predictions over test set

#### Result

In [None]:
# Reports: confussion matrix & accuracy score
cm = confusion_matrix(y_test, predictions_rfc_model)
rfc_model_acc = accuracy_score(predictions_rfc_model, y_test)

print('Accuracy: ', rfc_model_acc)
print('\nREPORT:\n', classification_report(y_test, predictions_rfc_model))

# Confussion matrix table
fig, ax =plt.subplots(1, figsize=(10,4))

labels_pred =['Pred 0', 'Pred 1']
labels_result =['Actual 0', 'Actual 1']
ax.axis('tight')
ax.axis('off')
colors = [['g', 'r'],[ 'r', 'g']]

conf_matrix_table = ax.table(   cellText = cm, 
                                    colLabels = labels_pred,
                                    rowLabels = labels_result,  
                                    loc = 'center',   
                                    cellColours = colors, 
                                    cellLoc = 'center')


ax.set_title('Confussion Matrix', fontsize = 20)
conf_matrix_table.auto_set_font_size(False)
conf_matrix_table.set_fontsize(8)
conf_matrix_table.set_fontsize(14)
conf_matrix_table.scale(1.5, 1.5)

# Pred 1 rate
pred_1_rate = cm[1, 1] / (cm[0, 1] + cm[1, 1])

accuracy_1_scores[4] = pred_1_rate
accuracy_total[4] = rfc_model_acc

print('Actual 1 prediction rate: ', pred_1_rate)

- Precision practically the same as logistic regression.
- The precision on the pred 1 is much higher than the other models.

### Model's analysis

Visualizing models accuracy

In [None]:
# Dataframe of accuracy values
acc_data = pd.DataFrame({'Model': ['Log Model', 'kNN', 'Naive Bayes', 'Decission Tree', 'Random Forest'], 'Accuracy':accuracy_total, 'Actual 1 Accuracy':accuracy_1_scores})
display(acc_data)

# Plots
fig, ax =plt.subplots(2, figsize=(10,10))

sns.barplot(
    data=acc_data, x='Model', y='Accuracy', 
    estimator=sum, ci=None, 
    ax = ax[0])

ax[0].set_title('Accuracy', fontsize = '20')

sns.barplot(
    data=acc_data, x='Model', y='Actual 1 Accuracy', 
    estimator=sum, ci=None, 
    ax = ax[1])


ax[1].set_title('Actual 1 Accuracy', fontsize = '20')

plt.tight_layout()
plt.show()
plt.close()


We can see how the best model we have created is the **Random Forest**. This model offers the second highest overall accuracy, but its accuracy on positives is much higher than the others.

## Result and conclusions

Finally, we use the model to create a new column *Response* in the dataset provided by the insurer about possible future costumers.

#### Making predictions

In [None]:
final_pred = logmodel.predict(test_final) # using logmodel due it is the mos accurate even in real wordl random forest would be the best one

#### Importing dataset again to return the results with the input format

In [None]:
test_final = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')
test_final['Response'] = final_pred
test_final.drop(test_final.columns.difference(['id','Response']), 1, inplace=True)

In [None]:
display(test_final)

#### Saving dataset

In [None]:
test_final.to_csv('predictions_car_insurance.csv', index = False)