Prudential, one of the largest issuers of life insurance in the USA.

In a one-click shopping world with on-demand everything, the life insurance application process is antiquated. Customers provide extensive information to identify risk classification and eligibility, including scheduling medical exams, a process that takes an average of 30 days.

The result? People are turned off. Thatâ€™s why only 40% of U.S. households own individual life insurance. Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries.

By developing a predictive model that accurately classifies risk using a more automated approach, you can greatly impact public perception of the industry


# Goal

In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable for each Id in the test set. "Response" is an ordinal measure of risk that has 8 levels.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Importing necessary packages and data

In [None]:
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.float_format = '{:.3f}'.format
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_info_columns', 200)

## Data Description

* train.csv - the training set, contains the Response values
* test.csv - the test set, you must predict the Response variable for all rows in this file


* Id :	A unique identifier associated with an application.
* Product_Info_1-7 :	A set of normalized variables relating to the product applied for
* Ins_Age :	Normalized age of applicant
* Ht :	Normalized height of applicant
* Wt :	Normalized weight of applicant
* BMI :	Normalized BMI of applicant
* Employment_Info_1-6 :	A set of normalized variables relating to the employment history of the applicant.
* InsuredInfo_1-6 :	A set of normalized variables providing information about the applicant.
* Insurance_History_1-9 :	A set of normalized variables relating to the insurance history of the applicant.
* Family_Hist_1-5 :	A set of normalized variables relating to the family history of the applicant.
* Medical_History_1-41 :	A set of normalized variables relating to the medical history of the applicant.
* Medical_Keyword_1-48 :	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
* Response :	This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

In [None]:
train=pd.read_csv('../input/prudential-life-insurance-assessment/train.csv.zip')
test=pd.read_csv('../input/prudential-life-insurance-assessment/test.csv.zip')

#  **Analysing features**

* **Weight**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'Wt', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['Wt'],  ax=axes[1])

* **Height**

In [None]:

f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'Ht', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['Ht'],  ax=axes[1])

* **BMI**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'BMI', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['BMI'],  ax=axes[1])

* **Age**

In [None]:
f,axes=plt.subplots(1,2,figsize=(15,7))
sns.boxplot(x='Ins_Age',data=train,orient='v',ax=axes[0])
sns.distplot(train['Ins_Age'],ax=axes[1])

* **Target Variable Analysis**

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train['Response'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Response')
ax[0].set_ylabel('')
sns.countplot('Response',data=train,ax=ax[1])
ax[1].set_title('Response')
plt.show()


**We can see that Class 8 has the highest distribution.**

#  Processing data

In [None]:
## Dropping the "Id" from train and test set. 
# train.drop(columns=['Id'],axis=1, inplace=True)

train.drop(columns=['Id'],axis=1, inplace=True)
test.drop(columns=['Id'],axis=1, inplace=True)

## Saving the target values in "y_train". 
y = train['Response'].reset_index(drop=True)

In [None]:
## Combining train and test datasets together so that we can do all the work at once. 
all_data = pd.concat((train, test)).reset_index(drop = True)
## Dropping the target variable. 
all_data.drop(['Response'], axis = 1, inplace = True)

#  Missing Value Analysis

In [None]:
def missing_percentage(df):
    """This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage"""
    ## the two following line may seem complicated but its actually very simple. 
    total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total','Percent'])


In [None]:
#missing_percentage(train)
missing_percentage(all_data)

In [None]:
#variables are discrete
all_data[['Medical_History_1','Medical_History_10','Medical_History_15','Medical_History_24','Medical_History_32']].describe()

In [None]:
all_data['Medical_History_1'].fillna(0, inplace=True)
all_data['Medical_History_10'].fillna(0, inplace=True)
all_data['Medical_History_15'].fillna(0, inplace=True)
all_data['Medical_History_24'].fillna(0, inplace=True)
all_data['Medical_History_32'].fillna(0, inplace=True)

In [None]:
#variables are continuous
all_data[['Family_Hist_2','Family_Hist_3','Family_Hist_4','Family_Hist_5']].describe()

In [None]:
all_data['Family_Hist_2'].fillna(all_data['Family_Hist_3'].mean(), inplace=True)
all_data['Family_Hist_3'].fillna(all_data['Family_Hist_3'].mean(), inplace=True)
all_data['Family_Hist_4'].fillna(all_data['Family_Hist_3'].mean(), inplace=True)
all_data['Family_Hist_5'].fillna(all_data['Family_Hist_3'].mean(), inplace=True)

In [None]:
#variables are continuous
all_data[['Employment_Info_1','Employment_Info_4','Employment_Info_6']].describe()

In [None]:
all_data['Employment_Info_1'].fillna(all_data['Employment_Info_1'].mean(), inplace=True)
all_data['Employment_Info_4'].fillna(all_data['Employment_Info_4'].mean(), inplace=True)
all_data['Employment_Info_6'].fillna(all_data['Employment_Info_6'].mean(), inplace=True)

In [None]:
#variables are continuous
all_data[['Insurance_History_5']].describe()

In [None]:
all_data['Insurance_History_5'].fillna(all_data['Insurance_History_5'].mean(), inplace=True)

In [None]:
#missing_percentage(train)
missing_percentage(all_data)

## Encoding

In [None]:
categorical=['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 
             'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 
             'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 
             'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 
             'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 
             'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 
             'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 
             'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 
             'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 
             'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 
             'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 
             'Medical_History_40', 'Medical_History_41']
## Creating dummy variable 
final_features = pd.get_dummies(all_data, columns=categorical).reset_index(drop=True)
final_features.shape

In [None]:
final_features

In [None]:
X = final_features.iloc[:len(y), :]

X_sub = final_features.iloc[len(y):, :]

In [None]:
X_sub

In [None]:
def overfit_reducer(df):
    """
    This function takes in a dataframe and returns a list of features that are overfitted.
    """
    overfit = []
    for i in df.columns:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df) * 100 > 95.0:
            overfit.append(i)
    overfit = list(overfit)
    return overfit


overfitted_features = overfit_reducer(X)

X = X.drop(overfitted_features, axis=1)
X_sub = X_sub.drop(overfitted_features, axis=1)

In [None]:
X_sub

# **KNN**

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 777)

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn_pred = knn.predict(X_test)
accuracy_score(y_test, knn_pred) # 0.4028627561044064

In [None]:
cohen_kappa_score(y_test, knn_pred) #0.16614776021896493

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score

knn_pipe = Pipeline([('knn', KNeighborsClassifier(n_jobs=-1))])

knn_params = {'knn__n_neighbors': range(1, 10)}

knn_grid = GridSearchCV(knn_pipe, knn_params,
                        cv=5, n_jobs=-1, verbose=True)

knn_grid.fit(X_train, y_train)

knn_grid.best_params_, knn_grid.best_score_

In [None]:
knn9 = KNeighborsClassifier(n_neighbors=9)
knn9.fit(X_train, y_train)

In [None]:
knn9_pred = knn9.predict(X_test)
accuracy_score(y_test, knn9_pred) # 0.384619702497895

In [None]:
cohen_kappa_score(y_test, knn9_pred) # 0.18439306881473994
#lower then with n=50

In [None]:
submission_knn9 = pd.read_csv("/kaggle/input/prudential-life-insurance-assessment/sample_submission.csv.zip")
submission_knn9.iloc[:,1] = knn9.predict(X_sub)

In [None]:
from IPython.display import FileLink

In [None]:
submission_knn9.to_csv("submission_knn9.csv", index=False)
FileLink('submission_knn9.csv')
# 0.29636

# **Logistic regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [None]:
lr = LogisticRegression(random_state=5, class_weight='balanced', solver='saga')

In [None]:
parameters = {'C': (0.0001, 0.001, 0.01, 0.1, 1, 10)}

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=5)

In [None]:
grid_search = GridSearchCV(lr, parameters, n_jobs=-1, cv=skf) #ask for scoring='roc_auc'
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_

In [None]:
grid_search.cv_results_['std_test_score'][1]

In [None]:
grid_search.best_score_

In [None]:
grid_search.cv_results_

# **Ordinal regression**