# Insurance Prediction

* Insurance plays an important role for financial aid during emergencies and unforeseen circumstances. In case of sudden accidents/death of earning member of family, insurance can help cover the expenditure for medical aid and even cover debts/loans incurred. 
* The application of machine Learning in the field of insurance prediction can greatly reduce the administrative cost to company.
* The aim of this project is to identify possible cases of insurance claim in field of travel based on geographical data, the insurance agency, type of insurance taken and commission/net sales of the agency. 
* This would help the company plan for future claims and accordingly manage the resources available at hand.

#### Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE 

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost.sklearn import XGBClassifier


from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score, roc_auc_score, recall_score, precision_score, balanced_accuracy_score, log_loss, confusion_matrix,classification_report

In [None]:
%matplotlib inline
np.random.seed(42)
plt.rcParams['figure.figsize'] = 6, 6

In [None]:
#Importing dataset
df=pd.read_csv('/kaggle/input/travel-insurance/travel insurance.csv')
columns_df = list(df.columns)
df.head(5)

Of the columns present, *Agency, AgencyType, Distribution Channel, Product Name,Claim, Destination and Gender* are categorical variables. *Duration, Net Sales, Commision (in values) and Age* are numerical variables.

Some columns need renaming as they have spaces. This might lead to confusion when using the column names directly.

In [None]:
df.rename(columns={ 'Agency Type':'AgencyType', 'Distribution Channel':'DistChannel', 'Product Name':'ProdName',
                   'Net Sales':'NetSales', 'Commision (in value)':'Commission'},inplace=True)

In [None]:
df_cat=['Agency','AgencyType','DistChannel','ProdName','Destination'] #Defining categorical variables

In [None]:
for k in df_cat:
    print(k,':',df[k].unique())

In [None]:
print(df.info())
print(df.describe())
print(df.shape)

The dataset has a total of 63326 rows and 11 columns.

Of these, *Gender* column has nearly 71% of values  missing. Imputation is avoided as very little data is available. The column is therefore dropped. Also, *NetSales* has negative values which also needs to be corrected. A similar case exists with *Duration column*.

In [None]:
print(df['Gender'].isnull().sum())
df.drop(columns=['Gender'],inplace=True)
df['NetSales'] = df['NetSales'].apply(lambda x: 0 if x<0 else x)
df['Duration'] = df['Duration'].apply(lambda x: 0 if x<0 else x)

In [None]:
df.describe()

*Destination* is categorical with a huge number of values. In order to make this simpler, top 20 countries with highest count are retained as is while the rest are classified as 'OTHERS'.  

In [None]:
df1=df.groupby(by=["Destination"]).size().reset_index(name="counts")
df1=df1.nlargest(20,['counts'])
dest_top = list(df1['Destination'])
df['Destination'] = df['Destination'].apply(lambda x: "OTHERS" if x not in dest_top else x)

In [None]:
df['Destination'].unique()

#### EDA

In [None]:
df.corr()

In [None]:
plt.scatter(x=df['NetSales'],y=df['Commission'],marker='*')

In [None]:
df['Agency'].value_counts().plot(kind='bar',color='purple')
plt.title('Top agencies for insurance')
plt.show()

Agency *EPX* has highest number of insurance claims from its customers.

In [None]:
df['DistChannel'].value_counts().plot(kind='bar',color='lightseagreen')
plt.title("Mode of Distribution Channel")
plt.show()

The dataset shows most insurance claims being distributed online with negligible percentage being distributed offline.

In [None]:
plt.hist(df['Age'],bins=20,color='limegreen')
plt.title("Insurance across age groups")
plt.show()

Customers in the age group of 30-40 filed insurance claims the most. A majority of this was in the 35-40 age.

In [None]:
df['Claim'].value_counts()[:].plot(kind='bar',color='tomato')
plt.title('Insurance Claimed')
plt.show()

The *Claim* column is the variable to be predicted,i.e., whether the claim was made or not. As such, it is highly unbalanced. Some balancing needs to be carried out later before feeding into any model

In [None]:
plt.pie(x=df1['counts'],labels=df1['Destination'],autopct='%1.1f%%')
plt.title('Share of countries in insurance')
plt.show()

Singapore, Malaysia and Thailand alone contribute to 43.7% of insurance filed. This is logical as the insurance company is itself based in Singapore. 

In [None]:
df2=df.groupby(by=["ProdName"]).size().reset_index(name="counts")
df2=df2.nlargest(10,['counts'])
plt.pie(x=df2['counts'],labels=df2['ProdName'],autopct='%1.1f%%')
plt.title('Type of insurance plans')
plt.show()

The top insurance plans taken include the Cancellation Plan and 2-way Comprehensive Plan contributing to ~53% of plans.

In [None]:
dfclaimed=df[df['Claim']=="Yes"]
dfplot=dfclaimed.groupby(by=["Agency"]).size().reset_index(name="counts")
plt.bar(dfplot['Agency'],dfplot['counts'],color='deeppink')
plt.title("Agencies with claimed insurances")
plt.show()

Although agency *EPX* had maximum number of insurance plans taken, *C2B* has maximum number of claimed insurances.

In [None]:
dfplot=dfclaimed.groupby(by=['ProdName']).size().reset_index(name="counts")
dfplot=dfplot.nlargest(5,['counts'])
plt.figure(figsize=(12, 5))
plt.bar(dfplot['ProdName'],dfplot['counts'],color='slategray')
plt.title('Insurance plans ranked by number of claims')
plt.show()

Of the claimed plans, *Bronze Plan* was claimed most, followed by the *Annual Silver Plan*.

In [None]:
dfplot=dfclaimed.groupby(by=['AgencyType']).size().reset_index(name="counts")
dfplot=dfplot.nlargest(10,['counts'])
plt.figure(figsize=(4, 4))
plt.bar(dfplot['AgencyType'],dfplot['counts'],color='lightcoral')
plt.title('Type of agency against number of claims')
plt.show()

Airlines agency customers claimed more than Travel Agency.

In [None]:
dfplot=dfclaimed.groupby(by=['Destination']).size().reset_index(name="counts")
dfplot=dfplot.nlargest(10,['counts'])
plt.figure(figsize=(15, 4))
plt.bar(dfplot['Destination'],dfplot['counts'],color='olivedrab')
plt.title('Countries ranked on number of claims')
plt.show()

Since the insurance company is itself from Singapore, it also saw the maximum number of claims (a majority of total claims as well).

#### Data Preprocessing

In [None]:
#Label encoding the categorical columns
le_agency= LabelEncoder()
df['Agency'] = le_agency.fit_transform(df['Agency'])

le_agtype= LabelEncoder()
df['AgencyType'] = le_agency.fit_transform(df['AgencyType'])

le_dchannel= LabelEncoder()
df['DistChannel'] = le_agency.fit_transform(df['DistChannel'])

le_prodname= LabelEncoder()
df['ProdName'] = le_agency.fit_transform(df['ProdName'])

le_dest= LabelEncoder()
df['Destination'] = le_dest.fit_transform(df['Destination'])

le_claim= LabelEncoder()
df['Claim'] = le_claim.fit_transform(df['Claim'])

In [None]:
df.head(5)

In [None]:
X=df.drop(columns='Claim',inplace=False)
y=df['Claim']

In [None]:
#Balancing unbalanced 'Claim' column 
sm = SMOTE(random_state=42)
Xb, yb = sm.fit_resample(X, y)
print(f'''Shape of X before SMOTE: {X.shape}
Shape of X after SMOTE: {Xb.shape}''')
print('\nBalance of positive and negative classes (%):')
yb.value_counts()

In [None]:
#Scaling dataset
scaler=MinMaxScaler()
cols_scaling = Xb.columns
Xb[cols_scaling]=scaler.fit_transform(Xb[cols_scaling])

#### Splitting the dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Xb,yb,test_size=0.3,random_state=42,shuffle=True,stratify=yb)

#### Training multiple models

In [None]:
values=[]
models = [RandomForestClassifier(),LogisticRegression(),DecisionTreeClassifier(random_state=42),SVC(),KNeighborsClassifier(),XGBClassifier()]
for m in models:
    m.fit(X_train,y_train)
    y_pred=m.predict(X_test)
    print(m)
    print(classification_report(y_test,y_pred)[1])
    print(confusion_matrix(y_test,y_pred))
    values.append([str(m)[:10],f1_score(y_test,y_pred), roc_auc_score(y_test,y_pred), recall_score(y_test,y_pred), precision_score(y_test,y_pred), 
      balanced_accuracy_score(y_test,y_pred), log_loss(y_test,y_pred)])
    print('==========================================================')

In [None]:
values.insert(0,['Model','f1_score','roc_auc_score','recall_score','precision_score','balanced_accuracy_score','log_loss'])
results= pd.DataFrame(values[1:],columns=values[0])

In [None]:
results

As a further improvement, hyperparameter tuning can be performed on the best model to check if performance improves. 