# Travel Insurance Logistic Regression

## Goal: 
The goal of this project is to predict whether insurance policies are claimed based on some of features. 
Before doing data analysis, we will introduce travel insurance policy first. Usually, travel insurance will have coverage for travelers concerns, including flight delays, trip cancellation, or loggage loss. And, some of policies include medical emergency. It means that the insured's age might have an influence on whether the policy is claimed. 
## Problem:
For any insurance policy, we would like to know whether some features lead to the policy claimed. Some features like age, gender or commision value could have influence on that. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Read Data

In [None]:
df = pd.read_csv("../input/travel-insurance/travel insurance.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
def cleanYesNo(s):
    if s == "Yes":
        return 1
    elif s == "No":
        return 0

df["Claim0"] = df.loc[:,'Claim'].apply(cleanYesNo)

Gender column has too many missing values, so at this point, we delete this column. 

In [None]:
df.drop(["Claim", "Gender"], axis = 1, inplace = True)

In [None]:
df.groupby(["Agency"]).mean()

In [None]:
df.groupby(['Agency Type']).mean()

In [None]:
df.groupby(['Distribution Channel']).mean()

In [None]:
df.describe()

## Clean Data
Some of data in some of columns are suspicious. If needed, we will transform the value of the data. Like Duration, negative values are strange. And, in age column, only 118 appears.

In [None]:
df[df["Duration"] <0]

In [None]:
df[df["Age"] > 100]

In [None]:
df.loc[df['Duration'] < 0, 'Duration'] = 49.317
df.loc[df['Age'] > 100, 'Age'] = 39.969981

In [None]:
df.describe()

Imbalanced data set is as follows. Claimed policies are 927. However, Not Claimed policies are 62399.

In [None]:
print("Claimed")
print(df[df["Claim0"] == 1]["Claim0"].count())
print("Not Claimed")
print(df[df["Claim0"] == 0]["Claim0"].count())

Imbalanced dataset appears here. Oversample method will be performed to deal with imblancing problems. Before resampling, data visualization is executed first to know potential relationship.

## Data Visualization

The following graph is to make sure whether claimed policies just occur in few of agencies. From this graph, it distributed evenly.

In [None]:
g = sns.catplot(x="Agency",y = "Claim0", data=df)
g.fig.set_size_inches(10,5)

The following visualization is to understand whether claimed policies happens in few of countries. If an abudunce in one of countries means that this country might have some of unavoidable situtation that lead to injury or disease. 

In [None]:
claimeddata = df[df["Claim0"]==1]

In [None]:
claimeddata['Destination'].value_counts().head(10).plot(kind='barh', figsize=(5,5))

In the below graph, we want to discover whether specific agency companies lead to an increase in the numbers of claimed policies. In this figure, clearly no moral hazard happens here. Since in gernal one, EPX has more sold policies, but it does not lead to more claimed policies.

In [None]:
f, axes = plt.subplots(1, 2)
f.set_size_inches(15,5)
axes[0].set_title('General')
axes[1].set_title('Agency (Claimed Policies)')
sns.countplot(y="Agency", data=df, ax = axes[0])
sns.countplot(y="Agency", data=claimeddata, ax = axes[1])

Different policy plans have different coverage. Some of plans cover most of types of injury or disease, so the possibility of claims 

In [None]:
a = pd.DataFrame(df.loc[:, "Product Name"].value_counts())
b = pd.DataFrame(claimeddata.loc[:, "Product Name"].value_counts())
combined = a.join(b, lsuffix = "_general", rsuffix = "_claimed")
combined.fillna(0, inplace = True)
combined

In [None]:
ratio_list = []
for i in range(len(combined)):
    ratio_list.append(combined.iloc[i][1] / combined.iloc[i][0])
ratio = pd.DataFrame(ratio_list, index = np.array(combined.index))
ratio = ratio.rename(columns = {0:"Ratio"})

plt.figure(figsize=(7,7))
sns.barplot(data = ratio, y = ratio.index, x = "Ratio")

From the above Ratio figure, we know different policies have different rate of being claimed.

In [None]:
sns.heatmap(df.corr(), square=True)

# Training Dataset and Testing Dataset

In [None]:
X = df.drop(columns=['Claim0'])
y = df['Claim0']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Oversampling
Oversampling methods should be performed in that the number of claimed policies is much less than non-claimed ones. So oversampling methods are used to address this issue.

In [None]:
updated_X = X_train.drop(columns = ["Agency", "Agency Type", "Distribution Channel", "Product Name", "Destination"])

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(updated_X, y_train)

In [None]:
(unique, counts) = np.unique(y_resampled, return_counts=True)
(unique, counts)

## Model Building

In [None]:
X_test_updated = X_test.drop(columns = ["Agency", "Agency Type", "Distribution Channel", "Product Name", "Destination"])

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X_resampled, y_resampled)
y_pred = clf.predict(X_test_updated)

Accuracy is around 65%. Since we do not have more features, it seems that we still need some features to improve accuracy.

In [None]:
clf.score(X_resampled, y_resampled)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names = ["Claimed", "Non-claimed"]))

## Cross-Validation
To avoid the overfitting problem, we will perform cross-validation method to check our accuracy score. But the result shows that our accruacy scores are around 65%

In [None]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(clf, X_resampled, y_resampled, cv=10)

In [None]:
cv_results['test_score']