# Predicting of Car Insurance
Step by Step Guide:
1. Importing the Necessary Libraries
2. Importing the Dataset 
3. Dataset Analysis
   * 3.1 Observing the data  
   * 3.2 Determining missing values
   * 3.3 Joining Train/Test Data
4. Visualizing and Comparing Features
   * 4.1 Correlation heatmap 
   * 4.2 Comparing the effect of different categorical features and features with 2 categories on the target(Response)
       * 4.2.1 Gender
       * 4.2.2 Vehicle Age
       * 4.2.3 Vehicle Damage 
       * 4.2.4 Driving_License
       * 4.2.5 Previoulsy_Insured
   * 4.3 Comparing the effect of different numerical features on the target(Response)   
5. Feature Engineering
   * 5.1 Converting categorical columns to numerical values
       * 5.1.1 Mapping categorical Vehicle_Age feature
       * 5.1.2 Mapping categorical Gender feature
       * 5.1.3 Mapping categorical Vehicle_Damage feature 
   * 5.2 Dropping non-essential columns
       * 5.2.1 Dropping categorical columns
       * 5.2.2 Dropping Id column and Vintage column
6. Building/Training/Evaluating our models
   * 6.1 Seperating Train/Test dataset
   * 6.2 Modelling various classifiers
   * 6.3 Hyperparameter tuning
   * 6.4 Submitting

# 1-Importing the Necessary Libraries

In [None]:
#Importing the data analysis libraries
import numpy as np # linear algebra
import pandas as pd # data processing

#Importing the visualization libraries
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
#Ensuring that we don't see any warnings while running the cells
import warnings
warnings.filterwarnings('ignore') 

#Importing the counter
from collections import Counter

#Importing sci-kit learn libraries that we will need for this project
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split



# 2-Importing the Dataset

In [None]:
train = pd.read_csv("../input/health-insurance-cross-sell-prediction/train.csv")
test = pd.read_csv("../input/health-insurance-cross-sell-prediction/test.csv")

# 3-Data Analysis

## 3.1-Observing the data

In [None]:
train.sample(10)


### Obervations/Discussion:
* As we can see from the sampling that the dataset has a mixture of both Quantitative variables and Categorical Variables
* This mix up of variables will cause problems while training our model
* We need to convert the categorical varibales into Quantitative variables so that our ML model doesn't encounter trouble is training and predicting

In [None]:
train.describe(include="all")

In [None]:
print(pd.isnull(train).sum())

### Obervations/Discussion:
* Diving deeper into the details of the dataset, we can observe that the dataset has no missing values
* We will still need to ensure whether the values are correct though
* The NaNs represent Cateogorical features which we will convert to Quantitative variables

## 3.3 - Joining Train/Test

First we will combine the train and test data to ensure that we implement the feature engineering on all data, and we don't have discrepancies when modeling and evaluating. We will split the dataframe again after the feature engineering process

In [None]:
df = pd.concat(objs = [train, test], axis = 0).reset_index(drop=True)
df.describe(include="all")

In [None]:
print(pd.isnull(df).sum())

### Observations and Discussion
* The missing values represent the missing responses from the dataset
* Apart from the response missing values, there are no missing values in the dataset

### Seperating categorical and numerical data

In [None]:
numerical_data = df.select_dtypes(include='number')
categorical_data = df.select_dtypes(exclude='number')

In [None]:
numerical_data.describe(include='all')

In [None]:
categorical_data.head()

# 4 - Visualizing and Comparing the Features

## 4.1 - Correlation heatmap


In [None]:
sn = sns.heatmap(df[["Response",
                "Age",
                "Driving_License", 
                "Region_Code", 
                "Previously_Insured", 
                "Vehicle_Age", 
                "Vehicle_Damage", 
                "Annual_Premium",
                "Policy_Sales_Channel",
                "Vintage"]].corr(), cmap = 'coolwarm', annot = True)

## 4.2 - Comparing the effect of different categorical features and features with 2 categories on the target(Response)

In [None]:
#A function to visualize and determine the fraction of responses in each category for a certain feature
def bar_plot(feature):
    
    feature_categories = df[feature].sort_values().unique()
    for category in feature_categories:
        temp_series = df["Response"][df[feature] == category].value_counts(normalize = True)
        #This code is used to solve problem when there are no Responses for a category, which causes an error in runtime
        if temp_series.shape == (1,):
            temp_series = temp_series.append(pd.Series([0], index=[1]))
        elif temp_series.shape == (0,):
            continue
        print("Percentage of individuals having {}: {}, who got the insurance: {:.2f} %".format(feature, category, temp_series[1]*100))
    #visualize
    sns.barplot(x = df[feature],y = df["Response"],  data = df).set_title('Fraction Who Got Insurance With Respect To {}'.format(feature))

### 4.2.1 - Gender

In [None]:
bar_plot("Gender")

### Observations and Discussion:
* From this graph it can be deduced that of those that did get the insurance, the fraction of males was slighty higher than the fraction of females
* Reasons for this difference cannot be determined just now and require further analysis
* From this observation it can be duduced that males have a 4% greater chance of getting insurance as compared to females
* Thus Gender also plays a part in determining whether an individual will get the insurance or not

### 4.2.2 - Vehicle Age

In [None]:
bar_plot("Vehicle_Age")

### Observations and Discussion:
* The rate of people getting insurance gets higher as the age of the vehicle increases
* Almost 30% of people who have vehicles that are older than 2 years, got the insurance
* This shows that Vehicle_Age plays a huge part in whether people will get insurance or not

### 4.2.3 - Vehicle Damage

In [None]:
bar_plot("Vehicle_Damage")

### Observations and Discussion:
* The rate of people getting insurance is significantly higher for people with vehicle damage
* Almost 25% of people who have vehicles that are damaged, got the insurance
* Just 0.5% of people who have vehicles that are not damaged, got the insurance
* This shows that Vehicle Damage plays a huge part in whether people will get insurance or not

### 4.2.4 - Driving License

In [None]:
bar_plot("Driving_License")

### Observations and Discussion:
* The rate of people getting insurance is more than double for people who have a Driving License as compared to those who don't
* Almost 12% of people who have a Driving License, got the insurance
* This shows that Driving License plays a significant part in whether people will get insurance or not

### 4.2.5 - Previously Insured

In [None]:
bar_plot("Previously_Insured")

### Observations and Discussion:
* The rate of people getting insurance is significantly higher for people without prior insurance
* Almost 23% of people who didn't have previous insurance, got the insurance
* As compared to only 0.09% of people who got the insurance, who already had a previous insurance
* This shows that Previous Insurance status plays a huge part in whether people will get insurance or not
* This makes sense that people who already have an insurance will not be looking for further insurance or a new insurance program
* People who are uninsured will be looking for insurance and thus explaining the difference in rate

## 4.3 - Comparing the effect of different numerical features on the target(Response)

### 4.3.1 - Correlation heatmap for numerical data

In [None]:
sn = sns.heatmap(df[["Response",
                    "Age", 
                    "Region_Code",
                    "Vehicle_Age",  
                    "Policy_Sales_Channel",
                    "Vintage"]].corr(), cmap = 'coolwarm', annot = True)

In [None]:
# A function that takes in a feature and returns the histogram
def histograms(feature):
    fig = px.histogram(
        train, 
        feature, 
        color='Response',
        nbins=100, 
        title=('{} Vs Response'.format(feature)), 
        width=700,
        height=500
    )
    fig.show()

### 4.3.2 Age Vs Response Distribution

In [None]:
histograms("Age")

### Observations and Discussion:
* The 1st graph shows that majority of those who do not go for insurance are in their 20s 
* The 2nd graph shows that the majority of people who do go for an insurance are between the ages of 38 and 50.
* Thus age plays a significant role in determining whether an individual will get vehicle insurance or not

### 4.3.3 Vintage Vs Response Distribution

In [None]:
histograms("Vintage")

### Observations and Discussion:
* From this graph it is pretty evident that Vintage features is evenly distributed
* This verifies the observation from the heatmap which states that Vintage plays an insignificant role in determining the response of the individual to getting an insurance or not
* This is one of the features that can be removed from modelling, again because of the fact that it plays a very insignificant role

### 4.3.4 Region Code Vs Response Distribution

In [None]:
histograms("Region_Code")

Observations:
* Region code = 28 has a very large number of counts as compared to other regions
* There are small spikes everywhere else but nothing as substantial as Region Code = 28

### 4.3.5 Policy_Sales_Channel Vs Response Distribution

In [None]:
histograms("Policy_Sales_Channel")

Observations:
* There are certain spikes at certain Policy Channels
* Most policy Channels have very few customers

In [None]:
histograms("Annual_Premium")

Observations:
* The vast majority of the customers have an annual premium of less than 100k
* As the annual premium increases, the chance of response = Yes(0), decreases
* Higher premiums usually deter customers from the insurance offer which can explain the point stated above

# 5 - Feature Engineering

## 5.1 - Converting categorical columns to numerical values

### 5.1.1 - Vehicle_Age Column

In [None]:
df["Vehicle_Age_Encoded"] = df["Vehicle_Age"].map({"< 1 Year": 0, "1-2 Year": 1, "> 2 Years": 2})

### 5.1.2 - Gender Column

In [None]:
df["Gender_Encoded"] = df["Gender"].map({"Male": 0, "Female": 1})

### 5.1.3 - Vehicle_Damage Column

In [None]:
df["Vehicle_Damage_Encoded"] = df["Vehicle_Damage"].map({"No": 0, "Yes": 1})

In [None]:
df.head()

## 5.2 - Dropping non-essential columns

### 5.2.1 - Dropping categorical columns

In [None]:
df = df.drop(["Vehicle_Age", "Vehicle_Damage", "Gender"], axis = 1)

Comments:
* I have already converted these categorical columns into numerical representations, thus these columns can now be dropped

### 5.2.2 - Dropping Id column and Vintage column

In [None]:
customer_ID = pd.Series(df["id"], name = "CustomerId")
df = df.drop(["id", "Vintage"], axis = 1)

Comments:
* The Id of the particilar customer plays no part in determining the outcome of the response thus must be dropped
* Vintage feature has a neglible correlation with the response and other features as evident from the confusion matrix, thus it also can be dropped

In [None]:
df.sample(5)

## 6 - Building/Training/Evaluating our models

### 6.1 - Seperating Train/Test dataset

In [None]:
train = df[:train.shape[0]]
test = df[train.shape[0]:].drop(["Response"], axis = 1)

 ### 6.2 - Modelling various classifiers

In [None]:
#StratifiedKFold aims to ensure each class is (approximately) equally represented across each test fold
k_fold = StratifiedKFold(n_splits=5)

X_train = train.drop(labels="Response", axis=1)
y_train = train["Response"]

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

# Creating objects of each classifier
LG_classifier = LogisticRegression(random_state=0)
SVC_classifier = SVC(kernel="rbf", random_state=0)
KNN_classifier = KNeighborsClassifier()
NB_classifier = GaussianNB()
DT_classifier = DecisionTreeClassifier(criterion="entropy", random_state=0)
RF_classifier = RandomForestClassifier(n_estimators=200, criterion="entropy", random_state=0)

#putting the classifiers in a list so I can iterate over there results easily
insurance_classifiers = [LG_classifier]

#This dictionary is just to grad the name of each classifier
classifier_dict = {
    0: "Logistic Regression",
    1: "Support Vector Classfication",
    2: "K Nearest Neighbor Classification",
    3: "Naive bayes Classifier",
    4: "Decision Trees Classifier",
    5: "Random Forest Classifier",
}

insurance_results = pd.DataFrame({'Model': [],'Mean Accuracy': [], "Standard Deviation": []})

#Iterating over each classifier and getting the result
for i, classifier in enumerate(insurance_classifiers):
    classifier_scores = cross_val_score(classifier, X_train, y_train, cv=k_fold, n_jobs=2, scoring="accuracy")
    insurance_results = insurance_results.append(pd.DataFrame({"Model":[classifier_dict[i]], 
                                                           "Mean Accuracy": [classifier_scores.mean()],
                                                           "Standard Deviation": [classifier_scores.std()]}))

In [None]:
print (insurance_results.to_string(index=False))

### 6.3 - Hyperparameter Tuning

In [None]:
# from sklearn.model_selection import GridSearchCV

# RF_classifier = RandomForestClassifier()


# ## Search grid for optimal parameters
# RF_paramgrid = {"max_depth": [None],
#                   "max_features": [1, 3, 10],
#                   "min_samples_split": [2, 3, 10],
#                   "min_samples_leaf": [1, 3, 10],
#                   "bootstrap": [False],
#                   "n_estimators" :[100,200,300],
#                   "criterion": ["entropy"]}


# RF_classifiergrid = GridSearchCV(RF_classifier, param_grid = RF_paramgrid, cv=k_fold, scoring="accuracy", n_jobs= -1, verbose=1)

# RF_classifiergrid.fit(X_train,y_train)

# RFC_optimum = RF_classifiergrid.best_estimator_

# # Best Accuracy Score
# RF_classifiergrid.best_score_

In [None]:
IDtest = customer_ID[train.shape[0]:].reset_index(drop = True)

### 6.4 - Submitting

In [None]:

X_train = train.drop(labels="Response", axis=1)
y_train = train["Response"]

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(test)

LG_classifier.fit(X_train, y_train)

test_predictions = pd.Series(LG_classifier.predict(X_test).astype(int), name="Response")
insurance_results = pd.concat([IDtest, test_predictions], axis = 1)
insurance_results.to_csv('submission.csv', index=False)