### Problem Statement:
The Insurance company that provides health insurance to its customers are now planning to provide vehicle insurance. The company wants to know how many of its customers would be interested in vehicle insurance. 

##### EDA:
Perform EDA to extract valuable insights from the data. 

##### Feature Engineering: 
Perform feature engineering to check which columns play a very important role in model building and try to come up with new features which makes a difference in building the model. 

##### Modelling:
Build a model to come up with a probability score which tells the chances of a person opting for vehicle insurance
    
##### Note:
<b><p> I am still working on the model. Feel free to drop by later for more updates. </p></b>
Consider upvoting if you like my work and if you have any suggestions please drop it in comments. I will take a look at it and work on it. 
     
Thank you!!!

In [None]:
## Importing libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
# Loading the dataset
train_df = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
test_df = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')

In [None]:
train_df.head()

In [None]:
test_df.head()

### EDA

In [None]:
## Checking the number of features and instances
train_df.shape

In [None]:
## Check for missing values
train_df.isnull().sum()

We can infer from above that there are no missing values. 

In [None]:
## Looking at columns
train_df.columns

In [None]:
print(train_df["Region_Code"].unique())
print(train_df["Policy_Sales_Channel"].unique())

In [None]:
## Segregating columns
numerical_columns = ["Age","Region_Code", "Annual_Premium", "Policy_Sales_Channel", "Vintage"]
categorical_columns = ["Gender","Driving_License", 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage']

In [None]:
train_df[numerical_columns].describe()

##### Checking if the data is skewed. 

In [None]:
### Checking if the data is skewed. 
sns.countplot(x = train_df["Response"])

From the above plot we can see that the data is skewed. Going forward, we need to implement techniques like random sampling or SMOT analysis to fix this issue. 

##### Gender participation

In [None]:
### Gender participation
sns.countplot(x = train_df["Gender"])

##### Gender distribution based on response

In [None]:
### Gender distribution based on response
fig, axis = plt.subplots(1, 2, figsize = (14, 5))

sns.countplot(ax = axis[0], x = train_df[train_df["Response"] == 1]["Gender"])
axis[0].set_title("When they subscribe to vehicle insurence")

sns.countplot(ax = axis[1], x = train_df[train_df["Response"] == 0]["Gender"])
axis[1].set_title("When they do-not subscribe to vehicle insurence")

fig.tight_layout()

From the above we can see that Male tend to subscribe to vehicle insurence more than female. Therefore we can assume that gender plays an important role in model building. 

##### Analysing Driving liscense

In [None]:
### Analysing Driving liscense
temp = train_df.groupby(["Gender"]).count()["Driving_License"].to_frame().reset_index()
print(temp)
sns.barplot(x = temp["Gender"], y = temp["Driving_License"])

##### Customer previously insured

In [None]:
sns.countplot(x = train_df["Previously_Insured"])

Data with respect to customer previously insured is almost equally distributed. 

##### Analysis Vehicle age

In [None]:
sns.countplot(x = train_df["Vehicle_Age"])

This doesnt tell much about data so I need to check how many opted for insurence with respect to above 3 groups

In [None]:
temp = train_df.groupby(["Vehicle_Age","Response"]).count()["id"].to_frame().reset_index()
temp

The above output doesnt help much as the dataset is highly skewed and number of people without insurence is much higher than the ones with insurence. To make actual sense of the response vs vehicle age, sampling the dataset is required

##### Plotting the count for each group when people have opted for insurence

In [None]:
sns.catplot(x = "Vehicle_Age", y="id", col="Response", data=temp[temp["Response"] == 1], kind="bar")

From the above graph we can observe that people tend to take insurence when the age of the vehicle is between 1-2 years. This might be a important factor when building the model. 

But before coming to this conclusion, we need to check the data to see how many records fall under the above 3 groups. If the data for vehicles age between 1-2 years is more than the others then we cant come to the above conclusion. 

##### Analysing Vehicle Age

In [None]:
sns.countplot(x = train_df["Vehicle_Age"])

##### Counting number of damaged vehicle

In [None]:
sns.countplot(x = train_df["Vehicle_Damage"])

From the above barplot we can see that equal number of records have damaged and non-damaged vehicles. 

##### Analysing the response of the customers when they have damaged vehicles

In [None]:
temp = train_df.groupby(["Vehicle_Damage","Response"]).count()['id'].to_frame().reset_index()
temp

In [None]:
sns.catplot(x="Vehicle_Damage", y="id", col = "Response", data = temp[temp["Response"] == 1], kind = "bar")

From the above we can see that most people who opted for the insurence have damaged there vehicle previously.

##### Analysisng Annual Premium 

In [None]:
sns.histplot(x = train_df["Annual_Premium"])

### Feature Engineering

##### Correlation plot

In [None]:
plt.figure(figsize = (10,10))
plt.title("Correlation Plot")
sns.heatmap(train_df.corr(), linewidth = 5, annot = True, square = True, annot_kws={'size': 10}, cmap="YlGnBu")

From the above plot we can see correlation among each features. If two features are highly correlated, we can eleminate one of them because they tend to overfit the data and we also need to make sure to consider all the features that highly correlate with the output so that they help in better prediction. 

##### Converting the data into 0-1 encodings

In [None]:
train_df.head()

In [None]:
## Reading Continuous and Categorical data
cont = ["Age", "Vintage", "Annual_Premium"]
cat = ["Gender", "Driving_License", "Region_Code", "Previously_Insured", "Vehicle_Age", "Vehicle_Damage"]

In [None]:
train = pd.get_dummies(train_df,drop_first = True)

In [None]:
train.head()

If you observe above, get_dummies only apply for categorical features where number of category is more than 2. If number of categories are two then they will be converted to 0 and 1 and also note that column names will be renamed. 

In [None]:
train.columns
train.columns = ['id', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response',
       'Gender_Male', 'Vehicle_Age__1_Year', 'Vehicle_Age_2_Years',
       'Vehicle_Damage_Yes']

In [None]:
train.head()

### Modelling

##### Random Sampling the data

In [None]:
# Getting the records which have the value as 1 for response
train_1 = train[train["Response"] == 1]

In [None]:
print(len(train_1))

# Getting the records which have value as 0 fro response
train_0 = train[train["Response"] == 0]

print(len(train_0))

In [None]:
## Getting random samples of train_0 for modelling
train_00 = train_0.sample(n = len(train_1))

In [None]:
len(train_00)

In [None]:
## Appending the two dataframes to have equal number of records when response = 1 and 0
train_sampled = train_1.append(train_00)

In [None]:
len(train_sampled)

##### Scalling the data

In [None]:
train_sampled.head()

In [None]:
train_sampled = train_sampled.drop(["id"], axis = 1)

In [None]:
train_sampled.head(3)

In [None]:
train_sampled.columns

In [None]:
## Spitting the dataset into features and target variable
X = train_sampled[['Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Gender_Male', 'Vehicle_Age__1_Year', 'Vehicle_Age_2_Years',
       'Vehicle_Damage_Yes']]
y = train_sampled[["Response"]]

In [None]:
## Printing the first 3 rows of X
X.head(3)

In [None]:
## printing the first 3 rows of y
y.head(3)

In [None]:
# Declaring the standard scaler and transforming the dataset 
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

In [None]:
## Displaying the first 3 rows 
X_scaled[:3]


##### Train_Test_Split

In [None]:
 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.33, random_state=42)

##### Building an ANN Model using PyTorch