# Health Insurance Lead Prediction

[Link to competition here!](https://datahack.analyticsvidhya.com/contest/job-a-thon/)

Go there and register to be able to download the dataset and submit your predictions.

Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it's customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive and they are classified as a lead.

Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

Now the company needs your help in building a model to predict whether the person will be interested in their proposed Health plan/policy given the information about:

- Demographics (city, age, region etc.)
- Information regarding holding policies of the customer
- Recommended Policy Information

In [None]:
# import useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

In [None]:
# load in data and set seed, do a bit of cleaning
BASE = '../input/jobathon-analytics-vidhya/'
SEED = 2021

train = pd.read_csv(f'{BASE}train.csv')
test = pd.read_csv(f'{BASE}test.csv')
ss = pd.read_csv(f'{BASE}sample_submission.csv')

In [None]:
# do a bit of cleaning
train['Holding_Policy_Duration'] = pd.to_numeric(train['Holding_Policy_Duration'].str.replace('+', ''))
test['Holding_Policy_Duration'] = pd.to_numeric(test['Holding_Policy_Duration'].str.replace('+', ''))

In [None]:
# Prepare a few key variables to classify columns into categorical and numeric
ID_COL, TARGET_COL = 'ID', 'Response'

features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]

cat_cols = ['City_Code',
            'Region_Code',
            'Accomodation_Type',
            'Reco_Insurance_Type',
            'Is_Spouse',
            'Health Indicator',
            'Holding_Policy_Type',
            'Reco_Policy_Cat']

num_cols = [c for c in features if c not in cat_cols]

## EDA starts
First we look at the first few rows of train dataset.

In [None]:
train.head(3)

In [None]:
ss.head(3)

In [None]:
# look at distribution of target variable
train[TARGET_COL].value_counts(), train[TARGET_COL].value_counts(normalize=True)

In [None]:
# look at which variables are null and if they were parsed correctly
train.info()

In [None]:
test.info()

In [None]:
# look at unique values in all columns
train.nunique()

In [None]:
test.nunique()

Looks like we have a lot of nulls in `Health Indicator`, `Holding_Policy_Duration`, and `Holding_Policy_Type`. :/ Otherwise pandas parsed out the columns quite well.

### Looking at categorical columns
Because of all the categorical columns I decided to set a baseline in Catboost. Here are top 5 value counts and countplots for all of them, they prove useful.

In [None]:
# print top 5 values and plot data wrt target variable
for col in cat_cols:
    if col != 'Region_Code': # too high granularity
      print(f'Analysing: {col}\nTrain top 5 counts:')
      print(train[col].value_counts().head(5))
      print('Test top 5 counts:')
      print(test[col].value_counts().head(5))
      plt.figure(figsize=(20,5))
      sns.countplot(x=col, hue=TARGET_COL, data=train)
      plt.show();
      print('\n')

#### Observations
Here I am interested in the ratio of target variable in each category. If it is a lot different from the other ratios, the signal conveyed for that category is useful. 

### Analysis of continuous variables
Plotted boxplots by target variable and kernel density estimates for each continuous variable to draw interesting insight.

In [None]:
# plot kernel density plot and a boxplot of data wrt target variable
for col in num_cols:
  print(f'Analysing: {col}')
  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))
  sns.kdeplot(train[col], ax=ax1)
  sns.boxplot(x = train[TARGET_COL], y = train[col], ax=ax2)
  plt.show();
  print('\n')

#### Observations
All num cols except `Reco_Policy_Premium` seem to have bimodal distribution. `Reco_Policy_Premium` is slightly skewed to the left, let's try log-transformation.

In [None]:
for col in ['Reco_Policy_Premium']:
  # plot kernel density plot and a boxplot of data wrt target variable
  print(f'Analysing: {col}')
  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))
  sns.kdeplot(np.log1p(train[col]), ax=ax1)
  sns.boxplot(x = train[TARGET_COL], y = np.log1p(train[col]), ax=ax2)
  plt.show();
  print('\n')

#### Observations
Looks like there are not too many differences in target var distributions. :/

In [None]:
# Correlation heatmap 
# not that useful for classification, especially with GBDTs
# since DT-models are not influenced by multi-collinearity
plt.figure(figsize=(22, 8))
sns.heatmap(train[num_cols].corr(), annot=True);

In [None]:
# Pairplots => these might take longer to render
sns.pairplot(train[num_cols]);

## Baseline Model
Alright, after EDA of all variables, it's time to introduce the CatboostClassifier model with no tuning as a baseline.

In [None]:
# Data preparation
y = train[TARGET_COL]
X = train.drop([TARGET_COL, ID_COL], axis=1)
X.head()

In [None]:
# Categorical features reminder
cat_cols

In [None]:
# fillnas and convert to right data types
print(X[cat_cols].info())

X_filled = X.copy()
X_filled['Health Indicator'] = X['Health Indicator'].fillna('NA')
X_filled['Holding_Policy_Type'] = X['Holding_Policy_Type'].fillna(0).astype(np.int64)

X_filled[cat_cols].info()

In [None]:
# Import train test split, then split the data into train and test set
# Cross validation is not included in the baseline => model could overfit
X_train, X_validation, y_train, y_validation = train_test_split(X_filled, y, train_size=0.8, random_state=SEED, shuffle=True, stratify=y)

In [None]:
model = CatBoostClassifier(
    random_seed=SEED,
    eval_metric='AUC',
)
model.fit(
    X_train, y_train,
    cat_features=cat_cols,
    use_best_model=True,
    eval_set=(X_validation, y_validation),
    verbose=50
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

In [None]:
print('Tree count: ' + str(model.tree_count_))

In [None]:
model.get_feature_importance(prettified=True)

In [None]:
X_test = test.drop([ID_COL], axis=1)
X_test.head()

In [None]:
# fillnas and convert to right data types TEST
print(X_test[cat_cols].info())

X_test_filled = X_test.copy()
X_test_filled['Health Indicator'] = X_test['Health Indicator'].fillna('NA')
X_test_filled['Holding_Policy_Type'] = X_test['Holding_Policy_Type'].fillna(0).astype(np.int64)

X_test_filled[cat_cols].info()

In [None]:
contest_predictions = model.predict_proba(X_test_filled)[:,1]
print('Predictions:')
print(contest_predictions)

In [None]:
ss[TARGET_COL] = contest_predictions
ss.head()

In [None]:
ss.to_csv("Catboost_Baseline.csv", index=False)

In [None]:
# and we're done!
'Done!'