# Health Insurance Cross Sell Prediction

### Predict Health Insurance Owners' who will be interested in Vehicle Insurance

## Context
An insurance company, currently offering Health Insurance with coverage up to $200,000, plans to expand its product range to include a new segment: Vehicle Insurance.

Since it is cheaper to sell to existing customers than to acquire new ones, the business team wants to identify which customers in the current base might be interested in buying the new Vehicle Insurance. To do this, the company needs a predictive model that can rank customers by their purchase propensity, allowing for more effective targeting of offers.

Insurance is an agreement where a company promises to provide compensation for specific losses (such as damage, illness, or death) in exchange for payment of a premium. For example, a customer might pay an annual premium to ensure that if they are hospitalized, the insurance company will cover the costs up to a certain limit.

Developing this predictive model can save the company time and resources. With a ranked list of customers based on their purchase propensity, the business team can better target offers and increase the chances of making sales.

The data is available on Kaggle: [data](https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction)

## Aim
Create a purchase propensity score for this customer base and rank customers based on this score. Considering the costs involved in sending offers to customers, the challenge is to effectively order the customers.

Questions to answer:

- Who are the customers with the highest purchase propensity?
- Which customers should we target with offers to maximize the company's revenue?

## Data Dictionary
- Train Data

| **Variable**             | **Definition**                                                                                      |
|--------------------------|-----------------------------------------------------------------------------------------------------|
| `id`                     | Unique ID for the customer                                                                          |
| `Gender`                 | Gender of the customer                                                                              |
| `Age`                    | Age of the customer                                                                                 |
| `Driving_License`        | Indicates whether the customer has a Driving License: 0 = No, 1 = Yes                               |
| `Region_Code`            | Unique code for the region of the customer                                                          |
| `Previously_Insured`     | Indicates if the customer already has Vehicle Insurance: 0 = No, 1 = Yes                            |
| `Vehicle_Age`            | Age of the Vehicle                                                                                  |
| `Vehicle_Damage`         | Indicates if the customer’s vehicle has been damaged in the past: 0 = No, 1 = Yes                   |
| `Annual_Premium`         | The amount the customer needs to pay as a premium for the year                                      |
| `Policy_Sales_Channel`   | Anonymized code for the channel used to reach the customer (e.g., Different Agents, Over Email, Over Phone, In Person, etc.) |
| `Vintage`                | Number of days the customer has been associated with the company                                    |
| `Response`               | Indicates if the customer is interested in the offer: 0 = No, 1 = Yes                               |

- Test Data

The test data contains the same variables as the train data, but it doesn't include the target variable `Response`.


# 0.0. IMPORTS

In [20]:
import pandas                       as pd
import numpy                        as np 
import matplotlib.pyplot            as plt
import seaborn                      as sns
from sklearn import model_selection as ms
from sklearn import ensemble        as en

plt.style.use('seaborn-v0_8-whitegrid')



## 0.1. Loading Dataset

In [2]:
df_raw = pd.read_csv( '../data/raw/train.csv')

# 1.0. DATA DESCRIPTION

In [3]:
df1 = df_raw.copy()

## 1.1 Rename Columns To Lowercase

In [4]:
cols_old = ['id', 'Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response']

lowercase = lambda x: x.lower()

cols_new = list(map(lowercase, cols_old))

# Renomeando as colunas do DataFrame
df1.columns = cols_new

## 1.2 Data Dimension

In [5]:
print( 'Number of Rows: {}'.format(df1.shape[0]))
print( 'Number of Cols: {}'.format(df1.shape[1]))

Number of Rows: 381109
Number of Cols: 12


## 1.3 Data Types

In [6]:
df1.dtypes

id                        int64
gender                   object
age                       int64
driving_license           int64
region_code             float64
previously_insured        int64
vehicle_age              object
vehicle_damage           object
annual_premium          float64
policy_sales_channel    float64
vintage                   int64
response                  int64
dtype: object

## 1.4 Check NA

In [7]:
df1.isna().sum()

id                      0
gender                  0
age                     0
driving_license         0
region_code             0
previously_insured      0
vehicle_age             0
vehicle_damage          0
annual_premium          0
policy_sales_channel    0
vintage                 0
response                0
dtype: int64

## 1.5 Check Duplicate ID

In [8]:
df1['id'].nunique()

381109

- Because the number of IDs matches the number of rows in the dataset, the IDs are unique 

## 1.6 Checking Values In Features

In [9]:
df1.nunique().sort_values()

gender                       2
driving_license              2
vehicle_damage               2
previously_insured           2
response                     2
vehicle_age                  3
region_code                 53
age                         66
policy_sales_channel       155
vintage                    290
annual_premium           48838
id                      381109
dtype: int64

- 50% of the features have 2 or 3 unique values

## 1.7 Checking Percentage Zeros 

In [17]:
features = df1.columns

percentage_zeros = df1[features].apply(lambda x: (x == 0).mean() * 100)
percentage_zeros_df = percentage_zeros.reset_index()
percentage_zeros_df.columns = ['features', 'porcentagem_zeros']
percentage_zeros_df.sort_values('porcentagem_zeros', ascending=False)

Unnamed: 0,features,porcentagem_zeros
11,response,87.743664
5,previously_insured,54.178988
4,region_code,0.530294
3,driving_license,0.213062
2,age,0.0
1,gender,0.0
0,id,0.0
6,vehicle_age,0.0
7,vehicle_damage,0.0
8,annual_premium,0.0


- I tried to identify if there were any features with many zero values, as this would indicate low variability, which is generally bad for modeling.
- The features 'gender', 'vehicle_damage', and 'vehicle_age' do not have zeros because their values are in text rather than numbers.

In [18]:
df1.vehicle_age.value_counts()

vehicle_age
1-2 Year     200316
< 1 Year     164786
> 2 Years     16007
Name: count, dtype: int64

- It will be necessary to encode some variables such as 'vehicle_damage', 'vehicle_age', and 'gender'.

## 1.8 Descriptive Statistical

In [12]:
num_attributes = df1.select_dtypes( include=['int64', 'float64'])
cat_attributes = df1.select_dtypes( exclude=['int64', 'float64', 'datetime64[ns]'])

In [19]:
# Central Tendency - mean, median
ct1 = pd.DataFrame(num_attributes.apply( np.mean ) ).T
ct2 = pd.DataFrame( num_attributes.apply( np.median ) ).T


# Dispersion - std, min, max, range, skew, kurtosis
d1 = pd.DataFrame( num_attributes.apply( np.std ) ).T
d2 = pd.DataFrame( num_attributes.apply( min ) ).T
d3 = pd.DataFrame( num_attributes.apply( max )  ).T
d4 = pd.DataFrame( num_attributes.apply( lambda x: x.max() - x.min() ) ).T
d5 = pd.DataFrame( num_attributes.apply( lambda x: x.skew() ) ).T
d6 = pd.DataFrame( num_attributes.apply( lambda x: x.kurtosis() ) ).T

# concatenate
metrics = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6]).T.reset_index()
metrics.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
metrics

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,id,1.0,381109.0,381108.0,190555.0,190555.0,110016.69187,9.443274e-16,-1.2
1,age,20.0,85.0,65.0,38.822584,36.0,15.511591,0.672539,-0.565655
2,driving_license,0.0,1.0,1.0,0.997869,1.0,0.046109,-21.59518,464.354302
3,region_code,0.0,52.0,52.0,26.388807,28.0,13.229871,-0.1152664,-0.867857
4,previously_insured,0.0,1.0,1.0,0.45821,0.0,0.498251,0.1677471,-1.971871
5,annual_premium,2630.0,540165.0,537535.0,30564.389581,31669.0,17213.132474,1.766087,34.004569
6,policy_sales_channel,1.0,163.0,162.0,112.034295,133.0,54.203924,-0.9000081,-0.97081
7,vintage,10.0,299.0,289.0,154.347397,154.0,83.671194,0.003029517,-1.200688
8,response,0.0,1.0,1.0,0.122563,0.0,0.327935,2.301906,3.298788
