# Introduction

Our task is to predict whether an individual is interested in buying a vehicle insurance policy.

We are given the following variables:
* **id**: Unique ID for the customer
* **Gender**: Gender of the customer
* **Age**: Age of the customer
* **Driving_License**: 0 : Customer does not have DL, 1 : Customer already has DL
* **Region_Code**: Unique code for the region of the customer
* **Previously_Insured**: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
* **Vehicle_Age**: Age of the Vehicle
* **Vehicle_Damage**: 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.
* **Annual_Premium**: The amount customer needs to pay as premium in the year
* **Policy_Sales_Channel**: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
* **Vintage**: Number of Days, Customer has been associated with the company
* **Response**: 1 : Customer is interested, 0 : Customer is not interested

Of course, only making good prediction does not necessarily means we are going to **make good decisions**. There are many factors underlying the insurance industry. I hope that below can shed some light on this topic.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')


train_df= pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
test_df = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
test_df.info()

**Checking for null value**

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

# Exploratory Data Analysis

**Response**

The majority are not interested in the vehicle insurance

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Response', data= train_df)
plt.show()

**Gender**

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='Gender', data= train_df)
plt.show()

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Gender')['Response'].sum() / tmp_df.groupby('Gender')['Response'].count()

plt.figure(figsize=(8,6))
sns.pointplot(x='Gender', y='Response', data=tmp_df.reset_index())
plt.show()

Males are more likely to be interested. However, this is not intuitive. It can be due to pure randomness or other factors.

**Age**

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(train_df['Age'])
plt.show()

print('The maximum age is {}'.format(train_df['Age'].max()))
print('The minimum age is {}'.format(train_df['Age'].min()))

The maximum age of the dataset is 85. This makes me wonder whether it is suitable for them to drive at all.

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Age')['Response'].sum() / tmp_df.groupby('Age')['Response'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Age', y='Response', data=tmp_df.reset_index())
plt.show()

Unsurprisingly, those who are very old are not interested in buying the vehicle insurance.

Those who are very young also are not interested. This may be because they are more risk tolerant and underestimate the importance of insurance.

It makes sense that people in their middle age are more interested.

The relationship between age and response seems to be **quadratic**.

**Whether they have a driving license**

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(train_df['Driving_License'])
plt.show()

Most people have a driving license

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Age')['Driving_License'].sum() / tmp_df.groupby('Age')['Driving_License'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Age', y='Driving_License', data=tmp_df.reset_index())
plt.show()

People under 50 almost always have a driving license and then the proportion starts to decrease when they get older.

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Driving_License')['Driving_License'].sum() / tmp_df.groupby('Driving_License')['Response'].count()

plt.figure(figsize=(8,6))
sns.pointplot(x='Driving_License', y='Driving_License', data=tmp_df.reset_index())
plt.show()

This is obvious. We should target those who have a driving license.

**Region**

In [None]:
print(set(train_df['Region_Code']))
print('Number of regions: {}'.format(len(set(train_df['Region_Code']))))

In [None]:
tmp_df = train_df.copy()
tmp_df_count = tmp_df.groupby('Region_Code')['Response'].count()
print('Number of regions (sample size >= 30): {}'.format(len(tmp_df_count[tmp_df_count >= 30])))

tmp_df = tmp_df.groupby('Region_Code')['Response'].sum() / tmp_df.groupby('Region_Code')['Response'].count()
tmp_df = tmp_df.reset_index()
tmp_df['Region_Code'] = tmp_df['Region_Code'].apply(lambda x: str(int(x)))

plt.figure(figsize=(20, 6))
sns.pointplot(x='Region_Code', y='Response', data=tmp_df)
plt.show()

Region can be correlated to response:
* Some region may be busier, thus higher demand of vehicle insurance.
* However, there are too many regions. If the insurance company has more information about each region, that might be more insightful.
* We can incorporate regions in our analysis given that all regions have a sample size >= 30.

**Whether they have vehicle insurance already**

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(train_df['Previously_Insured'])
plt.show()

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df[tmp_df['Driving_License'] == 1]
tmp_df = tmp_df.groupby('Age')['Previously_Insured'].sum() / tmp_df.groupby('Age')['Previously_Insured'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Age', y='Previously_Insured', data=tmp_df.reset_index())
plt.show()

Those between age 40 and age 50 are most unlikely to have a vehicle insurance policy already.

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df[tmp_df['Driving_License'] == 1]
tmp_df = tmp_df.groupby('Previously_Insured')['Response'].sum() / tmp_df.groupby('Previously_Insured')['Response'].count()

plt.figure(figsize=(8,6))
sns.pointplot(x='Previously_Insured', y='Response', data=tmp_df.reset_index())
plt.show()

While this is self-explanatory. We should be careful of those who already have vehicle insurance. Because why would they want another vehicle insurance policy? Does it indicate **adverse selection**?

**Adverse selection** means that individuals with higher-than-average risks are more willing to buy insurance, especially insurance with high limit. This is not **actuarially equitable** to those with average risks if these individuals are charged the same amount of premium. Underwriters have to pay attention to this issue when they indicate that they want to buy another policy.

**Age of the vehicle**

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(train_df['Vehicle_Age'])
plt.show()

Although the age of the vehicle is numerical in nature, it is given as categorical input.

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Vehicle_Age')['Response'].sum() / tmp_df.groupby('Vehicle_Age')['Response'].count()

plt.figure(figsize=(8,6))
sns.pointplot(x='Vehicle_Age', y='Response', data=tmp_df.reset_index())
plt.show()

There is not a clear direction as to how this will drive their decision. Perhaps more insight can be extracted if the vehicle age is given in number.

We may consider these questions:
* Will new car owners be more interested in buying vehicle insurance because they are more vulnerable to any damage to their vehicle?
* Will old car owners be more interested in buying vehicle insurance because they expect higher probability of accidents?

**Whether they have got their vehicle damaged before**

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(train_df['Vehicle_Damage'])
plt.show()

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df.groupby('Vehicle_Damage')['Response'].sum() / tmp_df.groupby('Vehicle_Damage')['Response'].count()

plt.figure(figsize=(8,6))
sns.pointplot(x='Vehicle_Damage', y='Response', data=tmp_df.reset_index())
plt.show()

This is in fact an extremely important factor for the insurance company. If individuals have got their vehicle damaged before, it usually indicates that they possess higher-than-average risks, thus **adverse selection**.

The insurance company may charge them higher premium. While our "task" is to predict who will be interested in buying insurance policy, the insurance company ought to consider whether they want to attract higher risks.

**Annual Premium**

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(train_df['Annual_Premium'])
plt.show()

print('The mean annual premium is {:.2f}'.format(train_df['Annual_Premium'].mean()))
print('The median annual premium is {:.2f}'.format(train_df['Annual_Premium'].median()))
print('The maximum annual premium is {:.2f}'.format(train_df['Annual_Premium'].max()))
print('The minimum annual premium is {:.2f}'.format(train_df['Annual_Premium'].min()))
ninety_nineth_percentile = train_df['Annual_Premium'].quantile(0.99)
print('99% of people\'s annual premium is less than {:.2f}'.format(ninety_nineth_percentile))

This is quite positively skewed.

In [None]:
plt.figure(figsize=(8,6))
sns.jointplot(x='Age', y='Annual_Premium', data=train_df)
plt.show()

Normally we would expect premium to be positively correlated with age because older people are charged higher premium.

In [None]:
train_df[['Age', 'Annual_Premium']].corr()

They are still positively correlated but very weakly. This is perhaps because there is great differentiation between health insurance policies. There are policies with different limits and benefits so they are charged differently.

In [None]:
train_df[['Annual_Premium', 'Response']].corr()

There is a weak positive correlation between annual premium and response. This is reasonable between those who can afford an expensive health insurance policy are probably richer.

In [None]:
tmp_df = train_df.copy()
tmp_df['Annual_Premium'] = tmp_df['Annual_Premium'].apply(lambda x: round(x, -3))
tmp_df = tmp_df.groupby('Annual_Premium')['Response'].sum() / tmp_df.groupby('Annual_Premium')['Response'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Annual_Premium', y='Response', data=tmp_df.reset_index())
plt.show()

In [None]:
tmp_df = train_df.copy()
tmp_df = tmp_df[tmp_df['Annual_Premium'] < ninety_nineth_percentile]
tmp_df['Annual_Premium'] = tmp_df['Annual_Premium'].apply(lambda x: round(x, -3))
tmp_df = tmp_df.groupby('Annual_Premium')['Response'].sum() / tmp_df.groupby('Annual_Premium')['Response'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Annual_Premium', y='Response', data=tmp_df.reset_index())
plt.show()

There is a clear positive correlation. It is just distorted by the outliers. 72963 is the 99th percentile as discussed above.

**Policy Sales Channel**

In [None]:
print(set(train_df['Policy_Sales_Channel']))
print('Number of policy sales channels: {}'.format(len(set(train_df['Policy_Sales_Channel']))))

In [None]:
tmp_df = train_df.copy()
tmp_df_count = tmp_df.groupby('Policy_Sales_Channel')['Response'].count()
print('Number of policy sales channels (sample size >= 30): {}'.format(len(tmp_df_count[tmp_df_count >= 30])))

tmp_df = tmp_df.merge(
    tmp_df_count.rename('Policy_Sales_Channel_Count'),
    how='left',
    left_on='Policy_Sales_Channel',
    right_on='Policy_Sales_Channel'
    )
tmp_df = tmp_df[tmp_df['Policy_Sales_Channel_Count'] >= 30]

tmp_df = tmp_df.groupby('Policy_Sales_Channel')['Response'].sum() / tmp_df.groupby('Policy_Sales_Channel')['Response'].count()
tmp_df = tmp_df.reset_index()

plt.figure(figsize=(25, 6))
sns.pointplot(x='Policy_Sales_Channel', y='Response', data=tmp_df)
plt.xticks(rotation='vertical')
plt.show()

There is correlation between policy sales channel and response. However, because there are too many channels, the sample size of each is small. For example, channel 123 has a response rate of 100% but the count is only 1. That does not mean anything significant.

**Vintage**

The number of days that they have been associated with the company

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(train_df['Vintage'])
plt.show()

In [None]:
tmp_df = train_df.copy()
tmp_df['Vintage'] = tmp_df['Vintage'].apply(lambda x: round(x, -1))
tmp_df = tmp_df.groupby('Vintage')['Response'].sum() / tmp_df.groupby('Vintage')['Response'].count()

plt.figure(figsize=(8,6))
sns.jointplot(x='Vintage', y='Response', data=tmp_df.reset_index())
plt.show()

I am not surprised that there is no clear correlation because the vintage is too low (< 300 days). I would expect loyal customers to be more willing to buy a policy from this insurance company (whether they already have one or not). However, these seem to be new customers.

# Preprocessing

**Categorical input**
* As noted above, there are 53 region codes and 153 policies sales channels. Transforming them into dummy variables would result in a lot of columns.
* As this notebook is going to use CatBoost, there is no need to transform the categorical input into dummy variables. CatBoost provides us with a convenient way to model the data. But be aware that the float has to been converted to int.
* **Policy_Sales_Channel**: I'd like to encode those whose sample sizes are below 30 as *9999*, instead of treating each of them as being significant as its own.


**Numerical input**
* Here applies the **min-max normalisation**. Another common way of preprocessing numerical input is **standardisation**, which subtracts the mean from each datum and divides it by the standard deviation. **Min-max normalisation** puts the input in a confined range, i.e. from 0 to 1, whereas **standardisation** would result in variables that range from negative to positive.
* For our preprocessing purposes, we are going to use the min and max of the **training** dataset. This is because the model is supposed to be built upon the training dataset and then the model is used to make prediction with the testing dataset.
* **Age**: I am implementing a degree 2 variables, given the quadratic relationship we have discussed above.
* **Premium**: I'd like to divide the premium into the portion below the 99th percentile and the portion above.

**Notes for cross-validation**
* When we build our model below, we are going to perform cross-validation.
* As a result, we need to perform the preprocessing upon the training dataset of each subsample. Notably, the sample size of each distinct categorical value (as in **Policy_Sales_Channel**) and the **min** and **max** of each numerical input (we need to perform the **min-max normalisation**) would be different in different subsamples.
* Thus, instead of preprocessing the whole dataset at once, we are going to create a function that can be reused below.

In [None]:
GENDER_MAPPING = {'Female': 0, 'Male': 1}
VEHICLE_AGE_MAPPING = {'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2}
VEHICLE_DAMAGE_MAPPING = {'No': 0, 'Yes': 1}

def preprocessing(train_df, test_df):
    # categorical
    train_df['Gender'] = train_df['Gender'].map(GENDER_MAPPING).astype(int)
    test_df['Gender'] = test_df['Gender'].map(GENDER_MAPPING).astype(int)
    train_df['Vehicle_Age'] = train_df['Vehicle_Age'].map(VEHICLE_AGE_MAPPING).astype(int)
    test_df['Vehicle_Age'] = test_df['Vehicle_Age'].map(VEHICLE_AGE_MAPPING).astype(int)
    train_df['Vehicle_Damage'] = train_df['Vehicle_Damage'].map(VEHICLE_DAMAGE_MAPPING).astype(int)
    test_df['Vehicle_Damage'] = test_df['Vehicle_Damage'].map(VEHICLE_DAMAGE_MAPPING).astype(int)

    train_df['Region_Code'] = train_df['Region_Code'].astype(int)
    test_df['Region_Code'] = test_df['Region_Code'].astype(int)

    tmp_df_count = train_df.groupby('Policy_Sales_Channel')['Response'].count()
    tmp_count_dict = tmp_df_count.to_dict()
    for index, val in tmp_count_dict.items():
        tmp_count_dict[index] = index if val >= 30 else 9999
    train_df['Policy_Sales_Channel'] = train_df['Policy_Sales_Channel'].map(tmp_count_dict).astype(str)
    test_df['Policy_Sales_Channel'] = test_df['Policy_Sales_Channel'].apply(
        lambda x: tmp_count_dict[x] if x in tmp_count_dict else 9999
    ).astype(int)
    
    # numerical
    train_df['Age_Squared'] = train_df['Age'].apply(lambda x: x ** 2)
    test_df['Age_Squared'] = test_df['Age'].apply(lambda x: x ** 2)
    min_age_squared = train_df['Age_Squared'].min()
    max_age_squared = train_df['Age_Squared'].max()
    train_df['Age_Squared'] = train_df['Age_Squared'].apply(lambda x: (x - min_age_squared) / (max_age_squared - min_age_squared))
    test_df['Age_Squared'] = test_df['Age_Squared'].apply(lambda x: (x - min_age_squared) / (max_age_squared - min_age_squared))

    min_age = train_df['Age'].min()
    max_age = train_df['Age'].max()
    train_df['Age'] = train_df['Age'].apply(lambda x: (x - min_age) / (max_age - min_age))
    test_df['Age'] = test_df['Age'].apply(lambda x: (x - min_age) / (max_age - min_age))

    min_vintage = train_df['Vintage'].min()
    max_vintage = train_df['Vintage'].max()
    train_df['Vintage'] = train_df['Vintage'].apply(lambda x: (x - min_vintage) / (max_vintage - min_vintage))
    test_df['Vintage'] = test_df['Vintage'].apply(lambda x: (x - min_vintage) / (max_vintage - min_vintage))
    
    min_annual_premium = train_df['Annual_Premium'].min()
    ninety_nineth_percentile = train_df['Annual_Premium'].quantile(0.99)
    max_annual_premium = train_df['Annual_Premium'].max()

    train_df['Premium_Below_Ninety_Nineth_Percentile'] = train_df['Annual_Premium'].apply(lambda x: (min(x, ninety_nineth_percentile) - min_annual_premium) / (ninety_nineth_percentile - min_annual_premium))
    test_df['Premium_Below_Ninety_Nineth_Percentile'] = test_df['Annual_Premium'].apply(lambda x: (min(x, ninety_nineth_percentile) - min_annual_premium) / (ninety_nineth_percentile - min_annual_premium))
    train_df['Premium_Above_Ninety_Nineth_Percentile'] = train_df['Annual_Premium'].apply(lambda x: max(0, x - ninety_nineth_percentile) / (max_annual_premium - ninety_nineth_percentile))
    test_df['Premium_Above_Ninety_Nineth_Percentile'] = test_df['Annual_Premium'].apply(lambda x: max(0, x - ninety_nineth_percentile) / (max_annual_premium - ninety_nineth_percentile))

    return train_df, test_df

# CatBoost Modelling (with Cross Validation and Backward Stepwise Selection)

In the following, we are going to build a model using the library **CatBoost**.

**Cross-validation**
* When training the model, we are going to perform a 5-fold cross-validation.
* The training dataset in each subsample will be proprocessed with the function defined above.
* We are going to evaluate the perfomance of the model using the validation dataset in each subsample by the **ROC AUC score**.

**Backward stepwise selection**
* In the first round of model training, we are going to include all features in our dataset.
* In subsequent rounds, the least important features are going to be dropped one by one.
* The final selected model will be based on **ROC AUC score**.

Here are all our numerical and categorical columns respectively.

In [None]:
numerical_cols = ['Age_Squared', 'Age', 'Premium_Below_Ninety_Nineth_Percentile', 'Premium_Above_Ninety_Nineth_Percentile', 'Vintage']
categorical_cols = ['Gender', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Policy_Sales_Channel']

In fact I'd like to drop Vintage because there is no sensible correlation (the number of days is too low).

In [None]:
numerical_cols.remove('Vintage')

In [None]:
X_col = numerical_cols + categorical_cols
print(X_col)

Let's split the dataset and perform preprocessing.

In [None]:
training_sets = []
validation_sets = []

K = 5
kf = KFold(n_splits=K, shuffle=True)

for train_index, test_index in kf.split(train_df):
    print(train_index, test_index)
    training_set = train_df[train_df.index.isin(train_index)]
    validation_set = train_df[train_df.index.isin(test_index)]

    training_set, validation_set = preprocessing(training_set, validation_set)

    training_sets.append(training_set)
    validation_sets.append(validation_set)

In [None]:
results = []

for i in range(len(X_col)):
    result = {}
    print('Features included in this iteration: {}'.format(X_col))
    result['features'] = X_col

    model_scores = []
    feature_scores = [0] * len(X_col)
    
    fold = 1
    for training_set, validation_set in zip(training_sets, validation_sets):
        print('Fold #{}'.format(fold))
        X_train = training_set[X_col]
        X_val = validation_set[X_col]
        y_train = training_set[['Response']]
        y_val = validation_set[['Response']]

        model = CatBoostClassifier()

        model = model.fit(
            X_train,
            y_train,
            cat_features=categorical_cols,
            eval_set=(X_val, y_val),
            early_stopping_rounds=10,
            verbose=False
        )

        y_pred = [i[1] for i in model.predict_proba(X_val)]

        model_score = roc_auc_score(y_val, y_pred)
        print('ROC AUC score: {}'.format(model_score))
        model_scores.append(model_score)

        feature_importance = model.get_feature_importance()
        print('Feature importance: {}'.format(feature_importance))
        for i in range(0, len(X_col)):
            feature_scores[i] += feature_importance[i]

        fold += 1

    print('Overall:')
    print('ROC AUC score: {}'.format(np.mean(model_scores)))
    result['score'] = np.mean(model_scores)
    print('Feature importance: {}'.format(feature_scores))
    results.append(result)
    if len(X_col) > 1:
        least_importance_feature = X_col[feature_scores.index(min(feature_scores))]
        print('The least important feature is: {}'.format(least_importance_feature))
        print('Thus, for the next iteration, we are going to drop {}'.format(least_importance_feature))

        if least_importance_feature in numerical_cols:
            numerical_cols.remove(least_importance_feature)
        else:
            categorical_cols.remove(least_importance_feature)
        X_col = numerical_cols + categorical_cols

        print()
        print()

# Results

Here is a summary of the ROC AUC scores of all models above.

In [None]:
results_df = pd.DataFrame.from_records(results)
print(results_df)

All models have ROC AUC scores close to or above 80%, which indicates that their predictive powers are high.

We can observe a material difference between a 3-factor model (#8) and models with fewer factors (#9-10). However, when more factors are included, they do not exhibit material difference (the difference in ROC AUC score is within 1%). It is justifiable to select the model with high predictive power while being simple enough, i.e. #8.

Note that this score has been calculated with cross-validation, so it indicates the predictive power of the model when applied to a new set of data. Features to be included are:

In [None]:
selected_features = results_df.iloc[8]['features']
print(selected_features)

# Final Model

Lastly, let's use everything of the training dataset to train the model and apply it on the testing dataset.

In [None]:
training_set, test_set = preprocessing(train_df, test_df)

X_train = training_set[selected_features]
X_test = test_set[selected_features]
y_train = training_set['Response']

model = CatBoostClassifier()

model = model.fit(
    X_train,
    y_train,
    cat_features=['Previously_Insured', 'Vehicle_Damage'],
    early_stopping_rounds=10,
    verbose=100
)

CatBoost allows us to easily plot the importance of features.

In [None]:
feature_importance = model.get_feature_importance()
feature_importance_df = pd.DataFrame(
    data={'feature_importance': feature_importance},
    index=selected_features
)
feature_importance_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)

plt.figure(figsize=(8, 6))
sns.barplot(x=feature_importance_df['feature_importance'], y=feature_importance_df.index)
plt.title('Feature Importance')
plt.show()

The resulted models have only 3 variables:
* whether they already have vehicle insurance
* their age squared
* whether they have got their vehicle damaged before

As discussed above, the contribution of **Vehicle_Damage** can imply adverse selection. Even if we correctly predict their willingness to buy, it may not be desirable. The insurance company might charge them higher premium or not target them at all.

We can also plot the tree.

In [None]:
model.plot_tree(
    tree_idx=0,
    pool=X_train
)

# Output

In [None]:
y_pred_submit = [i[1] for i in model.predict_proba(X_test)]
submission_df = pd.DataFrame(data={'Passenger': test_df['id'], 'Response': y_pred_submit})

In [None]:
submission_df.head()

In [None]:
submission_df.to_csv('submission.csv', index=False)

# Afterthought

While the model result is satisfactory, it does not necessarily translate to good decision making. As discussed over and over, adverse selection is an issue that needs to be tackled.

Besides, the insurance company should obtain more data in order to have more insightful analysis:
* Investigate the relationship between age and willingness to buy insurance. Does risk tolerance play a role? What is the implication of the **quadratic** relationship?
* Develop a profile of regions (we've only got pure numerical identifier). Perhaps people in busier regions are more willing to buy insurance.
* Obtain more accurate age of the vehicle.
* Obtain the type of health insurance policy that the individual is buying. This can reveal how rich they are.
* Policy sales channels can play a role (e.g. agent channels are more effective due to interpersonal relationship). The number of channels in this dataset is too high. Perhaps they can be grouped.
* Vintage can play a role (e.g. we would expect loyal customers to be more willing to buy a policy). However, this dataset only contains new customers.