I'm going to try something on the dataset that has to do with downsizing some of the features that you're going to see in the notebook. The code is pretty straightforward but if you have any questions/suggestions/comments, please let me know in the comments.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px

In [None]:
warnings.filterwarnings('ignore')

In [None]:
train_df = pd.read_csv("../input/health-insurance-cross-sell-prediction/train.csv")
test_df  = pd.read_csv("../input/health-insurance-cross-sell-prediction/test.csv")

In [None]:
train_df = train_df.set_index('id')
test_df  = test_df.set_index('id')

In [None]:
train_df.info()

In [None]:
 test_df.info()

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

Hmm. No missing values at all. To EDA we go!

#### But first, `Region_Code` and `Policy_Sales_Channel` are in float, converting them to string

In [None]:
train_df['Region_Code'] = train_df['Region_Code'].astype(int).astype(str)
test_df['Region_Code'] = test_df['Region_Code'].astype(int).astype(str)

In [None]:
train_df['Policy_Sales_Channel'] = train_df['Policy_Sales_Channel'].astype(int).astype(str)
test_df['Policy_Sales_Channel'] = test_df['Policy_Sales_Channel'].astype(int).astype(str)

# EDA

### Feature distribution in Train vs Test

In [None]:
plt.rcParams['figure.figsize'] = 25, 3
sns.color_palette("deep")
for feature in ['Gender', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Policy_Sales_Channel']:
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True)
    ax0 = sns.countplot(train_df[feature].sort_values(), ax=ax[0])
    ax1 = sns.countplot(test_df[feature].sort_values(),  ax=ax[1])
    ax0.set_title(f'{feature} - Train');
    ax1.set_title(f'{feature} - Test');
    
    if feature == 'Region_Code':
        for tick in ax0.get_xticklabels():
            tick.set_rotation(90)
        for tick in ax1.get_xticklabels():
            tick.set_rotation(90)
        
    plt.show();

All the non-quantitative features have the same distribution in Train and Test. Now for continuous features:

In [None]:
plt.rcParams['figure.figsize'] = 25, 3
for feature in ['Age', 'Annual_Premium', 'Vintage']:
    fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True)
    ax0 = sns.distplot(train_df[feature].sort_values(), ax=ax[0][0])
    ax1 = sns.distplot(test_df[feature].sort_values(),  ax=ax[0][1])
    ax0.set_title(f'{feature} - Train');
    ax1.set_title(f'{feature} - Test');
    
    sns.boxplot(train_df[feature].sort_values(), ax=ax[1][0])
    sns.boxplot(test_df[feature].sort_values(),  ax=ax[1][1])
        
    plt.show();

Same distributions again, which is good!

### Lets figure they are related to the `Response` variable

For categorical features:

In [None]:
plt.rcParams['figure.figsize'] = 15, 8
sns.color_palette("deep")
for feature in ['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage']:
    fig, ax = plt.subplots(nrows=1, ncols=2)
    temp = train_df[[feature, 'Response']].groupby(feature)['Response'].apply(lambda x: x.sum()/x.count()).mul(100).rename('% (Response = 1)')
    
    train_df[feature].value_counts().plot.pie(ax=ax[1])
    temp.plot(kind='bar', ax=ax[0])
    
    plt.show();

Observations:

- People with no driver's license are not interested much (as Insurance people may ask for it during the process)
- People who've not been insured before are way more interested than people who have their insurance
- As Vehicle gets older, people tend to take insurance more
- If a vehicle has been damaged before, people tend to respond. Maybe because they've had a bad experience where they had their vehicle damaged and the smallest of the part costed a lot

But what about continuous features?

In [None]:
train_dfc = train_df.copy()

In [None]:
plt.rcParams['figure.figsize'] = 7, 7
for feature in ['Age', 'Annual_Premium', 'Vintage']:    
    ax = sns.distplot(train_dfc[train_dfc['Response'] == 1][feature], label='Response = 1')
    sns.distplot(train_dfc[train_dfc['Response'] == 0][feature], ax=ax, label='Response = 0')
    
    plt.title(f'{feature} - Response == 0 vs Response == 1')
    plt.legend();
    plt.show();

Interestingly, the `Age` distribution of `Response` = 0 and `Response` = 1 is very different. Also, there seems to be a lot of population interested between 40 and 50. What % exactly? Let's dive in!

# Feature Engineering

In [None]:
Q = 10
train_dfc['AgeGroups'] = pd.qcut(train_dfc['Age'], q=Q)

Why 10? No particular reason, could have used 11, 15, etc. as well.

In [None]:
train_dfc[['AgeGroups', 'Response']].groupby('AgeGroups')['Response'].apply(lambda x: x.sum()/x.count() * 100).rename('% (Response == 1)').to_frame()

It appears that the young population upto age 29 don't respond as they have a HUGE number of insurance options(that they need to research on) and since they have the energy and time, they don't response. As people enter their thirties, maybe because of their responsibilies towards family, to avoid further researching on their own, they respond much, much more. As people get old, they have their own contacts who can get them their insurance, whom they can trust. And as expected, a lot of interested people in the 40-50 age bucket.

Let's merge ages upto 29, 35 to 50 and 50 onwards

In [None]:
train_dfc['self_defined_agegroups'] = pd.cut(train_dfc['Age'], bins=[0, 29, 35, 50, 100])

In [None]:
train_dfc[['self_defined_agegroups', 'Response']].groupby('self_defined_agegroups')['Response'].apply(lambda x: x.sum()/x.count() * 100).rename('% (Response == 1)').to_frame()

In [None]:
train_dfc['VehicleAgeDe'] = train_dfc['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})

In [None]:
Q = 10
train_dfc['Annual_Premium_Groups'] = pd.qcut(train_dfc['Annual_Premium'], q=Q, duplicates='drop')

In [None]:
train_dfc[['Annual_Premium_Groups', 'Response']].groupby('Annual_Premium_Groups')['Response'].apply(lambda x: x.sum()/x.count() * 100).rename('% (Response == 1)').to_frame()

Response rate is increasing but very slowly. Let's define out own bins as per this.

In [None]:
train_dfc['Annual_Premium_self'] = pd.cut(train_dfc['Annual_Premium'], bins=[0, 30000, 35000, 37500, 41700, 48400, np.inf])

In [None]:
train_dfc[['Annual_Premium_self', 'Response']].groupby('Annual_Premium_self')['Response'].apply(lambda x: x.sum()/x.count() * 100).rename('% (Response == 1)').to_frame()

In [None]:
Q = 10
train_dfc['Vintage_Groups'] = pd.qcut(train_dfc['Vintage'], q=Q, duplicates='drop')

In [None]:
train_dfc[['Vintage_Groups', 'Response']].groupby('Vintage_Groups')['Response'].apply(lambda x: x.sum()/x.count() * 100).rename('% (Response == 1)').to_frame()

No pattern at all, no point in binning `Vintage`

In [None]:
train_dfc['Annual_Premium_de'] = train_dfc['Annual_Premium_self'].cat.codes
train_dfc['self_defined_agegroups_en'] = train_dfc['self_defined_agegroups'].cat.codes

In [None]:
train_dfc['Gender_ohe'] = pd.get_dummies(train_dfc['Gender'], prefix='Gender', drop_first=True)['Gender_Male']

In [None]:
train_dfc['Vehicle_Damage_ohe'] = train_dfc['Vehicle_Damage'].map({'Yes': 1, 'No': 0})

### Deriving some features for categorical features with many values

In [None]:
train_dfc[['Policy_Sales_Channel', 'Response']].groupby('Policy_Sales_Channel').apply(lambda x: x.sum()/x.count() * 100)['Response'].sort_values()

In [None]:
temp_df = train_df[['Policy_Sales_Channel', 'Response']].groupby('Policy_Sales_Channel').apply(lambda x: x.sum()/x.count() * 100)['Response'].sort_values()
ax = temp_df.plot.bar()

ax.set_xticks(temp_df.index[::5])

for tick in ax.get_xticklabels():
    tick.set_rotation(90)

`Policy_Sales_Channel` = 123 and 43 have 100% response rate! Let's look at these two `Policy_Sales_Channel` only

In [None]:
train_df[train_df['Policy_Sales_Channel'].isin(['123', '43'])]

Hmm.. so there are only two customers with those `Policy_Sales_Channel`. But is that the case with test set as well? Let's find out.

In [None]:
test_df[test_df['Policy_Sales_Channel'].isin(['123', '43'])]

4 customers only, which is okay.

I'm going to club similar `Policy_Sales_Channel` together on the basis of their Response rate using KMeans

In [None]:
from sklearn.cluster import KMeans

In [None]:
km = KMeans(n_clusters=6).fit(X = temp_df.values.reshape(-1, 1))

In [None]:
clusters = km.predict(temp_df.values.reshape(-1, 1))

In [None]:
temp_df = temp_df.rename('% (Response == 1)').reset_index()

In [None]:
temp_df['cluster'] = clusters

In [None]:
ax = sns.barplot(x=temp_df['Policy_Sales_Channel'], y=temp_df['% (Response == 1)'], hue=temp_df['cluster']);

ax.set_xticks(temp_df.index[::5])

for tick in ax.get_xticklabels():
    tick.set_rotation(90)

In [None]:
train_dfc = train_dfc.merge(temp_df[['Policy_Sales_Channel', 'cluster']], how='left').rename(columns={'cluster': 'Policy_Sales_Channel_cluster'})

In [None]:
train_dfc[['Region_Code', 'Response']].groupby('Region_Code').apply(lambda x: x.sum()/x.count() * 100)['Response'].sort_values()

In [None]:
temp_df = train_df[['Region_Code', 'Response']].groupby('Region_Code').apply(lambda x: x.sum()/x.count() * 100)['Response'].sort_values()
ax = temp_df.plot.bar()

ax.set_xticks(temp_df.index[::3])

for tick in ax.get_xticklabels():
    tick.set_rotation(90)

Repeating the same clustering method for `Region_Code` as well

In [None]:
km = KMeans(n_clusters=6).fit(X = temp_df.values.reshape(-1, 1))

In [None]:
clusters = km.predict(temp_df.values.reshape(-1, 1))

In [None]:
temp_df = temp_df.rename('% (Response == 1)').reset_index()

In [None]:
temp_df['cluster'] = clusters

In [None]:
ax = sns.barplot(x=temp_df['Region_Code'], y=temp_df['% (Response == 1)'], hue=temp_df['cluster']);

ax.set_xticks(temp_df.index[::3])

for tick in ax.get_xticklabels():
    tick.set_rotation(90)

In [None]:
train_dfc = train_dfc.merge(temp_df[['Region_Code', 'cluster']], how='left').rename(columns={'cluster': 'Region_Code_cluster'})

In [None]:
train_dfc = pd.concat([\
    train_dfc,
    pd.get_dummies(train_dfc['Policy_Sales_Channel_cluster'], prefix='PSCC', drop_first=True)
], axis=1)

In [None]:
train_dfc = pd.concat([\
    train_dfc,
    pd.get_dummies(train_dfc['Region_Code_cluster'], prefix='RCC', drop_first=True)
], axis=1)

In [None]:
train_dfc.columns

In [None]:
features_to_consider = ['Driving_License', 'Previously_Insured', 'VehicleAgeDe', 'self_defined_agegroups_en', 'Annual_Premium_de', 'Gender_ohe', 'Vehicle_Damage_ohe', 'PSCC_1', 'PSCC_2', 'PSCC_3', 'PSCC_4', 'PSCC_5', 'RCC_1', 'RCC_2', 'RCC_3', 'RCC_4', 'RCC_5']
target = 'Response'

# Deciding on CV

I'm going to use 10-fold stratified cross validation for this one

In [None]:
X = train_dfc.loc[:, features_to_consider].values
y = train_dfc.loc[:, target].values

In [None]:
from sklearn.model_selection import StratifiedKFold

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True)

In [None]:
train_dfc['Response'].value_counts(normalize=True).mul(100).round(2)

# Model Building

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, roc_auc_score

In [None]:
performance_tree = {}

In [None]:
for fold_no, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f'Running fold {fold_no}')
    X_train = X[train_idx, :]
    X_val   = X[val_idx,   :]
    y_train = y[train_idx]
    y_val   = y[val_idx]
    
    tree = DecisionTreeClassifier().fit(X_train, y_train)
    predictions = tree.predict(X_val)
    
    performance_tree[fold_no] = roc_auc_score(y_val, predictions)

In [None]:
performance_tree

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_performance = {}

In [None]:
for fold_no, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f'Running fold {fold_no}')
    X_train = X[train_idx, :]
    X_val   = X[val_idx,   :]
    y_train = y[train_idx]
    y_val   = y[val_idx]
    
    rf = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
    predictions = rf.predict(X_val)
    
    rf_performance[fold_no] = roc_auc_score(y_val, predictions)

In [None]:
rf_performance.values()

The performance is the worst imaginable. Should we use the original features?

In [None]:
train_dfc = pd.concat([train_dfc,
    pd.get_dummies(train_dfc['Region_Code'], prefix='Region_Code', drop_first=True)],
         axis=1)

In [None]:
train_dfc = pd.concat([train_dfc,
    pd.get_dummies(train_dfc['Policy_Sales_Channel'], prefix='Policy_Sales_Channel', drop_first=True)],
                          axis=1)

In [None]:
features_to_consider = ['Age', 'Driving_License', 'Previously_Insured', 'VehicleAgeDe', 'Vintage', 'Gender_ohe', 'Vehicle_Damage_ohe'] + \
                        [i for i in train_dfc.columns if 'Region_Code' in i and i != 'Region_Code_cluster' and i != 'Region_Code'] + \
                        [i for i in train_dfc.columns if 'Policy_Sales_Channel' in i and i != 'Policy_Sales_Channel_cluster' and i != 'Policy_Sales_Channel']

In [None]:
X = train_dfc.loc[:, features_to_consider].values

In [None]:
rf_performance_orig = {}

In [None]:
for fold_no, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f'Running fold {fold_no}')
    X_train = X[train_idx, :]
    X_val   = X[val_idx,   :]
    y_train = y[train_idx]
    y_val   = y[val_idx]
    
    rf = RandomForestClassifier(n_estimators=15).fit(X_train, y_train)
    predictions = rf.predict(X_val)
    
    rf_performance_orig[fold_no] = roc_auc_score(y_val, predictions)

In [None]:
rf_performance_orig

Hmm... a ~15% increase on average! Nice!

Although this downsizing process didn't work out well but at least I got to know what happens, this was just something I've been meaning to try out and is **based on** something I've learnt during my time in Financial data science. 

Apart from that, I know that while building a model, I'm missing many things like normalizing/standardizing features, hyperparameter optimization, etc. but I'm going to skip it for now as I'm learning something new. Anyways, it was fun writing this one, peace.