Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue. 

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

* Public leaderboard rank:
* Private leaderboard rank:

## Importing the libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder, OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures, PowerTransformer, FunctionTransformer
from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold, RandomizedSearchCV, StratifiedShuffleSplit
from sklearn.feature_selection import SelectFromModel, SelectKBest, VarianceThreshold
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, accuracy_score, classification_report
from sklearn.decomposition import PCA, FactorAnalysis, TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier, VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
pd.set_option('display.max_columns', None)

In [None]:
train = pd.read_csv("/kaggle/input/janatahack-crosssell-prediction/train.csv")
test = pd.read_csv("/kaggle/input/janatahack-crosssell-prediction/test.csv")
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)
sample = pd.read_csv("/kaggle/input/janatahack-crosssell-prediction/sample_submission.csv")

In [None]:
train.info()

## Train data head

In [None]:
train.head()

## Missing values check

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

- No missing values in the data. Will confirm the same using unique values

In [None]:
for col in train.columns:
    print(f"{col} : {train[col].nunique()}")
    print(train[col].unique())

In [None]:
#separating continuous and categorical variables
cat_var = ["Gender","Driving_License","Previously_Insured","Vehicle_Age","Vehicle_Damage"]
con_var = list(set(train.columns).difference(cat_var+["Response"]))

In [None]:
train.Response.value_counts(normalize=True)

### Around 12.26 % of customer have given a positive response

In [None]:
sns.countplot(train.Response)
plt.title("Class count")
plt.show()

In [None]:
train.head(3)

In [None]:
# axis is as follows

sns.pairplot(train, hue='Response', diag_kind='hist')
plt.show()

In [None]:
def map_val(data):
    data["Gender"] = data["Gender"].replace({"Male":1, "Female":0})
    data["Vehicle_Age"] = data["Vehicle_Age"].replace({'> 2 Years':2, '1-2 Year':1, '< 1 Year':0 })
    data["Vehicle_Damage"] = data["Vehicle_Damage"].replace({"Yes":1, "No":0})
    return data

train = map_val(train)
test = map_val(test)

In [None]:
comb = pd.concat([train,test])
comb.shape , train.shape , test.shape

# lets see the distribution of each category in the entire dataset

In [None]:

print('The distribution of gender:',comb['Gender'].value_counts())

In [None]:
comb.head()

In [None]:
comb.info()

In [None]:
list1 = ['Gender', 'Age', 'Region_Code', 'Previously_Insured',
       'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage']

In [None]:
# list1 = set(comb.columns) - set('Driving_License')
list1

In [None]:

fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(20,20))

for i, column in enumerate(list1):
    print(column)
    sns.distplot(comb[column],ax=axes[i//3,i%3])

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable. The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.


[Click here for info](https://seaborn.pydata.org/generated/seaborn.pointplot.html)

In [None]:
train.head(3)

In [None]:
cat_var

* **As the vehicle age increases the response rate also increases** . 

In [None]:
fig, ax = plt.subplots(2,3 , figsize=(15,15))
ax = ax.flatten()
for i,col in enumerate(cat_var):
    sns.pointplot(x = col, y = 'Response',hue = 'Vehicle_Age',data=train, ax = ax[i])
plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(2,3 , figsize=(10,10))
ax = ax.flatten()
for i,col in enumerate(cat_var):
    sns.pointplot(col, 'Response', data=train, ax = ax[i])
plt.tight_layout()
plt.show()

# Some key findings :
* Males in general have better response rates
* As the vehicle age increases response rate also increases
* People having prior vehicle damage are more prone to respond as normally expected
* People having prior insurance do not have a high response rate as expected 
* Driving License has a considerable error margin


In [None]:
sns.catplot('Gender', 'Response',hue='Vehicle_Age', row = 'Previously_Insured',col='Vehicle_Damage',data=train, kind='point', height=3, aspect=2)
plt.show()

* When the vehicle was not previously insured and sustained damage the response shows tremendous gain . Even more so when the vehicle range is at higher limits


In [None]:
fig, ax = plt.subplots(2,3 , figsize=(16,6))
ax = ax.flatten()
i = 0
for col in con_var:
    sns.boxplot( 'Response', col, data=train, ax = ax[i])
    i+=1
plt.tight_layout()
plt.show()

In [None]:
sns.catplot('Gender', 'Vintage',hue='Response', row = 'Previously_Insured',col='Vehicle_Damage',data=train, kind='box', height=3, aspect=2)
plt.show()

In [None]:
sns.catplot('Gender', 'Age',hue='Response', row = 'Previously_Insured',col='Vehicle_Damage',data=train, kind='box', height=3, aspect=2)
plt.show()

In [None]:
sns.catplot('Gender', 'Annual_Premium',hue='Response', row = 'Previously_Insured',col='Vehicle_Damage',data=train, kind='box', height=3, aspect=2)
plt.show()

In [None]:
plt.figure(figsize=(30,5))
sns.heatmap(pd.crosstab([train['Previously_Insured'], train['Vehicle_Damage']], train['Region_Code'],
                        values=train['Response'], aggfunc='mean', normalize='columns'), annot=True, cmap='inferno')
plt.show()

In [None]:
crosstab_df=pd.crosstab([train['Previously_Insured'], train['Vehicle_Damage']], train['Region_Code'],values=train['Response'], aggfunc='mean', normalize='columns')
crosstab_df

Annual_Premium showed interesting characteristics in the starting scatter plots

In [None]:
cat_var

In [None]:
train.head(1)

In [None]:
sns.relplot(x="Age", y="Annual_Premium", hue="Response", data=train)

In [None]:
sns.relplot(x="Vintage", y="Annual_Premium", hue="Response", data=train)

In [None]:
sns.relplot(x="Policy_Sales_Channel", y="Annual_Premium", hue="Response", data=train)

## Correlation Heatmap

In [None]:
corr = train.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)]=True
plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='YlGnBu', mask=mask)
plt.title("Correlation Heatmap")
plt.show()

## Current Age/ Vintage/ Annual Premium distributions are not helping very much so we will try mean transformation

In [None]:
train.skew()

In [None]:
train['log_premium'] = np.log(train.Annual_Premium)
train['log_age'] = np.log(train.Age)
test['log_premium'] = np.log(test.Annual_Premium)
test['log_age'] = np.log(test.Age)

In [None]:
train.groupby(['Previously_Insured','Gender'])['log_premium'].plot(kind='kde')
plt.show()

In [None]:
train.groupby(['Previously_Insured','Gender'])['log_age'].plot(kind='kde')
plt.show()

# Feature importance / Feature selection 

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

In [None]:
def feature_engineering(data, col):
    mean_age_insured = data.groupby(['Previously_Insured','Vehicle_Damage'])[col].mean().reset_index()
    mean_age_insured.columns = ['Previously_Insured','Vehicle_Damage','mean_'+col+'_insured']
    mean_age_gender = data.groupby(['Previously_Insured','Gender'])[col].mean().reset_index()
    mean_age_gender.columns = ['Previously_Insured','Gender','mean_'+col+'_gender']
    mean_age_vehicle = data.groupby(['Previously_Insured','Vehicle_Age'])[col].mean().reset_index()
    mean_age_vehicle.columns = ['Previously_Insured','Vehicle_Age','mean_'+col+'_vehicle']
    data = data.merge(mean_age_insured, on=['Previously_Insured','Vehicle_Damage'], how='left')
    data = data.merge(mean_age_gender, on=['Previously_Insured','Gender'], how='left')
    data = data.merge(mean_age_vehicle, on=['Previously_Insured','Vehicle_Age'], how='left')
    data[col+'_mean_insured'] = data['log_age']/data['mean_'+col+'_insured']
    data[col+'_mean_gender'] = data['log_age']/data['mean_'+col+'_gender']
    data[col+'_mean_vehicle'] = data['log_age']/data['mean_'+col+'_vehicle']
    data.drop(['mean_'+col+'_insured','mean_'+col+'_gender','mean_'+col+'_vehicle'], axis=1, inplace=True)
    return data

train = feature_engineering(train, 'log_age')
test = feature_engineering(test, 'log_age')

train = feature_engineering(train, 'log_premium')
test = feature_engineering(test, 'log_premium')

train = feature_engineering(train, 'Vintage')
test = feature_engineering(test, 'Vintage')

In [None]:
train

In [None]:
test

# Preparing the data for training

In [None]:
X = train.drop(["Response"], axis=1)
Y = train["Response"]

Just a baseline submission check 

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X,Y,test_size=0.2,random_state=294,stratify = Y)

In [None]:
lg=LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=5,learning_rate=0.04,objective='binary',metric='auc',is_unbalance=True,
                 colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=294,n_jobs=-1)

lg.fit(X_train,y_train)
print(roc_auc_score(y_val,lg.predict_proba(X_val)[:,1]))

In [None]:
#Check for Permutation Importance of Features
perm = PermutationImportance(lg,random_state=294).fit(X_val, y_val)
eli5.show_weights(perm,feature_names=X_val.columns.tolist())

This function will drop the permutation features which have the weight < 0

In [None]:

def drop_permute_feat(data,permuter):
    mask = permuter.feature_importances_ > 0 
    features = data.columns[mask]
    return features

In [None]:
features = drop_permute_feat(X_train,perm)
features

In [None]:
X_train_permute = X_train[features]
X_val_permute = X_val[features]

In [None]:
lg=LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=5,learning_rate=0.04,objective='binary',metric='auc',is_unbalance=True,
                 colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=294,n_jobs=-1)

lg.fit(X_train_permute,y_train)
print(roc_auc_score(y_val,lg.predict_proba(X_val_permute)[:,1]))

In [None]:
## Full fit
lg=LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=10,learning_rate=0.04,objective='binary',metric='auc',is_unbalance=True,
                 colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=294,n_jobs=-1)
lg.fit(X,Y)

In [None]:
submission_df=pd.read_csv('/kaggle/input/janatahack-crosssell-prediction/sample_submission.csv')
submission_df['Response']=np.array(lg.predict_proba(test)[:,1])
submission_df.to_csv('baseline_test.csv',index=False)
submission_df.head(5)

Dropping function due to eli5 Permutation Importance of Features

In [None]:
def drop(data,list2):
    data_new = data.drop(list2, axis=1,inplace = False)
    return data_new






#  Cross validation strategy 

* best scores are obtained from catboost 

In [None]:
X.info()

In [None]:
test.info()

In [None]:
X_select = X.copy()
test_select = test.copy()

In [None]:
from sklearn.model_selection import StratifiedKFold

In [None]:
model_xgb = XGBClassifier(n_jobs=4, random_state=1, scale_pos_weight=7, objective='binary:logistic')
model_lgbm = LGBMClassifier(n_jobs=4, random_state=1, is_unbalance=True, objective='binary')
model_cat = CatBoostClassifier(random_state=1, verbose=0, scale_pos_weight=7, custom_metric=['AUC'])

In [None]:
def submission(preds, model):
    sample["Response"] = preds
    sample.to_csv("model_"+model+".csv", index=False)

In [None]:
model_lgbm = LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=10,learning_rate=0.04,objective='binary',metric='auc',is_unbalance=True,
                 colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=294,n_jobs=-1)

Keeping baseline score of all the stack : 
* Model          | CV  test score   | Leaderboard score
* xgb_stacked : 0.8553    0.8563
* lgb_stacked : 0.858    0.8573
* cat_stacked : 0.85541   0.8559



In [None]:
def cv_generator(model,n_splits_user,X_select,Y):
    cv = StratifiedKFold(n_splits=n_splits_user, random_state=1, shuffle=True)
    predictions= []
    train_roc_score = 0
    test_roc_score = 0

    for train_index, test_index in cv.split(X_select, Y):
        xtrain, xtest = X_select.iloc[train_index], X_select.iloc[test_index]
        ytrain, ytest = Y[train_index], Y[test_index]

        model.fit(xtrain, ytrain)
        trainpred = model.predict_proba(xtrain)[:,1]
        testpred = model.predict_proba(xtest)[:,1]
        train_roc_score += roc_auc_score(ytrain, trainpred)
        test_roc_score += roc_auc_score(ytest, testpred)
        print("Train ROC AUC : %.4f Test ROC AUC : %.4f"%(roc_auc_score(ytrain, trainpred),roc_auc_score(ytest, testpred)))

        prediction = model.predict_proba(test_select)[:,1]
        predictions.append(prediction)
    
    print("The mean train score is :",train_roc_score/5)
    print("The mean test score is :",test_roc_score/5)
    
    return prediction


In [None]:
predictions_lgbm = cv_generator(model = model_lgbm,n_splits_user = 5,X_select = X,Y = Y)
submission(np.mean(predictions_lgbm, axis=0), 'lgbm_stack')

In [None]:
predictions_xgb = cv_generator(model = model_xgb,n_splits_user = 5,X_select = X,Y = Y)
submission(np.mean(predictions_xgb, axis=0), 'xgb_stack')

In [None]:
predictions_cat = cv_generator(model = model_cat,n_splits_user = 5,X_select = X,Y = Y)
submission(np.mean(predictions_cat, axis=0), 'cat_stack')

Cat boost gave the best CV score for train : 0.883 although the roc auc score is less for test 0.8554
* The cat boost model is probably overfitting

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X,Y,test_size=0.2,random_state=294,stratify = Y)

In [None]:
# categorical column 
cat_col=['Gender','Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage','Policy_Sales_Channel']

In [None]:
X.info()

In [None]:
X.columns

In [None]:
X.Region_Code.dtype == 'float64'


In [None]:
X_copy = X.copy()

In [None]:
test_copy = test.copy()
for col in test.columns:
    if test[col].dtype == 'float64' :
        test_copy[col] = test[col].astype('int')
test_copy.info()          
        
        

In [None]:
for col in X.columns:
    if X[col].dtype == 'float64' :
        X_copy[col] = X[col].astype('int')
        
X_copy.info()        
        

In [None]:
col_1=['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
X_t, X_tt, y_t, y_tt = train_test_split(X_copy[col_1], Y, test_size=.25, random_state=150303,stratify=Y,shuffle=True)

In [None]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
catb = CatBoostClassifier()
catb= catb.fit(X_t, y_t,cat_features=cat_col,eval_set=(X_tt, y_tt),plot=True,early_stopping_rounds=40,verbose=100)
#catb= catb.fit(X_t, y_t,cat_features=cat_col,eval_set=(X_tt, y_tt),plot=True,verbose=100)
y_cat = catb.predict(X_tt)
probs_cat_train = catb.predict_proba(X_t)[:, 1]
probs_cat_test = catb.predict_proba(X_tt)[:, 1]
roc_auc_score(y_t, probs_cat_train)
roc_auc_score(y_tt, probs_cat_test)

In [None]:
cat_pred_new= catb.predict_proba(test_copy[col_1])[:, 1]


In [None]:
submission(cat_pred_new,'cat_boost_predictions_reduced_cols')


In [None]:
feat_importances = pd.Series(catb.feature_importances_, index=X_t.columns)
feat_importances.nlargest(15).plot(kind='barh')
#feat_importances.nsmallest(20).plot(kind='barh')
plt.show()

# Interactive features

Generate interactive features from the most important features

In [None]:
X_final = X_copy[col_1].copy()
test_final = test_copy[col_1].copy()

In [None]:
X_final.info()

In [None]:
test_final.info()