# Janatahack: Cross-sell Prediction 
### Public Rank - 72 (0.8584736363), Private Rank - 58 (0.8632359852)

### Problem Statement

Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue. 

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

In [None]:
import pandas as pd
import numpy  as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.metrics import confusion_matrix,roc_auc_score
from pandas_profiling import ProfileReport
from imblearn.over_sampling import SMOTE,SMOTENC,ADASYN
from sklearn.model_selection import StratifiedKFold
# Comment this if the data visualisations doesn't work on your side
%matplotlib inline

plt.style.use('bmh')

In [None]:
import warnings
warnings.filterwarnings('ignore')

##  Loading Train and Test Data

In [None]:
df_train = pd.read_csv('../input/health-insurance-cross-sell-prediction/train.csv')
df_test = pd.read_csv('../input/health-insurance-cross-sell-prediction/test.csv')
df_train.head()

In [None]:
# profile = ProfileReport(df_train,title='Profile Report',explorative=True)
# profile.to_widgets()

## Data Analysis:

1) Firstly Data with 12 variables and has 0 missing values.

2) Region_code feature have 2021 zero values which need to be checked if '0' is a region code or else its corrupted data.

3) Target Variable (Response) is highly imbalanced which need to be handled using by smote analysis or 
   other types analysis to balance the highly biased target variable.

In [None]:
df_train.info()

In [None]:
df_test.info()

## Pre-Processing

### Encoding Categorical features using Label Encoder

In [None]:
def label_encoder(data):
    le = preprocessing.LabelEncoder()
    data = le.fit_transform(data)
    return data

In [None]:
df_train['Gender'] = label_encoder(df_train['Gender'])
df_train['Vehicle_Age'] = label_encoder(df_train['Vehicle_Age'])
df_train['Vehicle_Damage'] = label_encoder(df_train['Vehicle_Damage'])
df_train.head()

In [None]:
df_train['Policy_Sales_Channel']  = df_train['Policy_Sales_Channel'].astype(int)
df_train['Region_Code']  = df_train['Region_Code'].astype(int)
df_train.head()

## Splitting up Train and Test Data

In [None]:
category_col=['Gender','Driving_License', 'Region_Code', 'Previously_Insured','Vehicle_Damage',
         'Vehicle_Age','Vintage','Policy_Sales_Channel']

X = df_train.copy()
y = X['Response']
X.drop(['id','Response'],axis=1,inplace=True)
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.25,shuffle=True,random_state=99,stratify=y)
print("Size of X_train ", X_train.shape)
print("Size of X_cv ", X_cv.shape)
print("Size of y_train ", y_train.shape)
print("Size of y_cv ", y_cv.shape)


## Model

In [None]:
from catboost import CatBoostClassifier
params = {'iterations':1000,
        'learning_rate':0.1,
        'cat_features': category_col,
        'depth':7,
        'eval_metric':'AUC',
        'loss_function':'Logloss',
        'verbose':200,
        'od_type':"Iter", # overfit detector
        'od_wait':300, # most recent best iteration to wait before stopping
        'random_seed': 99,
        'l2_leaf_reg' : 11
          }

cat_model = CatBoostClassifier(**params)
cat_model.fit(X_train, y_train,   
          eval_set=(X_cv, y_cv), 
          use_best_model=True, # True if we don't want to save trees created after iteration with the best validation score
          plot=True  
         );

In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())

In [None]:
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y, cat_model.predict_proba(X)[:,1])
fpr, tpr, thresholds = roc_curve(y, cat_model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Cataboost (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

## Prediction on Test Data

In [None]:
X_test = df_test.copy()

# Encoding category using Label Encoder
X_test['Gender'] = label_encoder(X_test['Gender'])
X_test['Vehicle_Damage'] = label_encoder(X_test['Vehicle_Damage'])
X_test['Vehicle_Age'] = label_encoder(X_test['Vehicle_Age'])
X_test.drop(['id'],axis=1, inplace=True)
X_test['Policy_Sales_Channel']  = X_test['Policy_Sales_Channel'].astype(int)
X_test['Region_Code']  = X_test['Region_Code'].astype(int)
X_test.head()

## Prediction

In [None]:
response = cat_model.predict_proba(X_test)
response

### Submission

In [None]:
submission = pd.DataFrame(df_test['id'],columns=['id',])
submission['Response'] = response[:, 1]
submission.head(10)