<b>Problem Statement:</b> Calculating claim probability on an insurance policy.

<b>Problem type:</b> A binary classification problem

<b>Evaluation matrix:</b> Submissions are evaluated on area under the <b>ROC(receiver operating characteristic)</b> curve between the predicted probability and the observed target.

<h2 id="Approach">Approach to the problem</h2>
Idea is to develop a generalized approach for solving any binary classification problem
<ol>
    <li>Performing exploratory data analysis.</li>
    <ol>
        <li><a href="#Target">Understanding Target feature distribution</a></li>
        <li><a href="#Corr">Correlation check</a></li>
        <li><a href="#TrainVisual">Visualizing Training dataset</a></li>
        <li><a href="#FeatureSummary">Understanding Training dataset features</a></li>
    </ol>
    <li><a href="#FeatureEng">Feature Engineering.</a></li> 
     <li>Data Preparation.</li>
    <ol>
        <li><a href="#MissingValues">Handling missing values</a></li>
    </ol>
    <li>Training Linear and Gradient Boost Base models.</li>
    <ol>
        <li><a href="#LogisticRegression">Logistic Regression</a></li>
        <li><a href="#CatBoost">CatBoost Classification</a></li>
        <li><a href="#LGBM">LGBM Classification</a></li>
        <li><a href="#XGB">XGBoost Classification</a></li>
    </ol>
    <li>Basic Blending.</li>
    <ol>
        <li><a href="#Ratios">Calculating best blending Ratios (using training preditions to calculate blending ratios)</a></li>
        <li><a href="#FinalPred">Calculating blended prediction</a></li>
    </ol>
</ol>

<h2>Required Libraries</h2>

In [None]:
#REQUIRED LIBRARIES

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc
import os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression,RidgeClassifier
# from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import roc_auc_score,accuracy_score
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.cluster import KMeans
# import pickle
# from sklearn.externals import joblib

# import pandas_profiling as pp

warnings.filterwarnings('ignore')
gc.enable()
%matplotlib inline

<h2>Available Files</h2>

<br><a href="#Approach">back to main menu</a>

In [None]:
#CHECKING ALL AVAILABLE FILES
path='/kaggle/input/tabular-playground-series-sep-2021/'
data_files=list(os.listdir(path))
df_files=pd.DataFrame(data_files,columns=['file_name'])
df_files['size_in_mb']=df_files.file_name.apply(lambda x: round(os.path.getsize(path+x)/(1024*1024),2))
df_files['type']=df_files.file_name.apply(lambda x:'file' if os.path.isfile(path+x) else 'directory')
df_files['file_count']=df_files[['file_name','type']].apply(lambda x: 0 if x['type']=='file' else len(os.listdir(path+x['file_name'])),axis=1)

print('Following files are available under path:',path)
display(df_files)

<h2>All Custom Functions for this Notebook</h2>

<br><a href="#Approach">back to main menu</a>

In [None]:
#ALL CUSTOM FUNCTIONS

#FUNCTION FOR PROVIDING FEATURE SUMMARY
def feature_summary(df_fa):
    print('DataFrame shape')
    print('rows:',df_fa.shape[0])
    print('cols:',df_fa.shape[1])
    col_list=['null','unique_count','data_type','max/min','mean','median','mode','std','skewness','sample_values']
    df=pd.DataFrame(index=df_fa.columns,columns=col_list)
    df['null']=list([len(df_fa[col][df_fa[col].isnull()]) for i,col in enumerate(df_fa.columns)])
    #df['%_Null']=list([len(df_fa[col][df_fa[col].isnull()])/df_fa.shape[0]*100 for i,col in enumerate(df_fa.columns)])
    df['unique_count']=list([len(df_fa[col].unique()) for i,col in enumerate(df_fa.columns)])
    df['data_type']=list([df_fa[col].dtype for i,col in enumerate(df_fa.columns)])
    for i,col in enumerate(df_fa.columns):
        if 'float' in str(df_fa[col].dtype) or 'int' in str(df_fa[col].dtype):
            df.at[col,'max/min']=str(round(df_fa[col].max(),2))+'/'+str(round(df_fa[col].min(),2))
            df.at[col,'mean']=round(df_fa[col].mean(),4)
            df.at[col,'median']=round(df_fa[col].median(),4)
            df.at[col,'mode']=round(df_fa[col].mode()[0],4)
            df.at[col,'std']=round(df_fa[col].std(),4)
            df.at[col,'skewness']=round(df_fa[col].skew(),4)
        elif 'datetime64[ns]' in str(df_fa[col].dtype):
            df.at[col,'max/min']=str(df_fa[col].max())+'/'+str(df_fa[col].min())
        df.at[col,'sample_values']=list(df_fa[col].unique())
    display(df_fa.head())      
    return(df.fillna('-'))

#PREDICTION FUNCTIONS

def claim_predictor(X,y,test,model,model_name):  

    df_preds=pd.DataFrame()
    df_preds_x=pd.DataFrame()
    k=1
    splits=5
    avg_score=0

    #CREATING STRATIFIED FOLDS
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=200)
    print('\nStarting KFold iterations...')
    for train_index,test_index in skf.split(X,y):
        df_X=X[train_index,:]
        df_y=y[train_index]
        val_X=X[test_index,:]
        val_y=y[test_index]
       

    #FITTING MODEL
        model.fit(df_X,df_y)

    #PREDICTING ON VALIDATION DATA
        col_name=model_name+'xpreds_'+str(k)
        preds_x=pd.Series(model.predict_proba(val_X)[:,1])
        df_preds_x[col_name]=pd.Series(model.predict_proba(X)[:,1])

    #CALCULATING ACCURACY
        acc=roc_auc_score(val_y,preds_x)
        print('Iteration:',k,'  roc_auc_score:',acc)
        if k==1:
            score=acc
            best_model=model
            preds=pd.Series(model.predict_proba(test)[:,1])
            col_name=model_name+'preds_'+str(k)
            df_preds[col_name]=preds
        else:
            preds1=pd.Series(model.predict_proba(test)[:,1])
            preds=preds+preds1
            col_name=model_name+'preds_'+str(k)
            df_preds[col_name]=preds1
            if score<acc:
                score=acc
                best_model=model
        avg_score=avg_score+acc        
        k=k+1
    print('\n Best score:',score,' Avg Score:',avg_score/splits)
    #TAKING AVERAGE OF PREDICTIONS
    preds=preds/splits
    
    print('Saving test and train predictions per iteration...')
    df_preds.to_csv(model_name+'.csv',index=False)
    df_preds_x.to_csv(model_name+'_.csv',index=False)
    x_preds=df_preds_x.mean(axis=1)
    return preds,best_model,x_preds 


In [None]:
%%time
#READING DATASET

df_train=pd.read_csv(path+'train.csv')
df_test=pd.read_csv(path+'test.csv')
df_submission=pd.read_csv(path+'sample_solution.csv')

<h2>Exploratory Data Analysis</h2>

<h2 id="Target">Understanding Target feature distribution</h2>
Lets visualize Target feature.

<h4>Observation</h4>
As observations have almost equal count of claim and no claim observations, this is a Balanced dataset. 

<br><a href="#Approach">back to main menu</a>

In [None]:
#Understanding Target (claim) feature distribution
pie_labels=['Claim-'+str(df_train['claim'][df_train.claim==1].count()),'No Claim-'+
            str(df_train['claim'][df_train.claim==0].count())]
pie_share=[df_train['claim'][df_train.claim==1].count()/df_train['claim'].count(),
           df_train['claim'][df_train.claim==0].count()/df_train['claim'].count()]
figureObject, axesObject = plt.subplots(figsize=(6,6))
pie_colors=('orange','grey')
pie_explode=(.01,.01)
axesObject.pie(pie_share,labels=pie_labels,explode=pie_explode,autopct='%.2f%%',colors=pie_colors,startangle=30,shadow=True)
axesObject.axis('equal')
plt.title('Percentage of Claim - No Claim Observations',color='blue',fontsize=12)
plt.show()

<h2 id="Corr">Correlation Check</h2>
Lets check if there are any correlated features. If two features are highly correlated we can remove one of the feature.
This will help in dimentionality reduction.

<h4>Observation</h4>
No correlation is observed among Training dataset features.

<br><a href="#Approach">back to main menu</a>

In [None]:
#Correlation check
corr = df_train.iloc[:,1:].corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Plotting correlation heatmap
plt.subplots(figsize=(22,20))
sns.heatmap(corr,mask=mask,xticklabels=corr.columns,yticklabels=corr.columns)
plt.show()

<h2 id="TrainVisual">Visualizating Training dataset</h2>
We are making use of PCA, dimentionality reduction technique to Visualize Training dataset.<br>
Visualization is also helpful in understanding any grouping or patterns within dataset.
<h4>Observation</h4>
No pattern or grouping observed in training dataset

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
X=df_train.drop(['id','claim'],axis=1)

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

X_= imputer.fit_transform(X)
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X_)
principalDf = pd.DataFrame(data = principalComponents,columns = ['principal_component_1','principal_component_2','principal_component_3'])
principalDf['claim']=df_train['claim']

fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection = '3d')

ax.set_xlabel("principal_component_1")
ax.set_ylabel("principal_component_2")
ax.set_zlabel("principal_component_3")

sc=ax.scatter(xs=principalDf['principal_component_1'], ys=principalDf['principal_component_2'],
              zs=principalDf['principal_component_3'],c=principalDf['claim'],cmap='OrRd')
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()

In [None]:
gc.collect()

In [None]:
del X,X_
gc.collect()

<h2 id="FeatureSummary">Understanding Training dataset features</h2>
Userstanding Training dataset features using basic statistical measures

<h4>Observations</h4>

<br><a href="#Approach">back to main menu</a>

In [None]:
pd.set_option('display.max_rows', len(df_train.columns))
feature_summary(df_train)

In [None]:
feature_summary(df_submission)

In [None]:
gc.collect()

<h2 id="FeatureEng">Feature Engineering</h2>
Creating features using given features.

<h4>Observations</h4>
Missing value count per observation proved to be a very strong engineered feature for both linear and gradiant boost models

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
features=list(df_test.columns[1:])

df_train['n_missing'] = df_train[features].isna().sum(axis=1)
df_test['n_missing'] = df_test[features].isna().sum(axis=1)

df_train['std'] = df_train[features].std(axis=1)
df_test['std'] = df_test[features].std(axis=1)

df_train['mean'] = df_train[features].mean(axis=1)
df_test['mean'] = df_test[features].mean(axis=1)

df_train['max'] = df_train[features].max(axis=1)
df_test['max'] = df_test[features].max(axis=1)

df_train['min'] = df_train[features].min(axis=1)
df_test['min'] = df_test[features].min(axis=1)

df_train['kurt'] = df_train[features].kurtosis(axis=1)
df_test['kurt'] = df_test[features].kurtosis(axis=1)

features += ['n_missing', 'std','mean','max','min','kurt']

<h2 id="MissingValues">Handling missing values</h2>
There are number of missing values in this dataset. We are replacing missing values for a feature with its mean.

<h4>Observations</h4>


<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
df_train[features] = df_train[features].fillna(df_train[features].mean())
df_test[features] = df_test[features].fillna(df_test[features].mean())

In [None]:
%%time
scaler = StandardScaler()
df_train[features] = scaler.fit_transform(df_train[features])
df_test[features] = scaler.transform(df_test[features])

In [None]:
X=df_train.drop(['id','claim'],axis=1).to_numpy()
y=df_train['claim'].values
test=df_test.drop(['id'],axis=1).to_numpy()

In [None]:
del df_train,df_test,scaler
gc.collect()

In [None]:
X.shape,y.shape,test.shape

<h2 id="LogisticRegression">LogisticRegression</h2>
Starting with Basic Linear Model

<h4>Observations</h4>

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
model=LogisticRegression()
print('Logistic Regression parameters:\n',model.get_params())

logistic_predictions,best_logistic_model,LRpreds=claim_predictor(X,y,test,model,'LR')

In [None]:
logistic_predictions

In [None]:
df_feature_impt=pd.DataFrame()
df_feature_impt['features']=features
df_feature_impt['importance']=best_logistic_model.coef_[0]

df_feature_impt.sort_values(by=['importance'],inplace=True,ascending=False)
plt.figure(figsize = (20,25))
sns.barplot(x=df_feature_impt['importance'],y=df_feature_impt['features'],data=df_feature_impt);

In [None]:
gc.collect()

<h2 id="CatBoost">CatBoostClassifier</h2>

<ul>
    <li>We are using GPU for training this model</li>
</ul>

<h4>Hyperparameters picked up from below very informative notebook</h4>
https://www.kaggle.com/mlanhenke/tps-09-simple-basic-stacking-lgbm-catb-xgb

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
catb_params = {
    'eval_metric' : 'AUC',
    'iterations': 15585, 
    'objective': 'CrossEntropy',
    'bootstrap_type': 'Bernoulli', 
    'od_wait': 1144, 
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 7, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
   'task_type' : 'GPU',
    'devices' : '0',
    'verbose' : 0
}
model=CatBoostClassifier(**catb_params)
print('CatBoost paramters:\n',model.get_params())

catb_predictions,best_catb_model,CBpreds=claim_predictor(X,y,test,model,'CB')

In [None]:
catb_predictions

In [None]:
df_feature_impt=pd.DataFrame()
df_feature_impt['features']=features
df_feature_impt['importance']=best_catb_model.feature_importances_

df_feature_impt.sort_values(by=['importance'],inplace=True,ascending=False)
plt.figure(figsize = (10,25))
sns.barplot(x=df_feature_impt['importance'],y=df_feature_impt['features'],data=df_feature_impt);

In [None]:
gc.collect()

<h2 id="LGBM">LGBMClassifier</h2>
<ul>
    <li>We are using GPU for training this model</li>
</ul>
    

<h4>Hyperparameters picked up from below very informative notebook</h4>
https://www.kaggle.com/mlanhenke/tps-09-simple-basic-stacking-lgbm-catb-xgb

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
lgbm_params = {
    'metric' : 'auc',
    'objective' : 'binary',
   'device_type': 'gpu', 
    'n_estimators': 10000, 
    'learning_rate': 0.12230165751633416, 
    'num_leaves': 1400, 
    'max_depth': 8, 
    'min_child_samples': 3100, 
    'reg_alpha': 10, 
    'reg_lambda': 65, 
    'min_split_gain': 5.157818977461183, 
    'subsample': 0.5, 
    'subsample_freq': 1, 
    'colsample_bytree': 0.2
}

model=lgb.LGBMClassifier(**lgbm_params)
print('LGBM parameters:\n',model.get_params())

lgb_predictions,best_lgb_model,LGBpreds=claim_predictor(X,y,test,model,'LGB')

In [None]:
lgb_predictions

In [None]:
df_feature_impt=pd.DataFrame()
df_feature_impt['features']=features
df_feature_impt['importance']=best_lgb_model.feature_importances_

df_feature_impt.sort_values(by=['importance'],inplace=True,ascending=False)
plt.figure(figsize = (10,25))
sns.barplot(x=df_feature_impt['importance'],y=df_feature_impt['features'],data=df_feature_impt);

In [None]:
gc.collect()

<h2 id="XGB">XGBClassifier</h2>

<ul>
    <li>We are using GPU for training this model</li>
</ul>

<h4>Hyperparameters picked up from below very informative notebook</h4>
https://www.kaggle.com/mlanhenke/tps-09-simple-basic-stacking-lgbm-catb-xgb

<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
xgb_params = {
    'eval_metric': 'auc', 
    'objective': 'binary:logistic', 
   'tree_method': 'gpu_hist', 
   'gpu_id': 0, 
   'predictor': 'gpu_predictor', 
    'n_estimators': 10000, 
    'learning_rate': 0.01063045229441343, 
    'gamma': 0.24652519525750877, 
    'max_depth': 4, 
    'min_child_weight': 366, 
    'subsample': 0.6423040816299684, 
    'colsample_bytree': 0.7751264493218339, 
    'colsample_bylevel': 0.8675692743597421, 
    'lambda': 0, 
    'alpha': 10
}
model=xgb.XGBClassifier(**xgb_params)
print('XGB parameters:\n',model.get_params())

xgb_predictions,best_xgb_model,XGBpreds=claim_predictor(X,y,test,model,'XGB')

In [None]:
xgb_predictions

In [None]:
df_feature_impt=pd.DataFrame()
df_feature_impt['features']=features
df_feature_impt['importance']=best_xgb_model.feature_importances_

df_feature_impt.sort_values(by=['importance'],inplace=True,ascending=False)
plt.figure(figsize = (10,25))
sns.barplot(x=df_feature_impt['importance'],y=df_feature_impt['features'],data=df_feature_impt);

In [None]:
gc.collect()

<h2>Blending</h2>

<h2 id="Ratios">Calculating best blending Ratios (using training preditions to calculate blending ratios)</h2>
We are trying to calculate best blending ratio on trained dataset and then applying the same to predicted test values

<h4>Observation</h4>
<ul>
    <li>This approach lead to selecting ratio 1.0 for best model and 0.0 for others.</li>
    <li>If we observe results by above three models, we can conclude different models are working better than other on different part of dataset.Now how can we bring this to our blending strategy?</li>
    <li>As changed blending strategy, we will try to find best blending ratio with each ratio greater than zero</li>
</ul>


<br><a href="#Approach">back to main menu</a>

In [None]:
%%time
# blending_ratios=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
blending_ratios=[0.0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.0]
roc_final=0
LR_ratio=0
LGB_ratio=0
CB_ratio=0
XGG_ratio=0
for i in blending_ratios:
    for j in blending_ratios:
        for k in blending_ratios:
            for l in blending_ratios:
                if((i+j+k+l==1) and (i>0 and j>0 and k>0 and l>0)):
                    roc_new=roc_auc_score(y,(LRpreds*i+LGBpreds*j+CBpreds*k+XGBpreds*l))
                    print("LRratio: ",i," LGBratio: ",j," CBratio:",k," XGBratio: ",l," ROCscore: ",roc_new)
                    if roc_new>roc_final:
                        roc_final=roc_new
                        LR_ratio=i
                        LGB_ratio=j
                        CB_ratio=k
                        XGB_ratio=l
print("Final Ratios, LR ratio: ",LR_ratio," LGB ratio: ",LGB_ratio," CB ratio:",CB_ratio," XGB ratio: ",XGB_ratio," ROC score: ",roc_final)

<h2 id="BlendPred">Calculating blended prediction</h2>

<br><a href="#Approach">back to main menu</a>

In [None]:
df_submission['claim']=lgb_predictions*LGB_ratio+catb_predictions*CB_ratio+logistic_predictions*LR_ratio+xgb_predictions*XGB_ratio
#CREATING SUMBISSION FILE
df_submission.to_csv('submission.csv',index=False)

In [None]:
df_submission