## Competition Overview 
The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset.

These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard.

## Competition Objective
The original dataset deals with predicting whether a claim will be made on an insurance policy.

## Evalualtion Metric
Submissions are evaluated on area under the **ROC curve** between the predicted probability and the observed target.

## Competition Timeline
- Start Date - September 1, 2021
- Entry deadline - Same as the Final Submission Deadline
- Team Merger deadline - Same as the Final Submission Deadline
- Final submission deadline - September 30, 2021


## Points to note:

- This competition does not award ranking points
- This competition does not count towards tiers

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns


# for feature importance study
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
import shap

# ML
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn import metrics
import os
import gc

# Reproducability
def set_seed(seed = 0):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    print('*** --- Set seed "%i" --- ***' %seed)
    
# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')


# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-sep-2021/test.csv')

# Highlevel Data Overview 

In [None]:
print("The shape of the train dataset is : ", train.shape)
print()
print("The shape of the test dataset is : ", test.shape)
print()

In [None]:
print("The first 2 rows of the train dataset are ")
train.head(2)

In [None]:
print("The first 2 rows of the test dataset are ")
test.head(2)

In [None]:
train.columns

In [None]:
print('Total missing values in the training data' , sum(train.isna().sum()))

### Observations:
- *claim* column is the target variable
- We have 118 columns + 1 target column + 1 id column for train 
- We have 118 columns + 1 id column for test
- We have a total data of around 1M for the train data and around 500k for the test data
- Total missing values in the training data 1820782

# Missing Values

In [None]:
train_df = train
test_df = test

missing_train_df = pd.DataFrame(train_df.isna().sum())
missing_train_df = missing_train_df.drop(['id', 'claim']).reset_index()
missing_train_df.columns = ['feature', 'missing value count']

missing_train_percent_df = missing_train_df.copy()
missing_train_percent_df['missing value percentage'] = (missing_train_df['missing value count']/train_df.shape[0])*100

missing_test_df = pd.DataFrame(test_df.isna().sum())
missing_test_df = missing_test_df.drop(['id']).reset_index()
missing_test_df.columns = ['feature', 'missing value count']

missing_test_percent_df = missing_test_df.copy()
missing_test_percent_df['missing value percentage'] = (missing_test_df['missing value count']/test_df.shape[0])*100

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["axes.labelsize"] = 10
sns.set_theme(style="whitegrid")
f, ax = plt.subplots(2,2,figsize=(15,35))
ax0_sns = sns.barplot(y="feature", x="missing value count", data=missing_train_df, color="#00CDFF",orient='h',ax=ax[0][0]).set_title('Missing values in TRAIN data')
# ax0_sns.set_xlabel("Missing Values",fontsize=3, weight='bold')
ax1_sns = sns.barplot(y="feature", x="missing value count", data=missing_test_df, color="#00CDFF",orient='h',ax=ax[0][1]).set_title('Missing values in TEST data')
# ax1_sns.set_xlabel("Missing Values",fontsize=3, weight='bold')
ax2_sns = sns.barplot(y="feature", x="missing value percentage", data=missing_train_percent_df, color="#00CDFF",orient='h',ax=ax[1][0]).set_title('Missing values % in TRAIN data')
# ax2_sns.set_xlabel("Missing Values Percentage",fontsize=3, weight='bold')
ax3_sns = sns.barplot(y="feature", x="missing value percentage", data=missing_test_percent_df, color="#00CDFF",orient='h',ax=ax[1][1]).set_title('Missing values % in TEST data')
# ax3_sns.set_xlabel("Missing Values Percentage",fontsize=3, weight='bold')

In [None]:
# train_data missing values
null_values_train = []
for col in train.columns:
    c = train[col].isna().sum()
    pc = np.round((100 * (c)/len(train)), 2)            
    dict1 ={
        'Features' : col,
        'null_train (count)': c,
        'null_trian (%)': '{}%'.format(pc)
    }
    null_values_train.append(dict1)
DF1 = pd.DataFrame(null_values_train, index=None).sort_values(by='null_train (count)',ascending=False)


# test_data missing values
null_values_test = []
for col in test.columns:
    c = test[col].isna().sum()
    pc = np.round((100 * (c)/len(test)), 2)            
    dict2 ={
        'Features' : col,
        'null_test (count)': c,
        'null_test (%)': '{}%'.format(pc)
    }
    null_values_test.append(dict2)
DF2 = pd.DataFrame(null_values_test, index=None).sort_values(by='null_test (count)',ascending=False)


df = pd.concat([DF1, DF2], axis=1)
df#.head()

import plotly.figure_factory as ff
import plotly.graph_objects as go
fig = go.Figure(data=[go.Scatter(x=DF1['Features'],
                             y=DF1["null_trian (%)"], mode= 'markers',                             
                             name='Train', marker_color='#0004FF'),        

                go.Scatter(x=DF2['Features'],
                             y=DF2["null_test (%)"], mode= 'markers',
                             name='Test', marker_color='#00CDFF')])
fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=1)
fig.update_layout(title_text='Null Values In Each Feature (%)', 
                  #template='plotly_dark',
                  paper_bgcolor='#FFFFFF',
                  plot_bgcolor='#FFFFFF',
                  width=750, height=500,
                  xaxis_title='Features', yaxis_title='Count',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})
fig.show()

Handling the missing values can be tricky, some places to check out for learning it :
- [A Guide to Handling Missing values in Python](https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python)
- [How to Handle Missing Data with Python](https://machinelearningmastery.com/handle-missing-data-python/)
- [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)

# Target Distribution 

In [None]:
claim_df = pd.DataFrame(train_df['claim'].value_counts()).reset_index()
claim_df.columns = ['claim', 'count']

claim_percent_df = pd.DataFrame((train_df['claim'].value_counts()/train_df.shape[0])*100).reset_index()
claim_percent_df.columns = ['claim', 'count']

In [None]:
plt.rcParams["axes.labelsize"] = 3
sns.set_theme(style="whitegrid")
f, ax = plt.subplots(1,2,figsize=(15,5))
sns.barplot(x="claim", y="count", data=claim_df, color="#00CDFF",orient='v',ax=ax[0]).set_title('Target Distribution TRAIN data')
sns.barplot(x="claim", y="count", data=claim_percent_df, color="#00CDFF",orient='v',ax=ax[1]).set_title('Target Distribution % TRAIN data')

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#FFFFFF')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = "#FFFFFF"
run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[1:26])



run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#1700FF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#00CDFF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#FFFFFF')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = "#FFFFFF"
run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[26:51])



run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#1700FF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#00CDFF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#FFFFFF')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = "#FFFFFF"
run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[51:76])



run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#1700FF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#00CDFF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#FFFFFF')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = "#FFFFFF"
run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[76:101])



run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#1700FF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#00CDFF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#FFFFFF')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)
background_color = "#FFFFFF"
run_no = 0
for row in range(0, 4):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = list(train_df.columns[101:119])



run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#1700FF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#00CDFF')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1
ax18.remove()
ax19.remove()

plt.show()

# Correlation Heatmap

In [None]:
train = pd.read_csv(r'/kaggle/input/tabular-playground-series-sep-2021/train.csv', index_col='id')
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(16, 16), facecolor='#EAECEE')
cmap = sns.color_palette("Blues", as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.05, vmin=-0.05, center=0, annot=False,
            square=True, linewidths=.5, cbar_kws={"shrink": 0.75})

ax.set_title('Correlation heatmap', fontsize=24, y= 1.05)
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([-0.75, 0, 0.75])

train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv', 
                    index_col = 0)
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv', 
                   index_col = 0)
ss = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

# Model Baseline - XGBoost

In [None]:
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv', 
                    index_col = 0)
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv', 
                   index_col = 0)
ss = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')
claim = train.claim

train.drop(['claim'], axis = 1, inplace = True)
column_name = train.columns

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
train = imputer.fit_transform(train)
test = imputer.transform(test)

In [None]:
def model_imp_viz(model, columns, bias = 0.01):
    imp = pd.DataFrame({'importance': model.feature_importances_,
                        'features': columns}).sort_values('importance', 
                                                          ascending = False)
    fig, ax = plt.subplots(figsize = (10, 40))
    plt.title('Feature importances', size = 15, fontweight = 'bold', fontfamily = 'serif')

    sns.barplot(x = imp.importance, y = imp.features, edgecolor = 'black',
                palette = reversed(sns.color_palette("Blues", len(imp.features))))

    for i in ['top', 'right']:
            ax.spines[i].set_visible(None)

    rects = ax.patches
    labels = imp.importance
    for rect, label in zip(rects, labels):
        x_value = rect.get_width() + bias
        y_value = rect.get_y() + rect.get_height() / 2

        ax.text(x_value, y_value, round(label, 4), fontsize = 9, color = 'black',
                 ha = 'center', va = 'center')
    ax.set_xlabel('Importance', fontweight = 'bold', fontfamily = 'serif')
    ax.set_ylabel('Features', fontweight = 'bold', fontfamily = 'serif')
    plt.show()

In [None]:
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(train, claim, 
                                                      test_size = 0.2,
                                                      random_state = 42)

In [None]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn import metrics
import os
import gc
# The basic model
params = {'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'auc'}

model = XGBClassifier(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict_proba(X_valid)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_valid, preds)
print('Valid AUC: ', metrics.auc(fpr, tpr))

In [None]:
metrics.plot_confusion_matrix(model, X_valid, y_valid,
                              cmap = 'Blues')
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

In [None]:
model_imp_viz(model, column_name.values, bias = 0.0005)

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

# Improved Model - XGB

In [None]:
# The basic model
params = {'n_estimators': 5000,
          'max_depth': 2,
          'colsample_bytree': 0.30,
          'learning_rate': 0.09,
          'reg_lambda': 18,
          'reg_alpha': 18,
          'random_state': 2021,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'auc'}

model = XGBClassifier(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict_proba(X_valid)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_valid, preds)
print('Valid AUC: ', metrics.auc(fpr, tpr))

In [None]:
metrics.plot_confusion_matrix(model, X_valid, y_valid,
                              cmap = 'Blues')
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

In [None]:
model_imp_viz(model, column_name.values, bias = 0.0005)

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

# k-folds

In [None]:
FOLDS = 10
ss.claim = np.zeros(len(ss.claim))
metric = []
kfold = KFold(n_splits = FOLDS, random_state = 2021, shuffle = True)
i = 1
for train_idx, test_idx in kfold.split(train):
    X_train, y_train = train[train_idx, :], claim[train_idx]
    X_test, y_test = train[test_idx, :], claim[test_idx]

    model = XGBClassifier(**params)
    model.fit(X_train, y_train)
    
    preds = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = metrics.roc_curve(y_test, preds)
    AUC = metrics.auc(fpr, tpr)    
    print('[FOLD #{}] Validation AUC: {:.5f}'.format(i, AUC))

    ss.claim += model.predict_proba(test)[:, 1] / FOLDS
    metric.append(AUC)
    i += 1
print('*'*50)
print('[ALL FOLDS] Mean Validation AUC: {:.5f}'.format(np.mean(metric)))

In [None]:
ss.to_csv('submission.csv', index = False)
ss.head()

# LightGBM

In [None]:
import pandas as pd
import numpy as np
import random
import time
import os
import gc

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter('ignore')

In [None]:
N_SPLITS = 5
N_ESTIMATORS = 20000
EARLY_STOPPING_ROUNDS = 200
VERBOSE = 1000
SEED = 42

In [None]:
import random
import os
import numpy  as np
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
seed_everything(SEED)

In [None]:
INPUT = "../input/tabular-playground-series-sep-2021/"

train = pd.read_csv(INPUT + "train.csv")
test = pd.read_csv(INPUT + "test.csv")
submission = pd.read_csv(INPUT + "sample_solution.csv")

In [None]:
features = [col for col in test.columns if 'f' in col]
TARGET = 'claim'

target = train[TARGET].copy()

In [None]:
train['n_missing'] = train[features].isna().sum(axis=1)
test['n_missing'] = test[features].isna().sum(axis=1)

train['std'] = train[features].std(axis=1)
test['std'] = test[features].std(axis=1)

features += ['n_missing', 'std']
n_missing = train['n_missing'].copy()

In [None]:
train[features] = train[features].fillna(train[features].mean())
test[features] = test[features].fillna(test[features].mean())

In [None]:
train[features] = train[features].fillna(train[features].mean())
test[features] = test[features].fillna(test[features].mean())

In [None]:
lgb_params = {
    'objective': 'binary',
    'n_estimators': N_ESTIMATORS,
    'random_state': SEED,
    'learning_rate': 5e-3,
    'subsample': 0.6,
    'subsample_freq': 1,
    'colsample_bytree': 0.4,
    'reg_alpha': 10.0,
    'reg_lambda': 1e-1,
    'min_child_weight': 256,
    'min_child_samples': 20,
    'importance_type': 'gain',
}

In [None]:
lgb_oof = np.zeros(train.shape[0])
lgb_pred = np.zeros(test.shape[0])
lgb_importances = pd.DataFrame()

skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

for fold, (trn_idx, val_idx) in enumerate(skf.split(X=train, y=n_missing)):
    print(f"===== fold {fold} =====")
    X_train = train[features].iloc[trn_idx]
    y_train = target.iloc[trn_idx]
    X_valid = train[features].iloc[val_idx]
    y_valid = target.iloc[val_idx]
    X_test = test[features]
    
    start = time.time()
    model = lgb.LGBMClassifier(**lgb_params)
    model.fit(
        X_train, 
        y_train,
        eval_set=[(X_valid, y_valid)],
        eval_metric='auc',
        early_stopping_rounds=EARLY_STOPPING_ROUNDS,
        verbose=VERBOSE,
    )
    
    fi_tmp = pd.DataFrame()
    fi_tmp['feature'] = model.feature_name_
    fi_tmp['importance'] = model.feature_importances_
    fi_tmp['fold'] = fold
    fi_tmp['seed'] = SEED
    lgb_importances = lgb_importances.append(fi_tmp)

    lgb_oof[val_idx] = model.predict_proba(X_valid)[:, -1]
    lgb_pred += model.predict_proba(X_test)[:, -1] / N_SPLITS

    elapsed = time.time() - start
    auc = roc_auc_score(y_valid, lgb_oof[val_idx])
    print(f"fold {fold} - lgb auc: {auc:.6f}, elapsed time: {elapsed:.2f}sec\n")

print(f"oof lgb roc = {roc_auc_score(target, lgb_oof)}")

np.save("lgb_oof.npy", lgb_oof)
np.save("lgb_pred.npy", lgb_pred)

In [None]:
ss['lgb_TARGET'] = lgb_pred


ss.to_csv('submission_1.csv', index = False)
ss.head()

ss['final'] = (ss['lgb_TARGET']+ss['claim'])/2
ss['claim1'] = ss['claim']
ss['claim'] = ss['final']

In [None]:
ss1 = ss[['id','claim']]
ss1.to_csv('sub_2.csv',index=False)

Credits to the codes that have helped me make this notebook:

- [Notebook by Jaupula](https://www.kaggle.com/jarupula/tps-sep-getting-started)
- [Notebook by Sharlto Cope](https://www.kaggle.com/dwin183287/tps-september-2021-eda)
- [Notebook by des](https://www.kaggle.com/desalegngeb/sept-2021-tps-eda-model)


Some of my other works
- [How did Covid-19 impact Digital Learning - EDA](https://www.kaggle.com/udbhavpangotra/how-did-covid-19-impact-digital-learning-eda)
- [EDA + Optuna: An attempt at a clean notebook](https://www.kaggle.com/udbhavpangotra/eda-optuna-an-attempt-at-a-clean-notebook)
- [Heart Attacks! Extensive EDA and visualizations :)](https://www.kaggle.com/udbhavpangotra/heart-attacks-extensive-eda-and-visualizations)
- [CommonLit Readibility Prize Extensive EDA + Model](https://www.kaggle.com/udbhavpangotra/commonlit-readibility-prize-extensive-eda-model)

In [None]:
%%html
<marquee style='width: 90% ;height:70%; color: #799EFF ;'>
    <b> Do UPVOTE if you like my work, I will be adding some more content to this kernel (Baseline and SHAP) :) </b></marquee>