## Introduction

**Kaggle** competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.<p>

The dataset is used for this competition,[**Tabular Playground Series - Sep 2021**](https://www.kaggle.com/c/tabular-playground-series-sep-2021), is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.<p>

The ground truth claim is binary valued, but a prediction may be any number from **0.0 to 1.0**, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.<p>
Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target.

## Exploratory Data Analysis

In [None]:
# importing libraries
import numpy as np
import pandas as pd

from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
# read data into dataframe
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

In [None]:
# first five rows
train.head()

In [None]:
# descriptive statistics
train.describe()

In [None]:
# checking for missing values
train.isnull().any()

## Preprocessing

In [None]:
# predictor
X = train.drop(columns=['id','claim'])

# target
y = train['claim']

# test data 
test_df = test.drop(columns=['id'])

In [None]:
# preprocessing pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])

X = pd.DataFrame(columns=X.columns,
                 data=pipeline.fit_transform(X))

test_df = pd.DataFrame(columns=test_df.columns,
                       data=pipeline.transform(test_df))

## Model - CatBoost Classifier

Params used for tuning this model was taken from this [notebook](https://www.kaggle.com/mlanhenke/tps-09-optuna-study-catboostclassifier). Thanks @mlanhenke

In [None]:
# parameters

best_params = {
    'iterations': 15585, 
    'objective': 'CrossEntropy', 
    'bootstrap_type': 'Bernoulli', 
    'od_wait': 1144, 
    'learning_rate': 0.023575206684596582, 
    'reg_lambda': 36.30433203563295, 
    'random_strength': 43.75597655616195, 
    'depth': 7, 
    'min_data_in_leaf': 11, 
    'leaf_estimation_iterations': 1, 
    'subsample': 0.8227911142845009,
    'task_type' : 'GPU',
    'devices' : '0',
    'verbose' : 0
}

In [None]:
from catboost import CatBoostClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import roc_curve, auc

# k fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

pred_fol = []
scores_list = []

for fold, (idx_train, idx_valid) in enumerate(kf.split(X)):
    X_train, y_train = X.iloc[idx_train], y.iloc[idx_train]
    X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid]
    
    # CatBoost Classifier
    model = CatBoostClassifier(**best_params)
    model.fit(X_train, y_train)

    # validation prediction
    pred_valid = model.predict_proba(X_valid)[:,1]
    fpr, tpr, _ = roc_curve(y_valid, pred_valid)
    score = auc(fpr, tpr)
    scores_list.append(score)
    
    print("Fold : {} Score : {}".format(fold + 1, score))
    print('--'*18)
    
    # test prediction
    y_pred = model.predict_proba(test_df)[:,1]
    pred_fol.append(y_pred)
    
print("Overall Validation Score : {}".format(np.mean(scores_list)))

In [None]:
# average predictions
pred = np.mean(np.column_stack(pred_fol),axis=1)

In [None]:
# submission
submission['claim'] = pred
submission.to_csv('submission.csv', index=False)

**Thank You**