# Welcome to the September 2021 Tabular Playground Competition! #

In this competition, we predict whether a customer will make an insurance claim.

# Data #

The full dataset has almost one million rows. We'll use just a sample so we can explore the data more quickly.

In [None]:
import pandas as pd
from pathlib import Path

data_dir = Path('../input/tabular-playground-series-sep-2021/')

df_train = pd.read_csv(
    data_dir / "train.csv",
    index_col='id',
    nrows=25000,  # comment this row to use the full dataset
)

FEATURES = df_train.columns[:-1]
TARGET = df_train.columns[-1]

df_train.head()

The target `'claim'` has binary outcomes: `0` for no claim and `1` for claim.

# Model #

Let's try out a simple XGBoost model. This algorithm can handle missing values, but you could try imputing them instead.  We use `XGBClassifier` (instead of `XGBRegressor`, for instance), since this is a classification problem.

In [None]:
from xgboost import XGBClassifier

X = df_train.loc[:, FEATURES]
y = df_train.loc[:, TARGET]

model = XGBClassifier(
    max_depth=3,
    subsample=0.5,
    colsample_bytree=0.5,
    n_jobs=-1,
    # Uncomment if you want to use GPU. Recommended for whole training set.
    #tree_method='gpu_hist',
    random_state=0,
)


# Evaluation #

The evaluation metric is AUC, which stands for "area under curve".  Run the next code cell to evaluate the model.

In [None]:
from sklearn.model_selection import cross_validate
import warnings 
warnings.filterwarnings('ignore')

def score(X, y, model, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model, cv=2)

display(scores)

A "neutral" AUC is 0.5, so anything better than that means our model learned something useful.

# Make Submission #

Our predictions are binary 0 and 1, but you're allowed to submit probabilities instead. In scikit-learn, you would use the `predict_proba` method instead of `predict`.

In [None]:
# Fit on full training set
model.fit(X, y)

X_test = pd.read_csv(data_dir / "test.csv", index_col='id')

# Make predictions
y_pred = pd.Series(
    model.predict(X_test),
    index=X_test.index,
    name=TARGET,
)

# Create submission file
y_pred.to_csv("submission.csv")