# Intro
Welcome to the monthly Kaggle experiment in 2021. This is [february](https://www.kaggle.com/c/tabular-playground-series-feb-2021/overview). 
![](https://storage.googleapis.com/kaggle-competitions/kaggle/25225/logos/header.png)

This notebook is a simple tutorial of the second experimental competition. For feature encoding techniques we recommend [this notebook](https://www.kaggle.com/drcapa/categorical-feature-engineering-2-xgb).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span>

# Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings("ignore")

# Path

In [None]:
path = '/kaggle/input/tabular-playground-series-feb-2021/'
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Overview

In [None]:
print('Number train samples:', len(train_data.index))
print('Number test samples:', len(test_data.index))
print('Number features:', len(train_data.columns))

In [None]:
print('Missing values on the train data:', train_data.isnull().sum().sum())
print('Missing values on the test data:', test_data.isnull().sum().sum())

# Feature Engineering

In [None]:
features_cat = ['cat'+str(i) for i in range(10)]
features_num = ['cont'+str(i) for i in range(1, 14)]
no_features = ['id', 'target']

In [None]:
print('number of categorical features:', len(features_cat))
print('number of numerical features:', len(features_num))

Encoding of categorical features:

In [None]:
features_cat = ['cat'+str(i) for i in range(10)]
le = LabelEncoder()
for col in features_cat:
    le.fit(train_data[col])
    train_data[col] = le.transform(train_data[col])
    test_data[col] = le.transform(test_data[col])

We create statistical features like mean, max and min for every sample on the train and test data.

In [None]:
train_data['mean'] = train_data[features_num].mean(axis=1)
train_data['std'] = train_data[features_num].std(axis=1)
train_data['max'] = train_data[features_num].max(axis=1)
train_data['min'] = train_data[features_num].min(axis=1)
train_data['sum'] = train_data[features_num].sum(axis=1)

test_data['mean'] = test_data[features_num].mean(axis=1)
test_data['std'] = test_data[features_num].std(axis=1)
test_data['max'] = test_data[features_num].max(axis=1)
test_data['min'] = test_data[features_num].min(axis=1)
test_data['sum'] = test_data[features_num].sum(axis=1)

# EDA

Distribution of the numerical values:

In [None]:
train_data.boxplot(column=features_num, figsize=(10,4))
plt.show()

Distribution of the categorcial data:

In [None]:
train_data.boxplot(column=features_cat, figsize=(10,4))
plt.show()

Correlation matrix:

In [None]:
temp = train_data
corr = temp.corr()
corr.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

# Train, Validation And Test Data

In [None]:
X_train = train_data[train_data.columns.difference(no_features)]
y_train = train_data['target']
X_test = test_data[test_data.columns.difference(no_features)]

Scale Date:

In [None]:
mean = X_train.mean()
X_train = X_train-mean
std = X_train.std()
X_train = X_train/std
X_test = (X_test-mean)/std

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.1, random_state=2021)

In [None]:
print('Train shape:', X_train.shape)
print('Val shape:', X_val.shape)
print('Test shape:', X_test.shape)

# Model

In [None]:
model = XGBRegressor(objective='reg:squarederror',
                     booster = "gbtree",
                     eval_metric = "rmse",
                     tree_method = "gpu_hist",
                     n_estimators = 1000,
                     learning_rate = 0.02,
                     random_state = 2021)

In [None]:
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
print('Score validation data:', np.sqrt(mean_squared_error(y_val, y_val_pred)))

# Analyse Training

Feature Importance:

In [None]:
importance = model.feature_importances_
fig = plt.figure(figsize=(10, 6))
x = list(train_data[train_data.columns.difference(no_features)])
plt.barh(x, 100*importance, color='orange')
plt.title('Feature Importance', loc='left')
plt.xlabel('Percentage')
plt.grid()
plt.show()

Visualization of the error:

In [None]:
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

fig, axs = plt.subplots(1, 2, figsize=(22, 6))
fig.subplots_adjust(hspace = .5, wspace=.5)
axs = axs.ravel()
axs[0].plot(y_train, y_train_pred, 'ro')
axs[0].plot(y_train, y_train, 'blue')
axs[1].plot(y_val, y_val_pred, 'ro')
axs[1].plot(y_val, y_val, 'blue')
for i in range(2):
    axs[i].grid()
    axs[i].set_xlabel('true')
    axs[i].set_ylabel('pred')
axs[0].set_title('train')
axs[1].set_title('val')
plt.show()

# Predict Test Data

In [None]:
y_test = model.predict(X_test)

In [None]:
output = samp_subm.copy()
output['target'] = y_test

# Write Output

In [None]:
output.to_csv('submission.csv', index=False)