This here is some essential EDA, we will:
- check for missing values
- plot distributions for the train and test features
- plot distribution of the train target
- plot correlations
- check for outliers

Let's get to it!

# Import libraries

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 25)
import seaborn as sns
sns.set()
sns.set_palette('Set2')
from pathlib import Path

import os
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
        
input_path = Path('/kaggle/input/tabular-playground-series-feb-2021/')

# Read in the data files

In [None]:
df_train = pd.read_csv(input_path / 'train.csv', index_col='id')
display(df_train.head())

In [None]:
df_test = pd.read_csv(input_path / 'test.csv', index_col='id')
display(df_test.head())

In [None]:
df_submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
display(df_submission.head())

# Check for missing values
There are no missing values, so we don't have to deal with them...

In [None]:
print('Are there missing values in train set?', df_train.isnull().values.any())
print('Are there missing values in test set?', df_test.isnull().values.any())

# Plot distributions

To chec the distribution of each feature and compare between the train and test set, we'll make a common DataFrame and plot histograms of the probability. For continuous features, we'll also make a box plot.

## Continuous features

In [None]:
df_train_copy = df_train.drop('target', axis=1)
df_train_copy['set'] = 'train'
df_test_copy = df_test.copy()
df_test_copy['set'] = 'test'
df_common = pd.concat([df_train_copy, df_test_copy])
cat_features = [col for col in df_test.columns if col.startswith('cat')]
cont_features = [col for col in df_test.columns if col.startswith('cont')]

In [None]:
for feature in cont_features:
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw=dict(height_ratios=(.15, .85)))
    sns.boxplot(data=df_common, x=feature, y='set', ax=ax_box)
    sns.histplot(data=df_common, x=feature, kde=True, hue='set', ax=ax_hist, stat='probability', common_norm=False)
    plt.show()

### Plot again with logarithmic y-axis to check for outliers

In [None]:
for feature in cont_features:
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw=dict(height_ratios=(.15, .85)))
    sns.boxplot(data=df_common, x=feature, y='set', ax=ax_box)
    sns.histplot(data=df_common, x=feature, kde=True, hue='set', ax=ax_hist, stat='probability', common_norm=False)
    ax_hist.set_yscale('log')
    plt.show()

### Observations:
- The box plots are misleading, since the features are far from being normally distributed. But I don't really see any nasty outliers, there are some weird values in cont5 below 0, but I don't consider them to be problematic for modelling.
- The train and test sets follow the same distributions.
- For certain models, it will be necessary to normalize the features, e.g. via quantile transformation

## Try to normalize continuous features

Let's see what effect the QuantileTransformer will have on the data

In [None]:
from sklearn.preprocessing import quantile_transform

df_train_trans = df_train[cont_features].copy()
for feature in cont_features:
    df_train_trans[feature] = quantile_transform(
        df_train_trans[feature].values.reshape(-1, 1), n_quantiles=900,
        output_distribution='normal'
    )
fix, axs = plt.subplots(5, 3, figsize=(15, 25))
axs = axs.flatten()
for i, feature in enumerate(cont_features):
    sns.histplot(data=df_train_trans, x=feature, ax=axs[i])

## Categorical features
Plot the distribution of the categorical features. Since there are large difference of probabilities within some features, we'll use logarithmic scaling.

In [None]:
fig, axs = plt.subplots(4, 3, figsize=(20, 30))
axs = axs.flatten()
for i, feature in enumerate(cat_features):
    sns.histplot(data=df_common, x=feature, hue='set', multiple="dodge", shrink=.8, ax=axs[i],
                 stat='probability', common_norm=False)
    axs[i].set_yscale('log')

## Distribution of target

In [None]:
f, (ax_box, ax_lin, ax_log) = plt.subplots(3, sharex=True, figsize=(10, 6))
sns.boxplot(data=df_train, x='target', ax=ax_box)
sns.histplot(data=df_train['target'], kde=True, stat='probability', ax=ax_lin)
sns.histplot(data=df_train['target'], stat='probability', ax=ax_log)
ax_log.set_yscale('log')

### Observations:
- It's a bimodal distribution for sure. We'll try to apply a Gaussian mixture model below
- Looks like we have some outliers - I would be tempted to remove everything below 3.

In [None]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=2)
gmm.fit(df_train.target.values.reshape(-1, 1))
df_gmm = df_train[['target']].copy()
df_gmm['target_class'] = gmm.predict(df_train.target.values.reshape(-1, 1))
display(df_gmm.head())
sns.histplot(data=df_gmm, x='target', hue='target_class', stat='probability');

We can also try quantile transformation here again:

In [None]:
target_trans = quantile_transform(
        df_train['target'].values.reshape(-1, 1), n_quantiles=900,
        output_distribution='normal'
    )
sns.histplot(data=target_trans, stat='probability')

# Correlations

No particularly strong correlations with the target anywhere.

In [None]:
corr = df_train.corr().abs()
mask = np.triu(np.ones_like(corr, dtype=np.bool))

fig, ax = plt.subplots(figsize=(14, 14))

# plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            cbar_kws={"shrink": .8})
# yticks
plt.yticks(rotation=0);

# Baseline model

Finally, lets make a simple baseline model with LightGBM, without any preprocessing or hyperparameter optimization. I'm using LightGBM simply because there is no need for preprocessing to get a halfway decent model. Any improvements that we come up with should lead to a better performance than **0.84523**.

In [None]:
import lightgbm as lgb

df_train[cat_features] = df_train[cat_features].astype('category')
params = {'metrics': 'rmse',
          'objective': 'regression'}
d_train = lgb.Dataset(df_train.drop('target', axis=1), label=df_train.target)
result = lgb.cv(params, d_train, stratified=False, num_boost_round=1000, early_stopping_rounds=10,
                return_cvbooster=True, verbose_eval=50)
print(f'RMSE: {result["rmse-mean"][-1]}')

It is also interesting to compare the distributions of the predictions and the target - the distribution of the prediction is much narrower for some reason.

In [None]:
regressor = result['cvbooster']
prediction = np.array(regressor.predict(df_train.drop('target', axis=1))).mean(axis=0)
df = df_train.drop(cat_features + cont_features, axis=1)
df['prediction'] = prediction
sns.histplot(data=df.melt(), x='value', hue='variable');