In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import LocalOutlierFactor

from scipy.stats import norm
import scipy.stats as st

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the Data

Here we will load the data into a pandas dataframe.

In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv')
display(train_df.head())
train_df.describe()

In [None]:
FEATURES = ['cont%d' % (i) for i in range(1, 15)]

# Cleaning the Dataset

Following the steps of the Machine Learning Checklist we will start by cleaning out invalid values and outliers from the dataset.

### Invalid Values

In [None]:
train_df.info()

Here we can see that there are no *non-null* values so there is nothing to remove here.

### Outliers

#### **Removing outliers is less of a science and more of an art form. So I will leave the choice up to you, but show you how to visualise these points.**

First we will look at outliers for the *target*.

We will add noise to the one dimensional features in order to "explode" the points out, helping us see the distributions and potential outliers.

We will use two methods for finding outliers:
* The first will consider a point to be an outlier if it is N standard deviations from the mean. N is defined as the threshold.
* A more complex form of outlier detection is LOF (Local Outlier Factor) which uses a points 20 nearest neighbours to determine if it is in a low density region (and therefore potentially and outlier).

In [None]:
def plot_outliers(df, feature, threshold=3):
    mean, std = np.mean(df), np.std(df)
    z_score = np.abs((df-mean) / std)
    good = z_score < threshold

    print(f"Rejection {(~good).sum()} points")
    visual_scatter = np.random.normal(size=df.size)
    plt.scatter(df[good], visual_scatter[good], s=2, label="Good", color="#4CAF50")
    plt.scatter(df[~good], visual_scatter[~good], s=8, label="Bad", color="#F44336")
    plt.legend(loc='upper right')
    plt.title(feature)
    plt.show();
    
    return good
    
def plot_lof_outliers(df, feature):
    lof = LocalOutlierFactor(n_neighbors=20, contamination=0.001, p=1)
    good = lof.fit_predict(df) > 0.5 # change this value to set the threshold for outliers
    print(f"Rejection {(~good).sum()} points")
    
    visual_scatter = np.random.normal(size=df.size)
    plt.scatter(df[good], visual_scatter[good], s=2, label="Good", color="#4CAF50")
    plt.scatter(df[~good], visual_scatter[~good], s=8, label="Bad", color="#F44336")
    plt.legend(loc='upper right')
    plt.title(feature)
    plt.show();
    
    return good

### Target Outliers

In [None]:
good = plot_outliers(train_df['target'], 'target', threshold=4)

Above we can see that these points are very reasonable outliers. There is a clear grouping for the target values however these points marked in red fall outside this grouping. I will therefore remove these 17 rejected points.

In [None]:
train_df = train_df[good]
print('Now train_df has %d rows.' % (train_df.shape[0]))

Next we will look at the LOF outliers.

In [None]:
good = plot_lof_outliers(train_df['target'].values.reshape(train_df['target'].shape[0], -1), 'target')

The above is harder to read as it has picked some points inside grouping. However, since there are only 300 points and I trust the LOF measurement, I am going to remove these points from dataset as well.

In [None]:
train_df = train_df[good]
print('Now train_df has %d rows.' % (train_df.shape[0]))

### Feature Outliers

First we will look at the threshold outliers.

In [None]:
for feature in FEATURES:
    plot_outliers(train_df[feature], feature)

So above we can see that the majority of the features do not contain outliers, however features *cont7*, *cont9*, *cont10* and *cont13* do contain some points that are could be considered as outliers.

We will now look at the **LOF (Local Outlier Factor)** outliers.

In [None]:
for feature in FEATURES:
    # There some reshaping done here for syntax sake
    plot_lof_outliers(train_df[feature].values.reshape(train_df[feature].shape[0], -1), feature)

We can see from the above that there are a small number of reasonable outliers selected here. I am therefore not going to remove any of these points as outliers.

# Analysing Distributions

Here we will look at correlations between the features, distributions of the features.

In [None]:
for feature in FEATURES:
    sns.violinplot(x=train_df[feature], inner='quartile', bw=0.1)
    plt.title(feature)
    plt.show();

The above shows us that each feature has a unique distribution which could likely be used to help our models make predictions.

# Empirical CDFs

The below graphs show us where the 10th/20th/..../90th percentiles lie for each of the features.

In [None]:
def plot_cdf(df, feature):
    ps = 100 * st.norm.cdf(np.linspace(-4, 4, 10)) # The last number in this tuple is the number of percentiles
    x_p = np.percentile(df, ps)

    xs = np.sort(df)
    ys = np.linspace(0, 1, len(df))

    plt.plot(xs, ys * 100, label="ECDF")
    plt.plot(x_p, ps, label="Percentiles", marker=".", ms=10)
    plt.legend()
    plt.ylabel("Percentile")
    plt.title(feature)
    plt.show();

for feature in FEATURES:
    plot_cdf(train_df[feature], feature)

This is perhaps the most revealing visualisations. It shows us that our features (especially '*cont2*' and '*cont5*') have unusual distributions. '*cont2*' appears to turn into an categorical variable when greater than 0.4 and '*cont5*' is a linear distribution once above 0.3. 

This could suggest that these variables need to split into additional features or have functions applied to their values to create a bigger distinction between very similar values.

# Correlation

Here we can look at the correlation between the features and each other (and the target)

In [None]:
# This plots a 16x16 matrix of correlations between all the features and the target
# Note: I sometimes comment this out because it takes a few minutes to run and doesn't show any useful information.

pd.plotting.scatter_matrix(train_df, figsize=(10, 10));

We can see that the above graph is far too busy to show us any useful information. However, at least we know that there isn't any clear correlations between a particular variable and the target.

The one interesting thing from this plot is that the target values are almost exclusively in the upper half of the range.

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(train_df.drop(columns=['id']).corr(), annot=True, cmap='viridis', fmt='0.2f', ax=ax)

Above we can see a cluster of features (cont1, cont6-cont13) that appear to be quite highly correlated together. This suggests that dimensionality reduction techniques could be used to reduce these features to a smaller set.

# Analyse the Target

In [None]:
sns.violinplot(x=train_df['target'], inner='quartile', bw=0.1)
plt.title('target')
plt.show();

This doesn't show us much that is interesting other than the target is grouped around it's mean of 8, with some long tails out to either side.

Finally we will look at the 2D histogram plots for each features vs. the target, this can be a clue of unusual correlations between the target and features. 

**Note:** There is also code for a KDE plot but these take a long time to run.

In [None]:
for feature in FEATURES:
    #sns.kdeplot(x=train_df['target'], y=train_df[feature], bins=20, cmap='magma', shade=True) 
    plt.hist2d(x=train_df['target'], y=train_df[feature], bins=20)
    plt.xlabel(feature)
    plt.ylabel('target')
    plt.title(feature)
    plt.show()