# Exploratory Data Analysis (outliers, CDFs, categorical correlations)

Below is my exploratory analysis for the March tabular dataset.

Please let me know what you think in the comments and **upvote** if you find anything useful.

Thanks and enjoy!

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as ss

from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from scipy.stats import norm
import scipy.stats as st

!pip install sklearn-contrib-py-earth
from pyearth import Earth

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the Data

Here we will load the data into a pandas dataframe.

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
display(train_df.head())
train_df.describe()
print(train_df.columns)

We can see here that we have quite a few categorical variables (especially compared to previous months) and 10 continuous variables.

In [None]:
cont_FEATURES = ['cont%d' % (i) for i in range(0, 11)]
cat_FEATURES = ['cat%d' % (i) for i in range(0, 19)]

# Cleaning the Dataset

Following the steps of the Machine Learning Checklist we will start by cleaning out invalid values and outliers from the dataset.

 ### Examine the target

In [None]:
print(set(train_df['target'].values))
train_df['target'].describe()

So above we can see that we have a binary target and where the majority (almost 75%) of the values are 0. 

### Invalid Values

In [None]:
train_df.info()

Here we can see that there are no non-null values so there is nothing to remove here.

### Outliers

Removing outliers is less of a science and more of an art form. So I will leave the choice up to you, but show you how to visualise these points.

We will add noise to the one dimensional features in order to "explode" the points out, helping us see the distributions and potential outliers.

We will consider a point to be an outlier if it is N standard deviations from the mean. N is defined as the threshold.

In [None]:
def plot_outliers(df, feature, threshold=3):
    mean, std = np.mean(df), np.std(df)
    z_score = np.abs((df-mean) / std)
    good = z_score < threshold

    print(f"Rejection {(~good).sum()} points")
    visual_scatter = np.random.normal(size=df.size)
    plt.scatter(df[good], visual_scatter[good], s=2, label="Good", color="#4CAF50")
    plt.scatter(df[~good], visual_scatter[~good], s=8, label="Bad", color="#F44336")
    plt.legend(loc='upper right')
    plt.title(feature)
    plt.show();
    
    return good

### Feature Outliers


In [None]:
for feature in cont_FEATURES:
    plot_outliers(train_df[feature], feature)

We can see from the above that there are possibly some outliers for `cont8` that could be removed if you were struggling with the accuracy of your model.

# Analysing Distributions

Here we will look at correlations between the features, distributions of the features.

First let's check that each row has it's own unique id.

In [None]:
len(set(list(train_df['id'].values)))

### Continuous Variables

In [None]:
for feature in cont_FEATURES:
    sns.violinplot(x=train_df[feature], inner='quartile', bw=0.1)
    plt.title(feature)
    plt.show();

In [None]:
for feature in cont_FEATURES:
    sns.violinplot(x='target', y=feature, data=train_df, inner='quartile');
    plt.title(feature)
    plt.show()

From the above analysis we can see that most of the features have some variation depending on the target value, however this difference is subtle. Therefore no feature is going to be a silver bullet.

# Categorical Variables

First let's look at what values the categorical variables can take.

In [None]:
for cat in cat_FEATURES:
    values = train_df.groupby(cat)['id'].count().reset_index()
    sns.barplot(x=cat, y='id', data=values)
    plt.title(cat)
    plt.show();

This quick piece of analysis shows us that some categorical features are binary, others have a large number of categories. We can also see that there is a lot of class imbalance in these features which could help us build a feature set to predict the target.

In [None]:
number_of_rows = train_df.shape[0]
for feature in cat_FEATURES:
    percentage_common_category = train_df.groupby(feature)['id'].count().reset_index()
    print(feature)
    print(percentage_common_category['id'].max() / number_of_rows)

In [None]:
# TODO: Stacked bar chart to show the percentage of target 0, that have a label, and the percentage of target 1 that have a label
# TODO: Like further down in this notebook https://www.kaggle.com/tsilveira/applying-heatmaps-for-categorical-data-analysis

# Empirical CDFs

The below graphs show us where the 10th/20th/..../90th percentiles lie for each of the features.

In [None]:
def plot_cdf(df, feature):
    ps = 100 * st.norm.cdf(np.linspace(-4, 4, 10)) # The last number in this tuple is the number of percentiles
    x_p = np.percentile(df, ps)

    xs = np.sort(df)
    ys = np.linspace(0, 1, len(df))

    plt.plot(xs, ys * 100, label="ECDF")
    plt.plot(x_p, ps, label="Percentiles", marker=".", ms=10)
    plt.legend()
    plt.ylabel("Percentile")
    plt.title(feature)
    plt.show();

for feature in cont_FEATURES:
    plot_cdf(train_df[feature], feature)

The majority of these continuous feature's CDFs are smooth and show a relatively even distribution of values. However we can see that `Cont3` and `Cont5` clearly have *steps* in their data that suggest that these features could be discretized.

# Correlation

Here we can look at the correlation between the features and each other (and the target)

In [None]:
# This plots a matrix of correlations between all the features and the target
# Note: I sometimes comment this out because it takes a few minutes to run and doesn't show any useful information.

# pd.plotting.scatter_matrix(train_df, figsize=(10, 10));

### Continuous Features

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(train_df.drop(columns=['id']).corr(), annot=True, cmap='viridis', fmt='0.2f', ax=ax)

It is promising to see relatively high numbers of correlations here. We can see some groups of features that could be suitable for PCA dimensionality reduction. For example `[cont1, cont2, cont3, cont7, cont8, cont9, cont10]`.


### Categorical Features



In [None]:
# I stole this method from here https://stackoverflow.com/questions/46498455/categorical-features-correlation/46498792#46498792

def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

cm = pd.DataFrame(columns=cat_FEATURES+['target'], index=cat_FEATURES+['target'])

for feature_1 in cat_FEATURES+['target']:
    for feature_2 in cat_FEATURES+['target']:
        confusion_matrix = pd.crosstab(train_df[feature_1], train_df[feature_2])
        #print(feature)
        #print(cramers_v(confusion_matrix.values))
        cm.at[feature_1, feature_2] = float(cramers_v(confusion_matrix.values))

In [None]:
fig, ax = plt.subplots(figsize=(25,10)) 
sns.heatmap(cm.astype(float).values, vmin=0, vmax=1, xticklabels=cat_FEATURES+['target'], yticklabels=cat_FEATURES+['target'], annot=True, ax=ax)
plt.title('Categorical Features Correlation')
plt.show();

From the above we can see correlations between the features with each other and even with the target. These can all help use decide whether to run dimensionality reduction such as Multiple Corresponance Analysis.

# Feature Engineering

This is still a work in progress but next I will perform some feature engineering based on the above findings (such as PCA for continuous features and MCA for categorical features)

In [None]:
# TODO: Feature engineering based on the above findings

In [None]:
# TODO: PCA for continuous features

In [None]:
# TODO: MCA for categorical features
# https://pypi.org/project/mca/