# EDA skewness

At first glance, the train.csv file in the data tab contains heavily skewed features. Is this reality or just an impression? There are simple methods to check the skewness and kurtosis of features, but let's start by performing an initial EDA on missing values, duplicates, outliers, and correlations, before going back to them.

If you're browsing this notebook, feel free to comment. I'll take into account the advice and comments to improve it.

[I. Importing libraries](#import)

[II. Loading the dataset](#load)

[III. Summary statistics](#stats)

[IV. Checking for null values](#null)

[V. Checking for duplicates](#duplicates)

[VI. Checking for outliers](#outliers)

[VII. Checking the correlations](#correlations)

[VIII. Checking the distribution of the dataset](#distribution)

For more information on handling large volumes of data with PySpark, see as well: [EDA with PySpark](https://www.kaggle.com/cmarquay/eda-pyspark)

<a id="import"></a>
## I. Importing libraries

We're content with basic libraries for this EDA.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_rows', None)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

<a id="load"></a>
## II. Loading the dataset

We load the training set to perform the EDA. We aren't supposed to know the content of the test set to avoid overfitting it. We start by displaying the first few lines and some basic information.

So we see that the training set contains 957,919 rows and 120 columns. The id column is actually the index, and the claim column is our y target: both are of type int64. Finally, we have 118 features of type float64 which constitute our X.

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
X = pd.read_csv("/kaggle/input/tabular-playground-series-sep-2021/train.csv")

In [None]:
X.head()

In [None]:
X.shape

In [None]:
X.info()

<a id="stats"></a>
## III. Summary statistics

We display some statistics concerning the 120 columns, we transpose the display for the sake of readability.

In [None]:
X.describe().T

<a id="null"></a>
## IV. Checking for null values

We find that only the id and claim columns don't contain missing values, however these only represent less than 2% of each column and they're distributed randomly in the training set.

In [None]:
X.isnull().sum()

In [None]:
np.max(X.isnull().sum() / X.shape[0])

In [None]:
plt.rcParams["figure.dpi"] = 100  # the dpi can be set to enhance the resolution of the image
# Congiguring retina format
%config InlineBackend.figure_format = "retina"
sns.heatmap(X.isnull(), cmap="viridis", yticklabels=False)

<a id="duplicates"></a>
## V. Checking for duplicates

There are no duplicate lines in the training set.

In [None]:
X.duplicated().sum()

<a id="outliers"></a>
## VI. Checking for outliers

We display the boxplots of the 120 columns to check for the presence of outliers.

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 0 and 19", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[:20]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 20 and 39", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[20:40]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 40 and 59", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[40:60]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 60 and 79", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[60:80]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 80 and 99", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[80:100]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Box plot for features between 100 and 119", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[100:120]):
    sns.boxplot(data=X, y=col, ax=axes[num_col // 4][num_col % 4])
    axes[num_col // 4][num_col % 4].set_title(X[col].name)

<a id="correlations"></a>
## VII. Checking the correlations

We see the absence of obvious correlations between the features and the claim column, which is a bit annoying. We also note the absence of obvious correlations between the different features, which is on the other hand a good thing. The correlations are all between -0.06 and 0.03, which is very close to 0.

In [None]:
sns.heatmap(X.corr(), cmap="RdYlGn")

In [None]:
X.corr().describe().T

In [None]:
X.corr()[(X.corr() < -0.06) | ((X.corr() > 0.03) & (X.corr() < 1.0))].sum().sum()

<a id="distribution"></a>
## VIII. Checking the distribution of the dataset

The seaborn distplot method was popular, but is now deprecated. We present here a method to obtain a close display of the distributions of each column using histplot.

We find that some columns are skewed, and/or have kurtosis. If a column has a value that is between -0.5 and 0.5, then everything is fine. If a column has a value that is between -1 and 1, then we can use it as is because the problem is moderate. If, on the other hand, a column has a value that is less than -1 or greater than 1, then we must intervene on it before going any further because it can distort the results of the analysis.

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 0 and 19", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[:20]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 20 and 39", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[20:40]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
    axes[num_col // 5][num_col % 4].set_title(X[col].name)

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 40 and 59", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[40:60]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 60 and 79", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[60:80]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 80 and 99", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[80:100]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})

In [None]:
fig, axes = plt.subplots(5, 4, figsize=(20,25))
fig.suptitle("Density plot for features between 100 and 119", y=1)
fig.subplots_adjust(top=2)
plt.tight_layout()

for num_col, col in enumerate(X.columns[100:120]):
    _, FD_bins = np.histogram(X[col].index, bins="fd")
    bin_nr = min(len(FD_bins)-1, 50)
    sns.histplot(data=X, x=col, bins=bin_nr, ax=axes[num_col // 4][num_col % 4], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})

In [None]:
X.skew()

In [None]:
X.kurt()

Feel free to comment. I'll take into account the advice and comments to improve this notebook.

For more information on handling large volumes of data with PySpark, see as well: [EDA with PySpark](https://www.kaggle.com/cmarquay/eda-pyspark)