<a id="section-top"></a>

# Table of Contents

**0)** [Introduction](#section-zero)

**1)** [Target Distribution](#section-one)

**2)** [Numerical Features - Distribution](#section-two)

**3)** [Numerical Features - Target](#section-three)

**4)** [Categorical Features - Distribution & Target](#section-four)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("max_info_columns", 300)

<a id="section-zero"></a>

# 0. Introduction

In [None]:
train = pd.read_csv("../input/tabular-playground-series-oct-2021/train.csv")

train.head(10)

You have to adjust max_info_columns option from pandas for showing infos.
> pd.set_option("max_info_columns", 300)

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.isnull().sum().sum()

We don't have missing value, nice.

In [None]:
target = "target"

cat_cols = [col for col in train.columns if (col not in ["id", target]) & (train[col].nunique() <= 10)]
num_cols = [col for col in train.columns if (col not in ["id", target]) & (train[col].nunique() > 10)]
predictors = [col for col in train.columns if col not in ["id", target]]

print(f"Number of Categorical columns: {len(cat_cols)}")
print(f"Number of Numerical columns: {len(num_cols)}")
print(f"Number of Predictors: {len(predictors)}")


len(cat_cols) + len(num_cols) == len(predictors)

[take me to the top](#section-top)

<a id="section-one"></a>

# 1. Target Distribution

In [None]:
fig, ax = plt.subplots(figsize = (6, 6))

labels = train[target].value_counts().index.tolist()
palette = ["#0EB8F1", "#F1480F"]


ax.pie(train[target].value_counts(), labels = labels, autopct = '%1.2f%%', 
       startangle = 180, colors = palette)

ax.set_title(target)
plt.show()

Balanced dataset, good.

[take me to the top](#section-top)

<a id="section-two"></a>

# 2. Numerical Features - Distribution

In [None]:
position = range(1, len(num_cols) + 1)

plt.rcParams["font.family"] = "Times New Roman"
fig = plt.figure(1, figsize=(40, 30), facecolor = "#E5E5E5")

for col, pos in zip(num_cols, range(len(num_cols))):
    
    skewness = np.round(train[col].skew(), 3)
    kurtosis = np.round(train[col].kurtosis(), 3)
    
    ax = fig.add_subplot(15, 16, position[pos])
    sns.kdeplot(data = train, x = col, ax = ax, color = "#101820")
    
    ax.set_title(r"$\bf{" + col  + "}$" + "\nSkewness: " + str(skewness) + "\nKurtosis: " + str(kurtosis))
    ax.set_facecolor("#E5E5E5")
    ax.set_xlabel("")

plt.tight_layout()
plt.show()

[take me to the top](#section-top)

<a id="section-three"></a>

# 3. Numerical Features - Target

In [None]:
position = range(1, len(num_cols) + 1)

fig = plt.figure(1, figsize=(40, 30), facecolor = "#E5E5E5")

order = sorted(train[target].unique())
palette = ["#0EB8F1", "#F1480F", "#971194", "#FEE715", "#101820"]

for col, pos in zip(num_cols, range(len(num_cols))):
    
    ax = fig.add_subplot(15, 16, position[pos])
    sns.boxplot(data = train, y = col, hue = target, 
                ax = ax, x = [""] * len(train), 
                palette = palette[:len(order)], linewidth = 0.5, 
                flierprops = dict(marker = "x", markersize = 3.5))
    
    ax.set_title(r"$\bf{" + col + "}$")
    ax.set_facecolor("#E5E5E5")
    ax.set_ylabel("")
    
    ax.get_legend().remove()
    handles, labels = ax.get_legend_handles_labels()
    fig.legend(handles, labels, loc = "upper center")

plt.tight_layout()
plt.show()

[take me to the top](#section-top)

<a id="section-four"></a>

# 4. Categorical Features - Distribution & Target

In [None]:
fig = plt.figure(1, figsize=(40, 30), facecolor = "#E5E5E5")

order = sorted(train[target].unique())

for col, pos in zip(cat_cols, list(np.arange(1, 90, 2))):
    
    ax1 = fig.add_subplot(9, 10, pos)
    sns.countplot(x = col, data = train, hue = target, ax = ax1, order = order, palette = palette, alpha = 1)
    ax1.set_title("Counts For Feature:\n" + col)
    
    ax2 = fig.add_subplot(9, 10, pos + 1)
    
    temp = train.groupby(col)[target].value_counts(normalize = True).\
    rename("percentage").\
    reset_index()
    
    sns.barplot(x = col, y = "percentage", hue = target, data = temp, 
                ax = ax2, order = order, palette = palette)
    
    ax2.set_title("Percentages For Feature: \n" + col)
    
    ax2.set_ylim(0,1)
    
    for p in ax2.patches:
        
        txt = "{:.1f}%".format(p.get_height() * 100)
        ax2.text(p.get_x(), p.get_height() + 0.02, txt, fontsize = 10)

    for ax in [ax1, ax2]:
        ax.get_legend().remove()
        ax.set_facecolor("#E5E5E5")
        ax.set_ylabel(" ")
        ax.tick_params(labelsize = 8)
        
    handles, labels = ax1.get_legend_handles_labels()
    fig.legend(handles, labels, loc = "upper center")

plt.tight_layout()
plt.show()    

[take me to the top](#section-top)