## <p style="background-color:#3a2c57; font-family:newtimeroman; margin-bottom:2px; font-size:32px; color: white; text-align:center">Table of Content</p>  

<a id="table-of-contents"></a>
1. [Preperation](#preperation)
    * 1.1. [Loading Packages and Importing Libraries](#load_packages_import_libraries)
    * 1.2. [Data Description](#data_description)
2. [Exploratory Data Analysis (EDA)](#eda)
    * 2.1. [Categorical Variables](#categorical_variables)
        * 2.1.1. [Number of Categorical Variables](#no_cat_features)
        * 2.1.2. [Correlation Matrix of Categorical Variables](#corr_categorical_variables)
    * 2.2. [Numerical Variables](#numerical_variables)
        * 2.2.1. [Box Plot of Numerical Variables](#box_numerical_variables)
        * 2.2.2. [KDE Plot of Numerical Variables](#kde_numerical_variables)
        * 2.2.3. [Correlation Matrix of Numerical Variables](#corr_numerical_variables)
        * 2.2.4. [Histogram Plot of Numerical Variables](#hist_numerical_variables)
        * 2.2.5. [Q-Q Plot of Numerical Variables](#qq_numerical_variables)
    * 2.3. [Normality Check and Outlier Detection](#norm_check_outlier_detect)
       * 2.3.1. [Mild and Extreme Outlier Detection](#mild_extreme_outlier)

[back to top](#table-of-contents)
<a id="preperation"></a>
# <p style="background-color:#3a2c57; font-family:newtimeroman; font-size:150%; text-align:center">1. Preperation</p>


<a id="load_packages_import_libraries"></a>
## <p style="background-color:#664e99; font-family:newtimeroman; font-size:120%; text-align:center">1.1. Loading Packages and Importing Libraries</p>

* **Loading packages and importing some helpful libraries.**

In [None]:
!pip install simple-colors

In [None]:
import numpy as np
import pandas as pd
from simple_colors import *
from termcolor import colored

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go

from scipy.stats import normaltest
from scipy import stats
from scipy.stats import iqr

# Supress Warnings
import warnings
warnings.filterwarnings("ignore")

<a id="data_description"></a>
## <p style="background-color:#664e99; font-family:newtimeroman; font-size:120%; text-align:center">1.2. Data Description</p>

* **First of all, some setting up options were made. It is aimed to show all rows and columns in order to improve the general view of data sets. Next, I will load the train and test data sets and display train and test data sets as well.**

In [None]:
#Setting up options

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.options.display.float_format = "{:,.3f}".format

In [None]:
# Load the data

train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')

In [None]:
def data_desc(df):
    
    """
    This function helps us with simple data analysis.
    We may explore the common information about the dataset, missing values, features distribution and duplicated rows
    """
    
    # applying info() method
    print('*******************')
    print(cyan('General information of this dataset', 'bold'))
    print('*******************\n')
    print(df.info())
    
    print('\n*******************')
    print(cyan('Number of rows and columns', 'bold'))
    print('*******************\n')
    print("Number of rows:", colored(df.shape[0], 'green', attrs=['bold']))
    print("Number of columns:", colored(df.shape[1], 'green', attrs=['bold']))
    
    # missing values
    print('\n*******************')
    print(cyan('Missing value checking', 'bold'))
    print('*******************\n')
    if df.isna().sum().sum() == 0:
        print(colored('There are no missing values', 'green'))
        print('*******************')
    else:
        print(colored('Missing value detected!', 'green', attrs=['bold']))
        print("\nTotal number of missing values:", colored(sum(df.isna().sum()), 'green', attrs=['bold']))
        
        print('\n*******************')
        print(cyan('Missing values of features', 'bold'))
        print('*******************\n')
        display(df.isna().sum().sort_values(ascending = False).to_frame().rename({0:'Counts'}, axis = 1).T.style.background_gradient('Purples', axis = None))
        print('\n*******************')
        print(cyan('Percentage of missing values of features', 'bold'))
        print('*******************\n')
        display(round((df.isnull().sum() / (len(df.index)) * 100) , 3).sort_values(ascending = False).to_frame().rename({0:'%'}, axis = 1).T.style.background_gradient('PuBuGn', axis = None))

        
    # applying describe() method for categorical features
    cat_feats = [col for col in df.columns if df[col].nunique() < 3]
    print('\n*******************')
    print(cyan('Categorical columns', 'bold'))
    print('*******************\n')
    print("Total categorical (binary) features:", colored(len(cat_feats), 'green', attrs=['bold']))
    display(df[cat_feats].describe())
        
        
    # describe() for numerical features
    cont_feats = [col for col in df.columns if df[col].nunique() >= 3 and col not in ('Id', 'Cover_Type')]
    print('\n*******************')
    print(cyan('Numerical columns', 'bold'))
    print('*******************\n')
    print("Total numerical features:", colored(len(cont_feats), 'green', attrs=['bold']))
    df = df[df.columns.difference(['Id', 'Cover_Type'], sort = False)]
    display(df[cont_feats].describe())
    
    # Checking for duplicated rows -if any-
    if df.duplicated().sum() == 0:
        print('\n*******************')
        print(colored('There are no duplicates!', 'green', attrs=['bold']))
        print('*******************')
    else:
        print('\n*******************')
        print(colored('Duplicates found!', 'green', attrs=['bold']))
        print('*******************')
        display(df[df.duplicated()])

    print('\n*******************')
    print(cyan('Preview of the data - Top 10 rows', 'bold'))
    print('*******************\n')
    display(df.head(10))
    print('*******************\n')
    
    print('\n*******************')
    print(cyan('End of the report', 'bold'))

In [None]:
data_desc(train)

In [None]:
data_desc(test)

[back to top](#table-of-contents)
<a id="eda"></a>
# <p style="background-color:#3a2c57; font-family:newtimeroman; font-size:150%; text-align:center">2. Exploratory Data Analysis (EDA)</p>

* **All numerical and categorical variables will be explored in this section.**

In [None]:
plt.figure(figsize=(10, 7))
ax = sns.countplot(y=train["Cover_Type"], palette='muted', zorder=3, linewidth=5, orient='h', saturation=1, alpha=1)
ax.set_title('Distribution of Cover Type', fontname = 'Times New Roman', fontsize = 30, color = '#8c49e7', x = 0.5, y = 1.05)
background_color = "#8c49e7"
sns.set_palette(['#ffd514']*120)

for a in ax.patches:
    value = f'Amount and percentage of values: {a.get_width():,.0f} | {(a.get_width()/train.shape[0]):,.3%}'
    x = a.get_x() + (a.get_width() / 16) 
    y = a.get_y() + a.get_height() / 2  
    ax.text(x, y, value, ha='left', va='center', fontsize=11, 
    bbox=dict(facecolor='none', edgecolor='black', boxstyle='round4', linewidth=0.7))

# ax.margins(-0.12, -0.12)
ax.grid(axis="x")

sns.despine(right=True)
sns.despine(offset=15, trim=True)

In [None]:
categorical_features =[]
numerical_features =[]

for col in train.columns:
    if train[col].nunique() < 3:
        categorical_features.append(col)
    elif train[col].nunique() >= 3 and col not in ('Id', 'Cover_Type'):
        numerical_features.append(col)
print('Catagoric features: ', categorical_features)
print()
print('Numerical features: ', numerical_features)

In [None]:
# Cardinality check

print(colored("In Train Dataset", 'cyan', attrs=['bold', 'underline']))
for col in categorical_features:
    print('{} unique values in {}'.format(train[col].nunique(), col))

print()
print(colored("In Test Dataset", 'cyan', attrs=['bold', 'underline']))
for col in categorical_features:
    print('{} unique values in {}'.format(test[col].nunique(), col))

In [None]:
def cardinality(data):
    for k in categorical_features:
        print(f'{k}\n{(np.round((data[k].value_counts() / len(data[k]))*100,3))}\n')

In [None]:
cardinality(train)

In [None]:
cardinality(test)

<a id="categorical_variables"></a>
## <p style="background-color:#664e99; font-family:newtimeroman; font-size:120%; text-align:center">2.1. Categorical Variables</p>

In [None]:
fig = go.Figure([go.Bar(x = train[categorical_features].nunique().index, y = train[categorical_features].nunique().values, marker_color='rgb(100, 14, 175)')])
#fig.show()

fig.update_traces(marker_line_color='rgb(120, 15, 155)', marker_line_width=1, opacity=0.7)

fig.update_layout(
    title="<b>Number of unique values of categorical features<b>",
    width=2000,
    height=1250,
    
    xaxis = dict(showline=True,
    title = '<b>Categorical Variables<b>',
    tickangle = -30,
    tickfont = dict(family='Times New Roman', color='black', size=16),
    titlefont_size = 16,
    ),

    yaxis = dict(showline=True,
    ticks = "outside", tickwidth=2, tickcolor='red', ticklen=7.5,
    title = '<b># of unique values<b>',
    tickfont = dict(family = 'Times New Roman', color='black', size=16),
    titlefont_size = 16,
    title_standoff = 5,
    ),
    bargap = 0.50, # gap between bars of adjacent location coordinates.   
)

<a id="no_cat_features"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.1.1. Number of Categorical Variables</p>

In [None]:
def count_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 5
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 80))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.4, 
                    hspace=0.4)

    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x = feature, palette='rocket_r', data=data, hue=None)
        plt.xlabel(feature, fontsize=14, fontweight = 'bold')
        plt.ylabel('#', fontsize=14, fontweight = 'bold')
        for p in ax.patches:
            height = p.get_height()
            value = f'{p.get_height():,.0f} | {(p.get_height()/data[feature].shape[0]):,.3%}'
            ax.text(p.get_x()+p.get_width()/2., height+15000, value, ha="center", fontsize = 11, fontweight = 'bold')     
        i += 1
    
    plt.suptitle(titleText, fontsize = 28, fontweight = 'bold', color = 'darkorange')
    plt.show()    

In [None]:
count_plot(train, categorical_features, 'Categorical features of train dataset', hue=None)

In [None]:
count_plot(test, categorical_features, 'Categorical features of test dataset', hue=None)

In [None]:
def count_plot_testtrain(data1, data2, features, titleText):
  
    L = len(features)
    nrow= int(np.ceil(L/4))
    ncol= 5
    remove_last= (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 80))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last = remove_last - 1

    fig.subplots_adjust(top = 0.97)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.4, 
                    hspace=0.4)
    
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x=feature, color='#61057c', data=data1, label='train')         
        ax = sns.countplot(x=feature, color='#b7f035', data=data2, label='test')
        plt.xlabel(feature, fontsize=14, fontweight = 'bold')
        plt.ylabel('#', fontsize=14, fontweight = 'bold')
        ax = ax.legend(loc = "best", fontsize = 12)
        i += 1

    plt.suptitle(titleText, fontsize = 28, fontweight = 'bold', color = 'indigo')
    plt.show()

In [None]:
count_plot_testtrain(train, test, categorical_features, titleText = 'Categorical features of train & test datasets')

* **Soil_type7 and Soil_type15 seem to only take 0.**

<a id="corr_categorical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.1.2. Correlation Matrix of Categorical Variables</p>

In [None]:
def correlation_matrix(data, features):
    
    fig, ax = plt.subplots(1, 1, figsize = (20, 20))
    plt.title('Pearson Correlation Matrix', fontweight='bold', fontsize=25)
    fig.set_facecolor('#d0d0d0') 
    corr = data[features].corr()

    # Mask to hide upper-right part of plot as it is a duplicate
    mask = np.triu(np.ones_like(corr, dtype = bool))
    sns.heatmap(corr, annot = False, center = 0, cmap = 'jet', mask = mask, linewidths = .5, square = True, cbar_kws = {"shrink": .70})
    ax.set_xticklabels(ax.get_xticklabels(), fontfamily = 'sans', rotation = 90, fontsize = 12)
    ax.set_yticklabels(ax.get_yticklabels(), fontfamily = 'sans', rotation = 0, fontsize = 12)
    plt.tight_layout()
    plt.show()

In [None]:
correlation_matrix(train, categorical_features)

In [None]:
correlation_matrix(test, categorical_features)

* **Except between 'Wilderness_Area1' and 'Wilderness_Area3', there is no significant correlation between categorical variables in both train and test dataset. No correlation between variables is even greater than 0.01. Additionally, the relationships between the variables are similar in both data sets.**

<a id="numerical_variables"></a>
## <p style="background-color:#664e99; font-family:newtimeroman; font-size:120%; text-align:center">2.2. Numerical Variables</p>

<a id="box_numerical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.2.1. Box Plot of Numerical Variables</p>

In [None]:
def box_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 5
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 20))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last = remove_last - 1

    fig.subplots_adjust(top = 0.94)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.1, 
                    hspace=0.7)
    
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        v0 = sns.color_palette(palette = "pastel").as_hex()[2]
        ax = sns.boxplot(x = data[feature], color=v0, saturation=.75)  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=14, fontweight = 'bold')
        plt.ylabel('Values', fontsize=14, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 28, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
box_plot(train, numerical_features, 'Box Plot of Numerical Columns of Train Dataset')

In [None]:
box_plot(test, numerical_features, 'Box Plot of Numerical Columns of Train Dataset')

* **It is very obvious that some features contain significant amount of outlier value in both data sets. This has to be handled.**

<a id="kde_numerical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.2.2. KDE Plot of Numerical Variables</p>

In [None]:
def kde_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 5
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 20))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.94)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.1, 
                    hspace=0.7)
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.kdeplot(data[feature], color="m", shade=True, label="%.3f"%(data[feature].skew()))  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=14, fontweight = 'bold')
        plt.ylabel('Density', fontsize=14, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 28, fontweight = 'bold', color = 'navy')
    
    plt.show()

In [None]:
kde_plot(train, numerical_features, titleText = 'KDE Plot of Numerical Features of Train Dataset', hue = None)

In [None]:
kde_plot(test, numerical_features, titleText = 'KDE Plot of Numerical Features of Test Dataset', hue = None)

* **Supporting the box chart, it can be seen from this chart that there are various outliers.**

<a id="corr_numerical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.2.3. Correlation Matrix of Numerical Variables</p>

In [None]:
def correlation_matrix(data, features):
    
    fig, ax = plt.subplots(1, 1, figsize = (10, 10))
    plt.title('Pearson Correlation Matrix', fontweight='bold', fontsize=25)
    fig.set_facecolor('#d0d0d0') 
    corr = data[features].corr()

    # Mask to hide upper-right part of plot as it is a duplicate
    mask = np.triu(np.ones_like(corr, dtype = bool))
    sns.heatmap(corr, annot = False, center = 0, cmap = 'jet', mask = mask, linewidths = .5, square = True, cbar_kws = {"shrink": .70})
    ax.set_xticklabels(ax.get_xticklabels(), fontfamily = 'sans', rotation = 90, fontsize = 12)
    ax.set_yticklabels(ax.get_yticklabels(), fontfamily = 'sans', rotation = 0, fontsize = 12)
    plt.tight_layout()
    plt.show()

In [None]:
correlation_matrix(train, numerical_features)

In [None]:
correlation_matrix(test, numerical_features)

* **There is no significant correlation between numerical variables in both train and test dataset.**

<a id="hist_numerical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.2.4. Histogram Plot of Numerical Variables</p>

In [None]:
def hist_plot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 5
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 20))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.94)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.1, 
                    hspace=0.7)
    
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.histplot(data[feature], edgecolor="black", color="darkseagreen", alpha=0.7)  
        ax = ax.legend(loc = "best")    
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Frequency', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
train_frac = train.sample(frac = 0.25).reset_index(drop = True)
hist_plot(train, numerical_features, titleText = 'Histogram of Numerical Features of Train Dataset', hue = None)

In [None]:
test_frac = test.sample(frac = 0.25).reset_index(drop = True)
hist_plot(test, numerical_features, titleText = 'Histogram of Numerical Features of Test Dataset', hue = None)

* **The logic in the KDE plots is also executed in the histogram plots.**

<a id="qq_numerical_variables"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.2.5. Q-Q Plot of Numerical Variables</p>

In [None]:
def qqplot(data, features, titleText, hue=None):

    L = len(features)
    nrow = int(np.ceil(L/4))
    ncol = 5
    remove_last = (nrow * ncol) - L

    fig, axs = plt.subplots(nrow, ncol, figsize=(30, 20))
    fig.tight_layout()
    fig.set_facecolor('#e4e4e4')

    while remove_last > 0:
      axs.flat[-remove_last].set_visible(False)
      remove_last -= 1

    fig.subplots_adjust(top = 0.94)
    plt.subplots_adjust(left=0.1,
                    bottom=0.01, 
                    right=0.9,  
                    wspace=0.1, 
                    hspace=0.7)
        
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)   
        stats.probplot(data[feature],plot=plt)
        plt.title('\nQ-Q Plot')
        plt.xlabel(feature, fontsize=18, fontweight = 'bold')
        plt.ylabel('Sample Quantile', fontsize=18, fontweight = 'bold')
        i += 1

    plt.suptitle(titleText, fontsize = 32, fontweight = 'bold', color = 'navy')
    plt.show()

In [None]:
qqplot(train, numerical_features, 'Q-Q Plot of Numerical Features of Train Dataset', hue=None)

In [None]:
qqplot(test, numerical_features, 'Q-Q Plot of Numerical Features of Train Dataset', hue=None)

* **The Q-Q plot with clues to the normal distribution also shows tremendously that the data is not normally distributed.**

<a id="norm_check_outlier_detect"></a>
## <p style="background-color:#664e99; font-family:newtimeroman; font-size:120%; text-align:center">2.3. Normality Check and Outlier Detection</p>

In [None]:
# D'Agostino and Pearson's Test

def normality_check(data):
  for i in numerical_features:
    # normality test
    stat, p = normaltest(data[[i]])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret results
    alpha = 1e-2
    if p > alpha:
        print(f'{i} looks Gaussian (fail to reject H0)\n')
    else:
        print(f'{i} does not look Gaussian (reject H0)\n')

In [None]:
normality_check(train)

In [None]:
normality_check(test)

In [None]:
def detect_outliers(x, c = 1.5):
    """
    Function to detect outliers.
    """
    q1, q3 = np.percentile(x, [25,75])
    iqr = (q3 - q1)
    lob = q1 - (iqr * c)
    uob = q3 + (iqr * c)

    # Generate outliers

    indicies = np.where((x > uob) | (x < lob))

    return indicies

In [None]:
# Detect all Outliers 
outliers = detect_outliers(train['Cover_Type'])
print("Total Outliers count for Cover Type : ", len(outliers[0]))

print("\nShape before removing outliers : ",train.shape)

# Remove outliers
#train.drop(outliers[0],inplace=True, errors = 'ignore')
print("Shape after removing outliers : ",train.shape)

* **Obviously, since the Cover Type variable is categorical, there is no outlier value for this variable. There are many outliers for other features, but no direct data dropping is done in order not to lose an enormous number of rows.** 

<a id="mild_extreme_outlier"></a>
## <p style="background-color:#9370db; font-family:newtimeroman; font-size:100%; text-align:center">2.3.1. Mild and Extreme Outlier Detection</p>

In [None]:
train_iqr = pd.DataFrame()
train_iqr.reindex(columns=[*train_iqr.columns.tolist(), "-3 IQR", "-1.5 IQR", "1.5 IQR", "3 IQR"], fill_value = 0)

In [None]:
data = []

k = 0
columns = ["-3 IQR", "-1.5 IQR", "1.5 IQR", "3 IQR"]

for i in numerical_features:

    q1 = train[i].quantile(0.25)
    q3 = train[i].quantile(0.75)
    
    iqr = (q3 - q1)
    lob_1 = q1 - (iqr * 1.5)
    uob_1 = q3 + (iqr * 1.5)
    lob_3 = q1 - (iqr * 3)
    uob_3 = q3 + (iqr * 3)
    
    number_uob_1 = f'{round(sum(train[numerical_features[k]] > uob_1) / len(train[numerical_features[k]]), 5):,.3%}'
    number_lob_1 = f'{round(sum(train[numerical_features[k]] < lob_1) / len(train[numerical_features[k]]), 5):,.3%}'
    number_uob_3 = f'{round(sum(train[numerical_features[k]] > uob_3) / len(train[numerical_features[k]]), 5):,.3%}'
    number_lob_3 = f'{round(sum(train[numerical_features[k]] < lob_3) / len(train[numerical_features[k]]), 5):,.3%}'

    values = [number_lob_3, number_lob_1, number_uob_1, number_uob_3]
    zipped = zip(columns, values)
    a_dictionary = dict(zipped)
    print(a_dictionary)
    data.append(a_dictionary)
    
    k = k + 1

In [None]:
train_iqr = train_iqr.append(data, True)
train_iqr.set_axis([numerical_features], axis='index')

In [None]:
def colour(value):

    if float(value.strip('%')) > 10:
      color = 'red'
    elif float(value.strip('%')) > 5:
        color = 'darkorange'   
    else:
      color = 'green'

    return 'color: %s' % color

# train_iqr = train_iqr.set_axis([numerical_features], axis='index')
train_iqr = train_iqr.style.applymap(colour)

In [None]:
train_iqr

* **This dataframe about how to manage outlier values during the feature engineering section while developing the model will be very helpful.** 