# Art of EDA
> - In this Notebook, I will perform the Exploratory Data Analysis with the aim to describe the nuances in the data which would help one to do strong feature engineering and build robust models.
> - My biggest motive is to promote the use of statistics in the EDA process.
> - Please Upvote if you find this notebook useful.

---

# Import Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

---

# Data Loading

In [None]:
data = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")
sub = pd.read_csv("../input/tabular-playground-series-jun-2022/sample_submission.csv")

---

# EDA

In [None]:
print(data.shape)
data.head()

In [None]:
data.tail()

## Data Summary

In [None]:
print("Data Summary")
print("-"*50)
print(f"Total number of rows -- {data.shape[0]}")
print(f"Total number of columns -- {data.shape[1]}")
print("-"*50)
print(f"Missing values per column -- \n{data.isnull().sum()}")


In the above summary, we can see that all the columns except `row_id` and the columns with prefix `F_2` have null values

Lets do some more analysis on columns 

## Column Summary

In [None]:
def print_column_summary(data):
    name = []
    dtype = []
    unique_values = []
    missing = []
    for column in data.columns:
        name.append(str(column))
        data_type = str(data[column].dtypes)
        dtype.append(data_type)
        if(data_type == 'float64'):
            unique_values.append("")
        else:
            unique_values.append(str(data[column].nunique()))
        missing.append("{:0.2f} % ".format(data[column].isnull().sum() / data.shape[0] * 100))
    
    dfSummary = pd.DataFrame(name,columns = ["Name"])
    dfSummary["Dtypes"] = dtype
    dfSummary["Unique Value Count"] = unique_values
    dfSummary["Missing Value %"] = missing
    return dfSummary

In [None]:
summary = print_column_summary(data)
summary

## Note on Columns -- 

> - row_id is the primary key
> - The columns with prefix `F_2` are categorical and have no nulls
> - All the other columns have float values and have nulls which we need to impute in this competetion
> - All the float columns have nearly 1.8% null values

In [None]:
print("-"*50)
print("Variable Dataypes :-")
print(summary.Dtypes.value_counts())

---

# Analyze float64 features

There are 55 float64 features, lets plot there histogram individually



In [None]:
float_features = [f for f in data.columns if data[f].dtype == 'float64']

# Training histograms
fig, axs = plt.subplots(len(float_features)//4 + 1, 4, figsize=(16, 50))
for f, ax in zip(float_features, axs.ravel()):
    ax.hist(data[f], density=True, bins=100)
    ax.set_title(f'data {f}, std={data[f].std():.1f}')
plt.suptitle('Histograms of the float features')
plt.show()

All the float 64 columns are standardized so that they have near to normal distribution

## Correlation Plot

In [None]:
plt.figure(figsize=(30, 30))
sns.heatmap(data[float_features].corr(), center=0, annot=True, fmt='.1f')

In [None]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top 2 Absolute Correlations")
print(get_top_abs_correlations(data[float_features], 2))

Columns `F_4_8` and `F_4_11` have the highest correlation. 

Correlation among different columns might be useful while performing the imputation

## Plot the bar graph to demonstrate number of missing values per column

In our previous analysis, we have seen all the float64 columns have nearly 1.8% null values, lets plot them for more clarity.

In [None]:
plt.figure(figsize=(20, 60))
missing_values = data[float_features].isnull().sum()
missing_values.plot(
    kind="barh", title="Number of Missing Values per Sample"
)

## Lets analyze first 200 rows for missing values

We are analyzing only `float64` columns since they are the ones with missing values

In [None]:
import missingno as msno
msno.matrix(data[float_features].sample(200))

## Number of missing values per sample

In [None]:
n_missing = data[float_features].isnull().sum(axis=1)
n_missing.value_counts().plot(
    kind="bar", title="Number of Missing Values per Sample"
)

---

# Analyze int64 features

There are 26 int64 features, lets plot their value counts



In [None]:
int_features = [f for f in data.columns if data[f].dtype == 'int64']
int_features.pop(0) 
fig, axs = plt.subplots(len(int_features)//3 + 1, 3, figsize=(20, 30))

for i,col in enumerate(int_features):
    data[col].value_counts(normalize = True).plot(kind = 'bar',ax = axs[i//3][i%3], title =col )

> - All the `int64` have rare class problem
> - None of the `int64` columns are binary or ternary

---

# Simple mean imputer

Lets try to build a simple mean imputer

Following implementation has been motivated by @reymaster's work <a href = "https://www.kaggle.com/code/reymaster/starter-code-sklearn-simpleimputer/notebook?scriptVersionId=97156185"> link here</a>


In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
imputed_df = pd.DataFrame(imputer.fit_transform(data), columns = data.columns)

In [None]:
for i, row in enumerate(sub.values):
    row_col = row[0]
    imputed_row = row_col.split("-")[0] #get the row index
    imputed_col = row_col.split("-")[1] #get the column index
    sub.at[i, "value"] = imputed_df.iloc[int(imputed_row)][imputed_col]

In [None]:
sub.head()

In [None]:
sub.to_csv("submission.csv", index=False)

# To be continued ..