# About this Notebook
Hey all,
my goal is to write a **<font color="purple">compact guide</font>** on Exploratory Data Analysis.
**<font color="purple">I will add new sections to this notebook, whenever I had enough time to work on this notebook</font>**

<br>
<br>

<div class="alert alert-danger" role="alert">
    <h3>Feel free to <span style="color:red">comment</span> if you have any suggestions   |   motivate me with an <span style="color:red">upvote</span> if you like this project.</h3>
</div>

some initial imports...

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

<h1 style="background-color:Purple; color:white" >-> Topics:</h1>

## 1. [What is Exploratory Data Analysis?](#sec1)

## 2. [Phase 1: Overview](#sec2)
#### 2.1. [Pandas Profiling](#sec21)
#### 2.2. [Describe()](#sec22)
#### 2.3. [Info()](#sec23)
#### 2.4. [Nunique()](#sec24)
## 3. [Phase 2: Relationship Analysis and Visualizations](#sec3)
#### 3.1. [What makes a good visualization?](#sec31)
#### 3.1. [Relationship Reports](#sec32)

<a id="sec1"></a>
<h1 style="background-color:Purple; color:white" >-> 1. What is Exploratory Data Analysis?</h1>

EDA is only performed on **<font color="purple">training data</font>**, to avoid the **<font color="purple">data snooping bias</font>**.
A combination of multiple Techniques to give a wider perspective on the data:
* summarize the main statistics of your data
* scan for Missing Values and Outliers
* identify the most important independent variables in the data set
* diagnose possible Transformations for your independend variables
* determine valuable combinations of your independend variables
* pinpoint the relationships between all variables, including the target variable
* utilize visualizations, to emphasize patterns and relationships 

<br>
<br>

### EDA consists of two main phases:
*  **<font color="purple">Phase 1: Overview</font>**  <br>Display the total number of columns and rows, distribution of each column, Unique Values, and Missing Values.

* **<font color="purple">Phase 2: Relationship Analysis and Visualization</font>**  <br>Create some visualizations to uncover the relationships between multiple variables.

After performing these 2 initial phases, and **<font color="purple">cleaning your data</font>** according to [these guidelines](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-data-cleaning),  you will have to  iteratively create your final machine learning model by performing these 3 steps in a loop:
* (I) Try out some machine learning models 
* (II) perform [Feature Engineering](https://www.kaggle.com/milankalkenings/comprehensive-tutorial-feature-engineering) 
* (III) perform some Relationship Analysis (the second phase of EDA)

<a id="sec2"></a>
<h1 style="background-color:Purple; color:white" >-> 2. Phase 1: Overview</h1>

Getting an overview of your data is an essential step, and you should always perform this phase of EDA to evade facing some unexpected anomalies in your data later on.

Even though extracting [descriptive statistics](https://www.statisticshowto.com/summary-statistics/) from your data set will give you some very important information about your data, you should always visualize the head and the tail of your data set as a first step.

The reason for this is that even extremely different looking data sets may have similar, or even equal **<font color="purple">descriptive statistics</font>**, which could lead you to treat a new data set similar to another data set with similar statistics which you have explored before. A very famous example for this is the [Ascombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), four very different data sets with very similar **<font color="purple">descriptive statistics</font>**.

Take a look at [this video](https://www.youtube.com/watch?v=fVj3Z7mU6zk&list=LL&index=7) by [Python Programmer](https://www.youtube.com/channel/UC68KSmHePPePCjW4v57VPQg) on youtube for further motivation.

This behavior can lead to some huge drawbacks for the performance of your predictive models.

In [None]:
df.head()

In [None]:
df.tail()

One fundamental information we need to have about the data is the distribution of each variable independently. Especially the distributions of the numerical variales might be very important, if we aim to tranform them into normal distributions.

In [None]:
fig, ax = plt.subplots(figsize=(15,12))
df.hist(ax=ax)
# some more padding between the subplots
#plt.subplots_adjust(hspace=2) 
plt.tight_layout()
plt.savefig("hist.png")
plt.show()

<a id="sec21"></a>
<h1 style="background-color:Purple; color:white" >-> 2.1. Pandas Profiling</h1>

Besides providing some basic statistics for each feature, pandas profiling calculates correlations between your features and it even displays the distribution for each of them. Moreover, it provides some hints for each variable like 'many null values', 'many unique values', and so on. 

In [None]:
import pandas_profiling

report = pandas_profiling.ProfileReport(df)

In [None]:
from IPython.display import display
display(report)

<a id="sec22"></a>
<h1 style="background-color:Purple; color:white" >-> 2.2. Describe()</h1>

In [None]:
df.describe()

<a id="sec23"></a>
<h1 style="background-color:Purple; color:white" >-> 2.3. Info()</h1>

In [None]:
df.info()

<a id="sec24"></a>
<h1 style="background-color:Purple; color:white" >-> 2.4. Unique()</h1>

In [None]:
df.nunique()

<a id="sec3"></a>
<h1 style="background-color:Purple; color:white" >-> 3. Phase 2: Relationship Analysis and Visualization</h1>

I will not focus so much on the high amount of different visualizations we can use, because every situation demands you to visualize data differently and there are tons of comprehensive collections of notebooks on kaggle already.

Instead, I will focus on the **<font color="purple">Main Ideas</font>**, **<font color="purple">Theoretical Concepts</font>**, and **<font color="purple">General Advices</font>** of Data Visualization. I will explain them using some very basic plots.




<a id="sec31"></a>
<h1 style="background-color:Purple; color:white" >-> 3.1. What makes a good visualization?</h1>

A good visualization...
* provides a quick solution to an information problem.
* emphasizes the key **<font color="purple">key findings</font>**.
* induces the viewer to think about the substance rather than about the methodology.
* gives an **<font color="purple">intuitive understanding</font>** (self-explaining labels and scales).
* is **<font color="purple">easy to interpret</font>** (no zooming is needed to get all the information, chosen colors are related to the displayed variables, has a high resolution...).
* doesn't contain any bias from the creator or other **<font color="purple">distortions</font>**.
* reveals **<font color="purple">patterns and relationships</font>**.
* provides an elegant solution due to **<font color="purple">interesting design</font>** and thus gains attention.
* is created for a **<font color="purple">specific audience</font>**.
* evokes **<font color="purple">associations</font>**
* keeps it **<font color="purple">simple</font>**

<a id="sec32"></a>
<h1 style="background-color:Purple; color:white" >-> 3.2. Relationship Reports</h1>

Whenever you create a Visualization to uncover valuable relationships, you should always write down your thoughts about the plot in order, so that... 

* ...all **<font color="purple">findings</font>** are written down in a **<font color="purple">condensed</font>** manner, so you don't have to interpret the visualization every time you take a look at your notebook
* ...you can present your thoughts and conclusions within your **<font color="purple">presentations</font>** for your coworkers / team members
* ...**<font color="purple">track</font>** your feature engineering and modeling **<font color="purple">decisions</font>** throughout your whole project

<br>

Let's create a simple visualization using *seaborn.relplot()*, in order to take a look at an example  **<font color="purple">Relationship Report</font>**:

In [None]:
def death_yes_no(x):
    if(x ==1):
        return 'yes'
    else:
        return 'no'

data_1 = df.loc[:,['serum_creatinine', 'age', 'DEATH_EVENT']]
data_1['death'] = data_1['DEATH_EVENT'].apply(death_yes_no)
data_1 = data_1.sort_values(by='age')
data_1

In [None]:
sns.relplot(kind='line', 
            x='age',  
            y="serum_creatinine", 
            hue='death',  
            style='death', 
            data=data_1, 
            palette="icefire",
            ci=None)
plt.show()

<b><mark style="background-color: purple"><font color="white">Findings:</font></mark></b>

The *serum_createnine* values of patients who *died* is higher than the *serum_createnine* values of patients who *survived* in almost every age category. 

<b><mark style="background-color: purple"><font color="white">Conclusions:</font></mark></b>

Create a new feature, defined as the *serum_creatinine* values of each observation minus the mean *serum_creatinine* values of the respective age group.

https://www.youtube.com/watch?v=5Zg-C8AAIGg&list=PLW89ucX5y-jQ1RFT9DkQmJ0FQt5HWCfJ4&index=6&t=8s