In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import warnings
warnings.filterwarnings('ignore')


# Exploratory Data Analysis (EDA)


<img src="img/eda.png" width="500" align="center" />

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and󠀢
- determine optimal factor settings.


<a href="https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm">󠀢󠀢󠀢󠀢📖 NIST</a>

### Interactive links:

- <a href="https://seeing-theory.brown.edu/index.html">Seeing Theory</a>
- <a href="http://mfviz.com/central-limit/">Central Limit Theorem</a>


## Some statistical jargon:

<img src="img/ml.jpg" width="600"/>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
plt.style.use('bmh')

### Expected value / Mean

The **mean**, also referred to by statisticians as the average, is the most common statistic used to measure the center of a numerical data set. The mean is the sum of all the values in the data set divided by the number of values in the data set. The mean of the entire population is called the population mean, and the mean of a sample is called the sample mean.

<font color='red'>The average is easily influenced by outliers (very small or large values in the data set that are not typical)<font/>

$$ \mu = \Large \bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_i $$

In [None]:
lst = [1, 4, 3, 2, 6, 4, 4, 3, 2, 6]

In [None]:
def mean(t):
    return sum(t) / len(t)

In [None]:
mean(lst)

In [None]:
np.mean(lst)

### Median

The **median** represents the middle value in a dataset. The median is important because it gives us an idea of where the center value is located in a dataset. The median tends to be more useful to calculate than the mean when a distribution is skewed and/or has outliers

<img src="img/med.svg" width="250" align="center" />

In [None]:
def median(X):
    n = len(X)
    sortX = sorted(X)
    
    if n % 2 != 0:
        return sortX[n // 2]
    else:
        return ( sortX[n // 2 - 1] + sortX[n // 2] ) / 2


In [None]:
median(lst)

In [None]:
# lst1 = [0, 1, -1, -10, -16, -6]
np.median(lst)
# np.percentile(lst, 50)

### Variance

In statistics, scientists and statisticians use the **variance** to determine how well the mean represents an entire set of data. For instance, the higher the variance, the more range exists within the set

https://en.wikipedia.org/wiki/Bessel%27s_correction

$$ \Large s^2 = \frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2 $$

In [None]:
def variance(t):
    mu = mean(t)
    return sum((x - mu) ** 2 for x in t) / len(t)

In [None]:
variance(lst)

In [None]:
np.var(lst)

### Standard Deviation

A **standard deviation** (or σ) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.

$$ \Large s = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2} $$

<img src="img/std.jfif" width="500" align="center" />

In [None]:
def std(t):
    return variance(t) ** 0.5

In [None]:
std(lst)

In [None]:
np.std(lst)

### Mode

In [None]:
from scipy import stats

stats.mode(lst).mode

## Preparations

In [None]:
df = pd.read_csv('data/Iris.csv')

In [None]:
df.head()

In [None]:
df.tail(1)

In [None]:
df.columns

In [None]:
df.columns = df.columns.str.lower()

In [None]:
df.info()

In [None]:
missing_data = pd.DataFrame({
    'total_missing': df.isnull().sum(),
    'perc_missing': (df.isnull().sum()/df.shape[0]) * 100
})

missing_data.sort_values(by='perc_missing', ascending=False)

In [None]:
df.describe()

## Data Visualization

In [None]:
import seaborn as sns

In [None]:
num_columns = df.select_dtypes(include=np.number).columns
num_columns

In [None]:
col = df[num_columns[3]]

In [None]:
plt.figure(figsize=(9, 8))

sns.distplot(col,
             color='b',
             bins=100,
             hist_kws={'alpha': 0.4})



### Skewness and Kurtosis

#### Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point

#### Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. It refers to the degree of presence of outliers in the distribution.

<img src="img/skew.gif" width="500" align="center" />

<table><tr>
<td> <img src="img/skew.jfif" style="width: 250px;"/> </td>
<td> <img src="img/kurt.png" style="width: 250px;"/> </td>
</tr></table>


In [None]:
print("Skewness: %f" % col.skew())
print("Kurtosis: %f" % col.kurt())

## Handling outliers

In [None]:
plt.figure(figsize=(18, 9))

df[['sepallengthcm', 'sepalwidthcm', 'petallengthcm', 'petalwidthcm']].boxplot()

plt.title("Box plot: Numerical variables in Iris dataset", fontsize=20)
plt.show()

## Correlation

In statistics, **correlation** or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data

Pearson's correlation coefficient is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1

<img src="img/r.jfif" width="500px"/>

In [None]:
fig = plt.gcf()
fig.set_size_inches(20,10)
__ = sns.violinplot(x='species', y='sepalwidthcm',data=df, size = 20)

In [None]:
df_num = df[['sepallengthcm', 'sepalwidthcm', 'petallengthcm', 'petalwidthcm']]

In [None]:
col = 'sepallengthcm'
sns.pairplot(data=df_num,
            x_vars=df_num.columns.difference([col]),
            y_vars=[col])


In [None]:
df_num.columns

In [None]:
corr_matrix = df_num.corr().abs()
corr_matrix

In [None]:
corr = df_num.corr()

plt.figure(figsize=(12, 10))

sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);


In [None]:
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
features_to_analyse = [column for column in upper.columns if not any(upper[column] > 0.95)]


In [None]:
from itertools import combinations


In [None]:
fig, ax = plt.subplots(1, 3, figsize = (18, 12))

for ax, (c1, c2) in zip(fig.axes, combinations(features_to_analyse, 2)):
       sns.regplot(x=c1, y=c2, data=df[features_to_analyse], ax=ax)

#### Resources used for creating this notebook:

- https://github.com/learn-co-students/dsc-implementing-statistics-with-functions-lab-dc-ds-071519/blob/master/index.ipynb
- https://archive.ics.uci.edu/ml/datasets/iris
- https://www.kaggle.com/agrawaladitya/step-by-step-data-preprocessing-eda
- https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
- https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python
- https://www.kaggle.com/mjamilmoughal/eda-of-titanic-dataset-with-python-analysis