# Exploratory Data Analysis

<div class="slide-title">

# Exploratory Data Analysis
     
</div>

## Table of Content  
- The EDA Process put into context
- Types of EDA
  - uni-/bi-/multi-variate
  - numerical vs. graphical
  - continuous vs. categorical data
- Measures of central tendency / dispersion
- Distribution functions / histograms / modality
- Correlations
- Special considerations for categorical / discrete variables
- Summary / References

  

## The EDA Process

### Research / Business Questions

Those questions that arise from a researcher guessing about reality (data). They are written in the form of a question.

Examples:
* Which factors can increase ice cream sales?
* Can we predict skin cancer using photographs of the melanome?
* How does the material composition of a bridge affect its durability?


### Hypothesis Generation
* Hypotheses are assumptions or educated guesses we make about the data, using our domain knowledge.
* You can form a hypothesis in the form of "if/then" or “the more the".
* A Hypothesis is formed as a measurable (operationisable) statement you can validate by looking at data.
* A research question can have multiple hypotheses attached to it.

Examples:
* If the sun is shining then ice cream sales increase.
* The larger the melanoma, the greater the risk that it is malignant.
* The cheaper the material composition the shorter its durability.


### Hypothesis Generation != Hypothesis testing

The process is an educated guess, the hypothesis could still come out to be true or false after EDA and hypothesis testing (is the conclusion random or not?)

[Hypothesis testing](https://www.analyticsvidhya.com/blog/2015/09/hypothesis-testing-explained/)


## Where does EDA belong in the bigger picture?

<center>
    <img src="../images/eda_intro/img_p2_6.png" width=700>
</center>   


Note:
Research question --> interface to Data Science --> above process  
- Area sizes denote complexity / effort  
- most interesting field is often small
- borders are fluid
- difference between *exploration* (for our own understanding) and "explanation* (to others)

## Black cats and domain experts

<div class="group">
    <div class="text">
The hardest thing of all is to find a black cat in a dark room, especially if there is no cat.
<br>
<br> 
<br> 
→ Talk to domain experts or become one.
    </div>
    <div class="images">
        <img src="../images/eda_intro/img_p1_2.png">
    </div>
</div>

Notes:
The more expert knowledge you have, the better you know, e.g. which features are (un)important / redundant / need processing

## What is EDA and why do we do it?

### What is EDA?

→ **Detective work**

EDA is the process of summarizing and visualising important characteristics of the data in order to gain insights.

It’s an **approach/process** not a set of techniques.

**Tools:**
* Any method of looking at data that does not include formal statistical modelling and inference
* Visualisation 


<div class="alert alert-block alert-info">
<b>Note:</b> 

Confirm the expected or show the unexpected!
</div>

Notes: We will learn to know some tools and terms today

### Goals and Benefits
* Understand each variable
* Get insight into relationships between the variables
* Draw valid conclusions
* Checking assumptions
* Aid in decision making and planning
* Help in causal analysis

→ To build intuition about the data and gain insights

→ To generate and corroborate or reject hypotheses


### Types of EDA
* Univariate or multivariate
* Graphical or non-graphical
* Dealing with both categorical and numerical data


### Estimates

<div class="group">
    <div class="text">
        
* We can rarely have money/resources to measure everything
* So we will have **samples** of the **population** which we hope to be representative of the whole population
* The more data we have the more confident we are in the estimates
* Ideally: the results drawn from our experiments are reproducible

        
<div class="alert alert-block alert-info">
<b>Note:</b> 

In Data Science most metrics omit the “estimate” word...
nevertheless mostly all metrics we use are estimates.
</div>   
    </div>
    <div class="images">
        <img src="../images/eda_intro/img_p7_2.png">
    </div>
</div>

Notes:  
This figure shows two things:  
- population vs. sample (here: data) and
- the two types of statistics:  

EDA is mainly **descriptive statistics** to describe single variables or the relationship of different vars  
The figure also shows the other type of statistics: **inferential statistics**.

### Univariate vs. Multivariate Analysis

<div class="group">
    <div class="text">
        
**Univariate Analysis:**
* Analysis of a single variable (often called feature)  
* Characterises data by focusing on distribution, central tendency, dispertion, etc
* Represents information numerically and visually</br></br>
    </div>
    <div class="text">
    
**Multivariate Analysis:**
* Simultaneous analysis of multiple variables
* Examines how changes in one variable are associated with changes in others
* Characterises dependence by a numerical coefficient</br></br>
    </div>
</div>

<div class="group">
    <div class="text">
→ Description and understanding of individual variables 
    </div>
    <div class="text">       
→ Understanding of the relationship and interaction among multiple variables 
    </div>
</div>



### Data Types
<img src="../images/eda_intro/data_types.png">

Notes:  
There are different kinds of "measurement levels". This one is a very common one.  
Sometimes "numerical" are also called "metric" or "scalar".

further subclasses of scalar values:   
- Interval scaled data: we can add / subtract, but not multiply / divide (there is no zero), e.g. °C
- Ratio: same as interval values but with absolute zero

## Univariate EDA

Notes: Most of the descriptive statistics can be applied to numerical data. For categorical data the choice is limited.

### Describing Central Tendency

#### Mean

<div class="group">
    <div class="text">
        
* sum of data points divided by number of data points
* often used as default
* sensitive to extreme values
  
What is the mean in this example?        
|id | y | 
|---|---|
| 1 | 2 |
| 2 | 6 |
| 3 | 6 |
| 4 | 8 |
| 5 | 10 |
        
</div>
    <div class="images">
        
$$\bar{y}=\frac{1}{n} \sum\limits_{i=1}^{n}y_{i}$$        
        
<img src="../images/eda_intro/img_p14_2.png">
    </div>
</div>


Notes: mean = 6.4  
The shown distribution function is symmetric and continuous as an example of a special case (not the given one).

#### Median

<div class="group">
    <div class="text">
        
* value in the middle of a data series ordered by size
* more robust against extreme values but computationally more expensive since values need to be sorted
  
What is the median in this example?        
|id | y | 
|---|---|
| 1 | 2 |
| 2 | 6 |
| 3 | 6 |
| 4 | 8 |
| 5 | 10 |
        
</div>
    <div class="images">
        
$$       
y_{i} \leq y_{i+1}\;\text{for}\;i=1,...,n-1 \\
y_{\frac{n+1}{2}}\;\text{if}\;n\;\text{is odd} \\
\frac{1}{2}\big(y_{\frac{n}{2}}+y_{\frac{n+1}{2}}\big)\;\text{if}\;n\;\text{is even}
$$

<img src="../images/eda_intro/img_p15_2.png">
    </div>
</div>

Notes: median = 6 For skewed distribution or when there is concern about outliers, the median might be preferred to the mean.

#### Mode
<div class="group">
    <div class="text">
        

* most frequent values
* not necessarily unique
* mostly used for categorical data
  
What is the mode in this example?        
|id | y | 
|---|---|
| 1 | 2 |
| 2 | 6 |
| 3 | 6 |
| 4 | 8 |
| 5 | 10 |
        
</div>
    <div class="images">
        <img src="../images/eda_intro/img_p16_2.png">
    </div>
</div>

Notes: mode = 6

### Describing the Spread

#### Range
<div class="group">
    <div class="text">
        
* difference between largest and smallest value
* measures the spread of the data
* sensitive to outliers
        
</div>
    <div class="images">
Which dataset has the larger range?       
        <img src="../images/eda_intro/img_p18_2.png" width=250>
    </div>
</div>

Notes: dataset1 = 38-20 = 18 dataset2 = 52-11 = 41

#### Quantiles

<div class="group">
    <div class="text">
        
* quantiles split sorted data into parts with **equal amount of observations**
    * quartiles: splits data into 4 parts
    * deciles: splits data into 10 parts
    * percentiles: splits data into 100 parts
        
</div>
    <div class="images">       
        <img src="../images/eda_intro/img_p17_2.png">
    </div>
</div>

Notes: The position of the percentiles are not equidistant (and depend on the distribution)

#### Interquartile Range (IQR)

<div class="group">
    <div class="text">
        
* width of interval that contains the middle 50% of the data
* interval between the 25th and 75th percentile
* interval between 1st and 3rd quartile
* robust to outliers
        
</div>
    <div class="images">       
        <img src="../images/eda_intro/img_p19_2.png">
    </div>
</div>

#### Outliers

<div class="group">
    <div class="text">
        
* No generally recognized formal definition for outlier
* Values outside of the areas of a distribution that would commonly occur

<div class="alert alert-block alert-info">
<b>Note:</b> 

If an outlier is good or bad depends on the data problem. For example for anomaly detection you want to keep outliers.
</div>
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p25_2.png">
    </div>
</div>

Notes:  
The lower line is not a line but should be brackets.
All which is further away from 1.5 times IQR from the IQR is regarded as outlier here. the "extreme" is (about) Q3 + 3 x IQR

### Box Plots

<center>
<img src="../images/eda_intro/img_p51_2.png">
</center>

#### Variance & Standard Deviation

<div class="group">
    <div class="text">
        
**Variance**
* average squared difference of the values from the mean:  $ \sigma_{sample} = \frac{1}{n-1}\sum_i{(x_i-\bar{x})^2}$

**Standard deviation**
* square root of variance: $SD = \sqrt{\sigma}$
* standard difference between each data point and the mean
* has the same unit as the original data

Both are not robust to outliers.

[degrees of freedom](https://web.ma.utexas.edu/users/mks/M358KInstr/SampleSDPf.pdf)        
</div>
    <div class="images">
        
<img src="../images/eda_intro/img_p20_3.png" width=500>
    </div>
</div>

Notes: 
include a live example here, e.g. [2, 2, 3, 5], mean=12/4, SD = 1/3(1+1+0+2)^2  
denomiator in standard deviation of *population* is n (instead of n-1).  
Want to know more? consider the link about DoF

### Describing the Distribution

Notes: Note that we saw / used distributions already above. Now some more details

#### Skewness & Kurtosis
<div class="group">
    <div class="text">
        
**Skewness**
* degree of asymmetry of the distribution of the data

    <img src="../images/eda_intro/img_p21_2.png">
        
</div>
    <div class="text">
        
**Kurtosis**
* degree of pointyness relative to a normal distribution (flat vs. pointy)     
        <img src="../images/eda_intro/img_p21_3.png">

[Kurtosis](https://www.spektrum.de/lexikon/geographie/kurtosis/4488)
    </div>
</div>

Notes:  
Which estimate of location (mean, median, mode) would you use if your data is skewed? (mode: not affected by outliers)

for positive skewed distributions, the mode is the smallest value, followed by median, then mean

### [Data Distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions)


<div class="group">
    <div class="text">
        
* <font size="4">***Uniform***  
 all events have same frequency, e.g. outcome of a dice roll</font>
* <font size="4">***Bernoulli***  
 two possible outcomes, e.g. a coin toss</font>
* <font size="4">***Binomial***  
 "discrete version" of normal distribution, e.g. 100 x two coins: likelihood of a certain number of only heads</font>
* <font size="4">***Normal***  
 most continuous real-valued variables in nature follow this distribution</font>
* <font size="4">***Poisson***  
 events occurring at random points of time and space - the number of events [Video](https://www.youtube.com/watch?v=jmqZG6roVqU)</font>
* <font size="4">***Exponential***  
 the interval between events</font>
        
</div>
    <div class="images"> 
        <img src="../images/eda_intro/img_p47_2.png">
    </div>
</div>

In general there are **discrete** and **continuous** [distributions](https://tinyheero.github.io/2016/03/17/prob-distr.html)  
Their number is [immens](https://en.wikipedia.org/wiki/List_of_probability_distributions)  
Create histograms (see above) to have an idea about the distribution (e.g. sns.pairplot)

### Q-Q Plot or Probability Plot

<div class="group">
    <div class="text">
        
* A Q–Q plot is a plot of the quantiles of two distributions against each other.
* It is used to check if the data follows a certain distribution
* If data points follow a straight line, your data follows that distribution
        
</div>
    <div class="images"> 
        <img src="../images/eda_intro/img_p48_2.png">
    </div>
</div>

### Visualising Data Distributions with Histograms

### Box Plots
What do we see here?
<center>
<img src="../images/eda_intro/img_p52_2.png">
</center>


#### Binwidth

<div class="group">
    <div class="text">
        
* **Binwidth matters!**
* Same data with bin width = 5, 2, 1
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p23_2.png">
    </div>
</div>


<div class="alert alert-block alert-info">
<b>Note:</b> 

Rule of thumb: Take the square root of the sample size as the number of bins for a first guess (trial and error).
</div>

#### How to choose number of bins
Rule of thumb: Take the square root of the sample size as the number of bins for a first guess (trial and error).

Consider:  
- too large bins &rarr; we lose details or end up having only one bin
- too small &rarr; we have too much detail or end up with one bin per observation

### Comparing Data Distributions with Histograms

<div class="group">
    <div class="text">
Histograms:<br>

* Good for looking at residuals (variance)
* Works best for comparing max 3-4 groups
* You can use so-called kernel density estimates (KDE) to plot it continuously  
    </div>
    <div class="text">
        <img src="../images/eda_intro/img_p50_2.png">
    </div>
</div>


Notes: A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions

#### Modality

* Plot a histogram and look at the number of peaks in the distribution

<center>
    <img src="../images/eda_intro/img_p22_2.png">
</center>

#### Data Summaries


| Central tendency | Spread | Modality | Shape | Outliers | 
|:----------------:|:------:|:--------:|:-----:|:--------:|
|mean | range | unimodal | skewness | |
| median | interquartile range | bimodal | kurtosis | | 
| mode | variance | multimodal | | 
| quantiles | standard deviation |uniform | | | 

## Multivariate EDA

### Numerical Data

Notes: Most of the descriptive statistics can be applied to numerical data. For categorical data the choice is limited.

#### Scatter Plot
<div class="group">
    <div class="text">
        
* Used to visualise relationship between two numeric variables
* Also called correlation plots
* Can encode multiple dimensions by color and size

It visually answers the question:  
        
→ “How are these variables related?”    
        
→ “When variable X grows, what happens to variable Y? With which intensity?”
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p35_2.png">
    </div>
</div>

#### Pearson correlation coefficient / Pearson’s r
<div class="group">
    <div class="text">
        
* Measures the **linear relationship** between two variables
* Ranges between -1 and 1
        
→ close to 1: strong positive linear relationship
        
→ around 0: no linear relationship
        
→ close to -1: strong negative linear relationship
        
[guess the correlation](https://guessthecorrelation.com/)
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p36_3.png">
    </div>
</div>

Notes: https://guessthecorrelation.com/

#### The ice cream example

<div class="group">
    <div class="text">
        
| Month | Average Temp | Sales |
|-------|--------------|-------|
| January | 4 | 73 |
| February | 4 | 57 |
| March | 7 | 81 |
| April | 8 | 94 |
| May | 12 | 110 |
| June | 15 | 124 |
| July | 16 | 134 |
| August | 17 | 139 |
| September | 14 | 124 |
| October | 11 | 103 |
| November | 7 | 81 |
| December | 5 | 80 |
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p35_2.png">
    </div>
</div>

#### The ice cream example

<div class="group">
    <div class="text">
        
**r = 0.983**
        
<div class="alert alert-block alert-info">
<b>Note:</b> 

A correlation analysis may establish a linear
relationship but does not allow us to use it to
predict the value of a variable given another.
Regression analysis allows us to this and
more.
</div>
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p35_2.png">
    </div>
</div>

Difference between correlation (just one number) and regression (the predictive formula): https://www.g2.com/articles/correlation-vs-regression

#### Spearman rank correlation coefficient - Spearman’s ρ


<div class="group">
    <div class="text_70">
 

* Measure of rank correlation, it is based on the rank of the values vs. the raw data
* Represents the strength of a monotonic relationship 

**Monotonic function:**

→ increasing: as X increases Y never decreases

→ decreasing: as X increases Y never increases      
        
</div>
    <div class="text_30"> 
        <img src="../images/eda_intro/spearman.png" width=200>
    </div>
</div>


Notes:  
similar to Pearson: instead of using the raw data, use the rank numbers for x as well as for y.  
Monotonically increasing - as the x variable increases the y variable never decreases and vice versa.

#### Consideration about correlation  

<div class="group">
    <div class="text">
        
* If two variables are independent, their correlation is 0, but a correlation of 0 does not imply that two variables are independent!  
* The correlation coefficients cannot replace visual examination of data.  
* The presence of correlation is not enough to infer causation!  
        
   </div>
   <div class="images"> 
        <img src="../images/eda_intro/img_p40_2.png">
   </div>
</div>

Notes:  
Visualization would reveal those non-linear correlations but it is not feasible in case of too many features.

#### Correlation != Causation

<center>
<img src="../images/eda_intro/img_p42_2.png" width=800>
</center>

### Bivariate and Multivariate Analysis
<div class="group">
    <div class="text">
        
Looking at all possible combinations of features:
* for 9 features bivariate would mean 36 combinations: $\sum_{i=1}^{8} i$
* How do we reduce the exploration space or focus on interesting combinations?
        
[Correlation Matrix](https://www.displayr.com/what-is-a-correlation-matrix/)
        
</div>
    <div class="images"> 
        <img src="../images/eda_intro/img_p43_1.png">
    </div>
</div>

Notes:  
check correlation matrix for correlated combinations to  
- discover redundancies  
- form hypothesis relevant for our interest/business case and focus on confirming or refuting them  

3 terms describing similar things: pairplot, correlation matrix, heatmap  

### Special consideration of discrete / categorical data  
* mode
* frequency tables: number of times a value occurs
* expected values: weighted mean when categories can be associated with numerical value

Notes: Most of the descriptive statistics can be applied to numerical data. For categorical data the choice is limited.

#### Frequency tables
<div class="group">
    <div class="text">
        
* Tabulation of the frequencies
* Show the range of values and frequency of occurrence
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p30_2.png">
    </div>
</div>

#### Expected values
<div class="group">
    <div class="text">
        
* weighted mean
        
*Example:
Offers for different course plans
for financial purposes we can sum this up in a single
“expected value,” which is a form of weighted mean,
in which the weights are probabilities.*

$EV = 0.05*300 + 0.15*50 + 0.80*0 = 22.5$        
        
</div>
    <div class="text"> 
        <img src="../images/eda_intro/img_p31_2.png">
    </div>
</div>

Notes: Example: A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% will sign up for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean, in which the weights are probabilities

#### Cross-tabulation

<center>
<img src="../images/eda_intro/img_p45_2.png" width=700>
</center>

## Summary / Outlook

EDA is like a detective's investigation to  
- understand the data
- identify patterns

Why do we want to know our data? Because we want to find out
- how to answer our research / business question
- whether the data is suitable / sufficient
- how to answer the research questions with the existing data  
- how to phrase/refine our hypotheses

#### Summaries vs. Details...

<center>
    <img src="../images/eda_intro/img_p26_2.gif" width=800>
</center>

Notes: Condensing the information from data into few values is helpful but we also lose a lot of details… → same summary statistics but completely different datasets statistics + visualisation

#### Techniques Map

<center>
<img src="../images/eda_intro/img_p59_6.png" width=900>
</center>

[Techniques Map](https://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf)

#### Visual Vocabulary

<center>
<img src="../images/eda_intro/img_p60_3.png" width=800>
</center>

[Visual Vocabulary](https://github.com/Financial-Times/chart-doctor/tree/main/visual-vocabulary)

#### References
* [Introduction to Data Science Lecture 6 Exploratory Data Analysis](https://bcourses.berkeley.edu/courses/1267848/files/51138259/download?verifier=SiKALrm4j6S8wszD3ZCnn7a39CcRwtCQnN1U8Eqp&wrap=1)
* [Exploratory Data Analysis](https://archive.org/details/exploratorydataa0000tuke_7616) - John W. Tukey
* [Practical Statistics for Data Science](https://books.google.de/books/about/Practical_Statistics_for_Data_Scientists.html?id=k2XcDwAAQBAJ&redir_esc=y) - Peter Bruce & Andrew Bruce
* [Econometric Methods with Applications in Business and Economics](https://dl.rasabourse.com/Econometrics_Erasmus/Christiaan_Heij_Paul_de_Boer.pdf) - Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K. van Dijk
* [http://www.sumsar.net/blog/2014/03/oldies-but-goldies-statistical-graphics-books](http://www.sumsar.net/blog/2014/03/oldies-but-goldies-statistical-graphics-books)
* [The Dino dataset](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html)
* [AMLD EDA workshop](https://github.com/terezaif/workshops_data_exploration)
* [Clearly explained pearson vs. spearman correlation coefficient](https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8)
* [Visual Vocabulary](https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary)
* [EDA Checklist](https://github.com/neuefische/datascience-infographics/blob/main/EDA_Checklist.md)
* [https://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf](https://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf)
* [https://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf](https://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf)
* [A bit on distributions](https://www.kdnuggets.com/2020/02/probability-distributions-data-science.html)
* [Even more distributions](http://www.math.wm.edu/~leemis/chart/UDR/UDR.html)