# Chapter 1. Exploratory Data Analysis

pages 14-35

## History
- Probability Theory (the mathematical foundation for statistics)
    - developed in the  17th to 19th centuries
    - Thomas Bayes, Pierre-Simon Laplace, Carl Gauss
- Statistics
    - applied science concerned with analysis and modeling of data
    - Francis Galton and Karl Pearson
- Modern Statistics
    - roots back to 1800s 
    - R. A. Fisher (leading pioneer in the early 20th century)
        - experimental design
        - maximum likelihood estimation

(A) Classical statistics vs (B) Exploratory data analysis [EDA] 

(A) focused almost exclusively on inference

John W. Turkey in 1962
- wrote a seminal paper called "The Future of Data Analysis"
- proposed a new scientific discipline called _data analysis_
- forged links to the engineering and computer science communities
- coined the term 'bit'

## Elements of Structured Data

Sources
- images
- text
- clickstreams

### 2 basic types of structured data:
1. numeric
    1. continuous
        - ex. wind speed, time duration, temperature
    2. discrete
        - ex. count of occurrence
2. categorical
    1. binary data (special case of categorical data)
        - [0,1] [yes, no] [true, false]
    2. ordinal data
        - rating from 1-5
        
Why learn the taxonomy/classification of data types?
- Data type is important to help determine the **type of visual display**, **data analysis**, or **statistical model**

### Rectangular Data

- Like a spreadsheet or database table
- 2-dimensional matrix with rows/records and columns/features

### Nonrectangular Data Structures
- Time Series
- Spatial data structures
- Graph data structures

---
# Dimensions of summarizing a feature
1. Location
2. Variability
---

## Metrics of Location/ Estimates of Location

Goal: To find the best measure to describe the central value
- Mean
    - sum of all the values divided by the number of values.
$$Mean = \bar{x} = \frac{\sum_{i}^{n}{x_i}}{n}$$
- Trimmed mean
    - eliminates the influence of extreme values
    - used in competitions (remove highest and lowest score)
    - can be thought as a compromise between median and mean
        - it is robust to extreme values in the data, but uses more data to calculate the estimate for location
- Weighted Mean  
    - 2 motivations for using this
        - some values are intrinsically more variable than others leading to highly variable observations given a lower weight
        - the data collected may have underrepresentation or overrepresentation
<br>
- Median
    - is the middle number on a sorted list of the data
    - robust to outliers
- Weighted Median  
    - is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list
    - robust to outliers
- Outlier  

---
## R

data.frame
row.names

data.frame - does not support user-specified ot multilevel indexes

2 popular packages
1. data.table
2. dplyr
---

Example: Location of Estimates of Population and Murder Rates

In [1]:
state <- read.csv(file="data/state.csv")
mean(state[["Population"]])

In [5]:
mean(state[["Population"]], trim=0.1)
# excludes the largest and smallest 5 states, drops 10% from each end

In [3]:
median(state[["Population"]])

In [4]:
state

State,Population,Murder.Rate,Abbreviation
Alabama,4779736,5.7,AL
Alaska,710231,5.6,AK
Arizona,6392017,4.7,AZ
Arkansas,2915918,5.6,AR
California,37253956,4.4,CA
Colorado,5029196,2.8,CO
Connecticut,3574097,2.4,CT
Delaware,897934,5.8,DE
Florida,18801310,5.8,FL
Georgia,9687653,5.7,GA


If we want to compute the average murder rate for the country, we need to use a weighted mean or median to account for different populations in the states.

In [6]:
weighted.mean(state[["Murder.Rate"]], w=state[["Population"]])

In [9]:
# install.packages("matrixStats")

package 'matrixStats' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\riche\AppData\Local\Temp\RtmpOI6Eke\downloaded_packages


In [10]:
library("matrixStats")

In [11]:
weightedMedian(state[["Murder.Rate"]], w=state[["Population"]])

## Estimates of Variability/Dispersion

Terms
- Deviations (errors, residuals)
    - diff between observed values and the estimate of location
- Variance (mean-squared error)
    - sum of squared deviations from the mean divided by `n-1` (n = # of values) 
- Standard deviation (l2-norm, Euclidean norm)
    - square root of the variance
- Mean absolute deviation (l1-norm, Manhattan norm)
    - mean of the absolute value of the deviations from the mean
- Median absolute deviation from the median
- Range
- Order statistics (ranks)
- Percentile (quantile) 
- Interquartile range (IQR)
    - diff between the 75th percentile and the 25th percentile


## Standard Deviation and Related Estimates

Goal: tell us how disperes the data is around the central value

For a set of data {1, 4, 4}, the mean is 3 and the median is 4.
The deviations from the mean are the differences:  
1 – 3 = –2  
4 – 3 = 1  
4 – 3 = 1.  
The absolute value of the deviations is {2 1 1} and their average is (2 + 1 + 1) / 3 = 1.33.

- the sum of the deviations from the mean is precisely zero


### 1 Mean absolute deviation
$$Mean absolution deviation = \frac{\sum_{i=1}^{n} | x_i - \bar{x} |}{n}$$

### 2 Variance

$$Variance = s^{2} = \frac{\sum {(x - \bar{x})}^2}{n-1}$$
### 3 Standard deviation

$$Standard deviation = s = \sqrt{Variance}$$


All 3 are not robust to outliers and extreme values.

With Standard deviation's  more complicated and less intuitive formula, it might seem peculiar that the standard deviation is preferred in statistics over the mean absolute deviation. It owes its preeminence to statistical theory: mathematically, working with squared values is much more convenient than absolute values, especially for statistical models.

---
#### Degrees of Freedom

N or N-1 ?

(N-1) It is based on the premise that you want to make estimates about a population, based on a sample. 

If you use the intuitive denominator of n in the variance formula, you will underestimate the true value of the variance and the standard deviation in the population. This is referred to as a *biased estimate*. However, if you divide by n – 1 instead of n, the standard deviation becomes an *unbiased estimate*.

To fully explain why using n leads to a biased estimate involves the notion of *degrees of freedom*, which takes into account the number of constraints in computing an estimate. In this case, there are n – 1 degrees of freedom
since there is one constraint: the standard deviation depends on calculating the sample mean. For many problems, data scientists do not need to worry about degrees of freedom, but there are cases where the concept is important (see “Choosing K”).

---

### 4 Median absolute deviation from the Median or MAD

$$Mean absolute deviation = Median (|x_1 - m| , |x_2 - m|, ..., |x_N - m | )$$