<h2 align="center">Machine Learning</h2> 
<h3 align="center">Travis Millburn<br>Fall 2020</h3> 

<center>
<img src="../images/logo.png" alt="drawing" style="width: 300px;"/>
</center>

<h3 align="center">Class 6: Statistics Fundaments for Data Science</h3> 


### Outline

1. Exam Next Class

2. Review Today

2. Sampling

3. Mean & Variance


In [1]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

### Key Statistics Concepts

* Mean & Variance (and related estimates)
* Distributions
* Sampling & Inference
* Correlation (& Causation)

### Statistics PRE-QUIZ

Define these terms:
    
1. Mean & Variance
2. Distribution
3. Sample

### Good versus Bad Sampling

Is polling this class a good way to estimate the average age of US college students? Why or why not?

<center>
<img src="../images/sampling_clouds_2.png" alt="drawing" style="width: 700px;" />
</center>



<center>
<img src="../images/pop_vs_sample.JPG" alt="drawing" style="width: 700px;" />
</center>

### Random Sampling

A central concept in Statistics. If a sample is randomly taken where every member is equally likely to be chosen, then a concept called the _Central Limit Theorem_ (CLT) (generally known by the related _Law of Large Numbers_), allows us to compute confidence intervals, "$p$-values, etc., describing how good our estimate is.   

Ex: If you sample 100 students randomly, your estimate of the average, e.g., 30.5 years, will have a confidence interval of $\pm 3.5$ years. Meaning the estimate is 63% likely to be within the interval [30.5-3.5, 30.5+3.5].

_HOWEVER_: it all goes out the window if we do a bad job of sampling. 

### "Real-World Sampling"

* Psychology studies that use psychology students.
* Several hours spent recording info on cars that drive on street through campus.
* Internet searches, youtube searches. 
* Study that uses patients who come into their clinic.
* Online polls.

Sampling Bias - error in estimate due to sampling that is not sufficiently random. What kinds of biases might appear in the above list?

### Most basic things to do in statistics:

<ol start="0">
  <li> Data display - _plots, histograms, ..._ </li>
  <li> Compute averages $\rightarrow$ _the mean_</li>
  <li> Quantify variability $\rightarrow$ _variance_, _standard deviation_</li>
</ol>

<center>
<img src="../images/key_terms.JPG" alt="drawing" style="width: 700px;" />
</center>

### Means... Simple averages

<img src="../images/population_and_sample_Nn.png" alt="drawing"  width="30%"  align="right"/>

#### Population mean: $\mu = \frac{\sum_{i=1}^N x_i}{N}$ 

   ...As in _the whole population_

#### Sample mean: $ \bar{x} = \frac{\sum_{i=1}^n x_i}{n}$ 

   ...As in _just the sample_



### Sample Mean (a.k.a. "Mean")

The arithmetic average of the data values

$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + \ldots + x_n}{n} $$
    where n is the sample size.
    
* The most common measure of center
* Can be affected by extreme data values (outliers)

<center><img src="../images/mean.png" width="600"></center>

In [2]:
print( (1+2+3+4+5)/5, np.mean((1,2,3,4,5)), (1+2+3+4+10)/5, np.mean((1,2,3,4,10)) )

3.0 3.0 4.0 4.0


### Mode

The most frequently occurred value

* There may be no mode or several modes. "Multimodal" implies multiple peaks in histogram.
* Not affected by extreme values (outliers)
    
<center><img src="../images/mode.png" width="600"></center>

### Percentile

The $p^{th}$ percentile - $p\%$ of the values in the data are less than or equal to this value ($0 \leq p \leq 100$)

#### Quartile: 
* $1^{st}$ quartile = $25^{th}$ percentile
* $2^{nd}$ quartile = $50^{th}$ percentile = median
* $3^{rd}$ quartile = $75^{th}$ percentile
    
<center><img src="../images/quartile.png" width="500"></center>

How might you compute this?

### Variance... trickier

<img src="../images/population_and_sample_Nn.png" alt="drawing"  width="30%"  align="right"/>
    
#### Population variance: $\sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N}$ 

   ...As in _the whole population_

#### Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$ 

   ...As in _just the sample_
    
Note that sneaky $n-1$ denominator. 

#### Standard deviation = $\sqrt{\text{Variance}}$ 

Variance is a kind of average too.

### Exercise


#### Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$ 
    
#### Standard deviation = $\sqrt{\text{Variance}}$ 

How do these relate to the norms we covered?

Do these by hand...

In [3]:
import numpy
import pandas
vals = [1, 2, 5, 10, 11, 12, 13, 18]
vals_df = pandas.DataFrame(vals, columns=['value'])
vals_df

Unnamed: 0,value
0,1
1,2
2,5
3,10
4,11
5,12
6,13
7,18


In [4]:
vals_df['delta'] = vals_df['value'] - vals_df['value'].mean()
vals_df['delta_sq'] = vals_df['delta'] **2
vals_df

Unnamed: 0,value,delta,delta_sq
0,1,-8.0,64.0
1,2,-7.0,49.0
2,5,-4.0,16.0
3,10,1.0,1.0
4,11,2.0,4.0
5,12,3.0,9.0
6,13,4.0,16.0
7,18,9.0,81.0


In [5]:
#variance:
vals_df['delta_sq'].sum() / (vals_df['delta_sq'].count()  - 1)

34.285714285714285

In [6]:
#std deviation by hand
(vals_df['delta_sq'].sum() / (vals_df['delta_sq'].count()  - 1) ) ** (1/2)

5.855400437691199

In [7]:
#What if we just ask pandas for the variance?
vals_df['value'].var()

34.285714285714285

In [8]:
#What if we just ask pandas for the std_deviation ?
vals_df['value'].std()

5.855400437691199

## What is the mean?

<center><img src="../images/dot_diagram_0.png" alt="drawing" style="width: 700px;"/></center>

       x = [1048 1059 1047 1066 1040 1070 1037 1073]


## <center>What is the mean?</center>

<center>
<img src="../images/fig2-2.png" alt="drawing" style="width: 800px;"/>
</center>





In [9]:
#Calculate by hand
(1048 + 1059 + 1047 + 1066 + 1040 + 1070 + 1037 + 1073)/8

1055.0

In [10]:
x_df = pandas.DataFrame([1048, 1059, 1047, 1066, 1040, 1070, 1037, 1073])
x_df.mean()

0    1055.0
dtype: float64

## What is the Variance & Standard Deviation?

<center>
<img src="../images/fig2-2.png" alt="drawing" style="width: 800px;"/>    
</center>

... how much are these points _spread out_ from the mean?


## What is the Variance & Standard Deviation?

* ## Sample variance: $ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}$ 

<center><img src="../images/fig2-3.png" alt="drawing" style="width: 800px;"/></center>

* what are the units of variance and standard deviation?

## Recap

\begin{align}
 \text{Population mean} &= \mu = \frac{\sum_{i=1}^N x_i}{N} \\
 \text{Sample mean}     &= \bar{x} = \frac{\sum_{i=1}^n x_i}{n} \\ 
 \text{Population variance} &= \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \\
 \text{Sample variance}     &= s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1} = \frac{\sum_{i=1}^n x_i^2  - \frac{1}{n}(\sum_{i=1}^n x_i)^2}{n - 1} \\ 
 \text{Standard deviation} &= \sqrt{\text{Variance}}
\end{align}

#### The normal distribution is also referred to as a Gaussian distribution after Carl Friedrich Gauss, a prodigious German mathematician from the late 18th and early 19th century. Another name previously used for the normal distribution was the “error” distribution. Statistically speaking, an error is the difference between an actual value and a statistical estimate like the sample mean.

<center><img src="../images/normal_dist.JPG" alt="drawing" style="width: 800px;"/></center>