## 02. Statistics

#### Resources

[Data Science from Scratch (pdf)](http://math.ecnu.edu.cn/~lfzhou/seminar/[Joel_Grus]_Data_Science_from_Scratch_First_Princ.pdf#page=96)<br/>
[How to Read Mathematical Formulae (video)](https://www.youtube.com/watch?v=-mu3TYZ_udM)<br/>
[Latex Editor](https://www.latex4technics.com/)<br/>
[List of Mathematical Symbols](https://www.rapidtables.com/math/symbols/Basic_Math_Symbols.html)<br/>
[List of Latex Symbols](https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols)<br/>
[Standard Deviation](https://simple.wikipedia.org/wiki/Standard_deviation)<br/>
[Variance and Standard Deviation](https://www.sciencebuddies.org/science-fair-projects/science-fair/variance-and-standard-deviation)<br/>
[Bessel's Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction)<br/>
[Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)<br/>
[Simpson's Paradox](http://www.statisticshowto.com/what-is-simpsons-paradox/)<br/>

#### Modules

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling as pdpf
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Text & Code

Statistics is the practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample. Essentially it refers to gaining an understanding of something from data.

In [None]:
# Some random data
np_data = np.random.randint(low=0,high=20,size=20)
np_data2 = np.random.randint(low=0,high=20,size=20)
pd_data = pd.DataFrame(np_data.tolist(), columns = ['data'])
pd_data2 = pd.DataFrame(np_data2.tolist(), columns = ['data2'])
pd_data2 = pd.concat([pd_data, pd_data2], axis=1)

# Basic Statistics for Numpy
np_data.min()                            # Minimum value
np_data.max()                            # Maximum value
np_range = np_data.max() - np_data.min() # Range
np.median(np_data)                       # Median value
stats.mode(np_data)                      # Mode value
np_data.sum()                            # Sum
np_data.cumsum()                         # Cumulative sum
np.percentile(np_data,10)                # Percentiles

# Basic Statistics for Pandas
pd_data['data'].min()                                     # Minimum value
pd_data['data'].max()                                     # Maximum value
pd_range = pd_data['data'].max() - pd_data['data'].min()  # Range
pd_data['data'].median()                                  # Median value
pd_data['data'].mode()                                    # Mode value
pd_data['data'].sum()                                     # Sum 
pd_data['data'].cumsum()                                  # Cumulative sum
pd_data['data'].quantile(0.1)                             # Percentiles

#### Basic Stats Exploratoin

In [None]:
# Automated stats using Pandas
df = pd.DataFrame.from_csv("https://raw.githubusercontent.com/nickhould/craft-beers-dataset/master/data/processed/beers.csv")

In [None]:
df.info()                 # Basic Information

In [None]:
df.dtypes                 # Data Types

In [None]:
df.describe()             # Basic stats for a dataset

In [None]:
pdpf.ProfileReport(df)    # Creating a profile report from a pandas dataset

In [None]:
df.hist(bins=50, figsize=(20,15))    # Plotting the data
plt.show()

#### Statistical Functions & Formulae

**Moments**

In statistics, moments are quantities that are related to the shape of a set of numbers. “Shape of a set of numbers,” means “what a histogram based on the numbers looks like” — how spread out it is, how symmetric it is, and more. They are as follows:

1. Mean
2. Variance
3. Skewness
4. Kurtosis

There is more information [here](http://www.statisticshowto.com/what-is-a-moment/).

**Arithmetic Mean (1st Moment)**<br/>

The mean is the average of a set of numerical values, as calculated by adding them together and dividing by the number of terms in the set.

<span style="color:#888888">${\displaystyle A={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}$</span>

Where:  
<span style="color:#888888">$A$ = Mean  
$i$ = First observation in the data  
$n$ = Number of observations in the dataset   
$\sum$ = Sum   
$a$ = Individual value in the dataset  
</span>


In [None]:
np.mean(np_data)
pd_data['data'].mean()

**Variance (2nd Moment)**  
Variance measures how far a data set is spread out. The technical definition is “The average of the squared differences from the mean,” but all it really does is to give you a very general idea of the spread of your data. It is expressed in square units whereas **Standard Deviation** is expressed in the same units at the mean. Note that there are slightly different ways of calculation variance depending upon whether you are working with a sample or a full population. See [Bessel's Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) for more details. The example below uses Bessel's correction.  

<span style="color:#888888">${\displaystyle s^{2} = {\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}}$</span>

Where:  
<span style="color:#888888">$s^{2}$ = Variance  
$N$ = The population size  
$x_i$ = ?  
$\overline{x}$ = The mean (estimated)
</span>

In [None]:
np.var(np_data, ddof=1)           # Not N-1 by default!
pd_data['data'].var(ddof=1)       # N-1 by default

**Standard Deviation**<br/>
The standard deviation is a measurement statisticians use for the amount of variability (or spread) among the numbers in a data set. As the term implies, a standard deviation is a standard (or typical) amount of deviation (or distance) from the average (or mean, as statisticians like to call it). So the standard deviation, in very rough terms, is the average distance from the mean. Note that there are slightly different ways of calculating standard deviation depending upon whether you are working with a sample or a full population. See [Bessel's Correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) for more details.

The standard deviation is also used to describe where most of the data should fall, in a relative sense, compared to the average. For example, **if your data have the form of a bell-shaped curve (also known as a normal distribution)**:

* 68% of data falls within the one standard deviation of the mean.  
* 95% fall within two standard deviations.  
* 99.7% fall within three standard deviations 

A standard deviation of 3 means that a majority of observations (about 68%, assuming a normal distribution) are either 3 more or 3 less than the average (one standard deviation).<br/>
Most observations (95%) will have a value either 6 more, or 6 less than the average (two standard deviations).<br/>
Almost all observations (99.7%) will have a value either 9 more, or 9 less than the average (three standard deviations).<br/>

This result is called the empirical rule, or the 68–95–99.7% rule. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out. 

<span style="color:#888888">${\displaystyle s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}}$</span>

Where:

<span style="color:#888888">$s$ = Standard Deviation  
$N$ = The population size  
$x_i$ = = $x$ variable indexed with ${i}$  
$\overline{x}$ = The mean (estimated)
</span>

In [None]:
np.std(np_data, ddof=1)     # Not N-1 by default!
pd_data['data'].std()       # N-1 by default

**Pearson Correlation**<br/>

The Pearson Correlation Coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. As with the examples above this is calculated for a sample dataset. The formula for a full population is slightly different. See [https://en.wikipedia.org/wiki/Pearson_correlation_coefficient](the Pearson Correlation Coefficient Wikipedia page) for more details. Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

Values returned will range between -1 and 1, with positive numbers indicating a positive correlation and negative numbers indicating a negative correlation.

Note that you should always take account of [Simpson's Paradox](http://www.statisticshowto.com/what-is-simpsons-paradox/) when correlation. This is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title **reversal paradox** or **amalgamation paradox**.

<span style="color:#888888">${\displaystyle r =\frac{\sum _{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}}$</span>

Where:  
<span style="color:#888888">$n$ = Sample size,  
$x_i$ = $x$ variable indexed with ${i}$   
$y_i$ = $y$ variable indexed with ${i}$  
$\overline{x}$ = The mean of $x$ (estimated) 
$\overline{y}$ = The mean of $y$ (estimated) 
</span>

In [None]:
np.corrcoef(np_data,np_data2)              # Correlation
pd_data2.corr()                            # Correlation
chart = sns.pairplot(pd_data2,kind="reg")  # Seaborn pairplot