# Course: Statistics and Probability

[Kahn Academy Course link](https://www.khanacademy.org/math/statistics-probability)

In [1]:
# Pre-load modules used later
from IPython.display import Image

## Supplement: [Seeing Theory](https://seeing-theory.brown.edu)
- **Probability theory** is the mathematical framework that allows us to analyze chance events in a logically sound manner. 
- The **expectation** of a random variable is a number that attempts to capture the center of that random variable's distribution. 
- **Random variable** - A mathematical object holding a number assigned to the possible outcomes — say, 1 for heads and 0 for tails, or 1-6 for a die.
  - Or: A function that assigns a real number to each outcome in the probability space.
- The **expectation** of a random variable is a number that attempts to capture the center of that random variable's distribution.
- the **variance** of a random variable quantifies the spread of that random variable's distribution.
- An **estimator** is simply any function of randomly sampled *observations*.
- **Bayes' Theorem**: Measures probability of the case "Given a positive test result, what is the probability that I actually have the disease?"

---
# Statistics

## Analyzing categorical data
- **midrange** - mean of the two *maximum and minimum* values in a set.
- **mode** - most common value in a set
- Marginal distribution - In a spreadsheet-like table, these are the "total" values across either rows or columns.
- Conditional distribution - Values in a row or column, typically given as percentages (%) of the total.

## Summarizing quantitative data
- **mean**, **median**
- **Variance** - The average of the squared differences from the mean.
- **Standard deviation** is the square root of the variance.
  - Excel function: **`STDEVP()`** (P for the whole Population)

In [2]:
# Standard deviation formula
Image(url= "https://www.mathsisfun.com/data/images/standard-deviation-formula.gif")

- ***Sample* standard deviation (or variance):** - Divide by **N-1** when calculating from a sample.
  - Excel function: **`STDEV()`**

In [3]:
# *Sample* standard deviation formula
Image(url= "https://www.mathsisfun.com/data/images/standard-deviation-sample.gif")

- **Mean Absolute Deviation (MAD)** - Excludes the squaring and square-rooting.
- **Interquartile Range (IQR)** - The *range* of the middle 75% of values.
  - Gotcha watch: *Exclude the median* when finding the median of the lower and upper halves!

## Modeling data distributions

- **Percentile** - What percent of *observations* are strictly *less than* a given value in a distribution.
- **z-score** - How many std devs away from the mean an observation is. (Applies to any type of distribution, not just normal.)
  - `z = (data point - mean) / std dev`
  - `z = (x − μ) / σ`
- [z-score table for reference](http://www.z-table.com/)
- **Probabilty Density Curve/Function**
  - The *area* under the curve is always 1.
- For reference: 
  - `area of a triangle = 1/2 * base * height`
  - `area of a trapezoid = (h1 + h2) / 2 * base`
- **Empirical Rule** aka **68-95-99.7 Rule** - For a *normal* distribution, 68% of the values lie **within** the first std deviation, 95% lie within the second, and 99.7% lie within the third.
  - Remember, the 68/95/99.7 are a range centered around the mean, NOT the proportion below that point.  E.g. to get the % *below the mean + 1 std dev, subtract 68% from 100%*.


In [4]:
# Empirical/68-95-99.7 Rule visualization
Image(url="https://i.stack.imgur.com/aM3fG.png")

## Exploring bivariate numerical data

[Article: How to select the Right Evaluation Metric for Machine Learning Models](https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0)

- **Correlation coefficient (r)** aka **Pearson correlation coefficient** - Measures the strength of the linear correlation between two variables.

In [5]:
# Correlation coefficient (r) formula
Image(url="http://www.stat.yale.edu/Courses/1997-98/101/cor.gif")

- **Residual** - Distance between a data point and the regression line.

**How to find the "least-squares" regression line equation from summary stats (means, std devs, and correlation coef.)**
1. Regression line equation: 
  - `ŷ = mx + b`
1. The slope equation from summary stats: 
 - `m = r * (σᵧ / σᵪ)`
1. Then find `b`, the y-intercept, by plugging in the one point we know, where the mean of x and mean of y intersect, into the line equation.


A **residual plot** is a good way to find out whether a linear or a higher-order equation may be a better fit for the data.

**r-squared** aka **coefficient of determination** - Measures percentage of the prediction error in the *y* variable eliminated when we use least-squares regression on the *x* variable.
  - More formally: Measures percent of the variability in the *y* variable accounted for by the regression on the *x* variable.
  - Equation: `1 - (squared error of the line / squared error of the mean of y) `

In [6]:
# r-squared equation
Image(url="https://cdn-images-1.medium.com/max/1600/1*WCaWmRreXCQxLez4yYOy5w.png")

In [7]:
# Squared error of the regression line
Image(url="http://onlinestatbook.com/2/regression/graphics/se_est.gif")

- **Mean-squared-error (MSE)**:
  - Measures the size of a typical prediction error in the *y* variable.
  - In more formal terms, measures the average of the squares of the errors — that is, the average squared difference between the estimated values and what is estimated.
  >\begin{equation*}
MSE = \frac{1}{N} \sum_{i=1}^N (y_i - ŷ_i)^2
\end{equation*}


```python
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
```

- **Root-mean-square error (RMSE)** aka **standard deviation of residuals** - The square root is introduced to make scale of the errors to be the same as the scale of targets.
  >\begin{equation*}
RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^N (y_i - ŷ_i)^2} = \sqrt{MSE}
\end{equation*}

- Always remember: *Correlation does not necessarily imply causation*

## Study Design

- **10% Rule** - A convention that if sample is <= 10% of population we can assume *independence* of trials.

### Types of statistical studies
- **Sample study** - Goal is to estimate the value of a parameter of a population from a sample.
- **Observational study** - Goal is to find out whether two parameters are correlated.
- **Experimental study** - Goal is to establish *causality* thru separate control and treatment groups. (The basis of the scientific method!)

### Types of sampling
- Simple random sample
- Stratified random sample - Take random samples from multiple groups to even-out representation.
- Cluster random sample
- Systematic random sample
#### Biased sampling types
- Voluntary response sampling: Self-selection within the population
- Convenience sampling: Chosing an easily-available sample without using any randomization

### Types of bias
- Undercoverage: Systematic exclusion of members of the population from being in the sample.
- Nonresponse: people chosen for the sample cannot be reached or refuse to participate.
- Response bias: when people are systematically dishonest when answering a question
- Biased wording

### Experiment design
- Block design - Divide up experiment population into blocks to ensure proper representation between control and treatment groups (subset of statified design).
- Matched pairs design - Swap the treatment and control groups and perform experiment again to mitigate unintentional differences between groups.

### Experiment terms 
- A **response variable** is the focus of a question in a study or experiment; measures the result of a study.
- An **explanatory variable** is one that *explains changes in another variable* (the response variable). It can be anything that might affect the response variable.
- **treatment** - the specific level of the explanatory variable given to individuals in an experiment. E.g. 10mg of a drug and 0mg/placebo.

- **experimental units** - who or what we are assigning to a treatment.

---
# Probability
Simple definition: how likely something is to happen.

>```
P(condition) =     # of possibilities that meet conditions 
                ---------------------------------------------
                     # of equally likely possibilities
```

- **The "Monty Hall" problem** - Moral of the story: always switch doors after shown a goat. ;)
  - **Intuition**: After revealing a door *without* the prize, the 2/3 probability of picking wrong *concentrates on* the door not picked/not shown.
  - The intuition feels much clearer when using an example with **100 doors** where 98 goats are revealed.  It's then obvious the chances of the car being behind the one unrevealed door, instead of the randome one you picked, are very high (99/100 probability).
  
  
- **Experimental Probabilty** (vs. Theoretical Probabilty)
  - "Experimental" means *estimated* based on historical outcomes.
  
- **"Compound" sample spaces**
  
## Set theory
[Set symbols reference by RapidTables](https://www.rapidtables.com/math/symbols/Set_Symbols.html)

## Rules & Generalizations
- **Addition rule: for OR'd events**
>**`P(A or B) = P(A) + P(B) - P(A and B)`**
  - This handles both cases of mutually exclusive and not.
  
- **Multiplication rule: for AND'd events**
  - For *independent* events:
>**`P(A and B) = P(A) * P(B)`**

  - For *dependent* events:
>**`P(A and B) = P(A) * P(B|A)`**
  

- **Multiple *equally-likely* events**
>**`P(event then event then event) = P(event) ^ # events`**

- **Multiple events with different probabilities**
>**`P(event A then event B) = P(event A) * P(event B)`**

- **Probabilities involving "at least one" success**
>**`P(at least 1 success) = 1 - P(all failures)`**

  or
>**`P(at least 1 failure) = 1 - P(all successes)`**

- **Testing for event independence:** Events are *independent* if all are true:
  - `P(A|B) = P(A)`
  - `P(B|A) = P(B)`
  - `P(A and B) = P(A) * P(B)`


- **Probability of exactly *k* successes in *n* trials**
  >\begin{equation*}
P(E) = {n \choose k} p^k (1-p)^{n-k}
\end{equation*}

---
# Counting, permutations, and combinations

- **Combinatorics** - "The branch of mathematics dealing with combinations of objects belonging to a finite set in accordance with certain constraints, such as those of graph theory."
- Note: **`0 factorial, 0! = 1`**

## Permutations

- **Permutation** - An *ordered* combination of items in a set.  Order/position *does* matter!
- Two types:
  - Repetition is allowed (such as marbles being replaced in a bag after choosing)
  - Repetition is *not* allowed (such as a lottery ball machine)
  
  
- **Permutation formula:** (where no repeats of **n** are allowed)
>```
nPk or P(n,k) =    n!               or   = n * (n-1) * ... (n-(k-1))
                 ----------
                  (n - k)!
```

  - Where 
    - `n = # of items/events` and 
    - `k = # of positions`

## Combinations

- **Combination** - A set where order does *not* matter.
- **Combinations formula:**
  >\begin{equation*}
{n \choose k} = \frac{n!}{k! \ (n - k)!} = \frac{\# \ of permutations}{\# \ of positions}
\end{equation*}

  - "n choose k"
  - Where
    - `n = # of items/trials` and
    - `k = # of ways to arrange k things`
  - `k` is aka the "binomial coefficient"


---
# Random variables

## Expected value: 
- `E(X) = μᵪ` 
  - (μ = mean)

## Variance:
- **`Var(X) = E((X - μᵪ)²) = σᵪ²`**
  - (This is just review)
  
## Combining random variables
- **`E(X +/- Y) = E(X) +/- E(Y)`**
- **`μ₍ᵪ₊₋ᵧ₎ = μᵪ +/- μᵧ`**
  - The means (same as "expected value") are either summed or subtracted accordingly. 


If X & Y are *independent* random variables:
- **`Var(X +/- Y) = Var(X) + Var(Y)`**
  - The variances are *always summed*.  This makes intuitive sense yo.
  - Equivalent statement:
    - **`σ²₍ᵪ₊ᵧ₎ = σᵪ² + σᵧ²`**
    - **`σ²₍ᵪ₋ᵧ₎ = σᵪ² + σᵧ²`**
  - Note the same is *not* true for std dev

## Binomial variables

Necessary conditions of a binomial variable:
- Made up of independent trials
- Probability of each trial's success is constant
- Each trial must have binary outcome
- Fixed # of trials

## Binomial distribution

- The "discrete" version of the normal distribution.
- See [scipy.stats.binom( )](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) function