# Linear Regression Precursor - Important Statistics Definitions

## Expected Value 
### Discreet Case:  
The probability _weighted_ average of _all possible_ values of a random variable. It is important to note that the word _average_ or the word _mean_ may refer to the average or mean of a sample of the random variable, while expected value refers to the average of all possible outcomes. Therefore any sample mean is an estimate of the expected value. 

$$E[X] = \sum_{i=1}^N x_ip_i = \mu$$

Where $x_i$ is the outcome and $p_i$ is the probability of that outcome.

_example:_ Let X be the outcome of a fair coin toss. Let X=1 represent landing on heads and $X=2$ represent landing on tails. Then the expected value of $X$ is 

$$E[X] = \sum_{i=1}^2 p_ix_i = 0.5\times 1 + 0.5 \times 2 = .75$$

_note_: If all outcomes are equally likely, as in the above example, then the expected value is simply the arithmetic mean of the possible outcomes.




### Continuous case:
in this case, the expected value is in integral form:

$$E[X]=\int_{\mathbb{R}} x f(x) dx$$

Where f(x) is that variables probability density function. 

In [1]:
# Put some code here to demonstrate calculating expected value of continuous variables. 

## __Deviation__ 
The difference between the observed value of a variable and the some other value, such as Expected Value.
- If the deviation is between an observed value and the expected value, it is called __error__.
- If the deviation is between an observed value and the estimate of the expected value (sample mean), it is called a __residual__

In [2]:
# Code for computing deviation

## __Variance__ ($\sigma^2$) 
The variance of a random variable $X$ is the expectation of the squared deviation of a random variable from its mean. It is a measure of how spread out from the mean a set of samples is. If $\mu$ is the mean: 

$$\text{Var}(X) = E[(X - \mu)^2] = \sum p_i (x_i - \mu)^2 = \sigma^2$$

Or in the case of N equally likely outcomes: 

$$\text{Var}(X) = E[(X - \mu)^2] = \frac{1}{N}\sum (x_i - \mu)^2 = \sigma^2$$

_note:_ Note that if we are computing sample variance rather than population variance, we would correct for bias using $N-1$. This is due to the fact, that in general we can expect variance to be an understimate when applied to samples instead of the entire populaton. To account for this, we divide by $N-1$ to shift the resulting variance to the right, correcting it.

In [3]:
# code to compute variance

## __Standard Deviation__ ($\sigma$, s)
A measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that values tend to be close to the mean, while high standard deviation indicates that values are more spread out. The standard deviation of a variable is the square root of its Variance.  Mathematically it is the square root of __variance__, defined above:

$$\sigma = \sqrt{\text{Var}(X)}$$

A useful property of standard deviation is the fact that it is expressed in the same units as the data from which it is derived. This allows for human readability and interpretation of data. 

In [4]:
# code to compute standard deviation

## __Covariance: ($\sigma_{X,Y}$)__ 
Covariance is the measure of joint variability of two random variables. For example, if small values of variable $X$ are typically associated with small values of variable $Y$, while large values of $X$ are associated with large values of $Y$, then covariance is positive. 

$$\sigma_{X,Y} = E[(X - \mu_x)(Y - \mu_y)] = p_i \sum_{i-=1}^n (x_i - \mu_x)(y_i - \mu_y)$$

Or if all $n$ outcomes are equally likely:

$$\sigma_{X,Y} = E[(X - \mu_x)(Y - \mu_y)] = \frac{1}{n} \sum (x_i - \mu_x)(y_i - \mu_y)$$

Reminder, expected value is the probability weighted mean of the entire population or sample space:

$$\mu_x = E[X] \\ \mu_y = E[Y]$$

While the sign (+/-) of covariance is useful in telling us about tendancies in the linear relationships in our data. The magnitude, however is not so useful, as it depends on the magnitude of the variables, which might be vastly different. To interpret the exact nature of the relationship, we must think about the correlation coefficient, in the next definition. 

## __Correlation coefficient: ($\rho_{X,Y}$, r)__ 
Correlation is a measure of the relatedness between two variables ranging in value from -1 to 1.

__$\rho = 1$__ implies that a straight line can be drawn that passes through ALL data points, implying perfect positive correlation.

$0 < \rho < 1$ implies we see _positive_ correlation between variables, but not all points fall in a stright line. 

$\rho =0$ implies that data points are evenly distributed across the sample space (i.e. highly scattered) and no relationship can be found between them. 

$-1 < \rho < 0$ implies we see _negative_ correlation between variables, but not all points fall in a stright line. 
The formula is: 

$\rho = -1$ implies that we see perfect _negative_ correlation, much like the case $\rho = 1$, but this time the slope of the line passing through the data points is negative. 

$$\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y}\\
\sigma_X := \text{standard deviation of X}\\
\sigma_Y := \text{standard deviation of Y}$$
