# Math for ML: Statistics

In this final Part 5 of the series on Math for ML we'll cover the essentials from ML's sister field of statistics. Statistics is at its heart the study of data using probability. We'll cover the following topics:
- Basic statistics: test statistics of a random variable (mean, median, variance, std dev, max, min, range), correlation (pearson vs spearman), outliers and how they affect statistics like correlations
- Statistical theory: maximum likelihood estimation, negative log likelihood and loss function, Bayesian (optional)
- Statistical inference: statistical tests, p-values, etc (maybe)
- Information theory: entropy, continuous entropy, cross entropy (maybe)

Let's get started.

## Statistics

Statistics is at its root the use of probability to study **data**. Data can be any real-world measurement, in essentially any form. Statistics treats data as either fixed values or random variables depending on the situation. If the data has already been observed we assume it's fixed. If not, we assume it's random and try to model it with a probability distribution. For the purposes of this lesson we'll assume data comes in the form of some 1D or 2D array and that each data point takes on a numerical value, either an integer or real number. 

Let's start with a simple 1D array of $m$ samples $\mathbf{x} = (x_0,x_1,\cdots,x_{m-1})$. We'd like to know what univariate distribution $p(x)$ would generate the samples $x_0,x_1,\cdots,x_{m-1}$. If we can get this probability distribution even approximately, we can gain a lot of insight into the nature of the data itself. 

Unfortunately, it's really hard to figure out what distribution data is coming from with only a finite amount of data in all but the simplest cases. What we often thus settle for instead is to *assume* the data come from a certain class of distribution, and then try to estimate what the parameters of that distribution would have to be to ensure that distribution fits the data well. In this sense, a whole lot of statistics boils down to how to estimate the parameters of a given distribution from the data.

Some of the most important parameters to estimate from an array of data are its moments. The hope is to find, without knowing the data's distribution, the best estimate of that distribution's moments from the data given. This is where the formulas you're probably used to come in for things like mean, variance, standard deviation, etc. Traditionally to avoid getting these moment estimates mixed up with the true distribution's moments, we call them **sample moments**. I'll define them below for the univariate case.

**Sample Mean:** $$\overline{x} = \frac{1}{m}\sum_{i=0}^{m-1} x_i = \frac{1}{m}(x_0 + x_1 + \cdots + x_{m-1})$$

**Sample Variance:** $$s^2 = \frac{1}{m-1}\sum_{i=0}^{m-1} (x_i-\overline{x})^2 = \frac{1}{m-1}\big((x_0-\overline{x})^2 + \cdots + (x_{m-1}-\overline{x})^2\big)$$

**Sample Standard Deviation:** $$s = \sqrt{s^2} = \sqrt{\frac{1}{m-1}\sum_{i=0}^{m-1} (x_i-\overline{x})^2}$$

Other quantities that might be of interest to estimate aren't moments at all. One example is the **median**, which is defined as the midpoint of a distribution, i.e. the $x$ such that $p(x)=1/2$. The median is another way of estimating the center of a distribution, but has slightly different properties than the mean. One of those properties is that the mean depends only on the *rank order* of values, not on what numbers those values take on. This implies that, unlike the mean, the median is invariant to points "far away from the center", called **outliers**.

We can estimate the sample median, call it $M$, of an array $x_0,x_1,\cdots,x_{m-1}$ by sorting them in ascending order and plucking out the midpoint. If $m$ is odd, this is just $M = x_{m//2+1}$. If $m$ is even we by convention take the median as the average of the two midpoints $M = \frac{1}{2}\big(x_{m//2} + x_{m//2+1}\big)$.

Other quantities we might want to estimate from the data are the sample **minimum**, the sample **maximum**, and the sample **range**, which is defined as the difference between the sample max and min.