# Statistical Methods and Probability in Data Analysis

Probability concerns over the chances of an outcome happening. The concept of "Central Limit Theorem" derives from probabilistic thinking and significatly applied in inferential statistics. 

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.
* __Descriptive statistics__ - With no interest in assumption that the data come from a larger population, Descriptive statistics solely concerned with properties of the observed data. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics (minimal assumptions about the underlying distribution).
* __Inferential statistics__ - Assumes that the observed data set is sampled from a larger population, Inferential statistics concerns with making interpretation about a set of data, specifically to determine the likelihood that a conclusion about a sample is true (testing hypothesis). These hypotheses are **not** probability statements, they relate to the populations represented by the samples, and so have nothing to do with significance probabilities, or p-values. The tests all produce a significance probability (p-value, or SP) that indicates the likelihood of the observed value of the test statistic belonging to the distribution described by the null hypothesis for the test. A p-value of 0.05 suggests that there is a 5% chance that the observation fits the "null hypothesis". Read more on p-value [here](https://onlinelibrary.wiley.com/doi/full/10.1002/nur.21947).

> __Note:__ The word _inference_ in machine learning is often used to mean "generating a prediction, by evaluating an already trained model". This is a part of 'Predictive Inference' which emphasizes the prediction of future observations based on past observations.



# 1. Probability Basics

__Expected Value__ is a generalization of the weighted average. Informally, the expected value is the mean of the possible values a random variable can take, weighted by the probability of those outcomes.
$$E(X)=\int_{-\infty}^{\infty}xf(x)dx$$

__Variance__ is a measure of how far a set of numbers is spread out from their average value.
$$\text{VAR}(X)=E[(X-\mu)^2]=E[(X-E[X])^2]=E[X^2]-E[X]^2$$
$$\text{VAR}(X) = \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx = \int_{-\infty}^{\infty}x^2f(x)dx - \mu^2$$

# 2. Statistical measures

Given a range of data such as 100 points between -50 and +23, we can commonly summarize the data by measuring its "middle" point (broadly called "Central Tendency") and "spread" (broadly called "Variability or Dispersion"). 

But, how should we quantify what is meant by "middle" or "spread"? 

Several measures (aka. "metric") of central tendency can be thought of as various ways to solve a variational problem. That is, given a measure of statistical dispersion, one asks for a measure of central tendency that minimizes variation: such that variation from the center is minimal among all choices of center. The most familiar method is called "Arithmetic Mean" (AM). This is the standard "average" calculation, done by summing all data points and dividing by the number of points: 
$$AM = \frac{x_1 + x_2 + x_3 + \cdots + x_N}{N} = \frac{1}{N}\sum_{i=1}^n{x_i}$$

> __PROOF__: How does this Arithmetic Mean **minimizes variation from the center**? 
> 
> TODO: 
> 
> Read more at [StackOverflow discussion](https://stats.stackexchange.com/a/520289/177561).

## 2.1 Univariate analysis (discrete)

Univariate analysis concerns with only the data in a $\R^1$ space. Similar to the example given above, we can start with considering working with finite data set $X\in\{x_1,x_1,\cdots,x_N\}$, meaning $X$ is a discrete random variable i.e. arbitrarily random selections from non-continuous range. In terms of "solving variational problem" stated previously, we can draw association with $L^p$ space (thus _p_-norms); see Apendix. 

|$L^p$|Dispersion|<sub>calculation</sub>|<sub>notes</sub>|Central tendency|<sub>calculation</sub>|<sub>notes</sub>|Proof of minimum variance|
|-|-|-|-|-|-|-|-|
|$L^0$|Variation ratio ($v$)|$1-\frac{\max(freq)}{N_{all}}$|The larger the variation ratio, the more differentiated the data. Simple, but highly unstable.|Mode|$x_{\max(freq)}$|Unlike mean or median, mode also makes sense for non-numeric values i.e. "nominal data". Simple, but highly unstable.|-|
|$L^1$|[Average Absolute Deviation (AAD)](https://en.wikipedia.org/wiki/Average_absolute_deviation)|$\frac{1}{N}\sum_{i=1}^N{\|x_i-m(X)\|}$<br><sub>m(X) is any measure of central tendency, typically mean or median</sub>|-|Median|$x_{@\frac{N}{2}}$|-|[via substitution](https://math.stackexchange.com/a/113279/481786)<br>[via differentiation](https://math.stackexchange.com/a/1862591/481786)|
|$L^2$|Standard deviation ($\sigma$)|$\sqrt{\frac{1}{N}\sum_{i=1}^N{(x_i-\mu)^2}}$<br><sub>where &mu; is Arithmetic Mean</sub>|Variance: $\text{VAR}(X)=\sigma^2$|Arithmetic Mean|$\frac{1}{N}\sum_{i=1}^N(x_i)$|-|[via differentiation](https://stats.stackexchange.com/a/520289/177561) [via expected value](https://faculty.washington.edu/swithers/seestats/SeeingStatisticsFiles/seeing/center/meanproof/meanProof.html)|
|$L^\infty$|Range aka. Maximum deviation ($D$)|$\max(X)-\min(X)$|-|Mid-range|$\frac{\max(X)+\min(X)}{2}$|-|-|


When we inspect the values of $L^1$, $L^2$ and $L^p$ norms, we can see that as the power increases, the resulting measure is more reliant on the largest values. This concept is explored more in cost function in Machine Learning algorithms. In classical ML algorithms, every iterations is followed by an evaluation of performance so that the model weights can be adjusted in the next iterations. During this evaluation, we are interested in the error or "loss" i.e. distance between actual and predicted. We typically uses Euclidean distance to measure Loss.
Let's take an example of simple linear regression:

**Hypothesis:** $y=h(x)=w_1^2x + w_2$  
**Parameters:** $w=(w_1, w_2)$  
**Cost Function:** $J(w) = MSE = \frac{1}{2m}\sum_{i=1}^m{(\hat{y}_i-y_i)^2}$  
**Goal:** $\text{minimize } J(w)$  

where:    
* $w$ or $\theta$ is a vector containing all parameters in the model, with $b$ also commonly used to refer to intercept term.   
* $m$ is the number of data points.  
* Technically called half-MSD, the "2" in the divisor is a technique only to make derivatives mathematically more convenient. When you take the derivative of the cost function that is used in updating the parameters during gradient descent, that 2 in the power get cancelled with the 1/2 multiplier, thus the derivation is cleaner. [ref](https://stackoverflow.com/a/68629431/8742400)
* $J$ stands for Jacobian matrix is a form of matrix representation and technique to calculate the partial derivatives of an equation with multiple variables. [Explained here.](https://math.stackexchange.com/a/2900485/481786)

Regularization technique that aims to improve the generalization of the linear regression model by adding a penalty term to the cost function. The additional term​ in the cost function ensures that the coefficients of the features do not grow too large, which helps prevent overfitting. For more exploration on the effect, read more [here](https://medium.com/@sangeeth.pogula_25515/a-deep-dive-into-linear-regression-bias-variance-cost-functions-and-regularization-35011abfe323) and [here](https://arxiv.org/pdf/2211.02989). 

__Lasso Regression ($L^1$)__ becomes:  
$J(w) = MSE = \frac{1}{2m}\sum_{i=1}^m{(\hat{y}_i-y_i)^2} + \lambda \sum_{j=1}^n{|w_i|}$   
When we update the weight, each weight has a proportional impact on the cost function, so the weight can be linearly driven down to zero.  


__Ridge Regression ($L^2$)__ becomes:  
$J(w) = MSE = \frac{1}{2m}\sum_{i=1}^m{(\hat{y}_i-y_i)^2} + \lambda \sum_{j=1}^n{w_i^2}$  
Unlike L1 regularization, as the weight is getting closer to zero, its impact on the update gets smaller. This make sure that the weight still does not grow too large, but also ensure that we do not drive the weight to zero.

The higher or even fractional p-norms is possible but not commonly used for convenience and interpretation reason. However, they all intend the same: to discriminate values at different magnitudes. For example, L3 will have a cubed weight to the largest values, and Lp will add power of p. 
[Read a nice perspective here.](https://stats.stackexchange.com/questions/269298/why-do-we-only-see-l-1-and-l-2-regularization-but-not-other-norms)



### Arithmetic Mean
Before we get into more variations of central tendency measure, if it important to discuss why do we like to use Arithmetic Mean. 

First reason is because of the Central Limit Theorem (CLT). Basically, if you have a random variable sampled from some stationary distribution with finite variance, the larger your sample gets, the more the sample mean converges to a normal distribution with arithmetic mean equal to the population mean. CLT is not saying the underlying data is normal; it’s saying the sample mean is itself a random variable that comes from its own distribution, which converges to normality. Because of this, arithmetic mean is more stable as the sample size increases and converges to their population mean.

Second, arithmetic mean minimizes the squared errors of any given sample. Basically the mean is the $\argmin$ of the very popular, very common squared error loss function. __Note__: $\argmin$ is a set of $x$ that minimizes some function, in this case squared error $\sum_{i=1}^N(x_i-\bar{x})^2$. Proof is linked in the table above.

Third motivation is because in many contexts, the mean gives us an estimator for a parameter we care about from an underlying distribution. For example:
* The sample mean of a normally distributed random variable x converges to the mean parameter for that distribution.
* The sample mean of a normally distributed random variable log(x) converges to the mean parameter of a log-normal distribution of x.
* The sample mean of the centered squares $(x_i - \bar{x})^2$ of a random variable x converges to a biased (but correctable) estimate of the distribution’s variance.
* The sample mean of a Bernoulli distributed random variable x converges to the probability parameter of a Bernoulli distribution.
* Given some r, the sample mean of a negative binomial distributed variable x, when divided by r, converges to the odds of the probability parameter p.

So by taking a mean, we can estimate these and many other distributions’ parameters.

> __Note__: For deeper explanation and comparison amoung key variations of means: AM vs GM vs HM, read [this](https://ryxcommar.com/2023/01/13/intuitive-explanation-of-arithmetic-geometric-harmonic-mean/).

#### Arithmetic mean (GM)
$$\text{AM} = \frac{1}{N}\sum_1^Nx_i$$  
__Function:__ $f_{AM}(X)=x \text{  and  } f_{AM}^{-1}(X)=x$  

#### Geometric mean (GM)
$$\text{GM} = \left(\prod_1^Nx_i\right)^{1/N}$$
__Function:__ $f_{GM}(X)=log(x) \text{  and  } f_{GM}^{-1}(X)=e^x$  
__Consider:__ $\log(\bar{x}_\text{GM}) = \frac{1}{N}*\log\left(\prod_1^Nx_i\right) = \frac{1}{N}*\sum_1^N\log(x_i)$  
Practically, this is the same effect as taking Arithmetic Mean of $\log(x)\rightarrow\chi$, then convert the result back via $e^\chi$. 

Some use cases
* **Inflation rates**: You have inflation of 1%, 2%, and 10%. What was the average inflation during that time? (1.01 * 1.02 * 1.10)^(1/3) = 4.3%
* **Coupons**: You have coupons for 50%, 25% and 35% off. Assuming you can use them all, what’s the average discount? (i.e. What coupon could be used 3 times?). (.5 * .75 * .65)^(1/3) = 37.5%. Think of coupons as a “negative” return — for the store, anyway.
* **Area**: You have a plot of land 40 × 60 yards. What’s the “average” side — i.e., how large would the corresponding square be? (40 * 60)^(0.5) = 49 yards.
* **Volume**: You’ve got a shipping box 12 × 24 × 48 inches. What’s the “average” size, i.e. how large would the corresponding cube be? (12 * 24 * 48)^(1/3) = 24 inches.

#### Harmonic mean (HM)
$$\text{HM} = \frac{N}{\sum_1^N(1/x_i)}$$ 
__Function:__ $f_{HM}(X)=\frac{1}{x} \text{  and  } f_{HM}^{-1}(X)=\frac{1}{x}$  
__Consider:__ $1/(\bar{x}_\text{HM}) = \frac{1}{N}*\sum_1^N(1/x_i)$  
Similarly, this is the same effect as taking Arithmetic Mean of $\frac{1}{x}\rightarrow\chi$, then convert the result back via $\frac{1}{x}$. 

Some use cases
* **Data transmission**: We’re sending data between a client and server. The client sends data at 10 gigabytes/dollar, and the server receives at 20 gigabytes/dollar. What’s the average cost? Well, we average 2 / (1/10 + 1/20) = 13.3 gigabytes/dollar for each part. That is, we could swap the client & server for two machines that cost 13.3 gb/dollar. Because data is both sent and received (each part doing “half the job”), our true rate is 13.3 / 2 = 6.65 gb/dollar.
* **Machine productivity**: We’ve got a machine that needs to prep and finish parts. When prepping, it runs at 25 widgets/hour. When finishing, it runs at 10 widgets/hour. What’s the overall rate? Well, it averages 2 / (1/25 + 1/10) = 14.28 widgets/hour for each stage. That is, the existing times could be replaced with two phases running at 14.28 widgets/hour for the same effect. Since a part goes through both phases, the machine completes 14.28/2 = 7.14 widgets/hour.
* **Buying stocks**: Suppose you buy $1000 worth of stocks each month, no matter the price (dollar cost averaging). You pay $25/share in Jan, $30/share in Feb, and $35/share in March. What was the average price paid? It is 3 / (1/25 + 1/30 + 1/35) = $29.43 (since you bought more at the lower price, and less at the more expensive one). And you have $3000 / 29.43 = 101.94 shares. The “workload” is a bit abstract — it’s turning dollars into shares. Some months use more dollars to buy a share than others, and in this case a high rate is bad.

#### Root Mean Square (RMS) - aka. Quadratic mean
$$\bar{x}_\text{RMS}=\sqrt{\frac{1}{N}\sum_1^Nx_i^2}$$
__Function:__ $f_{RMS}(X)=x^2 \text{  and  } f_{RMS}^{-1}(X)=\sqrt{x}$  
__Consider:__ $\bar{x}_\text{RMS}^2=\frac{1}{N}\sum_1^Nx_i^2$  
Similarly, this is the same effect as taking Arithmetic Mean of $x^2\rightarrow\chi$, then convert the result back via $\sqrt{x}$. 

Root Mean Square is useful in engineering, but not often used in statistics. This is because it is not a good indicator of the center of the distribution when the distribution includes negative values.

### Other univariate measure of Central tendency
* Mid-Hinge
  * $\text{MH}(X)=\frac{Q_1+Q_3}{2}$
  * Two-point L-estimator: For a normal distribution, mid-hinge is a remarkably efficient estimator of population mean with 81% efficiency.
* Trimean
  * $\text{TM}(X)=\frac{Q_1+2Q_2+Q_3}{4}$
  * Three-point L-estimator: For a normal distribution, trimean is a remarkably efficient estimator of population mean with 88% efficiency. 
  * For context, median (single-point L-estimator) achieves 64%. Mid-Hinge achieves 81%.
* Weighted arithmetic mean (wAM)
  * Similar to an ordinary arithmetic mean, except that some data points contribute more than others.
  * If all the weights are equal, then the weighted mean is the same as the arithmetic mean.
  * $\bar{x} = \frac{\sum_{1}^{N}w_ix_i}{\sum_{1}^{N}w_i}$
* Trimmed mean or Truncated mean
  * Calculating Arithmetic Mean after discarding some portion of the data, at the high and low end.
  * $\bar{x}_{p\%}=\frac{1}{(1-p)N}\sum_{p\%N+1}^{(1-p)\%N}(x_i)$
* Interquartile mean (IQM)
  * $\bar{x}_\text{IQM}=\frac{2}{N}\sum_{25\%N+1}^{75\%N}(x_i)$
  * This is directly a type of Truncated mean.
* Winsorized mean
  * Arithmetic Mean in which extreme values are replaced by values closer to the median.
  * For example, for a 10% winsorized mean for 10-point sample: $\frac{(x_2 + x_2) + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + (x_9 + x_9)}{N}$

### Other univariate measure of Dispersion
* Interquartile range (IQR)
  * $\text{IQR}=Q_3-Q_1$
* Median absolute deviation (MAD)
  * $\text{MAD}=\text{Median}(x_i-m(X))$
  * where $m(X)$ is any measure of central tendency, typically mean or median
* Mean absolute difference (aka. Gini mean absolute difference)
* Entropy

### _"Inefficient"_ statistics: L-estimator [[ref]](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-17/issue-4/On-Some-Useful-Inefficient-Statistics/10.1214/aoms/1177730881.full)
> "L" is for **L**inear combination of each measurement. This is unrelated to "$L^p$ space" to be referenced above.

__L-estimator__ is a robust statistical method that calculates some descriptive metrics such as central tendency and dispersion by taking a weighted average of ordered data points. Where we use $k$ to indicate the number of data points used in the calculation. Some examples include,
* Using single point (k=1): maximum, minimum, quantile, median
* Using 2 points (k=2): mid-range, mid-hinge, range, interquartile range (IQR)
* Using 3 points (k=3): trimean
* Using more points (k): interquartile mean (IQM), trimmed mean, Winsorized mean
* All points: Arithmetic mean (AM), geometric mean (GM), harmonic mean (HM), Standard Deviation

Using fewer points to estimate central tendency, for example, is beneficial when you want to analyze data that might contain outliers. 
This is also why most L-estimators are considered "robust" as compared to arithmetic mean. (Not all L-estimators are robust; if it includes the minimum or maximum.)
In situations when the cost of analysis greater than cost of data generation, statistically "efficient" or "most powerful" procedures may not be viable. 
Low-k estimators can analyze large masses of data economically. However, since we are not considerting the full range of data points, there will be some loss of information. This is a drawback when the "outliers" are in fact a signal.  

## 2.2 Bivariate and multivariate analysis
The metrics in univariate analysis can be extended and generalized to multiple dimensions.
* Mean --> Centroid  
* Median --> Tukey median (aka. Centerpoint), Geometric median

## 2.1 Linear Analysis of Joint Variability  

__Covariance__ is a measure of joint variability between 2 variables. It is also easy to show that $\text{VAR}(X)=\text{COV}(X,X)$. To compute covariance between two random continuous variables, Hoeffding's covariance identity is a useful tool. Read more [here](https://en.wikipedia.org/wiki/Covariance#Hoeffding's_covariance_identity).
$$\text{COV}(X,Y)=E[(X-E[X]){Y-E[Y]}]=E[XY]-E[X]E[Y]^2$$

__Correlation__: The challenge with covariance is that value of covariance scales with the value of outcome, so the range extends from $[-\infty,\infty]$. This makes convariance difficult to interpret and compare. "Pearson's correlation coefficient" attempts to normalize covariance by simply divides the covariance of the two variables by the product of their standard deviations. For more interpretation of correlation, see [here](https://www.stat.berkeley.edu/~rabbee/correlation.pdf).
$$\rho_{(X,Y)} = \text{cov}(X,Y) = \frac{E[(X-\mu_X)^2]E[(Y-\mu_Y)^2]}{\sigma_x\sigma_y}$$

__Auto-covariance matrix__ of real random vectors  
TODO: ...

__Covariance and Correlation Matrix__ are square matrices giving the covariance and Correlation, respectively, between each pair of elements of a given random vector. Essentially, you can visualize it as a table containing covariance and correlation for each pairs of variables. In a 2D space, covariance matrix represents an ellipse, which captures the distribution of a vector of 2 random variables. It is useful to look at the determinant and eigenvalues of covariance and correlation matrices. (Note that the product of n eigenvalues of A is the same as the determinant of A.). Study of Covariance matrices are used a lot in data preprocessing techniques for machine learning models >> https://builtin.com/data-science/covariance-matrix

## 2.2 Non-linear analysis of variances
* Spearman's Correlation measures the "monocity" between two sets of data (e.g. do they both increase and decrease at the same time?)
* Kendall's Tau measures the ordinal association between two sets of data - supposedly Kendall's Tau is similar to the Spearman Correlation, but Kendall's Tau has a more logical confidence intervals.
* Distance Correlation  >> https://en.wikipedia.org/wiki/Distance_correlation  
* Xi Correlation >> https://towardsdatascience.com/an-undeservedly-forgotten-correlation-coefficient-86245ccb774c  

## 2.3 Analysis of variance (ANOVA) and Analysis of covariance (ANCOVA) 

...

## 2.4 Partial Correlation

...

# 3. Exploratory data analysis


In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell beyond the formal modeling and thereby contrasts with traditional hypothesis testing, in which a model is supposed to be selected before the data is seen.

Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. 

EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

### Techniques and tools
There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.

Typical graphical techniques used in EDA are:
* Box plot
* Histogram
* Scatter plot
* Run chart - also known as a Run-sequence plot, is a graph that displays observed data in a time sequence
* Pareto chart
* Parallel coordinates
* Heat map
* Bar chart

Dimensionality reduction:
* Multidimensional scaling
* Principal component analysis (PCA)
* Multilinear PCA
* Nonlinear dimensionality reduction (NLDR)

Typical quantitative techniques are:
* Median polish
* Trimean
* Ordination

# 4. Fundamental introduction to Machine Learning Methods

## The importance of i.i.d assumption
"Independent and identically distributed" random variables refer to a situation when all random variables (i.e. some given variables that is randomly sampled from a larger population) has the same probability distribution (e.g. it is as likely to randomly find a 10th percentile in one population as in another population) and is mutually independent.
* __Independent__ means that the sample items are all independent events. In other words, they are not connected to each other in any way; knowledge of the value of one variable gives no information about the value of the other and vice versa.
* __Identically distributed__ means that there are no overall trends — the distribution does not fluctuate and all items in the sample are taken from the same probability distribution.

> TODO: The importance   
> ...   
> https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning

# Appendix

## Function space

Function space is an abstraction, expanding on the familiar idea of x-y Cartesian coordinates in Euclidean space.
A space represents a set of all possible objects which could be numbers, vectors, functions, etc.
Instead of a point, think of a function as another tangible object, function space is just a collection of these objects. 
It's a space whose points (or elements) are functions. It's extremely convenient to study functions by thinking of them as elements in some space. 
For example you can talk about two functions being close, just as you can talk about ordinary points being close. 
An approximation being good usually means that some kind of distance in the function space is minimalized etc.

## $L^p$ space
> Relates to Regularization ($L^1$ and $L^2$) in Machine Learning cost functions.

$L^p$ space, also called "Lebesgue spaces", refers to functional spaces defined using concept of _p-norm_. _p_ is not the number of dimensions (aka. cardinality) or a vector space, meaning $\R^1, \R^2, \R^3, \R^N$ can all be calculated with any _p_-norm. 

Essentially, a norm is a measure of distance between some objects e.g. points, functions. 
Mathematically speaking, a norm is a function from any space mapping to the non-negative real numbers (aka. distance is always a non-negative real number $\R^1$) that quantifies a concept of "distance". 
Moreover, _p-norm_ is a generalized concept with _p_ can be any real number greater than zero including 0, 1, 0.5, 100, etc.

The most familiar example of a norm is the "Euclidean length" in Euclidean space: $||x||_2 = \sqrt{x_1^2+x_2^2+x_3^2+x_4^2+...+x_n^2}$ where $n$ is the number of dimensions that $x$ projects on. 
The Euclidean distance between two points $x$ and $y$ is the length $||x-y||_2$ of the straight line between the two points.
In many situations, the Euclidean distance is appropriate for capturing the actual distances in a given space. 
In contrast, consider taxi drivers in a grid street plan who should measure distance not in terms of the length of the straight line to their destination, 
but in terms of the "rectilinear distance" (i.e. counting along the direction of axis only), which takes into account that streets are either orthogonal or parallel to each other. The class of $p$-norms generalizes these two examples and has an abundance of applications in many parts of mathematics, physics, and computer science.

For a real number $p \ge 1$, the _p-norm_ or $L^p$-norm of $x$ is defined by:
$$||x||_p=(|x_1|^p+|x_2|^p+\cdots+|x_n|^p)^{1/p}$$

<table style='margin-left: auto; margin-right: auto'><tr>
<td style="width:400px"><img src='./img/Lp_norms.png'></td>
<td style="width:400px">Illustrations of unit circles (see also superellipse) in ℝ<sup>2</sup> based on different p-norms (every vector from the origin to the unit circle has a length of one, the length being calculated with length-formula of the corresponding p.
</td></tr></table>


In statistics, you are unlikely to meet $L^3$ norms, but you will see 0, 1, 2 and infinity. 
Each measure of central tendency minimizes the average distance between the center and the data, in some distance metric. 
The mode minimizes $L^0$ (the count of observations different from the mode), the median minimizes $L^1$ (absolute deviation), the mean minimizes $L^2$ (variance = squared devation), the midrange minimizes $L^\infty$ (where only the largest and smallest observations in the data set matter.)

<table style='margin-left: auto; margin-right: auto'><tr>
<td style="width:400px"><img src='./img/Manhattan_distance.png'></td>
<td style="width:400px">In taxicab geometry, the lengths of the red, blue, green, and yellow paths all equal 12, the taxicab distance between the opposite corners, and all four paths are shortest paths. Instead, in Euclidean geometry, the red, blue, and yellow paths still have length 12 but the green path is the unique shortest path, with length equal to the Euclidean distance between the opposite corners, 6√2 ≈ 8.49.
</td></tr></table>