My goal is to explain the Pearson correlation coefficient without using the word correlation, which is often used to describe it.
\
One way is to just give the definition: the Pearson correlation coefficient of two random variables $X$ and $Y$ is
$$
\rho = \frac{\sigma_{XY}}{\sigma_X \sigma_Y},
$$
where 
$\sigma_X^2 =\mathrm{E} (X - \mu_X)^2$ is the variance of $X$, 
$\sigma_Y^2 =\mathrm{E} (Y - \mu_Y)^2$ is the variance of $Y$, 
$\sigma_{XY}=\mathrm{E} (X - \mu_X)(Y - \mu_Y)$ is the covariance of $X$ and $Y$,
$\mu_x = \mathrm{E} X$ is the mean of $X$,
and $\mu_Y = \mathrm{E} Y$ is the mean of $Y$.
\
But this is unsatisfying: why is this definition useful?

Consider the problem of estimating $Y$ from an observation of $X$.
It turns out that in the optimal linear estimator, *the number of standard deviations $Y$ is above it's mean is 
$\rho$ times
the number of standard deviations $X$ is above it's mean.*
 <!-- is $\rho$ 
 the 
The optimal linear estimat
A more practical 
A more practical approach is 
Here is a more practical approach:
Suppose you are given an observation of $X$ that is $n$ standard deviations away from it's mean and you want to estimate $Y$ using it.
Then the optimal (in the mean squared error sense) linear estimation is $n \rho$ standard deviations of $Y$ away from it's mean. -->
\
In other words, $\rho$ is the factor by which we shrink (and possibly flip) the deviation from the mean in one variable when we estimate the other.
\
So for example, consider a population of people where
height and weight are correlated with $rho=0.72$,
heights are distributed with mean 170cm and std of 10cm,
weights are distributed with mean 70Kg and std of 20Kg.
If you know that the height of a certain person is 190cm,
a good guess for it's weight is
$ 70 + 2 * 0.72 * 20 = 98.8Kg $.
<!-- 
$\rho=0.72, \mu_x = 4, \sigma_x = 0.5, \mu_y = 100, \sigma_y = 10$ 
and we observe $X=5$, the estimate of $Y$ is 
$100 + 0.72 \cdot 2 \cdot 10 =114.4$ -->

The proof is very simple. Since we are dealing with linear (actually, affine) estimators, we need to show that the $a$ and $b$ that would minimize 
$$
\text{MSE} := \mathrm{E} \left( \hat{Y} - Y ^2 \right) ^2,
$$
where $\hat{Y} := a (X - \mu_x) + b$,
<!-- $$
\mathrm{E} \left[ \left(a (x - \mu_x) + b - y \right) ^ 2 \right]
$$ -->
are $\rho \sigma_Y / \sigma_X$ and $\mu_Y$.

The MSE is the sum of bias squared and variance.
The variance doesn't depend on $b$, and the bias is
$\mathrm{E} \left[ \hat{Y} - Y \right] = b - \mu_Y$
which doesn't depend on $a$.
So we already know that $b=\mu_Y$.
To minimize the variance, we simplify
$$
\begin{align*}
\mathrm{Var}\left[\hat{Y} - Y\right]
&= \mathrm{Var}\left[a (X - \mu_X) - Y\right] \
&= \mathrm{Var}\left[a \left(X - \mu_X\right) \right] 
    + \mathrm{Var}\left[ Y\right] 
    -2 \mathrm{Cov}\left[a \left(X - \mu_X\right), Y\right] \
&= \sigma_x ^ 2 a^2 
   + \sigma_Y ^2
   -2  \sigma_{XY} a
\end{align*}
$$
This is just a parabola in $a$, so the optimal $a$ is
$$
\frac{2 \sigma_{XY}} {2 \sigma_X ^2}
=
\rho \frac{\sigma_Y } {\sigma_X }
$$
(which is what we wanted to show).

The estimator is unbiased, so it's MSE is equal to it's variance:
$$
\text{MSE} = \sigma_Y ^2 (1 - \rho ^ 2).
$$
This equation gives another concrete interpretation of $\rho$:
*If $X$ and $Y$ are correlated with coefficient $\rho$, observing $X$ will decrease the standard deviation of a $Y$ estimate by a factor of at least $\sqrt{1 - \rho^2}$.*
\
"at least" since the the optimal linear estimator is equal or worse than the optimal estimator.
\
In the example above, knowing the height decreases weight estimation error from 20Kg to $20  (1 - 0.72^2) = 9.6$ Kg.

Notes:
1. If $X$ and $Y$ are jointly Gaussian, the optimal linear estimator is also the optimal estimator.

2. The "mean" in "MSE" is an average over the joint distribution of $X$ and $Y$, which is different than over the distribution of $Y$ given $X$, for which
our estimator is not the optimal linear estimator (and biased).\
In our example, we estimated the weight to be 98.8Kg with variance $9.6^2$.
It doesn't mean that if we will sample random people with height 190cm, we would get a mean weight of 98.8Kg and variance smaller than $9.6^2$.
It means that if we sample random people, and estimate their weight from their height using the optimal linear estimator, our error will be zero on average, and with variance $9.6^2$.
If we use the optimal estimator, the $9.6^2$ is an upper bound on the variance.

3. The sentence "$X$ and $Y$ are not correlated" now has a concrete meaning: it means that the optimal linear estimator of $Y$ from $X$ will be the mean of $Y$, ignoring $X$ completely.

4. The discussion above is "Bayesian", in the sense that it assumes you have some knowledge about the distribution of $X$ and $Y$.
In practice usually get $n$ samples of $X$ and $Y$ pairs, and we use plug-in estimators to estimate the means, variances, and covariance, which we will then use build our $Y$ from $X$ linear estimator.
\
Machine learning people would say: we can take the samples to train a linear regression model to predict $Y$ from $X$. It sounds better, more "end-to-end"y.
But actually it gives exactly the same result (assuming we don't use [Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction)):
TODO: proof
<!-- 
The optima
The optimal $a=\frac{2 \sigma_{XY}} {2 \sigma_X ^ 2}$

$$
\begin{align*}
a 
&= \text{argmin}_{a'} \mathrm{Var}\left[\hat{Y} - Y\right] \
&= \text{argmin}_{a'} \mathrm{Var}\left[a (X - \mu_X) - Y\right] \
&= \text{argmin}_{a'} 
    \mathrm{Var}\left[a \left(X - \mu_X\right) \right] 
    + \mathrm{Var}\left[ Y\right] 
    -2 \mathrm{Cov}\left[a \left(X - \mu_X\right), Y\right] \
&= \text{argmin}_{a'} 
    a^2 \sigma_x ^ 2
    + \sigma_Y ^2
    -2 a \sigma_{XY}
\end{align*}
$$
and the variance do
We with expanding the MSE as the sum of the squared bias and variance
$$
\begin{align*}
\text{MSE} &=
\left(\mathrm{E} \left[\hat{Y} - Y \right] \right)^2
+ \mathrm{Var} \left[\hat{Y} - Y \right]
\&=
b ^ 2
+ \mathrm{Var} \left[a (x - \mu_X) - Y \right]
\end{align*}
$$ -->

<!-- 
Suppose the Pearson correlation coefficient is $\rho$, 
and you wish to estimate $Y$ based on a given observation of $X$
that is $n$ standard deviations away from the mean.
The optimal linear estimate is $n \rho$ standard deviations away from the mean.

If the observation of $X$ is $n$
It turns out that the optimal linear estimation is $\rho$ 
then the optimal linear estimation of $Y$ given a sample of $X$ is
then the optimal linear estimation of $Y$ from $X$ is obtained by 
1. Calculate by how many standard deviations the sampled $X$ is above it's mean.
2. multiply by $\rho$.
3. This is by how many standard deviations the estimate of $Y$ is above it's mean. -->
<!-- $$
\begin{align*}
\mathrm{E} \left[ \left(a (x - \mu_x) + b - y \right) ^ 2 \right]
&= 
a ^ 2 \mathrm{E} \left[ \left( x - \mu_x \right) ^ 2 \right]
+
\mathrm{E} \left[ \left(b - y \right) ^ 2 \right]
+
a \mathrm{E} \left[ \left(x - \mu_x\right) \left(b - y \right) \right]
\&=
a ^ 2 \sigma_x ^2
+
\sigma_y ^ 2
+
a \, \sigma_{xy}
\end{align*}
$$ -->