<div class='alert alert-warning'>

SciPy's interactive examples with Jupyterlite are experimental and may not always work as expected. Execution of cells containing imports may result in large downloads (up to 60MB of content for the first import from SciPy). Load times when importing from SciPy may take roughly 10-20 seconds. If you notice any problems, feel free to open an [issue](https://github.com/scipy/scipy/issues/new/choose).

</div>

In [None]:
import numpy as np
from scipy import stats
from scipy.stats import mstats
mstats.pearsonr([1, 2, 3, 4, 5], [10, 9, 2.5, 6, 4])

(-0.7426106572325057, 0.1505558088534455)

There is a linear dependence between x and y if y = a + b*x + e, where
a,b are constants and e is a random error term, assumed to be independent
of x. For simplicity, assume that x is standard normal, a=0, b=1 and let
e follow a normal distribution with mean zero and standard deviation s>0.


In [None]:
s = 0.5
x = stats.norm.rvs(size=500)
e = stats.norm.rvs(scale=s, size=500)
y = x + e
mstats.pearsonr(x, y)

(0.9029601878969703, 8.428978827629898e-185) # may vary

This should be close to the exact value given by


In [None]:
1/np.sqrt(1 + s**2)

0.8944271909999159

For s=0.5, we observe a high level of correlation. In general, a large
variance of the noise reduces the correlation, while the correlation
approaches one as the variance of the error goes to zero.

It is important to keep in mind that no correlation does not imply
independence unless (x, y) is jointly normal. Correlation can even be zero
when there is a very simple dependence structure: if X follows a
standard normal distribution, let y = abs(x). Note that the correlation
between x and y is zero. Indeed, since the expectation of x is zero,
cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero
by symmetry. The following lines of code illustrate this observation:


In [None]:
y = np.abs(x)
mstats.pearsonr(x, y)

(-0.016172891856853524, 0.7182823678751942) # may vary

A non-zero correlation coefficient can be misleading. For example, if X has
a standard normal distribution, define y = x if x < 0 and y = 0 otherwise.
A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797...,
implying a high level of correlation:


In [None]:
y = np.where(x < 0, x, 0)
mstats.pearsonr(x, y)

(0.8537091583771509, 3.183461621422181e-143) # may vary

This is unintuitive since there is no dependence of x and y if x is larger
than zero which happens in about half of the cases if we sample x and y.