# Dive into [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

In this notebook, I'll show some features of correlation.

- **shift invariance**
- **scale invariance**
- **sensitivity to outlier**

Because of sensitivity to outliers, median may be better than mean. <br>
An example is provided at the end.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(seed=42)

In [None]:
mean = np.array([3, 8])
cov = np.array([[1, 0.83], [0.83, 1]])

x, y = np.random.multivariate_normal(mean, cov, size=1000).T

plt.title(f"r={np.corrcoef(x, y)[0][1]}")
plt.plot(x, y, 'x')
plt.show()

## shift invariance

In [None]:
x_shift = x - np.mean(x)
plt.title(f"shift: r={np.corrcoef(x_shift, y)[0][1]}")
plt.plot(x_shift, y, 'x')
plt.show()

## scale invariance

In [None]:
x_scale = 3*x
plt.title(f"scale: r={np.corrcoef(x_scale, y)[0][1]}")
plt.plot(x_scale, y, 'x')
plt.show()

## sensitivity to outlier

In [None]:
x_outlier = x.copy()
x_outlier[10] =  x_outlier[10]+10
plt.title(f"with outlier: r={np.corrcoef(x_outlier, y)[0][1]}")
plt.plot(x_outlier, y, 'x')
plt.show()

## Mean vs Median

### without outlier, Mean is better than Median

In [None]:
mean = np.array([0, 0, 0, 0, 0, 0])
cov = np.array([[1, 0.83, 0.80, 0.81, 0.82, 0.78], 
                [0.83, 1, 0.79, 0.80, 0.82, 0.81],
                [0.80, 0.79, 1, 0.82, 0.79, 0.77],
                [0.81, 0.80, 0.82, 1, 0.81, 0.83],
                [0.82, 0.82, 0.79, 0.81, 1, 0.82],
                [0.78, 0.81, 0.77, 0.83, 0.82, 1]
               ])

y, x1, x2, x3, x4, x5 = np.random.multivariate_normal(mean, cov, size=1000).T

In [None]:
x_list = [x1, x2, x3, x4, x5]
x_mean = np.mean(x_list, axis=0)
plt.title(f"Mean: r={np.corrcoef(x_mean, y)[0][1]}")
plt.plot(x_mean, y, 'x')
plt.show()

In [None]:
x_median = np.median(x_list, axis=0)
plt.title(f"Median: r={np.corrcoef(x_median, y)[0][1]}")
plt.plot(x_median, y, 'x')
plt.show()

### with outlier, Median is better than Mean

In [None]:
x1_outlier = x1.copy()
x1_outlier[10] = x1_outlier[10] + 10
x1_outlier[50] = x1[50] - 15

x2_outlier = x2.copy()
x2_outlier[10] = x2_outlier[10] + 10
x2_outlier[50] = x2[50] - 15

In [None]:
x_list = [x1_outlier, x2_outlier, x3, x4, x5]
x_mean = np.mean(x_list, axis=0)
plt.title(f"Mean: r={np.corrcoef(x_mean, y)[0][1]}")
plt.plot(x_mean, y, 'x')
plt.show()

In [None]:
x_median = np.median(x_list, axis=0)
plt.title(f"Median: r={np.corrcoef(x_median, y)[0][1]}")
plt.plot(x_median, y, 'x')
plt.show()