# Fundamentals of statistics in Python
## Measures of correlation between pairs of variables

***
<br>

## Relationship between the elements of two variables

* The relationship between the corresponding elements of two variables in a data set is called a __correlation__.
* Say there are two variables, $𝑥$ and $𝑦$, with an equal number of elements, $𝑛$. Let $𝑥_1$ from $𝑥$ correspond to $𝑦_1$ from $𝑦$, $𝑥_2$ from $𝑥$ to $𝑦_2$ from $𝑦$, and so on. You can then say that there are $𝑛$ pairs of corresponding elements: $(𝑥_1, 𝑦_1)$, $(𝑥_2, 𝑦_2)$, and so on.
* We can speak of the following types of correlation between pairs of variables: 
    * __Positive correlation__ exists when larger values of $𝑥$ correspond to larger values of $𝑦$ and vice versa.
    * __Negative correlation__ exists when larger values of $𝑥$ correspond to smaller values of $𝑦$ and vice versa.
    * __Weak or no correlation__ exists if there is no such apparent relationship.

<img src="img/correlations.png" style="width:700px">

## Covariance

* The __sample covariance__ is a measure that quantifies the strength and direction of a relationship between a pair of variables.
* If the correlation is positive, then the covariance is positive, as well. A stronger relationship corresponds to a higher value of the covariance.
* If the correlation is negative, then the covariance is negative, as well. A stronger relationship corresponds to a lower (or higher absolute) value of the covariance.
* If the correlation is weak, then the covariance is close to zero.
* The covariance of the variables $x$ and $y$ is mathematically defined as $s^{xy} = \frac{\sum_{i} (x_i − mean(x))*(y_i − mean(y))}{n − 1}$, where $i = 1,2,...,n$.
* The covariance of two identical variables is actually the variance.

In [2]:
x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

n = len(x)
mean_x, mean_y = sum(x) / n, sum(y) / n
cov_xy = sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n)) / (n - 1)
print(cov_xy)

19.95


* NumPy has the function `cov()` that returns the covariance matrix

In [3]:
import numpy as np

cov_matrix = np.cov(x, y)
print(cov_matrix)

cov_xy = cov_matrix[0, 1]
print(cov_xy)

[[38.5        19.95      ]
 [19.95       13.91428571]]
19.95


* Pandas `Series` have the method `.cov()` that you can use to calculate the covariance

In [4]:
import pandas as pd

x_, y_ = pd.Series(x), pd.Series(y)

cov_xy = x_.cov(y_)
print(cov_xy)

19.95


## Correlation coefficient

* The __correlation coefficient__, or __Pearson product-moment correlation coefficient__ is another measure of the correlation between data.
* You can think of it as a standardized covariance.
* Correlation coefficientis is denoted by the symbol $r$.
* The value $r > 0$ indicates positive correlation.
* The value $r < 0$ indicates negative correlation.
* The value $r = 1$ is the maximum possible value of $𝑟$. It corresponds to a perfect positive linear relationship between variables.
* The value $r = −1$ is the minimum possible value of $𝑟$. It corresponds to a perfect negative linear relationship between variables.
* The value $r \approx 0$, or when $𝑟$ is around zero, means that the correlation between variables is weak.
* The mathematical formula for the correlation coefficient is $r = \frac{s^{xy}}{s^x s^y}$ where $s^x$ and $s^y$ are the standard deviations of $x$ and $y$ respectively.

In [5]:
n = len(x)
mean_x, mean_y = sum(x) / n, sum(y) / n
cov_xy = sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n)) / (n - 1)
var_x = sum((item - mean_x)**2 for item in x) / (n - 1)
var_y = sum((item - mean_y)**2 for item in y) / (n - 1)
std_x, std_y = var_x ** 0.5, var_y ** 0.5
r = cov_xy / (std_x * std_y)
print(r)

0.861950005631606


* NumPy has the function `corrcoef()` that returns sthe correlation coefficient matrix of the variables.

In [6]:
corr_matrix = np.corrcoef(x, y)
print(corr_matrix)
r = corr_matrix[0, 1]
print(r)

[[1.         0.86195001]
 [0.86195001 1.        ]]
0.8619500056316061


* `scipy.stats` has the routine `pearsonr()` that calculates the correlation coefficient and the $p$-value

In [7]:
import scipy.stats

scipy.stats.pearsonr(x, y)

(0.8619500056316061, 5.122760847201135e-07)

* Pandas `Series` have the method `.corr()` for calculating the correlation coefficient

In [8]:
r = x_.corr(y_)
print(r)

0.8619500056316061


## --- Exercise ---

Load the data stored in the `data\mtcars.csv` file and using the `DataFrame.corr()` method, identify the most correlated feature with feature `qsec` (1/4 mile time).

Meaning of features in the dataset:
* mpg - Miles/(US) gallon
* cyl - Number of cylinders
* disp - Displacement (cu.in.)
* hp - Gross horsepower
* drat - Rear axle ratio
* wt - Weight (1000 lbs)
* qsec - 1/4 mile time
* vs - Engine (0 = V-shaped, 1 = straight)
* am - Transmission (0 = automatic, 1 = manual)
* gear - Number of forward gears
* carb - Number of carburetors 

In [None]:
# write your code here