# Covariance
Covariance is the average of all variance. It is mathematically expressed as,

$\text{Cov}(H, W) = \frac{1}{n} \sum_{i = 1}^n (H_i - \bar{H}) (W_i - \bar{W})$.

Where,
- $\text{Cov}(H, W)$ = Covariance between $H$ and $W$.
- $H_i$ = i-th value of the variable $H$.
- $\bar{H}$ = Mean of the variable $H$.
- $W_i$ = i-th value of the variable $W$.
- $\bar{W}$ = Mean of the variable $W$.
- n = Total number of data points.

In situations where the average area is positive, because it is observed that there are more data points in Q1 and Q3, it can be concluded that the data points are "positively correlated".

In situations where the average area is negative, because it is observed that there are more data points in Q2 and Q4, it can be concluded that the data points are "negatively correlated".

In situations where the net area would approximately be at the origin (positive area is approximately equal to the negative area), it can be concluded that the data points are "uncorrelated".

The co-variance depends on the scale. Say, if the plot is between inches and kilogram, and say the same is plotted between centimeter and pounds. The plot in the second scenario will be stretched because the unit of measurement are finer.

The formulation for co-variance that is under, whatever unit is used, centimeter, kilometer or inch, should not affect the area. To take care of this, statisticians have come up with a metric which prevents this variations from taking place and the area remains the same.

The first part of the equation, i.e., $(H_i - \bar{H}) (W_i - \bar{W})$, stays as it is.

The only change that occurs is, $\text{Cov}(X, Y) = \frac{1}{n} \sum_{i = 1}^n \frac{(H_i - \bar{H})}{\sigma_h} \frac{(W_i - \bar{W})}{\sigma_w}$.

This equation when divided with standard deviation is called as Z-Score.

The term used to call this entire expression is, correlation coefficient. Denoted by, $\rho$.

Therefore, $\rho = \frac{1}{n} \sum_{i = 1}^n \frac{(H_i - \bar{H})}{\sigma_h} \frac{(W_i - \bar{W})}{\sigma_w}$.

$\rho$ will always range between -1 and +1. This is because, the division is done with respect to standard deviation.

All the units have the same $\rho$.
- If the $\rho$ is close to 0, then it is said that the data has no correlation.
- If the $\rho$ is close to 1, then it is said that the data has strong positive correlation.
- If the $\rho$ is close to -1, then it is said that the data has strong negative correlation.
- If a plot is drawn such that all the data points are scattered around the origin (in close proximity). The $\rho$ here would be approximately 0.
- If a plot is drawn such that all the data points are scattered in Q2 and Q4. The $\rho$ here would be approximately -1 (approaching -1 but not equal to).
- If a plot is drawn such that all the data points are scattered in Q1 and Q3. The $\rho$ here would be approximately 1 (approaching 1 but not equal to).

To summarize, if different units are used then the co-variance increases unnecessarily. To prevent this, the Z-Score equivalent is calculated.

# Correlation Test
Correlation tests are statistical methods used to assess the strength and direction of the relationship between 2 continuous variables. The most common correlation test is the Pearson correlation coefficient, but there are also other correlation tests like Spearman rank correlation and the Kendall-Tau rank correlation.

# Pearson Correlation Coefficient
- Objective: Measures the linear relationship between 2 continuous variables.
- Null hypothesis (H0): There is no linear correlation between the 2 variables.
- Alternative hypothesis (H1): There is a linear correlation between the 2 variables.
- Test statistic: Pearson correlation coefficient ($r$).
- Assumptions: Assumes that the data are normally distributed and that there is a linear relationship between the variables.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

correlation_coefficient, p_value = pearsonr(x, y)
print("Pearson Correlation Coefficient:", correlation_coefficient)
print("P-Value:", p_value)

df = pd.read_csv("weight-height.csv")

df[["Height", "Weight"]].corr() # pearson correlation coefficient

Pearson Correlation Coefficient: 0.7745966692414834
P-Value: 0.12402706265755456


Unnamed: 0,Height,Weight
Height,1.0,0.924756
Weight,0.924756,1.0


In [3]:
np.corrcoef(df["Height"], df["Weight"]) # pearson correlation coefficient

array([[1.       , 0.9247563],
       [0.9247563, 1.       ]])