# Variance, Covariance, and Correlation

## Introduction 

In this lesson, we'll look at how **Variance** of a random variable is used to calculate **Covariance** and **Correlation**, two key measures used in statistics for finding the relationships between random variables. These measures help us identify the degree to which two sets of data tend to deviate from their expected value (i.e. mean), in a similar way.  Based on these measures, we can see if two variables move together, and to what extent.  This lesson will help you develop a conceptual understanding, necessary calculations, and some precautions to keep in mind when using these measures.

## Learning Objectives

You will be able to

* Understand and explain data variance and how it relates to standard deviation
* Understand and calculate Covariance and Correlation between two random variables
* Visualize and interpret Covariance and Correlation

---

## What is Variance ($\sigma^2$)

Before we talk about covariance, we should get some idea around **Variance** of a random variable. Variance refers to the __spread of a variable in a data set__.

> __Variance is a measure used to quantify how much a random variable deviates from its mean value__. 

When we calculate variance, we're essentially asking, "__Given the relationship of all given data points, how distant from the mean do we expect the next data point to be?__"  This "distance" is called the **error term**, and it's what variance measures.

Variance is shown using notation $\sigma^2$. Previously, we've seen $\sigma$ as a measure of standard deviation. Remember, standard deviation is also a measure of the spread of data. __Variance is simply the square of standard deviation. (Or we could say standard deviation is the square root of variance)__. 

### Example Use Case
For example, you're at a music festival and two acts you'd love to see, _"Slaw Bomb"_ and _"House of the Gavins"_* are playing at overlapping times.  You've seen both several times before.  Slaw Bomb puts on a consistently good show. They're not your favorite band in the world, but they do a good enough job.  On the other hand, House of the Gavins is all over the place.  The lead singer is a loose cannon and their last tour was sloppily put together, but when they're on, they're the best band in the world, and their debut album changed your life.

Even if, on average, Slaw Bomb is the better band, House of the Gavins has a much higher variance.  Your decision on who to see would depend on your risk appetite -- would you rather know you'll have a good time, or take a risk and potentially have a better outcome?

*https://www.theverge.com/tldr/2018/1/23/16924680/coachella-lineup-botnik-studios

Here, we can see the distributions of each band's past performances on a 1-10 scale.

<center>
    <img src='images/slaw_bomb.png'/>
    <img src='images/gavins.png'/>
</center>

Slaw Bomb's distribution of show quality is more concentrated than House of the Gavins', as illustrated above.  A more concentrated distribution is defined as having a smaller variance.  Slaw Bomb's performances are closer to their mean, and House of the Gavins' are more dispersed.  Even though Slaw Bomb's mean higher than House of the Gavins', House of the Gavins has a **much** higher variance.

### Interpreting Variance 

A variance value of zero represents that all of the values within a data set are identical, while all variances that are not equal to zero will be any positive number. The larger the variance, the more spread in the data set. A large variance means that the numbers in a set are far from the mean and each other. A small variance means that the numbers are closer together.

### How to Calculate Variance

Variance is calculated by:
1. Taking the squared differences between each element in a data set and the mean. 
2. Summing those squared differences.
3. Dividing the sum of the resulting squares by the number of values in the set, *n*.

$$\sigma^2 = \frac{\sum(x-\mu)^2}{n}$$

Here, $x$ represents an individual data point and $\mu$ represents the mean of the data points. $n$ is the total number of data points. Remember that while calculating a sample's variance in order to estimate a population variance, the denominator of the variance equation becomes n - 1. Doing so removes bias, preventing under-estimation of the population variance.

The following illustration summarizes how the spread of data around the mean (10) relates to the variance. 

<img src="images/var2.png" width=500>

### Code Example

In [1]:
#Let's calculate the variance of a randomly generated variable

import numpy as np
np.random.seed(0)


n = 100
X = np.random.normal(loc=0,scale=1,size=n)
mean = np.mean(X)

variance = sum([(x-mean)**2 for x in X])/n

print(variance)

1.0158266192149314


In [2]:
#We can also use np.var()
np.var(X)

1.0158266192149312

---

## Covariance ($\sigma_{xy}$)

Now that we know what variance is what quantity it measures, imagine calculating the variance of two random variables to get some idea on how they change together (or stay the same) considering all included values.

In statistics, if we're trying to figure out how two random variables **tend to vary** together, we're effectively talking about **Covariance** between these variables. Covariance provides an insight into how two variables **move according to each other**.

### How to Calculate Covariance
In essence, covariance is used to measure **how much variables change together**, and it's calculated using the formula:


$$ \large \sigma_{XY} = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)}{n}$$

Here $X$ and $Y$ are two random variables having n elements each. We want to calculate ___how much $Y$ depends on $X$___ (or vice-versa), by measuring how values in $Y$ change with observed changes in $X$ values. 

> This makes $X$ our __independent variable__ and $Y$, the __dependent variable__.  

$x_i$ = ith element of variable $X$

$y_i$ = ith element of variable $Y$

$n$ = number of data points (__$n$ is the same for $X$ and $Y$, because they are paired.__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$\sigma_{XY}$ = Covariance between $X$ and $Y$

*We can see that the above formula calculates the variance of $X$ and $Y$ (check the variance formula above) by multiplying the variance of each of their corresponding elements. Hence the term __Co-Variance__.*

### Interpreting Covariance Values 

* A positive covariance indicates that **higher than average** values of one variable tend to pair with higher than average values of the other variable.

* Negative covariance indicates that lower than average values of one variable tend to pair with **lower than average** values of the other variable. 

* A zero value, or values close to zero indicate no covariance, i.e. no values from one variable can be paired with values of second variable. 

These patterns are shown in the scatter plots below.
<img src="images/covariance.gif" width=500>



A large negative covariance shows an inverse relationship between values at x and y axes. i.e. y decreases as x increases. This is shown by the scatter plot on the left. The middle scatter plot shows values spread all over the plot, reflecting the fact that variables on x and y axes do not vary together. The covariance for these variables would be very close to zero.

In the scatter plot on right, we see a strong relationship between values at x and y axes i.e. y increases as x increases.

The units of covariance are also difficult to interpret.  For example, $\sigma_{XY}$ would use the units of $X$ times the units of $Y$.  So, if $X$ is in dollars and $Y$ is in days, the units would be $dollars * days$, which lacks meaning and therefore interpretability.

>__Covariance is not standardized. Therefore, covariance values can range from negative infinity to positive infinity.__

### Code Example

In [3]:
from sklearn.datasets import make_regression
np.random.seed(0)

#generate X and Y variables with some covariance
X, Y = make_regression(n_samples=100, n_features=1, noise=10)
X = X.flatten()

In [4]:
#first, we'll compute covariance manually.
def covariance(a,b):
    # calculates the covariance between a and b
    
    #mean of a
    mu_a = sum(a)/len(a)

    #mean of b
    mu_b = sum(b)/len(b)

    #number of obsservations in the data set
    n = len(a)
    
    #calculate covariance
    return sum((a[i] - mu_a)*(b[i] - mu_b) for i in range(n)) / n

covariance(X,Y)

43.29395178574572

In [5]:
#now, with numpy
np.cov(X,Y)

array([[1.02608749e+00, 4.37312644e+01],
       [4.37312644e+01, 1.97912631e+03]])

#### Wait...
1. Using `np.cov()` gave us a matrix of 4 different numbers.
2. None of these numbers match up with the covariance we calculated manually.

#### Explanation:
The matrix returned shows:
- the variance of X at index `[0,0]`.
- the covariance of X and Y at index `[0,1]`.
- the covariance of Y and X at index `[1,0]`. (note that this is  the same as X and Y)
- the variance of Y at index `[1,1]`.

As for the numbers not matchinbg up, Numpy's covariance calculation uses $n - 1$ in the denominator instead of $n$.  If we modify the original function we'll get a similar result.  Let's try that below.

In [6]:
def covariance_unbiased(a,b):
    #calculates covariance using n-1 instead of n
    n = len(a)
    #if we use the original function, multiply by the denominator, and divide by n-1,
    #we'll have essentially modified the equation into the unbiased estimator.
    return covariance(a,b)*n/(n-1)

covariance_unbiased(X,Y)

43.73126443004618

Now, our results match up with the results from `np.cov()`.

---

## Correlation 

Above, we saw how covariance can identify how much two random variables tend to vary together with a formula that depends on the units of the $X$ and $Y$ variables. During data analysis, covariance sometimes can't be directly used in data comparison, as different experiments may contain underlying data measured in different units. Therefore, we need to scale covariance into a standard unit, with interpretable results independent of the units of data. We achieve this with a derived normalized measure called correlation. 

Correlation is defined as covariance, scaled by the inverse product of standard deviations of $X$ and $Y$. This scaling sets the range of possible values to between -1 and 1 and cancels out the units of each variable. So the correlation between $X$ and $Y$ would be calculated as:

$$Correlation(X,Y) = \frac{\sigma_{XY}}{\sigma_X\sigma_Y}$$

>When two random variables **Correlate**, this reflects that the change in one variable **affects** the values of the second variable. 

In practice, we typically to look at correlation rather than covariance because it is more interpretable, as it does not depend on the scale or unit of either random variable involved.

### Calculating Coefficient of Correlation (r)

Expanding the correlation formula from above, Pearson Correlation (r) is calculated using the following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

So just like in the case of covariance,  $X$ and $Y$ are paired random variables having n elements each. 


$x_i$ = ith element of variable $X$

$y_i$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$r$ = Calculated Pearson Correlation


A detailed mathematical insight into this equation is available [in this paper](http://www.hep.ph.ic.ac.uk/~hallg/UG_2015/Pearsons.pdf)


### Why does Correlation Fall between -1 and 1?

Why does dividing by the product of standard deviations scale Covariance and give us Correlation?  We already know that to **maximize** (or conversely, minimize) covariance for a given range of X and Y, the data will perfectly follow a straight line.  We also know that the equation for a straight line is:

$$ y = m * x + b $$


Where $m$ is slope (either positive or negative) and $b$ is the y-intercept.
From this, it follows that:

$$ \mu_y = m * \mu_x + b $$

We can then substitute these equations into Pearson's correlation equation:

$$ r = \frac{\sum_{i=1}^{n}[(x_i -\mu_x)(m*x_i + b - (m * \mu_x + b))]} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(m*x_i + b-(m * \mu_x + b))^2}}$$

Which eventually simplifies to:

$$ r = \frac{m}{\sqrt{m^2}} \space\space \text{or} \space\space \frac{m}{\left|m\right|}$$

Which is just the sign of $m$, -1 or 1.  Cool!

### Code Example

In [7]:
#Let's start with the same data we've been working with.

#Just to generate it again...
np.random.seed(0)
X, Y = make_regression(n_samples=100, n_features=1, noise=10)
X = X.flatten()

In [8]:
#Now to compute correlation.
#Once again, we should account for the fact that numpy uses n-1 instead of n for the unbiased estimator.
#We'll do this by changing the denominator of the equation to use n-1 when computing standard deviations.

def correlation_unbiased(a,b):
    n = len(a)
    return np.cov(a,b)[0,1] / ((np.std(X)*np.std(Y)) * (n)/(n-1))

correlation_unbiased(X,Y)

0.9704274690934446

In [9]:
#Now with numpy's built-in correlation function.
np.corrcoef(X,Y)

array([[1.        , 0.97042747],
       [0.97042747, 1.        ]])

Similar to `np.cov()`, `np.corrcoef()` returns a matrix of Pearson correlation coefficients in the same orientation as before.  The relevant number to our needs is at index `[0,1]`.

### Types of Correlation Measures

The __linear correlation coefficient__, $r$, measures the strength and the direction of a linear relationship between two variables. It also called __Pearson's correlation coefficient__. 

In statistics, we measure four types of correlations for detailed relationship analysis: 
* Pearson correlation 
* Kendall Rank correlation 
* Spearman correlation
* Point-Biserial correlation. 


For now, we'll focus on Pearson correlation as it's the go-to correlation measure for most situations. 

For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoskedasticity. Linearity assumes a straight line relationship between each of the two variables and homoskedasticity assumes that data is equally distributed about the regression line.

### Use Cases


#### Social Media and Websites
Digital publishers want to maximize their understanding of the potential relationship between social media activity and visits to their website. For example, the digital publisher runs the correlation report between hourly Twitter mentions and visits for two weeks. The correlation is found to be r = 0.28, which indicates a medium, positive relationship between Twitter mentions and website visits.

#### Optimization for E-retailers
E-retailers are interested in driving increased revenue. For example, an e-retailer wants to compare a number of secondary success events (e.g., file downloads, product detail page views, internal search click-throughs, etc.) with weekly web revenue. They can quickly identify internal search click-throughs as having the highest correlation, which may indicate an area for optimization.

### Interpreting Correlation Values

If two variables have a correlation of +0.9,  this means the change in one item results in an almost similar change to another item. A correlation value of -0.9 means that the change is one variable results as an opposite change in the other variable. A Pearson correlation near 0 would be no effect. Here are some examples of Pearson correlation values as scatter plots. 

![](images/pearson_2.png)

Think about stock markets in terms of correlation. Stock traders tend to use information about positive and negative correlations between prices of assets when building their portfolios.  All the stock market indexes tend to move together in similar directions. When the DOW Jones loses 5%, the S&P 500 usually loses around 5%. When the DOW Jones gains 5%, the S&P 500 usually gains around 5% because they are **highly correlated**.

On the other hand, there could also be negative correlation. You might observe that as the DOW Jones loses 5% of its value, gold might gain 5%. Alternatively, if the Dow Jones gains 5% of its value, gold may lose 5% of its value. That's **negative correlation**.

### How do These Measures Relate to Each Other?

Are covariance and correlation the same thing? Somewhat.

While both covariance and correlation indicate whether variables are positively or inversely related to each other, they are not considered to be the same. This is because correlation also informs about the degree to which the variables tend to move together.

Covariance is used to measure variables that can have different units of measurement. By leveraging covariance, analysts can determine whether units are increasing or decreasing, but they can't say to what degree the variables are moving together since covariance does not use one standardized unit of measurement.

Correlation, on the other hand, standardizes the measure of interdependence between two variables and informs researchers as to how closely the two variables move together.

---

## Summary
In this lesson, we looked at calculating the variance of random variables as a measure of deviation from the mean. We saw how this measure can be used to first calculate covariance, and then correlation, to analyze how the change in one variable is associated with change in another variable. Next, we'll see how we can use correlation analysis to run a __regression analysis__ and later, how covariance calculation helps us with dimensionality reduction.