# Calculating the Correlation

Rhe correlation coefficient measures the strength of the __linear__ relationship. The coefficient can be either positive or negative, and its magnitude can range between −1 and 1. A correlation coefficient of 0 indicates that there’s no linear correlation between the two quantities. A coefficient of 1 or close to 1 indicates that there’s a strong positive linear correlation; a coefficient of exactly 1 is referred to as perfect positive correlation. Similarly, a correlation coefficient of –1 or close to –1 indicates a strong negative correlation, where 1 indicates a perfect negative correlation.

## Correlation and Causation

In statistics, you’ll often come across the statement “correlation doesn’t imply causation.” This is a reminder that even if two sets of observations are strongly correlated with each other, that doesn’t mean one variable causes the other. When two variables are strongly correlated, sometimes there’s a third factor that influences both variables and explains the correlation.

## Calculating the Correlation Coefficient

The correlation coefficient is calculated using the formula: 

$$ correlation = \dfrac{n\Sigma xy - \Sigma x \Sigma y}{\sqrt{(n\Sigma x^2 - (\Sigma x)^2)(n\Sigma y^2 - (\Sigma y)^2)}} $$

In the above formula, $n$ is the total number of values present in each set of numbers (the sets have to be of equal length). The two sets of numbers are denoted by x and y (it doesn’t matter which one you denote as which). The other terms are described as follows:


$\Sigma xy$ - Sum of the products of the individual elements of the two sets of numbers, x and y

$\Sigma x$ - Sum of the numbers in set x

$\Sigma y$ - Sum of the numbers in set y

$(\Sigma x)^2$ - Square of the sum of the numbers in set x

$(\Sigma y)^2$ - Square of the sum of the numbers in set y

$\Sigma x^2$ - Sum of the squares of the numbers in set x

$\Sigma y^2$ - Sum of the squares of the numbers in set y

Once we’ve calculated these terms, you can combine them according to the preceding formula to find the correlation coefficient. For small lists, it’s possible to do this by hand without too much effort, but it certainly gets complicated as the size of each set of numbers increases.

We'll need the zip() function to help us calculate the sum of products from the two sets of numbers:

In [1]:
list1 = [1, 2, 3]
list2 = [4, 5, 6]
for x, y in zip(list1, list2):
    print(x, y)

1 4
2 5
3 6


The zip() function returns pairs of the corresponding elements in x and y, which you can then use in a loop to perform other operations.

Now, we can write our code:

In [4]:
def find_corr_x_y(x,y):
    n = len(x)

    # Find the sum of the products
    prod = []
    for xi, yi in zip(x,y):
        prod.append(xi*yi)

    sum_prod_x_y = sum(prod)
    sum_x = sum(x)
    sum_y = sum(y)
    squared_sum_x = sum_x**2
    squared_sum_y = sum_y**2

    x_square = []
    for xi in x:
        x_square.append(xi**2)
    # find the sum 
    x_square_sum = sum(x_square)


    y_square=[]
    for yi in y:
        y_square.append(yi**2)
    # Find the sum
    y_square_sum = sum(y_square)

    # now, just apply the formula
    numerator = n*sum_prod_x_y - sum_x*sum_y
    denominator_term1 = n*x_square_sum - squared_sum_x
    denominator_term2 = n*y_square_sum - squared_sum_y
    denominator = (denominator_term1*denominator_term2)**0.5
    correlation = numerator/denominator

    return correlation

list1 = [1, 2, 3]
list2 = [4, 5, 12]

print("The correlation of the lists is: {:.2f}".format(find_corr_x_y(list1, list2)))


The correlation of the lists is: 0.92


The find_corr_x_y() function accepts two arguments, x and y, which are the two sets of numbers we want to calculate the correlation for.

---

[Describing Data with Statistics](statistics.ipynb)

[Main Page](../README.md)