# Calculating correlation

The full equation for the correlation coefficient is

$$\rho_{X, Y} = \frac{\mathrm{\mathrm{cov}\left(X,Y\right)}}{\sigma_{X}\sigma_{Y}}$$

where $\sigma_X$ and $\sigma_Y$ are the standard deviations of $X$ and $Y$ respectively.  The covariance ($\mathrm{cov}$) is given by:

$$\mathrm{cov}\left(X,Y\right) = \mathrm{E}\left[\left(X-\mu_{X}\right)\left(Y-\mu_{Y}\right)\right]$$

where $\mu_X$ and $\mu_Y$ are the means of $X$ and $Y$, and $\mathrm{E}$ is the "expected value" (a fancy-pants term for the mean) of the term in the square brackets.  Note that $X$ and $Y$ are vectors (aka lists or arrays) of numbers, and this is the *average of multiplying the corresponding X and Y values*, not the *product of the averages of X and Y*.

# 1) Compute correlation the "hard" way

In the code cell below, write code to compute the correlation of `X` and `Y`:

1. Compute the means $\mu_X$ and $\mu_Y$ and save them in variables called `meanX` and `meanY`.
2. Compute the covariance $\mathrm{cov}\left(X,Y\right)$ and save it in a variable named `cov`.
3. Compute the correlation of $X$ and $Y$, and save it in a variable called `corr`.

You can (and should) test your code by changing the `generateData()` function to return different values.

In [None]:
# You can change this function, but don't move it to another cell
import numpy as np

def generateData():
    np.random.seed(1) # Start the random number generater at a known state
    x = np.random.randn(100) # 100 random numbers, centered at 0
    y = x + np.random.randn(100) * 0.5
    
    return (x, y)

In [None]:
# Call the function above to generate some data.  If you want to change the data, change the
# function above, not this line.  Otherwise the autograder will fail, and we will both be sad.
X, Y = generateData()

# Your code here...

# Example showing the data and the correlation; you can remove this if you want
import matplotlib.pyplot as plt
plt.scatter(X, Y)
plt.title("Correlation: {}".format(corr))
plt.show()

# 2) Compute correlation the easy way

As you'd expect, NumPy has a built-in function to compute the correlation coefficient: `corrcoef(a, b)`, where `a` and `b` are two arrays of data.  (WARNING: there is also a NumPy function called `correlate()` which does a totally different thing that also happens to be called correlation.)

Use the `corrcoef()` function to compute the correlation between `X` and `Y` and save the result into a variable called `corr2`.  Note that `corrcoef()` will return an array of values instead of a single number --- you'll need to figure out which value in the array is the one you care about, and extract only that value.

In [None]:
X, Y = generateData()

# Your code here...

print(corr2)

# 3) Your own data

Create a dataset (lists of points in variables called `x` and `y`) which are obviously related, but which have a statistical correlation of 0 (within floating-point rounding error).  There should be at least 10 unique data points (i.e., both `x` and `y` should have a length $\ge$ 10, and repeated points don't count toward the total).

*Challenge*: Make `x` and `y` contain at least 1000 unique points.

In [None]:
# Your code here...

plt.scatter(x, y)
# Add a title to the plot showing the correlation
# Your code here...
plt.show()