# Measuring Similarity

### How to Find Your Neighbor?

In neighborhood based collaborative filtering, it is incredibly important to be able to identify an individual's neighbors.  Let's look at a small dataset in order to understand, how we can use different metrics to identify close neighbors.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr, kendalltau
import matplotlib.pyplot as plt
import tests as t
import helper as h

In [3]:
play_data = pd.DataFrame({'x1': [-3, -2, -1, 0, 1, 2, 3], 
               'x2': [9, 4, 1, 0, 1, 4, 9],
               'x3': [1, 2, 3, 4, 5, 6, 7],
               'x4': [2, 5, 15, 27, 28, 30, 31]
})

#create play data dataframe
play_data = play_data[['x1', 'x2', 'x3', 'x4']]

### Measures of Similarity

The first metrics we will look at have similar characteristics:

1. Pearson's Correlation Coefficient
2. Spearman's Correlation Coefficient
3. Kendall's Tau

Let's take a look at each of these individually.

### Pearson's Correlation

First, **Pearson's correlation coefficient** is a measure related to the strength and direction of a **linear** relationship.  

If we have two vectors **x** and **y**, we can compare their individual elements in the following way to calculate Pearson's correlation coefficient:

$$CORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum\limits_{i=1}^{n}(y_i-\bar{y})^2}} $$

where 

$$\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}x_i$$

`1.` Write a function that takes in two vectors and returns the Pearson correlation coefficient.  You can then compare your answer to the built in function in numpy by using the assert statements in the following cell.

In [11]:
def pearson_corr(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    corr - the pearson correlation coefficient for comparing x and y
    '''
    numerador = np.sum((x - np.mean(x))*(y - np.mean(y)))
    denominador = (np.sqrt(np.sum((x - np.mean(x))**2))*np.sqrt(np.sum((y - np.mean(y))**2)))
                           
    return numerador/denominador

In [30]:
x = [1,2,3,4]
y = [2,9,69,568]

print(pearson_corr(x, y))
print(np.corrcoef(x,y)[0][1])


0.833383160902722
0.8333831609027221


In [14]:
# This cell will test your function against the built in numpy function
assert pearson_corr(play_data['x1'], play_data['x2']) == np.corrcoef(play_data['x1'], play_data['x2'])[0][1], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(pearson_corr(play_data['x1'], play_data['x2']))
assert round(pearson_corr(play_data['x1'], play_data['x3']), 2) == np.corrcoef(play_data['x1'], play_data['x3'])[0][1], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], pearson_corr(play_data['x1'], play_data['x3']))
assert round(pearson_corr(play_data['x3'], play_data['x4']), 2) == round(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], pearson_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!")

If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!


`2.` Now that you have computed **Pearson's correlation coefficient**, use the below dictionary to identify statements that are true about **this** measure.

In [28]:
a = True
b = False
c = "We can't be sure."


pearson_dct = {"If when x increases, y always increases, Pearson's correlation will be always be 1.": b,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": b
}

t.sim_2_sol(pearson_dct)

That's right!  Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.


### Kendall's Tau

Kendall's tau is quite similar to Spearman's correlation coefficient.  Both of these measures are nonparametric measures of a relationship.  Specifically both Spearman and Kendall's coefficients are calculated based on ranking data and not the raw data.  

Similar to both of the previous measures, Kendall's Tau is always between -1 and 1, where -1 suggests a strong, negative relationship between two variables and 1 suggests a strong, positive relationship between two variables.

Though Spearman's and Kendall's measures are very similar, there are statistical advantages to choosing Kendall's measure in that Kendall's Tau has smaller variability when using larger sample sizes.  However Spearman's measure is more computationally efficient, as Kendall's Tau is O(n^2) and Spearman's correlation is O(nLog(n)). You can find more on this topic in [this thread](https://www.researchgate.net/post/Does_Spearmans_rho_have_any_advantage_over_Kendalls_tau).

Let's take a closer look at exactly how this measure is calculated.  Again, we want to map our data to ranks:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Then we calculate Kendall's Tau as:

$$TAU(\textbf{x}, \textbf{y}) = \frac{2}{n(n -1)}\sum_{i < j}sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

Where $sgn$ takes the the sign associated with the difference in the ranked values.  An alternative way to write 

$$sgn(x^r_i - x^r_j)$$ 

is in the following way:

$$
 \begin{cases} 
      -1  & x^r_i < x^r_j \\
      0 & x^r_i = x^r_j \\
      1 & x^r_i > x^r_j 
   \end{cases}
$$

Therefore the possible results of 

$$sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

are only 1, -1, or 0, which are summed to give an idea of the propotion of times the ranks of **x** and **y** are pointed in the right direction.

`5.` Write a function that takes in two vectors and returns Kendall's Tau.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.