<font color='blue'>The Spearman Rank Correlation Coefficient allows us to determine whether or not two data series move together; that is, when one increases (decreases) the other also increases (decreases)

<font color='blue'>This is more general than a linear relationship</font>; for instance, $y = e^x$ is a monotonic function, but not a linear one.

Therefore,<font color='blue'> in computing it we compare not the raw data but the ranks of the data.

<font color='blue'>This is useful when your data sets may be in different units, and therefore not linearly related

<font color='blue'>It's also suitable for data sets which not satisfy the assumptions that other tests require, such as the observations being normally distributed as would be necessary for a t-test.

In [None]:
# Example of ranking data
l = [10, 9, 5, 7, 5]
print 'Raw data: ', l
print 'Ranking: ', list(stats.rankdata(l, method='average'))

The argument `method='average'` indicates that when we have a tie, we average the ranks that the numbers would occupy.

<font color='blue'>The intution is now that instead of looking at the relationship between the two variables, we look at the relationship between the ranks. This is robust to outliers and the scale of the data.

<font color='blue'>To compute the Spearman rank correlation for two data sets $X$ and $Y$, each of size $n$, we use the formula
$r_S = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}$

where $d_i$ is the difference between the ranks of the $i$th pair of observations, $X_i - Y_i$.

<font color='blue'>The result will always be between  −1  and  1 . A positive value indicates a positive relationship between the variables, while a negative value indicates an inverse relationship. 

<font color='blue'>A value of 0 implies the absense of any monotonic relationship. This does not mean that there is no relationship; for instance, if  Y  is equal to  X  with a delay of 2, they are related simply and precisely, but their  rS  can be close to zero:

Because $e^X$ produces many values that are far away from the rest, we can this of this as modeling 'outliers' in our data.

<font color='blue'>Spearman rank compresses the outliers and does better at measuring correlation. Normal correlation is confused by the outliers and on average will measure less of a relationship than is actually there.

In [None]:
## Let's see an example of this
n = 100

def compare_correlation_and_spearman_rank(n, noise):
    X = np.random.poisson(size=n)
    Y = np.exp(X) + noise * np.random.normal(size=n)

    Xrank = stats.rankdata(X, method='average')
    # n-2 is the second to last element
    Yrank = stats.rankdata(Y, method='average')

    diffs = Xrank - Yrank # order doesn't matter since we'll be squaring these values
    r_s = 1 - 6*sum(diffs*diffs)/(n*(n**2 - 1))
    c_c = np.corrcoef(X, Y)[0,1]
    
    return r_s, c_c

experiments = 1000
spearman_dist = np.ndarray(experiments)
correlation_dist = np.ndarray(experiments)
for i in range(experiments):
    r_s, c_c = compare_correlation_and_spearman_rank(n, 1.0)
    spearman_dist[i] = r_s
    correlation_dist[i] = c_c
    
print 'Spearman Rank Coefficient: ' + str(np.mean(spearman_dist))
# Compare to the regular correlation coefficient
print 'Correlation coefficient: ' + str(np.mean(correlation_dist))

In [None]:
plt.hist(spearman_dist, bins=50, alpha=0.5)
plt.hist(correlation_dist, bins=50, alpha=0.5)
plt.legend(['Spearman Rank', 'Regular Correlation'])
plt.xlabel('Correlation Coefficient')
plt.ylabel('Frequency');

In [None]:
n = 100
noises = np.linspace(0, 3, 30)
experiments = 100
spearman = np.ndarray(len(noises))
correlation = np.ndarray(len(noises))

for i in range(len(noises)):
    # Run many experiments for each noise setting
    rank_coef = 0.0
    corr_coef = 0.0
    noise = noises[i]
    for j in range(experiments):
        r_s, c_c = compare_correlation_and_spearman_rank(n, noise)
        rank_coef += r_s
        corr_coef += c_c
    spearman[i] = rank_coef/experiments
    correlation[i] = corr_coef/experiments
    
plt.scatter(noises, spearman, color='r')
plt.scatter(noises, correlation)
plt.legend(['Spearman Rank', 'Regular Correlation'])
plt.xlabel('Amount of Noise')
plt.ylabel('Average Correlation Coefficient')

<font color='blue'>We can see that the Spearman rank correlation copes with the non-linear relationship much better at most levels of noise. Interestingly, at very high levels, it seems to do worse than regular correlation.

In [None]:
n = 100

X = np.random.rand(n)
Xrank = stats.rankdata(X, method='average')
# n-2 is the second to last element
Yrank = stats.rankdata([1,1] + list(X[:(n-2)]), method='average')

diffs = Xrank - Yrank # order doesn't matter since we'll be squaring these values
r_s = 1 - 6*sum(diffs*diffs)/(n*(n**2 - 1))
print r_s

It is important when using both regular and spearman correlation to <font color='blue'>check for lagged relationships by offsetting your data and testing for different offset values.

We can also use the spearmanr function in the scipy.stats library:

In [None]:
# Generate two random data sets
np.random.seed(161)
X = np.random.rand(10)
Y = np.random.rand(10)

r_s = stats.spearmanr(X, Y)
print 'Spearman Rank Coefficient: ', r_s[0]
print 'p-value: ', r_s[1]

<font color='blue'>`spearmanr` also computes the p-value for this coefficient and sample size for us. 

1. Download the csv from this link. https://gist.github.com/dursk/82eee65b7d1056b469ab
2. Upload it to the 'data' folder in your research account

In [None]:
mutual_fund_data = local_csv('mutual_fund_data.csv')
expense = mutual_fund_data['Annual Expense Ratio'].values
sharpe = mutual_fund_data['Three Year Sharpe Ratio'].values

plt.scatter(expense, sharpe)
plt.xlabel('Expense Ratio')
plt.ylabel('Sharpe Ratio')

r_S = stats.spearmanr(expense, sharpe)
print 'Spearman Rank Coefficient: ', r_S[0]
print 'p-value: ', r_S[1]

In [None]:
symbol_list = ['A', 'AA', 'AAC', 'AAL', 'AAMC', 'AAME', 'AAN', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAT', 'AAU', 'AAV', 'AAVL', 'AAWW', 'AB', 'ABAC', 'ABAX', 'ABB', 'ABBV', 'ABC', 'ABCB', 'ABCD', 'ABCO', 'ABCW', 'ABDC', 'ABEV', 'ABG', 'ABGB']

# Get the returns over the lookback window
start = '2014-12-01'
end = '2015-01-01'
historical_returns = get_pricing(symbol_list, fields='price', start_date=start, end_date=end).pct_change()[1:]

# Compute our stock score
scores = np.mean(historical_returns)
print 'Our Scores\n'
print scores
print '\n'

start = '2015-01-01'
end = '2015-02-01'
walk_forward_returns = get_pricing(symbol_list, fields='price', start_date=start, end_date=end).pct_change()[1:]
walk_forward_returns = np.mean(walk_forward_returns)
print 'The Walk Forward Returns\n'
print walk_forward_returns
print '\n'

plt.scatter(scores, walk_forward_returns)
plt.xlabel('Scores')
plt.ylabel('Walk Forward Returns')

r_s = stats.spearmanr(scores, walk_forward_returns)
print 'Correlation Coefficient: ' + str(r_s[0])
print 'p-value: ' + str(r_s[1])