Multiple <font color='blue'>comparisons bias</font> is a pervasive problem in statistics, data science, and in general forecasting/predictions. The short explanation is that the more tests you run, <font color='blue'>the more likely you are to get an outcome that you want/expect. 

<font color='blue'>A particularly common example of this is when looking for relationships in large data sets comprising of many indepedent series or variables. In this case you run a test each time you evaluate whether a relationship exists between a set of variables.

In [None]:
import scipy.stats as stats

<font color='blue'>Spearman Rank Correlation

<font color='blue'>It's a variation of correlation that takes into account the ranks of the data. This can help with weird distributions or outliers that would confuse other measures.

<font color='blue'>A higher coefficient means a stronger estimated relationship.

In [None]:
X = pd.Series(np.random.normal(0, 1, 100))
Y = X

r_s = stats.spearmanr(Y, X)
print 'Spearman Rank Coefficient: ', r_s[0]
print 'p-value: ', r_s[1]

In [None]:
X = pd.Series(np.random.normal(0, 1, 100))
Y = X + np.random.normal(0, 1, 100)

r_s = stats.spearmanr(Y, X)
print 'Spearman Rank Coefficient: ', r_s[0]
print 'p-value: ', r_s[1]

<font color='blue'>p-values must be treated as binary.

<font color='blue'>A common mistake is that p-values are treated as more or less significant. This is bad practice as it allows for what's known as [p-hacking](https://en.wikipedia.org/wiki/Data_dredging) and will result in more false positives than you expect. Effectively, you will be too likely to convince yourself that relationships exist in your data.

To treat p-values as binary,<font color='blue'> a cutoff must be set in advance. Then the p-value must be compared with the cutoff and treated as significant/not signficant.

We can refer to the cutoff as our <font color='blue'>significance level because a lower cutoff means that results which pass it are significant at a higher level of confidence. 

So if you have a cutoff of 0.05, then even on random data 5% of tests will pass based on chance. A cutoff of 0.01 reduces this to 1%, which is a more stringent test.

In [None]:
# Setting a cutoff of 5% means that there is a 5% chance
# of us getting a significant p-value given no relationship
# in our data (false positive).
# NOTE: This is only true if the test's assumptions have been
# satisfied and the test is therefore properly calibrated.
# All tests have different assumptions.
cutoff = 0.05

X = pd.Series(np.random.normal(0, 1, 100))
Y = X + np.random.normal(0, 1, 100)

r_s = stats.spearmanr(Y, X)
print 'Spearman Rank Coefficient: ', r_s[0]
if r_s[1] < cutoff:
    print 'There is significant evidence of a relationship.'
else:
    print 'There is not significant evidence of a relationship.'

In [None]:
df = pd.DataFrame()

In [None]:
N = 20
T = 100

for i in range(N):
    X = np.random.normal(0, 1, T)
    X = pd.Series(X)
    name = 'X%s' % i
    df[name] = X

In [None]:
df.head()

In [None]:
cutoff = 0.05

significant_pairs = []

for i in range(N):
    for j in range(i+1, N):
        Xi = df.iloc[:, i]
        Xj = df.iloc[:, j]
        
        results = stats.spearmanr(Xi, Xj)
        
        pvalue = results[1]
        
        if pvalue < cutoff:
            significant_pairs.append((i, j))

In [None]:
def do_experiment(N, T, cutoff=0.05):
    df = pd.DataFrame()

    # Make random data
    for i in range(N):
        X = np.random.normal(0, 1, T)
        X = pd.Series(X)
        name = 'X%s' % i
        df[name] = X

    significant_pairs = []

    # Look for relationships
    for i in range(N):
        for j in range(i+1, N):
            Xi = df.iloc[:, i]
            Xj = df.iloc[:, j]

            results = stats.spearmanr(Xi, Xj)

            pvalue = results[1]

            if pvalue < cutoff:
                significant_pairs.append((i, j))
    
    return significant_pairs


num_experiments = 100

results = np.zeros((num_experiments,))

for i in range(num_experiments):
    # Run a single experiment
    result = do_experiment(20, 100, cutoff=0.05)
    
    # Count how many pairs
    n = len(result)
    
    # Add to array
    results[i] = n

In [None]:
np.mean(results)

<font color='blue'>What's happening here is that p-values should be uniformly distributed, given no signal in the underlying data. 

Basically, they carry no information whatsoever and will be equally likely to be 0.01 as 0.99. 

<font color='blue'>Because they're popping out randomly, you will expect a certain percentage of p-values to be underneath any threshold you choose. The lower the threshold the fewer will pass your test.



In [None]:
def get_pvalues_from_experiment(N, T):
    df = pd.DataFrame()

    # Make random data
    for i in range(N):
        X = np.random.normal(0, 1, T)
        X = pd.Series(X)
        name = 'X%s' % i
        df[name] = X

    pvalues = []

    # Look for relationships
    for i in range(N):
        for j in range(i+1, N):
            Xi = df.iloc[:, i]
            Xj = df.iloc[:, j]

            results = stats.spearmanr(Xi, Xj)

            pvalue = results[1]

            pvalues.append(pvalue)
    
    return pvalues

In [None]:
pvalues = get_pvalues_from_experiment(10, 100)
    plt.hist(pvalues)
plt.ylabel('Frequency')
plt.title('Observed p-value');

In [None]:
pvalues = get_pvalues_from_experiment(50, 100)
plt.hist(pvalues)
plt.ylabel('Frequency')
plt.title('Observed p-value');

In [None]:
pvalues = get_pvalues_from_experiment(50, 100)
plt.vlines(0.01, 0, 150, colors='r', linestyle='--', label='0.01 Cutoff')
plt.vlines(0.05, 0, 150, colors='r', label='0.05 Cutoff')
plt.hist(pvalues, label='P-Value Distribution')
plt.legend()
plt.ylabel('Frequency')
plt.title('Observed p-value');

We can see that <font color='blue'>with a lower cutoff we should expect to get fewer false positives

In [None]:
num_experiments = 100

results = np.zeros((num_experiments,))

for i in range(num_experiments):
    # Run a single experiment
    result = do_experiment(20, 100, cutoff=0.01)
    
    # Count how many pairs
    n = len(result)
    
    # Add to array
    results[i] = n

In [None]:
np.mean(results)

As with any adjustment of p-value cutoff, we have a tradeoff. <font color='blue'>A lower cutoff decreases the rate of false positives, but also decreases the chance we find a real relationship (true positive).

<font color='blue'>So you can't just decrease your cutoff to solve this problem. https://en.wikipedia.org/wiki/Sensitivity_and_specificity


<font color='blue'>You can't really eliminate multiple comparisons bias, but you can reduce how much it impacts you.

Run fewer tests. <font color='blue'>Rather than just sweeping around hoping you hit an interesting signal, use your expert knowledge of the system to develop a great hypothesis and test that

This process of exploring the data, coming up with a hypothesis, then gathering more data and testing the hypothesis on the new data is considered the gold standard in statistical and scientific research.

<font color='blue'>It's crucial that the data set on which you develop your hypothesis is not the one on which you test it. 

<font color='blue'>Moving to new data and testing there will not only mean you only run one test, but will be an 'unbiased estimator' of whether your hypothesis is true.

<font color='blue'>If you must run many tests, try to correct your p-values. This means applying a correction factor to the cutoff you desire to obtain the one actually used when determining whether p-values are significant.

<font color='blue'>The most conservative and common correction factor is Bon Ferroni.

The concept behind Bon Ferroni is quite simple. It just says that<font color='blue'> if we run  m  tests, and we have a significance level/cutoff of  a , then we should use  a/m  as our new cutoff when determining significance.

In [None]:
num_experiments = 100
results = np.zeros((num_experiments,))
N = 20
T = 100
desired_level = 0.05
num_tests = N * (N - 1) / 2
new_cutoff = desired_level / num_tests
for i in range(num_experiments):
    # Run a single experiment
    result = do_experiment(20, 100, cutoff=new_cutoff)
    # Count how many pairs
    n = len(result)
    # Add to array
    results[i] = n

In [None]:
np.mean(results)

<font color='blue'>Because Bon Ferroni is so stringent, you can often end up passing over real relationships. 

There is a good example in the following article https://en.wikipedia.org/wiki/Multiple_comparisons_problem

Effectively, <font color='blue'>it assumes that all the tests you are running are independent, and doesn't take into account any structure in your data.

You may be able to design a more finely tuned correction factor, but this is adding a layer of complexity and therefore a point of failure to your research. 

<font color='blue'>Because of the over-zealousness of Bon Ferroni, often running fewer tests is the better option.

<font color='blue'>if you must run many tests, reserve multiple sets of data so your candidate signals can undergo an out-of-sample round of testing

p-hacking is just intentional or accidental abuse of multiple comparisons bias. It is surprisingly common, even in academic literature. 

The excellent statistical news website FiveThirtyEight has a great visualization here: https://fivethirtyeight.com/features/science-isnt-broken/

Wikipedia's article is also informative: https://en.wikipedia.org/wiki/Data_dredging

<font color='blue'>By running many tests or experiments and then focusing only on the ones that worked, you can present false positives as real results.

<font color='blue'>Keep in mind that this also applies to running many different models or different types of experiments and on different data sets

As mentioned above, out-of-sample testing is one of the best ways to reduce your risk. You should always use it, no matter the circumstances.

Often one of the ways that false positives make it through your workflow is a lack of an out-of-sample test at the end.

Sources
 -  https://en.wikipedia.org/wiki/Multiple_comparisons_problem
 -  https://en.wikipedia.org/wiki/Sensitivity_and_specificity
 -  https://en.wikipedia.org/wiki/Bonferroni_correction
 -  https://fivethirtyeight.com/features/science-isnt-broken/