# Lecture 4b

## The "Multiple Comparisons" problem: testing each predictor singly or jointly?

This example will demonstrate how just using a test of a single predictor, when there are many predictors, can lead to an incorrect conclusion.

$Y=\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon$.

Below, we simulate $S$ datasets, each using the same predictor data $X$ with many predictors, $P$.

Imporantly, **each predictor has no effect on the response**. We then test the this hypothesis both singly and jointly.

In [43]:
S <- 20 #number of samples
P <- 100 #number of predictors
n <- 1000 #number of observations per sample

b0 <- 0;
b=matrix(0,P,1); # Px1 vector of parameters, set to 0. This means there is no effect of the predictors on the responses.

x <- replicate(P,rnorm(n)) # generate P predictor variables. These are just chosen randomly, but the same predictors are used for each sample.

e <- replicate(S,rnorm(n)) #draw S noise samples

pvals=matrix(,P+1,S) #initialize matrix to store pvals for each coefficient
fs=matrix(,1,S) #initialize matrix to store pvals for each coefficient

for (s in 1:S){
    y <- b0 + x%*%b + e[,s] #simulate response data using the "true" model and the noise from sample s.
             # %*% does matrix multiplication 
    
    lm.fit=lm(y~x) # estimate the linear model on the sample s.
    pvals[,s] <- summary(lm.fit)$coefficients[,4] #store the p-values
    fs[,s] <- summary(lm.fit)$fstatistic[1] #store tjhe fstats
}

When $p$ is large, the t-test for each coefficient fails to account for the fact that you are making **many** t-tests! For 100 tests, you should expect to find 5 predictors that have a p-value<0.05. This is known as the "mulitple comparisons" problem.

In [46]:
colSums(pvals<0.05) # in each sample, how many times is a p-value for any predictor less than 0.05?

In [16]:
mean(pvals<0.05) #on average, over all samples, we should expect a rejection of the true null 5% of the time (using a 5% level test)

However the F-test (for all predictors =0) takes account of this "multiple comparisons" problem. It corrects for the fact that you are testing many coeeficients at once.

In [44]:
pval_F <- 1-pf(fs,100,899) # for each sample, construct a pvalue from the fistribution with 100 restrictions and 899 degrees of freedom (N-p-1)

In [45]:
pval_F < 0.05

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
