# Simulating a hypothesis test
ex: students sitting randomly in a class room

In [2]:
xCoord = c(0.2, 0.1, 0.6, 0.5, 0.4)
yCoord = c(0.9, 0.8, 0.7, 0.5, 0.3)

In [7]:
pointDistances = function(xVec, yVec){
    total = 0
    n = length(xVec)
    for (i in 1:n){
        for (j in 1:n){
            dist_ij = sqrt((xVec[i] - xVec[j])^2 + (yVec[i] - yVec[j])^2)
            total = total +dist_ij
        }
    }
    return(total)
}

In [8]:
testX = c(0,1)
testY = c(0,1)
pointDistances(testX, testY)

In [9]:
pointDistances(xCoord, yCoord)

Can we use the CLT to say metric S~N(mu, sigma) in a hypothesis test where

$H_0:$ students are sitting randomly $\mu=\mu_0$

$H_1:$ students are not sitting randomly $\mu\neq\mu_0$

issues:
- we don't know the distribution is normal because n is small
- we dont know the parameters
- no distribution to calculate a p value

Can we use the law of large numbers?
- maybe if we can figure out what $\bar{X}$ should be

what is a p-value:

in this case it is the P(S is as extreme or more extreme that sample given the null)

$P(S\in A|H_0) = E{1(S\in A|H_0)}$

where:

$1()$ is an indicator function where if exp is true then $1(expr)=1$


Use LLN:

1. simulate data of students picking seats randomly (under the null)
2. save S statistic
3. repeat many times and take average $\bar{s}$



Lets see what extreme is...

In this case, it will be any sum of distances which is smaller than 8.417 (where $s<8.417$)

In [17]:
numStudents = 5
runs = 1000

s = rep(NA, runs)
for (r in 1:runs){
    studentsX = runif(numStudents, 0, 1)
    studentsY = runif(numStudents, 0, 1)
    s[r] = pointDistances(studentsX, studentsY)
}
mean(s)
sd(s)


 how to use simulation to extimate pval?

 see how many values fall in the range - use indicator and take average for $1(s>8.417)$

In [22]:
numStudents = 5
runs = 1000

sExtreme = rep(NA, runs)
dataStat = pointDistances(xCoord, yCoord)
sVals = rep(NA, runs)
for (r in 1:runs){
    studentsX = runif(numStudents, 0, 1)
    studentsY = runif(numStudents, 0, 1)
    s = pointDistances(studentsX, studentsY)
    sVals[r] = s
    sExtreme[r] = (s < dataStat)
}

mean(sExtreme)*2

Interpretation:

Because the p-value is about 0.314, which is > alpha = .05, we 
fail to reject the null hypothesis and cannot conclude that students do not sit randomly in the room.

# Class work
