# Large sample statistics

Szymon Talaga | 15 March 2020

![ZIP logo](zip.png)

<hr>

In this notebook we cover R implementations of simple asymptotic methods, mainly $z$ tests and $\chi^2$ tests. We also discuss bootstrap.

In [None]:
# Packages that we will need
library(tidyverse)    # read in core tidyverse packages at once
library(BSDA)         # some helper functions for working with z tests
library(latex2exp)    # for easy math on plots

# Set default theme for ggplot2
theme_set(theme_bw())

## One sample $z$ test

Example with the following set of hypotheses:

$$
H_0: \mu = 100
$$
$$
H_1: \mu \neq 100
$$

In [None]:
# Sample avarage
xbar <- 102
# Sample variance
s2 <- 225
# Sample size
n <- 225

# Standard error
se <- sqrt(s2 / n)

# True mean under H0
mu0 <- 100

# Test statistic
z <- (xbar - mu0) / se
z

# 95% two-sided CI
ci95 <- xbar + c(-1, 1) * qnorm(0.975) * se
ci95

# p-value
p <- 2 * (1 - pnorm(abs(z)))
p

In [None]:
## Check with implementation from BDSA package
zsum.test(mean.x = 102, sigma.x = sqrt(225), n.x = 225, mu = 100, conf.level = .95)

## Two samples $z$ test (disjoint groups)

Example with the following set of hypotheses:

$$
H_0: \mu_2 \leq \mu_1
$$
$$
H_1: \mu_2 > \mu_1
$$

In [None]:
# I group
n1    <- 100    # sample size
xbar1 <- 10     # sample average
s2_1  <- 9      # sample variance

# II second group
n2    <- 70     # sample size
xbar2 <- 12     # sample average
s2_2  <- 16     # sample variance


# Difference between sample averages
dbar <- xbar2 - xbar1

# Variancde of `dbar`
dbar_v <- s2_1/n1 + s2_2/n2

# Standard error
se <- sqrt(dbar_v)

# Test statistic
z <- dbar / se
z

# 99% one-sided ci
ci99 <- c(dbar - qnorm(.99) * se, Inf)
ci99

# p-value
p <- 1 - pnorm(z)
p

In [None]:
## Check with the implementation from BDSA package
zsum.test(mean.x = 12, sigma.x = 4, n.x = 70, mean.y= 10, sigma.y = 3, n.y = 100, mu = 0, alternative = "greater", conf.level = .99)

## Two samples $z$ tests (paired groups)

Example with the following set of hypotheses:
$$
H_0: \mu_2 = \mu_1
$$
$$
H_1: \mu_2 \neq \mu_1
$$

In [None]:
# 1st measurements
x <- c(10, 12, 15, 20, 7, 11, 15, 18, 9, 10, 11, 13, 15, 16, 9, 10)
# 2nd measurements
y <- c(11, 9,  14, 18, 12, 10, 9, 17, 15, 12, 12, 14, 15, 16, 14, 13)

# Differences (2nd - 1st)
d <- y - x
# Number of observations
n <- length(d)

# Mean difference
dbar <- mean(d)
# Variance of differences (it accounts for covariance between measurements)
s2_d <- var(d)

# Standard error
se <- sqrt(s2_d / n)

# Test statistic
z <- dbar / se
z

# Two-sided 95% confidence interval
ci95 <- dbar + c(-1, 1) * qnorm(0.975) * se
ci95

# p-value
p <- 2*(1 - pnorm(abs(z)))
z

In [None]:
## Check with implementation from BDSA package
zsum.test(mean.x = dbar, sigma.x = sqrt(s2_d), n.x = n, alternative = "two.sided", conf.level = 0.95)

## $\chi^2$ goodness-of-fit and independence

Test if a coin is unbiased:

$$
H_0: P(H) = P(T) = \frac{1}{2}
$$

In [None]:
## Check if a sequence of coin tosses is unbiased
x <- c(0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0)

# Distribution over outcomes assumed in H0
p0 <-c(1/2, 1/2)

# tabulate values of
x_tab <- table(x)
x_tab

chisq.test(x_tab, p = p0)

Check if two coins are independent.

In [None]:
x <- c(0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0)
y <- c(1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0)

# Tabulate results together
xy_tab <- table(x, y)
xy_tab

chisq.test(xy_tab, correct = FALSE)

`R` is throwin a warning because some cell frequencies are not greater than $5$.
Moreover, for small sample sizes it is better to use so-called Yates continuity correction (only for 2-by-2 tables).

In [None]:
chisq.test(xy_tab, correct = TRUE)

However, if we really want to test whether two coins are ''working correctly'' we should test for their independence and lack of bias jointly.
This is actually simple. We can treat their joint distribution (over pairs $HH$, $HT$, $TH$ and $TT$) as a distribution over a single categorical variable,
and assuming the coins are independent and unbiased we can fully specify as:
$$
P(HH) = P(HT) = P(TH) = P(TT) = \frac{1}{4}
$$

This is our null hypothesis $H_0$.

In [None]:
# Distribution over joint outcomes under H0
p0 <- rep(1/4, 4) 

# We have to dump table to a vector
xy <- as.vector(xy_tab)
xy

In [None]:
chisq.test(xy, p = p0)

We see that the coins are both independent and unbiased.

## Bootstrap

Below we present a simple example of a bootstrap computations.

We first implement it by hand and the used a dedicated `R` package.

We will construct a bootstrap confidence interval and a Wald type test for difference between medians.
We will used simulated data for which we will know the true answer to be able to fully understand the results.

In [None]:
## Two normal distributions with the same median.
N  <- 1000 # number of observations
x1 <- rnorm(N, mean = 100, sd = 15)
x2 <- rnorm(N, mean = 100, sd = 15)

# Number of bootstrap replicates
B  <- 1000

# Preallocate vector for storing simualted medians
median_diff_boot <- vector(mode = "numeric", length = B)

for (i in 1:B) {
    x1_sim <- sample(x1, size = N, replace = TRUE)
    x2_sim <- sample(x2, size = N, replace = TRUE)
    diff_x1_x2 <- median(x1_sim) - median(x2_sim)
    median_diff_boot[i] <- diff_x1_x2
}

# Standard error
median_diff_se <- sd(median_diff_boot)

# 95% Percentile CI
print(quantile(median_diff_boot, probs = c(.025, .975)))

Clearly, the confidence interval contains zero and is more or less symmetric around it, so we have no reasons to believe that medians are different.
Now change sample size ($N$) to $10000$ and check whether this changes the result. It should not.

We can examine bootstrap distribution, to decide whether a Wald-type test (assuming normality of the distribution of a statistic) makes sense.

In [None]:
data.frame(median_diff = median_diff_boot) %>%
    ggplot(aes(x = median_diff)) +
    geom_histogram(color = "black")

It looks quite normal so we may run a Wald-type test.

In [None]:
z <- (median(x1) - median(x2)) / sd(median_diff_boot)
z

In [None]:
# p-value (two-sided)
2 * (1 - pnorm(abs(z)))

Now we repeat the same calculations but for data with different medians.

In [None]:
## Two normal distributions with the same median.
N  <- 1000 # umber of observations
x1 <- rnorm(N, mean = 100, sd = 15)
x2 <- rnorm(N, mean = 105, sd = 15)

# Number of bootstrap replicates
B  <- 1000

# Preallocate vector for storing simualted medians
median_diff_boot <- vector(mode = "numeric", length = B)

for (i in 1:B) {
    x1_sim <- sample(x1, size = N, replace = TRUE)
    x2_sim <- sample(x2, size = N, replace = TRUE)
    diff_x1_x2 <- median(x1_sim) - median(x2_sim)
    median_diff_boot[i] <- diff_x1_x2
}

# Standard error
median_diff_se <- sd(median_diff_boot)

# 95% Percentile CI
print(quantile(median_diff_boot, probs = c(.025, .975)))

In [None]:
data.frame(median_diff = median_diff_boot) %>%
    ggplot(aes(x = median_diff)) +
    geom_histogram(color = "black")

In [None]:
# Wald test
z <- (median(x1) - median(x2)) / sd(median_diff_boot)
z
# p-value (two-sided)
2 * (1 - pnorm(abs(z)))

### `boot` package

`boot` package provides convenient utility functions for generating bootstrap distributions and estimators. It also implements many advanced bootstrap techniques.
You may find short introduction [here](https://stats.idre.ucla.edu/r/library/r-library-introduction-to-bootstrapping/).

Below we solve our problem with the main function `boot` from `boot` package. The main work we have to do is to define a function that takes a data and indexes
used for selecting observations to a bootstrap sample and calculates a statistic.

In [None]:
library(boot)

# Number of bootstrap replicates
R <- 1000
# Data: two vectors joined as a single data frame
data <- data.frame(
    x1 = x1,
    x2 = x2
)

# Bootstrapping function
median_diff_func <- function(data, idx) {
    median(data[idx, "x1"] - data[idx, "x2"])
}


boot_result <- boot(data, median_diff_func, R = R)
boot_result

In [None]:
# Simulated medians differences
head(boot_result$t)

In [None]:
# Original median difference
boot_result$t0

In [None]:
# 95% Percentile CI
quantile(boot_result$t, probs = c(0.025, 0.975))

In [None]:
# Wald type test
z <- boot_result$t0 / sd(boot_result$t)
z