<font size="6"><b>TO REJECT OR NOT TO REJECT: STATISTICAL INFERENCE AND HYPOTHESIS TESTING</b></font>

<font size="5"><b>Serhat Çevikel</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(BBmisc)
library(plotly)
library(DescTools)
library(moments)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

![entropy](../imagesba/p_values.png)

(https://xkcd.com/1478/)

In this session, we will focus on a fundamental application of statistical inference:

Take a sample of values, and test whether the sample is from a hypothesized population.

We will cover topics such as hypothesis testing, null and alternative hypotheses, confidence intervals, p-values, Type I and Type II error and power of test.

We will start with the easiest case: A sample of values and a population with a known standard deviation to test against.

# One sample - Known population standard deviation

## Small Sample

Let's start with a simple setting:

- We have a sample of values of a certain size
- We want to test whether the sample is from a certain hypothesized population with a known mean and $\sigma$

- From the CLT we know that, whatever shape the population distribution has, the distribution of sample means (sampling distribution) converges to normal distribution as size of samples $n$ gets larger and the standard deviation of the sample means - standard error - obeys the square root law:

${\displaystyle \sigma _{\bar {x}}={\sqrt {\frac {\sigma ^{2}}{n}}}={\frac {\sigma }{\sqrt {n}}}}$

- To understand the concepts better we will draw multiple samples from the population and later on we will add more features into our simulation.

First, let's set some parameters for the first population (we will add more populations later on) and the samples:

The size, mean and $\sigma$ for population 1:

In [None]:
popsize <- 1e5
pop1mu <- 0
pop1sd <- 1

Number of samples and size of each sample:

In [None]:
nsamps <- 1e2
sizesamp <- 10 # 25

Let's create the first population and ensure that it obeys the parameters we set:

In [None]:
set.seed(1)
pop1 <- normalize(rnorm(popsize))
pop1 <- pop1 * pop1sd + pop1mu

In [None]:
summary(pop1)

In [None]:
sd(pop1)

In [None]:
hist(pop1)

Now let's draw the samples from population 1:

In [None]:
set.seed(2)
samps1 <- replicate(nsamps, sample(pop1, sizesamp), simplify = F)

And get the sampling distribution:

In [None]:
sampmus1 <- sapply(samps1, mean)

In [None]:
summary(sampmus1)

In [None]:
sd(sampmus1)

In [None]:
hist(sampmus1)

We see that even with a size of $n$, the sampling distribution is close to normal distribution and the standard error obeys the square root law:

In [None]:
pop1mu / sqrt(sizesamp)

Now let's combine the population, the samples and the sampling distribution into a large table with some additional identifiers:

In [None]:
pop1dt <- data.table(vals = pop1, pop = 1, ind = 0, type = "pop")
samps1dt <- mapply(function(x, y) data.table(vals = x, ind = y), samps1, seq_along(samps1), SIMPLIFY = F) %>% rbindlist
samps1dt[, (c("pop", "type")) := .(1, "samp")]
sampmus1dt <- data.table(vals = sampmus1, pop = 1, ind = 1, type = "sampmu")
sampdist_dt <- rbind(pop1dt, samps1dt, sampmus1dt)
popmus <- data.table(pop = 1, popmu = c(pop1mu), popsd = c(pop1sd))          
sampdist_dt <- sampdist_dt %>% left_join(popmus, by = "pop")# %>% left_join(lwdt, by = "type")

Remember the shape of normal distribution:

In [None]:
qvals1 <- seq(-4, 4, 0.05)

In [None]:
p1 <- plot_ly(x = qvals1, y = round(dnorm(qvals1), 3), type = "scatter", mode = "lines") %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "red"
    ),
    yaxis = list(showspikes = FALSE)
  )

p2 <- plot_ly(x = qvals1, y = round(pnorm(qvals1), 3), type = "scatter", mode = "lines") %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "red"
    ),
    yaxis = list(showspikes = FALSE)
  )

subplot(p1, p2, nrows = 2, shareX = TRUE) %>%
  layout(
      title = "Standard Normal Distribution",
    hovermode = "x unified",
      annotations = list(
 list(y = 0.9, text = "Probability Density Function", showarrow = F, xref='paper', yref='paper'),
  list(y = 0.3, text = "Cumulative Distribution Function", showarrow = F, xref='paper', yref='paper'))
  )


The probability of values away from the mean is lower, but how much lower?

Suppose we want to get the range of z-scores - quantile values of standard normal distribution of $\mu = 0$ and $\sigma = 1$ that are within a certain probability.

The support for the normal distribution is the whole continuum. So any value in this continuum has a non zero probability of being within any normal distribution, albeit the probabilities will converge zero after a certain distance from the mean.

However we may want to limit range of possible values with a certain probability. Let's say we want to trim two-sided tails with total probability $\alpha = 0.05$. The meaning of $\alpha$ is the probability that the values will be more extreme (away from the mean) than the quantiles at that probability. So the tail areas with 0.025 probability will be trimmed from the respective quantiles. So with 95% probability the values in this distribution will be within these quantile values:

In [None]:
qnorm(c(0.025, 1 - 0.025))

Now we will add the quantile information for some certain p-values - the two-sided areas under the tail of the normal distribution:

In [None]:
sampdist_dt[1, line001l := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line005l := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line010l := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp))]

In [None]:
sampdist_dt[1, line001r := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line005r := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line010r := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]

And last, we create a unique group id to generate a separate density plot for each distribution:

In [None]:
sampdist_dt[, ind2 := .GRP, by = c("ind", "type", "pop")]

Now we will create a plot that shows the:

- Density of the population
- Density of each sample
- Density of sample means (sampling distribution)
- Vertical lines showing quantiles for the left and right tail areas/probabilities of $\alpha = 0.005, 0.025, 0.05$ for the sampling distribution of the size 10 samples out of the population 1

In [None]:
p1 <- sampdist_dt %>%
mutate_at("vals", round, 4) %>%
mutate_at("pop", factor) %>%
mutate_at("type", factor, levels = c("pop", "sampmu", "samp")) %>%
ggplot(aes(x = vals, color = pop)) +
geom_density(aes(linewidth = type, group = ind2), adjust = 2) +
geom_vline(aes(xintercept = popmu, color = pop)) +
geom_vline(aes(xintercept = line010l, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line010r, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line005l, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line005r, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line001l, color = pop), linetype = 3) +
geom_vline(aes(xintercept = line001r, color = pop), linetype = 3) +
scale_linewidth_manual(values = c(pop = 1, sampmu = 0.5, samp = 0.02)) +
scale_color_manual(values = c("1" = "red", "2" = "blue")) +
xlim(c(-5, 5))

p1p <- ggplotly(p1)  %>%
          layout(legend = list(orientation = "h",
                     xanchor = "center",
                     x = 0.5, y = -0.6, title = list(text=NULL)), height = 600)

And we will also create a separate chart that shows the tail probabilities for the sampling distribution. The tail probability is two times the smaller of p-value and 1 - p-value calculated for a quantile. We add the vertical lines for the quantiles at certain left and right tail areas/probabilities of $\alpha = 0.005, 0.025, 0.05$:

In [None]:
qvals <- seq(-5, 5, 0.01)
pop1pdt <- data.table(qvals, pop1p = pnorm(qvals, pop1mu, pop1sd / sqrt(sizesamp)) %>% round(3), pop = 1)
pop1pdt[, pop1p := round(2 * pmin(pop1p, 1 - pop1p), 3)]
pop1pdt[1, line001l := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line005l := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line010l := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line001r := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line005r := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line010r := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p2 <- pop1pdt %>% ggplot(aes(x = qvals, y = pop1p)) +
geom_line(color = "red") +
geom_vline(aes(xintercept = line010l), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005l), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001l), linetype = 3, color = "red") +
geom_vline(aes(xintercept = line010r), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005r), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001r), linetype = 3, color = "red")
p2p <- ggplotly(p2)

Now we will create a third plot that shows the absolute Z-scores of quantile values calculated using the population mean and standard error of the sampling distribution. The Z-scores for at certain two-sided probabilities of $\alpha = 0.005, 0.025, 0.05$ are also added as vertical lines. These lines are considered as critical values:

In [None]:
qvals <- seq(-5, 5, 0.01)
pop1zdt <- data.table(qvals, pop1z = abs((qvals - pop1mu) / (pop1sd / sqrt(sizesamp))) %>% round(3), pop = 1)
pop1zdt[1, line001l := qnorm(0.01/2, lower.tail = F)]
pop1zdt[1, line005l := qnorm(0.05/2, lower.tail = F)]
pop1zdt[1, line010l := qnorm(0.10/2, lower.tail = F)]
p3 <- pop1zdt %>% ggplot(aes(x = qvals, y = pop1z)) +
geom_line(color = "red") +
geom_hline(aes(yintercept = line010l), linetype = 5, color = "black") +
geom_hline(aes(yintercept = line005l), linetype = 2, color = "black") +
geom_hline(aes(yintercept = line001l), linetype = 3, color = "black")
p3p <- ggplotly(p3)

Now a fourth plot is created that shows the confidence intervals, for each quantile value the lower and upper bounds that are within confidence level, or total probability around the mean which is $1 - \alpha$ for different $\alpha = 0.005, 0.025, 0.05$ values. The population is mean is shown as the horizontal black line:

In [None]:
qvals <- seq(-5, 5, 0.01)
pop1cidt <- data.table(qvals, pop1p = pnorm(qvals, pop1mu, pop1sd / sqrt(sizesamp)), pop = 1)
pop1cidt[, line001l := qnorm(0.01/2, qvals, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1cidt[, line005l := qnorm(0.05/2, qvals, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1cidt[, line010l := qnorm(0.10/2, qvals, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1cidt[, line001r := qnorm(0.01/2, qvals, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1cidt[, line005r := qnorm(0.05/2, qvals, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1cidt[, line010r := qnorm(0.10/2, qvals, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p4 <- pop1cidt %>% ggplot(aes(x = qvals)) +
geom_line(aes(y = line010l), linetype = 5, color = "red") +
geom_line(aes(y = line005l), linetype = 2, color = "red") +
geom_line(aes(y = line001l), linetype = 3, color = "red") +
geom_line(aes(y = line010r), linetype = 5, color = "red") +
geom_line(aes(y = line005r), linetype = 2, color = "red") +
geom_line(aes(y = line001r), linetype = 3, color = "red") +
geom_hline(yintercept = pop1mu, color = "black")
p4p <- ggplotly(p4)

And we combine these four plots in tandem with parallel spikes that show the values at the quantile on hover:

In [None]:
p1p2 <- p1p %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p2p2 <- p2p %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p3p2 <- p3p %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p4p2 <- p4p %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )


sp1 <- subplot(p1p2, p2p2, p3p2, p4p2, nrows = 4, shareX = TRUE) %>%
  layout(
      title = NULL,
    hovermode = "x unified",
      annotations = list(
    list(y = 1, text = "Sampling Distribution", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.75, text = "P-Value for Population 1", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.5, text = "Z-Value for Population 1", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.25, text = "Confidence Interval", showarrow = F, xref='paper', yref='paper')),
      height = 600
  )


And let's show the resulting plot. You can hover over the values and get the information from all plots parallelly with the tooltips:

In [None]:
sp1

Now we will understand what these plots are telling but for short: They are basically showing the same thing from different point of views!

The first plot shows the distribution of:

- The population with boldest line
- Separate samples with the thinnest filament like lines
- The sample means (sampling distribution) with the bolder line. Sampling distribution is the more peaked one.

As for the vertical lines:
- The population mean is shown as the solid vertical bold line
- The upper and lower limits for the two-sided tail probability $\alpha = 0.1$ of the sampling distribution are shown as vertical longer dashed lines. The tail area under the extreme side of each line is $\displaystyle \frac{0.1}{2} = 0.05$.
- The upper and lower limits for the two-sided tail probability $\alpha = 0.05$ of the sampling distribution are shown as vertical shorter dashed lines. The tail area under the extreme side of each line is $\displaystyle \frac{0.05}{2} = 0.025$.
- The upper and lower limits for the two-sided tail probability $\alpha = 0.01$ of the sampling distribution are shown as vertical dotted lines. The tail area under the extreme side of each line is $\displaystyle \frac{0.01}{2} = 0.005$. 

The second plot shows:
- The p-values, two-sided tail probability of the quantiles in the sampling distribution.
- The same vertical lines in the first plot also appear here.

- So the p-value is simply two times the minimum of cumulative probability $p$ or $1 - p$.
- The p-value shows the probability that values can be more extreme - towards the tails - than the certain quantile value.
- At the population mean the p-value is 1, so the probability that any quantile value is at least as extreme as the population mean is 1.
- As we move away from the population mean p-values converge to zero.
- When we hover over the vertical lines, we see that the p-values are 0.1, 0.05 and 0.01 respectively.

The third plot shows:

- The absolute value of standardized Z-scores of quantiles: The absolute difference of quantiles from the population mean divided by standard error ($\mu$ of sampling distribution).
- The horizontal lines shows the Z-scores of the limits that correspond to the two-sided tail probabilities of $\alpha = 0.1, 0.05, 0.01$. These Z-scores are respectively:

In [None]:
round(qnorm(c(0.1, 0.05, 0.01)/2, lower.tail = F), 3)

When the p-value in the second plot is higher than certain alpha values, the Z-scores in the third plot are lower than the Z-scores that correspond to those alpha values.

The fourth plot shows the confidence intervals across quantile values for each alpha value and corresponding confidence level:

- So the longer dashed lines shows the +/-1.645 * SE interval around the quantile values for $\alpha = 0.1$ and confidence level of $1 - 0.1 = 0.9$
- The longer shorter dashed lines shows the +/-1.96 * SE interval around the quantile values for $\alpha = 0.05$ and confidence level of $1 - 0.05 = 0.95$
- The dotted lines shows the +/-2.576 * SE interval around the quantile values for $\alpha = 0.01$ and confidence level of $1 - 0.01 = 0.99$

The horizontal line shows the population mean.

When the p-value of a quantile is higher than a certain $\alpha$ value in the second plot and the absolute Z-score is lower than those Z-scores that correspond to the $\alpha$ values, than the population mean is within the lines that mark the interval for the confidence level $1 - \alpha$.

Now let's come to some basic concepts:

- The hypothesized population mean is $\mu$, population standard deviation is $\sigma$
- The sample mean is $\displaystyle \bar {X}$, sample standard deviation is $s$.

First of all since we want to test whether the sample is coming from the hypothesized population, we have to form hypotheses:

- **Null hypothesis** (${\textstyle H_{0}}$): Sample is coming from the population, $\bar {X}$ is not significantly different than $\mu$

- **Alternative hypothesis** (${\textstyle H_{a}}$): Sample is not coming from the population, $\bar {X}$ is significantly different than $\mu$

So basically we will test whether the sample mean is significantly away from the population mean, considering the shape of sampling distribution which converges to normal distribution. Why *significantly*?

The support of normal distribution is the whole continuum. So any value on the continuum has a non-zero probability of being within the sampling distribution. However the hypothesized population mean is not the only plausible value. As we move away from the population mean in ${\textstyle H_{0}}$, the evidence's support for the null hypothesis decreases and alternative population mean values may become more plausible.

For this reason we set a **significance level** $\alpha$ in hypothesis testing. $\alpha$ states the probability that any value can be more extreme (away from the population mean) than the quantiles that match the $\alpha$ value.

So, if 

- the sample mean is closer to the mean than the quantiles that match the $\alpha$ value or
- The tail probability of the sample mean is larger than $\alpha$

then we **do not reject** ${\textstyle H_{0}}$ and not **accept** ${\textstyle H_{a}}$.

Why do we **not reject** ${\textstyle H_{0}}$ insteaad of simply **accepting** it? Because there may be many plausible values for the population mean is $\mu$ is just one of them (Agresti and Kateri 2021, Foundations of Statistics for Data Scientists, p.181).

If 

- the sample mean is further away from the mean than the quantiles that match the $\alpha$ value or
- The tail probability of the sample mean is smaller than $\alpha$

then we **reject** ${\textstyle H_{0}}$ and **accept** ${\textstyle H_{a}}$.

Why can we **accept** ${\textstyle H_{a}}$? Because ${\textstyle H_{a}}$ already contains all plausible population mean values other than $\mu$ (Agresti and Kateri 2021, Foundations of Statistics for Data Scientists, p.181).

Now let's use the parameters that we set above to conduct two examples.

Let's remember the population parameters and sample size:

In [None]:
pop1mu
pop1sd
sizesamp

Let's first calculate the standard error, the standard deviation of the sampling distribution:

In [None]:
se1 <- pop1sd / sqrt(sizesamp)
se1

Let's set the significance level $\alpha$ to 0.05 and conduct a two-sided test, so we will test whether a sample mean value is significantly away from the population in any direction, left or right:

In [None]:
siglev <- 0.05

Let's assume the mean of the first sample is:

In [None]:
sampm1 <- -0.7

### H0 rejected

#### P-value and Type I error

We now calculate the tail probability of having more extreme values than $\displaystyle \bar {X}$:

Consider an observed test-statistic ${\displaystyle t}$ from unknown distribution ${\displaystyle T}$. Then the p-value ${\displaystyle p}$ is what the prior probability would be of observing a test-statistic value at least as "extreme" as ${\displaystyle t}$ if null hypothesis ${\displaystyle H_{0}}$ were true. That is:

${\displaystyle p=\Pr(T\geq t\mid H_{0})}$ for a one-sided right-tail test-statistic distribution.

${\displaystyle p=\Pr(T\leq t\mid H_{0})}$ for a one-sided left-tail test-statistic distribution.

${\displaystyle p=2\min\{\Pr(T\geq t\mid H_{0}),\Pr(T\leq t\mid H_{0})\}}$ for a two-sided test-statistic distribution.

If the distribution of ${\displaystyle T}$ is symmetric about zero, then 
${\displaystyle p=\Pr(|T|\geq |t|\mid H_{0}).}$

(https://en.wikipedia.org/wiki/P-value)

In [None]:
pval <- pnorm(sampm1, pop1mu, se1)
pval <- 2 * min(pval, 1-pval)
pval

Here we take the double of the probability of the tail that the sample mean is closer to. And:

In [None]:
pval < siglev

Since 0.027 < 0.05, we reject $H_{0}$

**The second subplot in the tandem plot above show the P-values for any sample mean value along with selected quantiles at certain $\alpha$ values.**

What is the second meaning of $\alpha$ in a testing context? Since the sampling distribution can include any sample mean value in the continuum, there is a $\alpha$ probability that the values in the rejection region are still from the sampling distribution of the hypothesized population. **So there is a $\alpha$ probability that we rejected the ${\textstyle H_{0}}$ while we shouldn't! That is known as the Type I error**.

#### Test statistic

Now we will approach the problem from another point of view: The critical test statistics - Z-score of standard normal distribution can be calculated using the half the significance level (tail probability in each side):

In [None]:
critz05 <- qnorm(siglev/2, lower.tail = F)
critz05

So any absolute Z-score of a sample mean larger than that value should be deemed to be significantly different than $\mu$. Let's calculate the Z-score of the sample mean:

In [None]:
sampz1 <- abs((sampm1 - pop1mu) / se1)
sampz1

In [None]:
sampz1 > critz05

Since the absolute value of the test statistic - the Z-score - of the sample mean is larger than critical test statistic (2.21 > 1.96), we **reject** ${\textstyle H_{0}}$, so sample mean is significantly different than the population mean. We could do the same comparison without taking the absolute value of the test statistic. In that case:

$-2.21 < -1.96$

So if the test statistic is negative then the rejection area is when it is smaller than the negative critical value.

**The third subplot in the tandem plot above shows the Z-scores of sample means along with critical test-statistics.**

#### Confidence interval

Another way of looking into the hypothesis testing is to state the boundaries around the sample mean within the statistic, known as the confidence interval:

$\displaystyle \bar {X} +/- Z_{crit} * SE$

where $Z_{crit}$ is the critical Z-score for the selected $\alpha$ and $SE$ is the standard error of the sampling distribution.

In our example:

In [None]:
confint1 <- round(sampm1 + c(-1, 1) * critz05 * se1, 3)
confint1

The 95% confidence interval is within this range in our example. What is the meaning of this confidence interval?

Let's create 1000 new samples from the same population, calculate their means and the confidence intervals in the same way:

In [None]:
set.seed(5)
sampmus1000 <- sapply(replicate(1e3, sample(pop1, sizesamp), simplify = F), mean)

In [None]:
confint1_l <- lapply(sampmus1000, function(x) x + c(-1, 1) * critz05 * se1)

Now let's check in what portion of these confidence intervals the true parameter lies within:

In [None]:
confwithin1 <- sapply(confint1_l, function(x) pop1mu %between% x)

In [None]:
sum(confwithin1) / length(confwithin1)

So %95 of the time the true population mean lies within the confidence interval around the sample means

When $\mu$ is within this range we do not reject $H_{0}$ while $\mu$ is out of this range we reject $H_{0}$. In our case

In [None]:
pop1mu %between% confint1

Population mean 0 is not within the confidence interval around the sample mean so we reject $H_{0}$

**The fourth subplot in the tandem plot above shows the confidence interval around sample mean values for selected confidence levels ($1 - \alpha$) along with the population mean**

#### Test criteria summary

So let's summarize the three different points of view for null hypothesis testing:

| Method      | Not reject H0 | Reject H0 | 
| ------------- | ------------- | ----- |
| P-value | pval >= $\alpha$ | pval < $\alpha$ |
| Test statistic | \|$Z_{\bar {X}}$\| <= \|Z_{crit}\| | \|$Z_{\bar {X}}$\| > \|Z_{crit}\| |
| Confidence Interval | $\bar {X} - Z_{crit}SE <= \mu <= \bar {X} + Z_{crit}SE$ | $ \mu < \bar {X} - Z_{crit}SE \| \mu > \bar {X} + Z_{crit}SE$ |

So we reject $H_0$ when:

- P-value of the sample mean is smaller than the significance level $\alpha$
- Absolute value the Z-score of the sample mean is larger than the critical Z-score for selected $\alpha$
- When population mean is not within the confidence interval - interval around sample mean bounded by critical Z-score times the standard error of sampling distribution.

You can simulate these criteria by hovering over the quantile value of -0.7 on the x axis and comparing with the vertical lines in the first three plots or the sloped lines in the fourth plot that correspond to $\alpha = 0.05$.

#### ZTest function

Without going through these manual steps, we can make the one sample Z-test with known population standard deviation using `ZTest` function from `DescTools` package. Since the function can't run on a single sample mean value, we provide a small sample mean of which is the value we tested above:

In [None]:
set.seed(10)
samptest1 <- normalize(rnorm(sizesamp)) * pop1sd + sampm1

In [None]:
mean(samptest1)
sd(samptest1)

In [None]:
ZTest(samptest1, mu = pop1mu, sd_pop = pop1sd, conf.level = 1 - siglev)

We see that $H_0$ is rejected since:

- p-value of 0.027 < 0.05
- z statistic of -2.21 < -1.96
- Population mean of 0 is not within the conficence interval between -1.31 and -0.08.

### H0 not rejected

Now keeping all else equal, we change the significance level $\alpha$ from 0.05 to 0.01 so that we have to have a sample mean value more significantly away from the mean in order to reject $H_0$. The confidence level ($1 - \alpha$ rise to 99% percent from 95%.

In [None]:
siglev <- 0.01

#### P-value and Type I error

The calculation of P-value doesn't change

In [None]:
pval <- pnorm(sampm1, pop1mu, se1)
pval <- 2 * min(pval, 1-pval)
pval

However the conclusion when we compare with the new significance level is different:

In [None]:
pval < siglev

Since 0.027 > 0.01, we **do not reject** $H_{0}$

#### Test statistic

We calculate the new critical test statistic:

In [None]:
critz05 <- qnorm(siglev/2, lower.tail = F)
critz05

Test statistic of the sample mean is the same:

In [None]:
sampz1

However the comparison is different now:

In [None]:
sampz1 > critz05

The test statistic is not more extreme than the critical value (2.21 < 2.58). So we **do not reject** $H_0$.

#### Confidence interval

Let's calculate the new confidence interval using the new critical test statistic:

In [None]:
confint1 <- round(sampm1 + c(-1, 1) * critz05 * se1, 3)
confint1

And check whether population mean is within this interval now:

In [None]:
pop1mu %between% confint1

It is, so -1.52 < 0 < 0.115, population mean 0 is within the confidence interval around the sample mean so we **do not reject** $H_{0}$

#### ZTest function

We automate the test again with `ZTest` function from `DescTools` package:

In [None]:
set.seed(10)
samptest1 <- normalize(rnorm(sizesamp)) * pop1sd + sampm1

In [None]:
mean(samptest1)
sd(samptest1)

In [None]:
ZTest(samptest1, mu = pop1mu, sd_pop = pop1sd, conf.level = 1 - siglev)

In [None]:
qnorm(0.005)

We see that $H_0$ is not rejected since:

- p-value of 0.027 > 0.01
- z statistic of -2.21 > -2.58
- Population mean of 0 is within the conficence interval between -1.51 and 0.11.

We **reject** $H_0$ at $\alpha = 0.05$ while we **do not reject** $H_0$ at $\alpha = 0.01$ with the same sample mean and population parameters. What happened?

## Type I and Type II Errors, Power of Test, Effect of Sample Size

### Type I and Type II errors

At a certain two-sided significance level $\alpha$, there is an $\alpha$ probability that the sample mean value can be more extreme (away from the population mean) than the quantile that corresponds to the half of $\alpha$ in any of the tails.

A second meaning of $\alpha$ is that, there is a $\alpha$ probability of rejecting $H_0$ (sample is not from the hypothesized population) where as we shouldn't reject it ($H_0$ is from the hypothesized population).

So by changing the significance level from 0.05 to 0.01, it became harder to reject the $H_0$ and we decreased Type I error. However other things being equal that decrease in Type I error comes at a cost.

Now let's draw samples from two more candidate populations, and when we have a sample at hand we don't know from which population the sample is drawn from:

True population parameters:

In [None]:
pop2mu <- -1
pop2sd <- 1
pop3mu <- 1
pop3sd <- 1

Generated populations:

In [None]:
pop2 <- normalize(rnorm(popsize))
pop2 <- pop2 * pop2sd + pop2mu

pop3 <- normalize(rnorm(popsize))
pop3 <- pop3 * pop3sd + pop3mu

Samples drawn from populations:

In [None]:
samps2 <- replicate(nsamps, sample(pop2, sizesamp), simplify = F)
samps3 <- replicate(nsamps, sample(pop3, sizesamp), simplify = F)

Sampling distributions:

In [None]:
sampmus2 <- sapply(samps2, mean)
sampmus3 <- sapply(samps3, mean)
sampmus2dt <- data.table(vals = sampmus2, pop = 2, ind = 1, type = "sampmu")
sampmus3dt <- data.table(vals = sampmus3, pop = 3, ind = 1, type = "sampmu")
sampdist_dt <- rbind(sampmus1dt, sampmus2dt, sampmus3dt)
popmus <- data.table(pop = 1:3, popmu = c(pop1mu, pop2mu, pop3mu), popsd = c(pop1sd, pop2sd, pop3sd))
sampdist_dt <- sampdist_dt %>% left_join(popmus, by = "pop")# %>% left_join(lwdt, by = "type")

In [None]:
sampdist_dt %>% head

Now we will add the quantile information for some certain p-values - the two-sided areas under the tail of the normal distribution:

In [None]:
sampdist_dt[1, line001l := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line005l := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line010l := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp))]

In [None]:
sampdist_dt[1, line001r := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line005r := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line010r := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]

In [None]:
sampdist_dt[, ind2 := .GRP, by = c("ind", "type", "pop")]

Let's plot the densities of sampling distributions together along with vertical lines at quantiles for different significance levels:

In [None]:
p1b <- sampdist_dt %>%
mutate_at("vals", round, 4) %>%
mutate_at("pop", factor) %>%
mutate_at("type", factor, levels = c("pop", "sampmu", "samp")) %>%
ggplot(aes(x = vals, color = pop)) +
geom_density(aes(linewidth = type, group = ind2), adjust = 2) +
geom_vline(aes(xintercept = popmu, color = pop)) +
geom_vline(aes(xintercept = line010l, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line010r, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line005l, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line005r, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line001l, color = pop), linetype = 3) +
geom_vline(aes(xintercept = line001r, color = pop), linetype = 3) +
scale_linewidth_manual(values = c(pop = 1, sampmu = 0.5, samp = 0.02)) +
scale_color_manual(values = c("1" = "red", "2" = "blue", "3" = "green")) +
xlim(c(-5, 5))

p1bp <- ggplotly(p1b)  %>%
          layout(legend = list(orientation = "h",
                     xanchor = "center",
                     x = 0.5, y = -0.6, title = list(text=NULL)), height = 600)

And we will also create a separate chart that shows the tail probabilities for the sampling distribution for three populations. The tail probability is two times the smaller of p-value and 1 - p-value calculated for a quantile. We add the vertical lines for the quantiles at certain left and right tail areas/probabilities of $\alpha = 0.005, 0.025, 0.05$:

In [None]:
qvals <- seq(-5, 5, 0.01)

In [None]:
pop1pdt <- mapply(function(x, y, z) data.table(qvals, popp = pnorm(qvals, y, z / sqrt(sizesamp)) %>% round(3), pop = x),
       1:3, c(pop1mu, pop2mu, pop3mu), c(pop1sd, pop2sd, pop3sd), SIMPLIFY = F) %>% rbindlist
pop1pdt[, popp := round(2 * pmin(popp, 1 - popp), 3)]
pop1pdt[1, line001l := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line005l := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line010l := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line001r := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line005r := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line010r := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p2b <- pop1pdt %>%
mutate_at("pop", factor) %>%
ggplot(aes(x = qvals, y = popp)) +
geom_line(aes(color = pop)) +
geom_vline(aes(xintercept = line010l), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005l), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001l), linetype = 3, color = "red") +
geom_vline(aes(xintercept = line010r), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005r), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001r), linetype = 3, color = "red")
p2bp <- ggplotly(p2b)

And we sum up the p-values of the quantiles for each population in a separate chart:

In [None]:
poppdt <- pop1pdt[, .(popp = sum(popp)), by = qvals]
poppdt[1, line001l := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line005l := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line010l := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line001r := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
poppdt[1, line005r := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
poppdt[1, line010r := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p3b <- poppdt %>%
ggplot(aes(x = qvals, y = popp)) +
geom_line(aes(color = "red")) +
geom_vline(aes(xintercept = line010l), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005l), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001l), linetype = 3, color = "red") +
geom_vline(aes(xintercept = line010r), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005r), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001r), linetype = 3, color = "red")
p3bp <- ggplotly(p3b)

And we combine three charts:

In [None]:
p1bp2 <- p1bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p2bp2 <- p2bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p3bp2 <- p3bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

subplot(p1bp2, p2bp2, p3bp2, nrows = 3, shareX = TRUE) %>%
  layout(
    hovermode = "x unified",
      annotations = list(
    list(y = 1, text = "Sampling Distribution", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.67, text = "P-Values for All Populations", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.33, text = "Sum of P-Values for All Populations", showarrow = F, xref='paper', yref='paper')),
      height = 900
  )

In the first plot, when we decrease the significance level $\alpha$ the rejection area becomes more distant from the mean of population 1, hence Type I error decreases. However the p-value as calculated from the sampling distributions of other plausible populations increases as we try to decrease Type I error.

The probability that we fail to reject $H_0$ when it is not in fact true is known as **Type II error**. It is the p-value calculated for a different population and shown by $\beta$. Of course the level Type II error can change as calculated for different population means, so there is no single $\beta$ value.

The confusion matrix for true value of $H_0$ and whether we reject or not reject $H_0$ is given below:

<table class="wikitable">
<caption>
</caption>
<tbody><tr>
<th rowspan="2" colspan="2">Table of error types
</th>
<th colspan="2"><br>Null hypothesis (<span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle {\boldsymbol {H_{0}}}}">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <msub>
            <mi mathvariant="bold-italic">H</mi>
            <mrow class="MJX-TeXAtom-ORD">
              <mn mathvariant="bold">0</mn>
            </mrow>
          </msub>
        </mrow>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle {\boldsymbol {H_{0}}}}</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/5a3850bf3fcf8c20ffac3e4c3ed86a5277b83be8" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.671ex; width:3.459ex; height:2.509ex;" alt="{\textstyle {\boldsymbol {H_{0}}}}"></span>) is
</th></tr>
<tr>
<th>True
</th>
<th>False
</th></tr>
<tr>
<th rowspan="2">Decision<br>about null<br>hypothesis (<span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle {\boldsymbol {H_{0}}}}">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mrow class="MJX-TeXAtom-ORD">
          <msub>
            <mi mathvariant="bold-italic">H</mi>
            <mrow class="MJX-TeXAtom-ORD">
              <mn mathvariant="bold">0</mn>
            </mrow>
          </msub>
        </mrow>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle {\boldsymbol {H_{0}}}}</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/5a3850bf3fcf8c20ffac3e4c3ed86a5277b83be8" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.671ex; width:3.459ex; height:2.509ex;" alt="{\textstyle {\boldsymbol {H_{0}}}}"></span>)
</th>
<th><br>Not reject
</th>
<td style="text-align:center;"><br>Correct inference<br>(true negative)
<p>(probability = <span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle 1-\alpha }">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mn>1</mn>
        <mo>−<!-- − --></mo>
        <mi>α<!-- α --></mi>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle 1-\alpha }</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/7bc4a3f9f8b9fe2fdaa65e17f2ab04f813d18fab" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.505ex; width:5.49ex; height:2.343ex;" alt="{\textstyle 1-\alpha }"></span>)
</p>
</td>
<td style="text-align:center;">Type II error <br>(false negative)<br>(probability = <span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle \beta }">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mi>β<!-- β --></mi>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle \beta }</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/a77bfb138e56b5b44f6c8c4ce32a05449d1573d6" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.671ex; width:1.332ex; height:2.509ex;" alt="{\textstyle \beta }"></span>)
</td></tr>
<tr>
<th>Reject
</th>
<td style="text-align:center;">Type I error <br>(false positive)<br>(probability = <i><span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle \alpha }">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mi>α<!-- α --></mi>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle \alpha }</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/0d86dbd6183264b2f8569da1751380b173c7b185" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.338ex; width:1.488ex; height:1.676ex;" alt="{\textstyle \alpha }"></span></i>)
</td>
<td style="text-align:center;"><br>Correct inference <br>(true positive)
<p>(probability = <span class="mwe-math-element mwe-math-element-inline"><span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\textstyle 1-\beta }">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle displaystyle="false" scriptlevel="0">
        <mn>1</mn>
        <mo>−<!-- − --></mo>
        <mi>β<!-- β --></mi>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\textstyle 1-\beta }</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/78b8d7b4f0563f15124b1fe650d348b6605a7cc8" class="mwe-math-fallback-image-inline mw-invert skin-invert" aria-hidden="true" style="vertical-align: -0.671ex; width:5.335ex; height:2.509ex;" alt="{\textstyle 1-\beta }"></span>)
</p>
</td></tr></tbody></table>

(https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Table_of_error_types)

The probability that when we reject the null hypothesis when it is actually false is $1 - \beta$ and is also known as the **power** of the test. It is the probability of detecting an effect (i.e. rejecting the null hypothesis) given that some prespecified effect actually exists using a given test in a given context.

(https://en.wikipedia.org/wiki/Power_(statistics))

In our example while the population mean for the null hypothesis is 0, the alternative means of other populations are set as -1 and 1.

In the certain example we use, we see that since all three sampling distributions overlap to some extent, sum of Type I and Type II errors does not go below 0.23 for different sample mean values:

- Increasing the significance level $\alpha$, also increases Type I error while decreasing Type II error
- Decreasing the significance level $\alpha$, also decreases Type I error while increasing Type II error

So there is a sort of trade-off between Type I and Type II errors when holding all other things equal and only changing the significance level.

How can we decrease Type I error with a higher power of a test (without increasing Type II error)? Let's increase the sample sizes and repeat everything else:

### Larger sample size

Now we increase the sample size from 10 to 50:

In [None]:
sizesamp <- 50
sizesamp

Samples drawn from populations:

In [None]:
samps1 <- replicate(nsamps, sample(pop1, sizesamp), simplify = F)
samps2 <- replicate(nsamps, sample(pop2, sizesamp), simplify = F)
samps3 <- replicate(nsamps, sample(pop3, sizesamp), simplify = F)

Sampling distributions:

In [None]:
sampmus1 <- sapply(samps1, mean)
sampmus2 <- sapply(samps2, mean)
sampmus3 <- sapply(samps3, mean)
sampmus1dt <- data.table(vals = sampmus1, pop = 1, ind = 1, type = "sampmu")
sampmus2dt <- data.table(vals = sampmus2, pop = 2, ind = 1, type = "sampmu")
sampmus3dt <- data.table(vals = sampmus3, pop = 3, ind = 1, type = "sampmu")
sampdist_dt <- rbind(sampmus1dt, sampmus2dt, sampmus3dt)
popmus <- data.table(pop = 1:3, popmu = c(pop1mu, pop2mu, pop3mu), popsd = c(pop1sd, pop2sd, pop3sd))
sampdist_dt <- sampdist_dt %>% left_join(popmus, by = "pop")# %>% left_join(lwdt, by = "type")

In [None]:
sampdist_dt %>% head

Now we will add the quantile information for some certain p-values - the two-sided areas under the tail of the normal distribution:

In [None]:
sampdist_dt[1, line001l := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line005l := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp))]
sampdist_dt[1, line010l := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp))]

In [None]:
sampdist_dt[1, line001r := qnorm(0.01/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line005r := qnorm(0.05/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]
sampdist_dt[1, line010r := qnorm(0.1/2, popmu, popsd /sqrt(sizesamp), lower.tail = F)]

In [None]:
sampdist_dt[, ind2 := .GRP, by = c("ind", "type", "pop")]

Let's plot the densities of sampling distributions together along with vertical lines at quantiles for different significance levels:

In [None]:
p1b <- sampdist_dt %>%
mutate_at("vals", round, 4) %>%
mutate_at("pop", factor) %>%
mutate_at("type", factor, levels = c("pop", "sampmu", "samp")) %>%
ggplot(aes(x = vals, color = pop)) +
geom_density(aes(linewidth = type, group = ind2), adjust = 2) +
geom_vline(aes(xintercept = popmu, color = pop)) +
geom_vline(aes(xintercept = line010l, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line010r, color = pop), linetype = 5) +
geom_vline(aes(xintercept = line005l, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line005r, color = pop), linetype = 2) +
geom_vline(aes(xintercept = line001l, color = pop), linetype = 3) +
geom_vline(aes(xintercept = line001r, color = pop), linetype = 3) +
scale_linewidth_manual(values = c(pop = 1, sampmu = 0.5, samp = 0.02)) +
scale_color_manual(values = c("1" = "red", "2" = "blue", "3" = "green")) +
xlim(c(-5, 5))

p1bp <- ggplotly(p1b)  %>%
          layout(legend = list(orientation = "h",
                     xanchor = "center",
                     x = 0.5, y = -0.6, title = list(text=NULL)), height = 600)

And we will also create a separate chart that shows the tail probabilities for the sampling distribution for three populations. The tail probability is two times the smaller of p-value and 1 - p-value calculated for a quantile. We add the vertical lines for the quantiles at certain left and right tail areas/probabilities of $\alpha = 0.005, 0.025, 0.05$:

In [None]:
qvals <- seq(-5, 5, 0.01)

In [None]:
pop1pdt <- mapply(function(x, y, z) data.table(qvals, popp = pnorm(qvals, y, z / sqrt(sizesamp)) %>% round(3), pop = x),
       1:3, c(pop1mu, pop2mu, pop3mu), c(pop1sd, pop2sd, pop3sd), SIMPLIFY = F) %>% rbindlist
pop1pdt[, popp := round(2 * pmin(popp, 1 - popp), 3)]
pop1pdt[1, line001l := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line005l := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line010l := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
pop1pdt[1, line001r := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line005r := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
pop1pdt[1, line010r := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p2b <- pop1pdt %>%
mutate_at("pop", factor) %>%
ggplot(aes(x = qvals, y = popp)) +
geom_line(aes(color = pop)) +
geom_vline(aes(xintercept = line010l), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005l), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001l), linetype = 3, color = "red") +
geom_vline(aes(xintercept = line010r), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005r), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001r), linetype = 3, color = "red")
p2bp <- ggplotly(p2b)

And we sum up the p-values of the quantiles for each population in a separate chart:

In [None]:
poppdt <- pop1pdt[, .(popp = sum(popp)), by = qvals]
poppdt[1, line001l := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line005l := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line010l := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp)) %>% round(3)]
poppdt[1, line001r := qnorm(0.01/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
poppdt[1, line005r := qnorm(0.05/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
poppdt[1, line010r := qnorm(0.10/2, pop1mu, pop1sd /sqrt(sizesamp), lower.tail = F) %>% round(3)]
p3b <- poppdt %>%
ggplot(aes(x = qvals, y = popp)) +
geom_line(aes(color = "red")) +
geom_vline(aes(xintercept = line010l), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005l), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001l), linetype = 3, color = "red") +
geom_vline(aes(xintercept = line010r), linetype = 5, color = "red") +
geom_vline(aes(xintercept = line005r), linetype = 2, color = "red") +
geom_vline(aes(xintercept = line001r), linetype = 3, color = "red")
p3bp <- ggplotly(p3b)

And we combine three charts:

In [None]:
p1bp2 <- p1bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p2bp2 <- p2bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

p3bp2 <- p3bp %>%
  layout(
    xaxis = list(
      showspikes = TRUE, 
      spikemode = "across", 
      spikesnap = "cursor", 
      spikethickness = 1.5,
      spikecolor = "black"
    ),
    yaxis = list(showspikes = FALSE)
  )

subplot(p1bp2, p2bp2, p3bp2, nrows = 3, shareX = TRUE) %>%
  layout(
    hovermode = "x unified",
      annotations = list(
    list(y = 1, text = "Sampling Distribution", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.67, text = "P-Values for All Populations", showarrow = F, xref='paper', yref='paper'),
    list(y = 0.33, text = "Sum of P-Values for All Populations", showarrow = F, xref='paper', yref='paper')),
      height = 900
  )

When we increase the sample size, the standard errors of the sampling distributions are lower according to the square root law and the sampling distributions are narrower and they do not mostly overlap.

When we decrease the significance level from 0.1 to 0.05 and 0.01, the Type II error does not increase apparently. In the previous example with a sample size of 10, at 0.01 significance level, the sum of Type I and Type II errors in the third subplot was 0.56. In the second example with a sample size of 50, Type II error is almost zero so total error comes from only Type I and is 0.01.

So to increase the power of a test, while decreasing Type I error without a trade-off between two error types is to increase the sample size. 

# One sample - Unknown population standard deviation

## T-Distribution Simulation

We sample again from the first population with a size of 10, as we did in the first example:

In [None]:
sizesamp <- 10

Let's increase the sample size:

In [None]:
nsamps <- 1e3

In [None]:
set.seed(20)
samps1 <- replicate(nsamps, sample(pop1, sizesamp), simplify = F)
sampmus1 <- sapply(samps1, mean)

Now let's calculate the standardized Z-scores - test statistics - of the sample means:

In [None]:
sampz_pop <- (sampmus1 - pop1mu) / (pop1sd / sqrt(sizesamp))

See the distribution:

In [None]:
hist(sampz_pop)

And summary values:

In [None]:
var(sampz_pop)
summary(sampz_pop)

That's sth we can do when we know the true population standard deviation. However we usually do not have this value, except for some specific cases, like population 
proportions which conform with binomial distribution and the standard deviation can be calculated from the proportion.

The best we can have as an indicator of population $\sigma$ is the sample standard deviation $s$. However that is also subject to a large variation:

In [None]:
sampsds1 <- sapply(samps1, sd)

The histogram of standard deviations:

In [None]:
hist(sampsds1)

We know that sum of squares of k sized samples drawn from standard normal distribution is distributed according to $\chi^2$ distribution with a degrees of freedom of k. The above histogram shows the distribution of standard deviations not the sum of squares.

Now let's calculate the test statistics of sample means again, but this time using the sample standard deviation s of each sample:

In [None]:
sampz_samp <- (sampmus1 - pop1mu) / (sampsds1  / sqrt(sizesamp))

And see the distribution of test statistics:

In [None]:
hist(sampz_samp)

And see the summary values:

In [None]:
var(sampz_samp)
summary(sampz_samp)

The test statistics calculated by sample standard deviations have a higher variance and more extreme values.

While the kurtosis of the distribution of test statistics calculated by the population standard deviation is almost normally distributed, the kurtosis of the distribution of test statistics calculated by the sample standard deviation has more extreme values and fat tails:

In [None]:
kurtosis(sampz_pop)
kurtosis(sampz_samp)

The test statistics as calculated by sample standard deviations are t-distributed with a degrees of freedom of sample size minus 1. Let's get a random sample from the t-distribution with n - 1 degrees of freedom:

In [None]:
set.seed(20)
tsamp <- rt(nsamps, sizesamp - 1)

Its histogram:

In [None]:
hist(tsamp)

And summary values:

In [None]:
var(tsamp)
summary(tsamp)
kurtosis(tsamp)

The distribution statistics are similar to the test statistics we get by using sample standard deviations.

Now let's get the density of t-distribution with n - 1 df:

In [None]:
dens_t9 <- dt(qvals, sizesamp - 1)

And compare the densities:

In [None]:
plot(density(sampz_samp), col = "black")
lines(qvals, dens_t9, col = "red")

They almost overlap. So test statistics calculated by using the sample standard deviations for samples of size n are t-distributed with a degrees of freedom of n - 1.

Now let's combine the test-statistic distributions using population sd and sample sds, and add the quantile values for different $\alpha$ levels for both distributions:

In [None]:
sampt_dt <- rbind(data.table(vals = sampz_pop, type = "pop_sd"),
                  data.table(vals = sampz_samp, type = "samp_sd"))
sampt_dt[1, line001l := qnorm(0.01/2, 0, pop1sd)]
sampt_dt[1, line005l := qnorm(0.05/2, 0, pop1sd)]
sampt_dt[1, line010l := qnorm(0.1/2, 0, pop1sd)]
sampt_dt[1, line001r := qnorm(0.01/2, 0, pop1sd, lower.tail = F)]
sampt_dt[1, line005r := qnorm(0.05/2, 0, pop1sd, lower.tail = F)]
sampt_dt[1, line010r := qnorm(0.1/2, 0, pop1sd, lower.tail = F)]
sampt_dt[1, tline001l := qt(0.01/2, sizesamp - 1)]
sampt_dt[1, tline005l := qt(0.05/2, sizesamp - 1)]
sampt_dt[1, tline010l := qt(0.1/2, sizesamp - 1)]
sampt_dt[1, tline001r := qt(0.01/2, sizesamp - 1, lower.tail = F)]
sampt_dt[1, tline005r := qt(0.05/2, sizesamp - 1, lower.tail = F)]
sampt_dt[1, tline010r := qt(0.1/2, sizesamp - 1, lower.tail = F)]

In [None]:
p1c <- sampt_dt %>%
mutate_at("vals", round, 4) %>%
mutate_at("type", factor, levels = c("pop_sd", "samp_sd", "tdist")) %>%
ggplot(aes(x = vals, color = type)) +
geom_density(aes(group = type), adjust = 2) +
geom_vline(aes(xintercept = 0, color = "red")) +
geom_vline(aes(xintercept = line010l, color = "red"), linetype = 5) +
geom_vline(aes(xintercept = line010r, color = "red"), linetype = 5) +
geom_vline(aes(xintercept = line005l, color = "red"), linetype = 2) +
geom_vline(aes(xintercept = line005r, color = "red"), linetype = 2) +
geom_vline(aes(xintercept = line001l, color = "red"), linetype = 3) +
geom_vline(aes(xintercept = line001r, color = "red"), linetype = 3) +
geom_vline(aes(xintercept = tline010l, color = "blue"), linetype = 5) +
geom_vline(aes(xintercept = tline010r, color = "blue"), linetype = 5) +
geom_vline(aes(xintercept = tline005l, color = "blue"), linetype = 2) +
geom_vline(aes(xintercept = tline005r, color = "blue"), linetype = 2) +
geom_vline(aes(xintercept = tline001l, color = "blue"), linetype = 3) +
geom_vline(aes(xintercept = tline001r, color = "blue"), linetype = 3) +
xlim(c(-5, 5))

p1cp <- ggplotly(p1c)  %>%
          layout(legend = list(orientation = "h",
                     xanchor = "center",
                     x = 0.5, y = -0.6, title = list(text=NULL)), height = 600)

In [None]:
p1cp

We see that the distribution of test-statistics using sample standard deviations has higher distribution at the tails - fatter tails - and hence the critical values for different $\alpha$ levels are further away from the mean

In [None]:
critical_stats <- data.table(alpha = c(0.1, 0.05, 0.01))

In [None]:
critical_stats[, norm := qnorm(alpha / 2, lower.tail = F) %>% round(2)]
critical_stats[, tdist := qt(alpha / 2, sizesamp - 1, lower.tail = F) %>% round(2)]

In [None]:
critical_stats

When the df is 1, t-distribution becomes Cauchy distribution. As df approaches $\infty$, t-distribution converges to normal distribution.

## T-test Example

Now let's find the sample that we created above mean of which is closest to the example sample mean we used before:

In [None]:
sampm1

In [None]:
sampleind <- which(abs(sampmus1 - sampm1) < 0.01)
sampleind

In [None]:
smp <- samps1[[sampleind]]

In [None]:
smp

Let's test whether this sample's mean is significantly different from the population mean at $\alpha = 0.05$ significance level.

So our $H_0$ is **the population that the sample is drawn from has a mean of 0**

Let's first get the sample mean:

In [None]:
smpm <- mean(smp)
smpm

And sample standard deviation:

In [None]:
smpsd <- sd(smp)
smpsd

Our hypothesized population mu is the same:

In [None]:
pop1mu

And the standard error is:

In [None]:
se <- smpsd/sqrt(sizesamp)
se

And let's set significance level:

In [None]:
siglev <- 0.05
siglev

### t-test statistic

First let's calculate the t-test statistic, standardized score of the sample mean using standard error of the sample:

In [None]:
tval <- (smpm - pop1mu)/se
tval

And let's calculate the critical test statistic again:

In [None]:
tcrit05 <- qt(siglev/2, df = sizesamp - 1, lower.tail = F)
tcrit05

The absolute value of t-test statistic of the sample is smaller than the critical test statistic:

|-1.99| < 2.25


So **we do not reject H_0**.

Let's confirm with other methods.

### P-value

In [None]:
pval <- pt(tval, df = sizesamp - 1)
pval <- 2 * min(pval, 1 - pval)
pval

P-value of the t-test statistic of the sample 0.078 is larger than the $\alpha$ level of 0.05. So **we do not reject H_0**.

### Confidence interval

Using the critical test-statistic, let's calculate the lower and upper bounds of the confidence interval around the sample mean:

In [None]:
confintt <- round(smpm + c(-1, 1) * tcrit05 * se, 3)
confintt

In [None]:
pop1mu %between% confintt

Since hypothesized population mean is within this interval:

-1.48 < 0 < 0.095

**we do not reject $H_0$**

### t-test function

We can automatize the two-sided t-test using base `t.test` function:

In [None]:
t.test(smp, mu = pop1mu, conf.level = 1 - siglev)

We confirm the same conclusions:

- Absolute value of t-test statistic, 1.99, is smaller than the critical test statistic of 2.26 (which is not reported here in the output)
- P-value of 0.078 is greater than $\alpha$ level of 0.05
- Population mean of 0 is within the confidence interval between -1.48 and 0.095

So **we do not reject $H_0$**

Now let's change the significance level $\alpha$ to 0.1 and conduct the test again.

In [None]:
siglev <- 0.1

In [None]:
tcrit10 <- qt(siglev/2, df = sizesamp - 1, lower.tail = F)
tcrit10

In [None]:
t.test(smp, mu = pop1mu, conf.level = 1 - siglev)

In this second example:

- Absolute value of t-test statistic, 1.99 is larger than the critical test statistic of 1.83
- p-value of 0.078 is smallar than $\alpha$ value of 0.1
- Population mean of 0 is outside the confidence interval bounded by -1.33 and -0.05

While the sample values and the population mean is the same, with a higher significance level (and a higher Type I error), **we reject $H_0$**.

So we conclude that sample is not drawn from the population with a mean of 0. However, we know that the sample was in fact drawn from the population. With a higher significance level, we rejected $H_0$ which was in fact true. That is an example of Type I error.

# Resources for Hypothesis Testing

You can get more information and simulated examples on hypothesis testing from these resources:

- Chapters 3-4-5 of Agresti and Kateri 2021, Foundations of Statistics for Data Scientists: With R and Python
- Chapters 3-4 of Vasishth and Broe 2011, The Foundations of Statistics: A Simulation-based Approach
- Chapters 5 and 13 of Kaplan and Pruim 2023, Statistical Modeling: A Fresh Approach (https://statistical-modeling.netlify.app/13-hypothesis-logic)

# Function for Lab Assignment

**by Kemal Arda Elaman**

**Copy this cell into the top of your notebook:** 

In [None]:
library(ggplot2)
library(data.table)

hypo_lab <- function(student_id) {
  
  # Parameters
  alpha <- 0.05
  n <- 50
  mu_null <- 100
  sd_true <- 10
  
  offset_range <- 2
  
  set.seed(student_id)
  
  # Controlled Random True Mean (around CI)
  se <- sd_true / sqrt(n)
  ci_halfwidth <- qt(1 - alpha/2, df = n - 1) * se
  offset <- sample(seq(-offset_range, offset_range, 0.1), 1)
  true_mean <- mu_null + offset * ci_halfwidth
  
  # Generating Data
  data <- rnorm(n, mean = true_mean, sd = sd_true)
  
  # T-Test
  t_test <- t.test(data, mu = mu_null)
  t_val <- as.numeric(t_test$statistic)
  p_val <- t_test$p.value
  mean_data <- mean(data)
  sd_data <- sd(data)
  t_crit <- qt(1 - alpha/2, df = n - 1)
  
  # Summary Table
  summary_dt <- data.table(
    Student_ID  = student_id,
    Alpha       = alpha,
    True_Mean   = round(true_mean, 2),
    Sample_Mean = round(mean_data, 2),
    Sample_SD   = round(sd_data, 2),
    N           = n,
    T_Value     = round(t_val, 3),
    P_Value     = round(p_val, 4),
    CI_Lower    = round(t_test$conf.int[1], 2),
    CI_Upper    = round(t_test$conf.int[2], 2),
    T_Crit      = round(t_crit, 3)
  )
  
  # Visualization
  x <- seq(-4, 4, length = 400)
  y <- dt(x, df = n - 1)
  df_t <- data.frame(x, y)
  
  plt <- ggplot(df_t, aes(x, y)) +
    geom_line(size = 1.2, color = "black") +
    geom_area(data = subset(df_t, x <= -t_crit), fill = "red", alpha = 0.4) +
    geom_area(data = subset(df_t, x >=  t_crit), fill = "red", alpha = 0.4) +
    geom_area(data = subset(df_t, x > -t_crit & x < t_crit), fill = "skyblue", alpha = 0.4) +
    geom_vline(xintercept = t_val, color = "blue", linewidth = 1.2) +
    annotate("text", x = t_val, y = 0.05, label = paste0("t = ", round(t_val, 2)), 
             color = "blue", angle = 90, vjust = -0.5) +
    geom_vline(xintercept = c(-t_crit, t_crit), linetype = "dashed", color = "red") +
    labs(
      title = paste0("t-Distribution (α = ", alpha, ", n = ", n, ")"),
      subtitle = paste0("p-value = ", round(p_val, 4)),
      x = "t-value",
      y = "Density"
    ) +
    theme_minimal(base_size = 14)
  
  print(plt)
  data.table(summary_dt)
}

**Do not copy the cells below into your notebook but you can execute the example run and read the descriptions below to understand the output for your own submission:**

Example run:

In [None]:
hypo_lab(2025000000)

Explanation of columns:

- **Alpha:** Significance level for the test, $\alpha$.
- **True_Mean:** True mean of the population that we test the sample mean against.
- **Sample_Mean:** Mean of the sample.
- **Sample_SD:** Standard deviation of the sample.
- **N:** Size of the sample.
- **T-Value:** T-test statistic of the sample mean, standardized distance of sample mean to population mean divided by standard error of the sample (blue vertical line in the plot)
- **P-Value:** P-Value of the T-statistic of the sample mean. There is a P-Value probability that sample means drawn from the same population are more extreme than our sample mean.
- **CI_Lower:** Lower boundary of confidence interval. In $1-\alpha$ of all samples drawn from the same population, true mean is within the confidence interval calculated for each of these samples.
- **CI_Upper:** Upper boundary of confidence interval.
- **T_Crit:** Critical t-test statistic that marks the $\alpha$ tail probability. +/- T_Crit values are shown as dotted vertical red lines in the plot.

$H_0$ is the sample is drawn from the population with the True Mean.

In the example above we do not reject $H_0$.

We reject $H_0$ when:

- P-value is smaller than $\alpha$
- Absolute value of t-test statistic of the sample is larger than the critical t-test statistic
- True population mean is not within the confidence interval

We do not reject $H_0$ when:

- P-value is larger than $\alpha$
- Absolute value of t-test statistic of the sample is smaller than the critical t-test statistic
- True population mean is within the confidence interval

These three interpretations are always consistent and leads to the same conclusion.