Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
executable file 139 lines (107 sloc) 3.39 KB
---
title: "Two sample $t$ test example using `nycflights13` `flights` data"
author: "Chester Ismay"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
df_print: kable
vignette: |
%\VignetteIndexEntry{Two sample t test flights example}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r include=FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.height = 3)
```
**Note**: The `type` argument in `generate()` is automatically filled based on the entries for `specify()` and
`hypothesize()`. It can be removed throughout the examples that follow. It is left in to reiterate the type of generation process being performed.
## Data preparation
```{r message=FALSE, warning=FALSE}
library(nycflights13)
library(dplyr)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>%
sample_n(size = 500) %>%
mutate(half_year = case_when(
between(month, 1, 6) ~ "h1",
between(month, 7, 12) ~ "h2"
)) %>%
mutate(day_hour = case_when(
between(hour, 1, 12) ~ "morning",
between(hour, 13, 24) ~ "not morning"
)) %>%
select(arr_delay, dep_delay, half_year,
day_hour, origin, carrier)
```
* Two numeric - `arr_delay`, `dep_delay`
* Two categories
- `half_year` (`"h1"`, `"h2"`),
- `day_hour` (`"morning"`, `"not morning"`)
* Three categories - `origin` (`"EWR"`, `"JFK"`, `"LGA"`)
* Sixteen categories - `carrier`
***
# One numerical variable, one categorical (2 levels)
## Calculate observed statistic
The recommended approach is to use `specify() %>% calculate()`:
```{r}
obs_t <- fli_small %>%
specify(arr_delay ~ half_year) %>%
calculate(stat = "t", order = c("h1", "h2"))
```
The observed $t$ statistic is `r obs_t`.
Or using `t_test` in `infer`
```{r}
obs_t <- fli_small %>%
t_test(formula = arr_delay ~ half_year, alternative = "two_sided",
order = c("h1", "h2")) %>%
dplyr::pull(statistic)
```
The observed $t$ statistic is `r obs_t`.
Or using another shortcut function in `infer`:
```{r}
obs_t <- fli_small %>%
t_stat(formula = arr_delay ~ half_year, order = c("h1", "h2"))
```
The observed $t$ statistic is `r obs_t`.
## Randomization approach to t-statistic
```{r}
t_null_perm <- fli_small %>%
# alt: response = arr_delay, explanatory = half_year
specify(arr_delay ~ half_year) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("h1", "h2"))
visualize(t_null_perm) +
shade_p_value(obs_stat = obs_t, direction = "two_sided")
```
## Calculate the randomization-based $p$-value
```{r}
t_null_perm %>%
get_p_value(obs_stat = obs_t, direction = "two_sided")
```
## Theoretical distribution
```{r }
t_null_theor <- fli_small %>%
# alt: response = arr_delay, explanatory = half_year
specify(arr_delay ~ half_year) %>%
hypothesize(null = "independence") %>%
# generate() ## Not used for theoretical
calculate(stat = "t", order = c("h1", "h2"))
visualize(t_null_theor, method = "theoretical") +
shade_p_value(obs_stat = obs_t, direction = "two_sided")
```
## Overlay appropriate $t$ distribution on top of permuted t-statistics
```{r}
visualize(t_null_perm, method = "both") +
shade_p_value(obs_stat = obs_t, direction = "two_sided")
```
## Compute theoretical p-value
```{r}
fli_small %>%
t_test(formula = arr_delay ~ half_year,
alternative = "two_sided",
order = c("h1", "h2")) %>%
dplyr::pull(p_value)
```