part_4/10_infer.qmd

---
execute:
  echo: true
---

::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.10}
\thispagestyle{chapterfirstpage}
```
:::

# Infer {#sec-infer-chapter}

```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```

::: {.callout}
**{{< fa regular list-alt >}} Outcomes**

- Identify the research goals of inferential data analysis
- Describe the workflow for inferential data analysis
- Indicate the importance of quantifying uncertainty in inferential data analysis
:::

```{r}
#| label: inference-data-packages
#| echo: false
# Packages

```

In this chapter, we consider approaches to deriving knowledge from information which can be generalized to the population from which the data is sampled. This process is known as statistical inference. The discussion here implements descriptive assessments, statistical tests, and evaluation procedures for a series of contexts which are common in the analysis of corpus-based data. During our treatment of these contexts, we will establish a foundational understanding of statistical inference using a simulation-based approach.

::: {.callout}
**{{< fa terminal >}} Lessons**

**What**: Advanced Tables \
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To explore how to enhance dataset summaries using {janitor} and present them effectively with {kableExtra}'s advanced formatting options.
:::

## Orientation {#sec-infer-orientation}

In contrast to exploratory and predictive analyses, inference is not a data-driven endeavor. Rather, the goal of inferential data analysis (IDA)\index{inferential data analysis (IDA)} is to make theoretical claims about the population and assess the extent to which the data supports those claims\index{theory-driven research}. This implicates two key methodological restrictions which are not in play in other analysis methods.

First, the research question\index{research question} and expected findings are formulated *before* the data is analyzed, in fact strictly speaking this should take place even before data collection\index{acquire data}. This helps ensure that the data is aligned with the research question, that the data is representative\index{representativeness} of the population\index{sampling}, and that the analysis has a targeted focus and does not run the risk of becoming a 'just-so' story[^hark] or a 'significance-finding' mission[^phack], both of which violate the principles of significance testing.

[^hark]: "Hypothesis After Result is Known" (HARKing) involves selectively analyzing data, trying different variables or combinations until a significant $p$-value is obtained, or stopping data collection when a significant result is found [@Kerr1998].

[^phack]: "$p$-hacking" is the practice of running multiple tests until a statistically significant result is found. This practice violates the principles of significance testing [@Head2015].

Second, the data used in IDA is only used once. That is to say, the entire dataset is used a single time to statistically interrogate the relationship(s) of interest. In both exploratory\index{exploratory data analysis (EDA)} and predictive data analysis\index{predictive data analysis (PDA)} the data can be approached multiple times in different ways and the results of the analysis can be used to inform the next steps in the analysis. In IDA, however, the data is used to test a specific hypothesis\index{hypothesis} and the results of the analysis are interpreted in the context of that hypothesis.\index{research interpretation}

The methodological approach to IDA\index{inferential data analysis (IDA)} is the most straightforward of the analysis types covered in this textbook. As the research goal is to test a claim, the steps necessary are fewer than in EDA or PDA, where the exploratory nature of these approaches includes various possible iterations. The workflow for IDA is shown in @tbl-infer-workflow.

<!-- Workflow -->

::: {#tbl-infer-workflow tbl-colwidths="[5, 15, 80]"}

| Step | Name | Description |
|:-----|:-----|:------------|
| 1 | Identify | Identify and map the hypothesis statement to the appropriate response and explanatory variables |
| 2 | Inspect | Assess the distribution of the variable(s) with the appropriate descriptive statistics and visualizations. |
| 3 | Interrogate | Apply the appropriate statistical procedure to the dataset. |
| 4 | Interpret | Review the statistical results and interpret them in the context of the hypothesis. |

Workflow for inferential data analysis
:::

Based on the hypothesis\index{hypothesis} statement, we first identify and operationalize\index{operationalize} the variables. The response variable\index{response variable} is the variable whose variation we aim to explain. Additionally, in most statistical designs, one or more explanatory variables\index{explanatory variables} are included in the analysis in an attempt to gauge the extent to which these variables account for the variation in the response variable. For both response and explanatory variables, it is key to confirm that your operationalization of the variables is well-defined and that the data aligns.

::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**

What are the explanatory and/or response variables in each of these statements? How are these variables operationalized? What key sampling features are necessary for the data to test these hypotheses?

1. There will be statistically significant differences in the kinds of collocations used in English dialects spoken in urban areas compared to those spoken in rural areas.
2. French L2 learners will make more vocabulary errors in oral production than in written production.
3. The association strength between Mandarin words and their English translations will be a significant predictor of translation difficulty for novice translators.
4. The prevalence of gender-specific words in German-speaking communities on distinct online forums will significantly reflect gender roles.
5. The frequency of function words used by Spanish L2 learners will be a significant predictor of their stage in language acquisition.
:::

Next, we determine the informational values\index{informational types} of the variables\index{variables}. The informational value of each variable will condition how we approach visualization, interrogation, and ultimately interpretation of the results. Note that some informational types can be converted to other types, specifically higher-order types can be converted to lower-order types. For example, a continuous variable\index{continuous variables} can be converted to a categorical variable\index{categorical variables}, but not vice versa. It is preferable, however, to use the highest informational value of a variable. Simplifying data results in a loss of information ---which will result in a loss of information and hence statistical power which may lead to results that obscure meaningful patterns in the data [@Baayen2004]\index{Baayen}.

With our design in place, we can now inspect the data. This involves assessing the distribution of the variables using descriptive statistics and visualizations\index{descriptive assessment}. The goal of this step is to confirm the integrity of the data (missing data\index{missing data}, anomalies, *etc*.), identify general patterns in the data, and identify potential outliers\index{outliers}. As much as this is a verification step, it also serves to provide a sense of the data and the extent to which the data aligns with the hypothesis. This is particularly true when statistical designs are complex and involve multiple explanatory variables. An appropriate visualization provides context for interpreting the results of the statistical analysis.

Interrogating the data involves applying the appropriate statistical procedure to the dataset. In the **Null Hypothesis Significance Testing** (NHST)\index{Null Hypothesis Significance Testing (NHST)} paradigm, this process includes calculating a statistic from the data, comparing it to a null hypothesis distribution, and measuring the evidence against the null hypothesis. The **null hypothesis distribution**\index{null hypothesis distribution} is a distribution of statistic values that we would expect if the null hypothesis were true, *i.e.* that there is no difference or relationship between the explanatory and/or response variables. By comparing the observed statistic to the null hypothesis distribution, we can determine the likelihood of observing the observed statistic, if the null hypothesis were true. The estimate of this likelihood is a **$p$-value**\index{p-value}. When the $p$-value is below a threshold, typically 0.05, the result is considered statistically significant. This means that the observed statistic is sufficiently different from the null hypothesis distribution that we can reject the null hypothesis.

Now let's consider how to approach interpreting the results from a statistical test.  The $p$-value\index{p-value} provides a probability that the results of our statistical test could be explained by the null hypothesis. When this probability is below the alpha level of 0.05, the result is considered statistically significant, otherwise we have a 'null result' (*i.e.* non-significant).

However, this sets up a binary distinction that can be problematic. On the one hand, what is one to do if a test returns a $p$-value of 0.051?  According to standard practice these "marginally significant" results would not be statistically significant. On the other hand, if we get a statistically significant result, say a $p$-value of 0.049, do we move on ---case closed? To address both of these issues, it is important to calculate a confidence interval for the test statistic. The **confidence interval**\index{confidence interval} is the range of values for our test statistic that we would expect the true statistic value to fall within some level of uncertainty. Again, 95% is the most common level of uncertainty. The upper and lower bounds of this range are called the confidence limits for the test statistic.

Used in conjunction with $p$-values\index{p-value}, confidence intervals\index{confidence interval} can provide a more nuanced interpretation of the results of a statistical test. For example, if we get a $p$-value of 0.051, but the confidence interval is very narrow, we can be more confident that the results are reliable. Conversely, if we get a $p$-value of 0.049, but the confidence interval is very wide, we can be less confident that the results are reliable. If our confidence interval contains the null value, then even a significant $p$-value will require a more nuanced interpretation\index{research interpretation}.

::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**

Overgeneralization and undergeneralization are more formally known as Type I and Type II error, respectively. Type I error (false positive) occurs when we reject the null hypothesis when it is true. That is, we erroneously detect a significant result, when in fact the tested relationship is not borne out in the population. Type II error (false negative) occurs when we fail to reject the null hypothesis when it is false. This is a case of missing a significant result due to the limitations of the analysis which can stem from the sample size, the design of the study, or the statistical test used.
:::

It is important to underscore that the purpose of IDA is to draw conclusions from a dataset which are generalizable to the population. These conclusions require that there are rigorous measures to ensure that the results of the analysis do not overgeneralize (suggest there is a relationship when there is not one) and balance that with the fact that we don't want to undergeneralize (miss the fact that there is a relationship in the population, but our analysis was not capable of detecting it).

## Analysis {#sec-infer-analysis}

<!-- Goals of this section -->

In this section, we will discuss the practical application of inferential data analysis. The discussion will be divided into two sections based on the type of response variable: categorical and numeric. We will then explore specific designs for univariate, bivariate, and multivariate tests\index{multivariate analysis}. We will learn and implement NHST\index{Null Hypothesis Significance Testing (NHST)} using a simulation-based workflow. In contrast to theory-based methods, simulation-based methods tend to be more intuitive, easier to implement, and provide a better conceptual understanding of the statistical designs and analyses [@Morris2019; @Rossman2014a].

The steps for implementing a simulation-based approach to significance testing are outlined in @tbl-infer-simulation.

<!-- Simulation-based approach -->

::: {#tbl-infer-simulation tbl-colwidths="[5, 30, 65]"}

| Step | Name | Description |
|:-----|:-----|:------------|
| 1 | Specify | Specify the variables of interest and their relationship |
| 2 | Calculate | Calculate the observed statistic |
| 3 | Hypothesize | Generate the null hypothesis distribution |
| 4 | Get $p$-value | Calculate the $p$-value |
| 5 | Get confidence interval | Calculate the confidence interval |

Simulation-based workflow for significance testing
:::

<!-- Setup: packages/ options/ data -->

{infer} [@R-infer] provides a Tidyverse-friendly\index{Tidyverse} framework to implement simulation-based methods for statistical inference. Designed to be used in conjunction with {tidyverse}, {infer} provides a set of functions that can be used to specify the variables of interest, calculate the observed statistic, generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value\index{p-value} and the confidence interval\index{confidence interval}.

Let's load the necessary packages we will use in this section, as seen in @exm-infer-setup.

\pagebreak

::: {#exm-infer-setup}
```r
# Load packages
library(infer)      # for statistical inference
library(skimr)      # for descriptive statistics
library(janitor)    # for cross-tabulation
```
\cindex{library()}

```{r}
#| label: infer-setup
#| echo: false

# Load packages
library(infer)      # for statistical inference
library(skimr)      # for descriptive statistics
library(janitor)    # for cross-tabulation
```
:::

### Categorical {#sec-infer-categorical}

Here we demonstrate the application of IDA to categorical response variables\index{response variable}\index{categorical variables}. This will include various common statistical designs and analyses. In @tbl-infer-cat-design, we see common design scenarios, the variables involved, and the statistic used in the analysis.

::: {#tbl-infer-cat-design tbl-colwidths="[15, 30, 35, 20]"}

| Scenario | Explanatory Variable(s) | Statistical Test | `infer` |
|:---------|:------------------------|:-----------------|:------------|
| Univariate  | - | Proportion | `prop` |
| Bivariate | Categorical | Difference in proportions | `diff in props` |
| Bivariate | Categorical (3+ levels) | Chi-square | `chisq` |
| Multivariate  | Categorical or Numeric | Logistic regression | `fit()` |

Statistical test designs for categorical response variables
:::

We will use a derived version of the `dative` dataset from {languageR} [@R-languageR]. It contains over 3k observations describing the realization of the recipient clause in English dative constructions drawn from Switchboard corpus and the Treebank Wall Street Journal collection\index{corpus!reference}. To familiarize ourselves with the dataset, let's consider the data dictionary in @tbl-infer-cat-data-dict.

```{r}
#| label: tbl-infer-cat-data-dict
#| tbl-cap: "Data dictionary for the `dative_tbl` dataset."
#| tbl-colwidths: [10, 25, 15, 50]
#| echo: false

# Data dictionary for the `dative_tbl` dataset
read_csv("data/dative_ida_dd.csv") |>
  tt(width = 1)
```

We see that this dataset has four variables, two categorical and two numeric. In our demonstrations we are going to use the `rcp_real` as the response variable, the variable whose variation we are investigating.\index{response variable}

For a bit more context, a dative is the phrase which reflects the entity that takes the recipient role in a ditransitive clause. In English, the recipient (dative) can be realized as either a prepositional phrase (PP) as seen in @exm-infer-cat-dative-examples (1) or as a noun phrase (NP) as seen in (2).

::: {#exm-infer-cat-dative-examples}
Dative examples

(1) John gave the book [to Mary ~PP~].
(2) John gave [Mary ~NP~] the book.
:::

Together these two syntactic options are known as the Dative Alternation [@Bresnan2007].

Let's go ahead and load the dataset, as seen in @exm-infer-cat-read-dative.

::: {#exm-infer-cat-read-dative}
```r
# Load datasets
dative_tbl <-
  read_csv("../data/dative_ida.csv")
```

```{r}
#| label: infer-cat-read-dative
#| include: false

# Load datasets
dative_tbl <-
  read_csv("data/dative_ida.csv")
```
\index{R packages!readr}
\cindex{read_csv()}
:::


```{r}
#| label: infer-cat-diagnostics
#| include: false

# Convert character variables to factors
dative_tbl <-
  dative_tbl |>
  mutate(across(where(is.character), factor))

# Statistical overview
dative_tbl |>
  skim()
```

In preparation for statistical analysis, I performed a statistical overview and diagnostics of the dataset\index{descriptive assessment}. This included checking for missing data\index{missing data}, outliers\index{outliers}, and anomalies. I also checked the distribution of the variables using descriptive statistics\index{descriptive statistics} and visualizations, noting that the `rcp_len` and `thm_len` variables are right-skewed\index{skewed distribution}. This is something to keep in mind. The results of this overview and diagnostics are not shown here, but they are important steps in the IDA workflow. In this process, I converted the character variables to factors as most statistical tests require factors. A preview of the dataset is shown in @exm-infer-cat-dative-preview.

::: {#exm-infer-cat-dative-preview}
```{r}
#| label: infer-cat-dative-preview

# Preview 
glimpse(dative_tbl)
```
:::

We can see that the dataset includes `r format(nrow(dative_tbl), big.mark=",")` observations. We will take a closer look at the descriptive statistics for the variables as we prepare for each analysis.

#### Univariate analysis {#sec-infer-cat-univariate}

The univariate analysis is the simplest statistical design and analysis. It includes only one variable. The goal is to describe the distribution of the levels of the variable. The `rcp_real` variable has two levels: NP and PP. A potential research question\index{research question} for a case like this may aim to test the claim that:

- NP realizations of the recipient clause are the canonical form in English dative constructions, and therefore will be the most frequent realization of the recipient clause.

This hypothesis can be tested using a **difference in proportion test**\index{difference in proportion test}. The null hypothesis\index{null hypothesis} is that there is no difference in the proportion of NP and PP realizations of the recipient clause. The alternative hypothesis\index{alternative hypothesis} is that NP realizations of the recipient clause are more frequent than PP realizations of the recipient clause.

Before we get into statistical analysis, it is always a good idea to cross-tabulate\index{contingency table} or visualize the question, depending on the complexity of the relationship. In @exm-infer-cat-univariate-tbl, we see the code shows the distribution of the levels of the `rcp_real` variable in a contingency table. 

::: {#exm-infer-cat-univariate-tbl}
```r
# Contingency table of `rcp_real`
dative_tbl |>
  tabyl(rcp_real) |>
  adorn_pct_formatting(digits = 2) |>
  kable() |>
  kable_styling()
```
\index{R packages!janitor}\index{R packages!kableExtra}\index{R packages!knitr}
\cindex{tabyl()}\cindex{adorn_pct_formatting()}\cindex{kable()}\cindex{kable_styling()}
```{r}
#| label: tbl-infer-cat-univariate
#| tbl-cap: "Distribution of the levels of the `rcp_real` variable."
#| tbl-colwidths: [50, 25, 25]
#| echo: false

dative_tbl |>
  tabyl(rcp_real) |>
  adorn_pct_formatting(digits = 2) |>
  as_tibble() |>
  tt(width = 1)
```
:::

From @tbl-infer-cat-univariate, we see that the proportion of NP realizations of the recipient clause is higher than the proportion of PP realizations of the recipient clause. However, we cannot conclude that there is a difference in the proportion of NP and PP realizations of the recipient clause\index{proportion}. We need to conduct a statistical test to determine if the difference is statistically significant.

To determine if the distribution of the levels of the `rcp_real` variable is different from what we would expect if the null hypothesis were true, we need to calculate the difference observed in the sample and compare it to the differences observed in many samples where the null hypothesis is true.

First, let's calculate the proportion of NP and PP realizations of the recipient clause in the sample. We turn to the `specify()` function from {infer} to specify the variable of interest, step 1 in the simulation-based workflow in @tbl-infer-simulation. In this case, we only have the response variable\index{response variable}. Furthermore, the argument `success` specifies the level of the response variable that we will use as the 'success'. The term 'success' is used because the `specify()` function was designed for binomial variables where the levels are 'success' and 'failure', as seen in @exm-infer-cat-specify.

::: {#exm-infer-cat-specify}
```{r}
#| label: infer-cat-specify

# Specify the variable of interest
dative_spec <-
  dative_tbl |>
  specify(
    response = rcp_real,
    success = "NP"
  )

# Preview
dative_spec
```
\index{R packages!infer}
\cindex{specify()}
:::

The `dative_spec` is a data frame with attributes which are used by {infer} to maintain information about the statistical design for the analysis. In this case, we only have information about what the response variable is.

Step 2 is to calculate the observed statistic. The `calculate()` function is used to calculate the proportion statistic setting `stat = "prop"`, as seen in @exm-infer-cat-calculate.

\pagebreak

::: {#exm-infer-cat-calculate}
```{r}
#| label: infer-cat-calculate

# Calculate the proportion statistic
dative_obs <-
  dative_spec |>
  calculate(stat = "prop")

# Preview
dative_obs
```
\index{R packages!infer}
\cindex{calculate()}
:::

Note that the observed statistic, proportion, is the same as the proportion we calculated in @tbl-infer-cat-univariate. In such a simple example, the summary statistic\index{central tendency} and the observed statistic are the same. But this simple example shows how choosing the 'success' level of the response variable is important. If we had chosen the 'PP' level as the 'success' level, then the observed statistic would be the proportion of PP realizations of the recipient clause. There is nothing wrong with choosing the 'PP' level as the 'success' level, but it would change the direction of the observed statistic.

Now that we have the observed statistic, our goal will be to determine if the observed statistic is different from what we would expect if the null hypothesis were true. To do this, we simulate samples where the null hypothesis\index{null hypothesis distribution} is true, step 3 in our workflow.

Simulation means that we will randomly sample from the `dative_tbl` data frame many times. We need to determine how the sampling takes place. Since `rcp_real` is a variable with only two levels, the null hypothesis is that both levels are equally likely. In other words, in a null hypothesis world, NP and PP we would expect the proportions to roughly be 50/50.

To formalize this hypothesis with `infer` we use the `hypothesize()` function and set the null hypothesis to "point" and the proportion to 0.5. Then we can `generate()` a number of samples, say 1,000, drawn from our 50/50 world. Finally, the `prop` (proportion) statistic is calculated for each of the 1,000 samples and returned in a data frame, as seen in @exm-infer-cat-null-hypothesis.

::: {#exm-infer-cat-null-hypothesis}
```{r}
#| label: infer-cat-null-hypothesis

# Generate the null hypothesis distribution
dative_null <-
  dative_spec |>
  hypothesize(null = "point", p = 0.5) |>
  generate(reps = 1000, type = "draw") |>
  calculate(stat = "prop")

# Preview
dative_null
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{calculate()}
:::

The result of @exm-infer-cat-null-hypothesis is a data frame with as many rows\index{observations} as there are samples. Each row contains the proportion statistic for each sample drawn from the hypothesized distribution that the proportion of NP realizations of the recipient clause is 0.5.

To appreciate the null hypothesis distribution\index{null hypothesis distribution}, we can visualize it using a histogram\index{histogram}. {infer} provides a convenient `visualize()` function for visualizing distributions, as seen in @exm-infer-cat-null-hypothesis-vis.

::: {#exm-infer-cat-null-hypothesis-vis}
```r
# Visualize the null hypothesis distribution
visualize(dative_null)
```
```{r}
#| label: fig-infer-cat-null-hypothesis
#| fig-cap: "Simulation-based null distribution"
#| fig-alt: "A histogram showing the spread of values possible under the null hypothesis."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# Visualize the null hypothesis distribution
dative_null |> 
  visualize() + 
  labs(title = "", y = "Count", x = "Statistic") +
  theme_qtalr(font_size = 10)
```
\index{R packages!infer}
\cindex{visualize()}\cindex{labs()}

```{r}
#| label: infer-cat-null-hypothesis-range
#| include: false

null_range <- round(range(dative_null$stat), 2)
null_mean <- round(mean(dative_null$stat), 2)
null_sd <- round(sd(dative_null$stat), 2)
```
:::

In @fig-infer-cat-null-hypothesis, we see that on the x-axis is the proportion statistic of NP realizations of the recipient clause that we would expect if the null hypothesis were true. For the 1,000 samples, the proportion statistic ranges from `r null_range[1]` to `r null_range[2]`. Importantly we can appreciate that most of the proportion statistics are around 0.5. In fact, the mean is `r null_mean` with a standard deviation\index{standard deviation}\index{dispersion} of `r null_sd`, which is what we would expect if the null hypothesis were true. But there is variation, as we would also expect.

Why would we expect variation? Consider the following analogy. If we were to flip a fair coin 10 times, we would expect to get 5 heads and 5 tails. But this doesn't always happen. Sometimes we get 6 heads and 4 tails. Sometimes we get 7 heads and 3 tails, and so on. As the number of flips increases, however, we would expect the proportion of heads to be closer to 0.5, but there would still be variation. The same is true for the null hypothesis distribution. As the number of samples increases, we would expect the proportion of NP realizations of the recipient clause to be closer to 0.5, but there would still be variation. The question is whether the observed statistic we obtained from our data, in @exm-infer-cat-calculate, is within some level of variation that we would expect if the null hypothesis were true.

Let's visualize the observed statistic on the null hypothesis distribution\index{null hypothesis distribution}, as seen in @fig-infer-cat-null-hypothesis-obs, to gauge whether the observed statistic is within some level of variation that we would expect if the null hypothesis were true. The `shade_p_value()` function will take the null hypothesis distribution and the observed statistic and shade the sample statistics that fall within the alpha level\index{alpha level}.

::: {#exm-infer-cat-null-hypothesis-obs}
```r
dative_null |>
  visualize() + # note we are adding a visual layer `+`
  shade_p_value(
    obs_stat = dative_obs, # the observed statistic
    direction = "greater" # the direction of the alternative hypothesis
  )
```
\index{R packages!infer}
\index{one-sided test}
\cindex{shade_p_value()}\cindex{visualize()}
```{r}
#| label: fig-infer-cat-null-hypothesis-obs
#| fig-cap: "Simulation-based null distribution with the observed statistic."
#| fig-alt: "A histogram showing the spread of values possible under the null hypothesis with the observed statistic represented as a line on the x-axis."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# Visualize the null hypothesis distribution with the observed statistic
dative_null |>
  visualize() + # note we are adding a visual layer `+`
  shade_p_value(
    obs_stat = dative_obs, # the observed statistic
    direction = "greater", # the direction of the alternative hypothesis
    color = "grey"
  ) +
  labs(title = "", y = "Count", x = "Statistic") +
  theme_qtalr(font_size = 10)
```
\index{R packages!infer}
\cindex{visualize()}\cindex{shade_p_value()}\cindex{labs()}
\index{one-sided test}
:::

Just from a visual inspection, it is obvious that the observed statistic lies far away from the null distribution, far right of the right tail. No shading appears in this case as the observed statistic is far from the expected variation. This suggests that the observed statistic is not within the level of variation that we would expect if the null hypothesis were true.

::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**

The direction of the alternative hypothesis is important because it determines the $p$-value range. The "two-sided" direction means that we are interested in the proportion being different from 0.5. If we were only interested in the proportion of one outcome being greater than 0.5, then we would use the "greater" direction, or "less" in the opposite scenario.
:::

But we need to quantify this. We need to calculate the probability of observing the observed statistic or a more extreme statistic if the null hypothesis were true, the $p$-value\index{p-value}. Calculating this estimate is step 4 in the workflow. The $p$-value is calculated by counting the number of samples in the null hypothesis distribution that are more extreme than expected within some level of uncertainty. 95% is the most common level of uncertainty, which is called the **alpha level**\index{alpha level}. The remaining 5% of the distribution is the space where the likelihood that the null hypothesis accounts for the statistic is below our given alpha level of 0.05. This means that if the $p$-value is less than 0.05, then we reject the null hypothesis. If the $p$-value is greater than 0.05, then we fail to reject the null hypothesis.

With `infer` we can calculate the $p$-value using the `get_p_value()` function. Let's calculate the $p$-value for our observed statistic, as seen in @exm-infer-cat-p-value.

\pagebreak

::: {#exm-infer-cat-p-value}
```{r}
#| label: infer-cat-p-value

# Calculate the $p$-value (observed statistic)
dative_null |>
  get_p_value(
    obs_stat = dative_obs, # the observed statistic
    direction = "greater" # the direction of the alternative hypothesis
  )
```
\index{R packages!infer}
\cindex{get_p_value()}
\index{one-sided test}
```{.r code-line-numbers="false"}
Warning message:
Please be cautious in reporting a $p$-value\index{p-value} of 0. This result is an approximation based on the number of `reps` chosen in the
`generate()` step.
```
:::

The $p$-value for our observed statistic is reported as $0$, with a warning that the $p$-value estimate is contingent on the number of samples we generate in the null distribution. 1,000 is a reasonable number of samples, so we likely have a statistically significant result at the alpha level of 0.05.

The $p$-value\index{p-value} is one, traditionally very common, estimate of uncertainty. Another estimate of uncertainty is the confidence interval, our 5th and final step. The confidence interval\index{confidence interval} is the range of values for our test statistic that we would expect the true statistic value to fall within some level of uncertainty. Again, 95% is the most common level of uncertainty. The upper and lower bounds of this range are called the **confidence limits**\index{confidence limits} for the test statistic. The confidence interval is calculated by calculating the confidence limits for the test statistic for many samples from the observed data. But instead of generating a null hypothesis distribution, we generate a distribution based on resampling from the observed data. This is called the bootstrap distribution. The **bootstrap distribution**\index{bootstrap distribution} is generated by resampling from the observed data, with replacement, many times. This simulates the process of sampling\index{sampling} from the population many times. Each time the test statistic is generated for each sample. The confidence limits are the 2.5th and 97.5th percentiles of the bootstrap distribution. The confidence interval is the range between the confidence limits.

In @exm-infer-cat-confidence-interval, we see the code for calculating the confidence interval for our observed statistic.

\pagebreak

::: {#exm-infer-cat-confidence-interval}
```r
# Generate bootstrap distribution
dative_boot <-
  dative_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop")

dative_ci <-
  dative_boot |>
  get_confidence_interval(level = 0.95) # 95% confidence interval

dative_ci
```
:::

```{r}
#| label: infer-cat-confidence-interval
#| echo: false

# Generate bootstrap distribution
dative_boot <-
  dative_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop")

dative_ci <-
  dative_boot |>
  get_confidence_interval(level = 0.95) # 95% confidence interval

dative_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{get_confidence_interval()}\cindex{calculate()}

Let's visualize the confidence interval using the `visualize()` and `shade_confidence_interval()` function in @exm-infer-cat-confidence-interval-visualize on our bootstrapped samples, as seen in @fig-infer-cat-confidence-interval-visualize.

::: {#exm-infer-cat-confidence-interval-visualize}
```r
# Visualize the bootstrap distribution with the confidence interval
dative_boot |>
  visualize() +
  shade_confidence_interval(
    dative_ci # the confidence interval
  )
```
\index{R packages!infer}
\cindex{visualize()}\cindex{shade_confidence_interval()}
```{r}
#| label: fig-infer-cat-confidence-interval-visualize
#| fig-cap: "Bootstrap distribution of the proportion of NP realizations of the recipient clause with the confidence interval."
#| fig-alt: "A histogram showing the spread of values generated by bootstrapping the observed data with the confidence interval represented as a shaded area."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# Visualize the bootstrap distribution with the confidence interval
dative_boot |>
  visualize() +
  shade_confidence_interval(
    endpoints = dative_ci, # the confidence interval
    color = "grey2",
    fill = "grey"
  ) +
  labs(title = "", y = "Count", x = "Statistic") +
  theme_qtalr(font_size = 10)
```
:::

The confidence level is the probability that the confidence interval contains the true value. The confidence level is typically set to 0.95 in the social sciences. This means that if the confidence interval contains the null hypothesis value, then we fail to reject the null hypothesis. If the confidence interval does not contain the null hypothesis value, then we reject the null hypothesis.

Confidence intervals\index{confidence interval} are often misinterpreted. Confidence intervals are not the probability that the true value is within the range. The true value is either within the range or not. The confidence interval is the probability that the range contains the true value. This is a subtle but important distinction. Interpreted correctly, confidence intervals can enhance our understanding of the uncertainty of our test statistic and reduces the interpretation of $p$-values\index{p-value} (which are based on a relatively arbitrary alpha level\index{alpha level}) as a binary decision, significant or not significant. Confidence intervals encourage us to think about the uncertainty of our test statistic, as we would expect the true value to fall somewhere within that range, with varying levels of uncertainty.

Our stat is `r dative_obs$stat` and the confidence interval limits are `r dative_ci$lower_ci` and `r dative_ci$upper_ci`. The confidence interval does not contain the null hypothesis\index{null hypothesis} value of 0.5, which supports the evidence from the $p$-value that the proportion of NP realizations of the recipient clause is greater than 0.5.

#### Bivariate analysis {#sec-infer-cat-bivariate}

The univariate case is not very interesting or common in statistical inference, but it is a good place to start to understand the simulation-based process and the logic of statistical inference. The bivariate case, on the other hand, is much more common and interesting. The bivariate case includes two variables. The goal is to test the relationship between the two variables\index{research question}.

Using the `dative_tbl` dataset, we can imagine making the claim that:

- The proportion of NP and PP realizations of the recipient clause are contingent on the modality.

This hypothesis can be approached using a **difference in proportions** test,\index{difference in proportions test} as both variables are binomial (have two levels). The null hypothesis is that there is no difference in the proportion of NP and PP realizations of the recipient clause by modality. The alternative hypothesis is that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality.

We can cross-tabulate or visualize, but let's cross-tabulate this relationship as it is a basic 2-by-2 contingency table\index{contingency table}. In @exm-infer-cat-bivariate-tbl, we see the code for the cross-tabulation of the `rcp_real` and `modality` variables. Note I've made use of {janitor} to adorn this table with percentages, totals, and observation numbers.

::: {#exm-infer-cat-bivariate-tbl}
```r
dative_tbl |>
  tabyl(rcp_real, modality) |> # cross-tabulate
  adorn_totals(c("row", "col")) |> # provide row and column totals
  adorn_percentages("col") |> # add percentages to the columns
  adorn_pct_formatting(rounding = "half up", digits = 0) |> # round the digits
  adorn_ns() |> # add observation number
  adorn_title("combined") |> # add a header title
  kable(booktabs = TRUE) |>  # pretty table)
  kable_styling()
```
\index{R packages!janitor}\index{R packages!kableExtra}\index{R packages!knitr}
\cindex{tabyl()}\cindex{adorn_totals()}\cindex{adorn_percentages()}\cindex{adorn_pct_formatting()}\cindex{adorn_ns()}\cindex{adorn_title()}\cindex{kable()}\cindex{kable_styling()}
```{r}
#| label: tbl-infer-cat-bivariate
#| tbl-cap: "Contingency table for `rcp_real` and `modality`."
#| tbl-colwidths: [36, 22, 22, 22]
#| echo: false

dative_tbl |>
  tabyl(rcp_real, modality) |> # cross-tabulate
  adorn_totals(c("row", "col")) |> # provide row and column totals
  adorn_percentages("col") |> # add percentages to the columns
  adorn_pct_formatting(rounding = "half up", digits = 0) |> # round the digits
  adorn_ns() |> # add observation number
  adorn_title("combined") |> # add a header title
  as_tibble() |>
  tt(width = 1)
```
:::

In @tbl-infer-cat-bivariate, we can appreciate that the proportion of NP realizations of the recipient clause is higher in both modalities, as we might expect from our univariate analysis. However, the proportion appears to be different with the spoken modality having a higher proportion of NP realizations of the recipient clause than the written modality. But we cannot conclude that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality. We need to conduct a statistical test to determine if the difference is statistically significant.

To determine if the distribution of the levels of the `rcp_real` variable by the levels of the `modality` variable is different from what we would expect if the null hypothesis were true, we need to calculate the difference observed in the sample and compare it to the differences observed in many samples where the null hypothesis is true.

{infer} provides a pipeline, steps 1 through 5, which maintains a consistent workflow for statistical inference. As such, the procedure is very similar to the univariate analysis we performed, with some adjustments. Let's focus on the adjustments. First, our `specify()` call needs to include the relationship between two variables: `rcp_real` and `modality`. The `response` argument is the response variable, which is `rcp_real`. The `explanatory` argument is the explanatory variable, which is `modality`.

There are two approaches to specifying the relationship between the response\index{response variable} and explanatory variables\index{explanatory variables}. The first approach is to specify the response variable and the explanatory variable separately as values of the arguments `response` and `explanatory`. The second approach is to specify the response variable and the explanatory variable as a formula using the `~` operator\cindex{\textasciitilde}. The formula approach is more flexible and allows for more complex relationships between the response and explanatory variables. In @exm-infer-cat-specify-bivariate, we see the code for the `specify()` call using the formula approach.

::: {#exm-infer-cat-specify-bivariate}
```{r}
#| label: infer-cat-specify-bivariate

# Specify the relationship between the response and explanatory variables
dative_spec <-
  dative_tbl |>
  specify(
    rcp_real ~ modality,
    success = "NP"
  )

# Preview
dative_spec
```
\index{R packages!infer}
\cindex{specify()}
:::

The `dative_spec` now contains attributes about the response and explanatory variables encoded into the data frame.

We now calculate the observed statistic with `calculate()`, as seen in @exm-infer-cat-calculate-bivariate.

::: {#exm-infer-cat-calculate-bivariate}
```{r}
#| label: infer-cat-calculate-bivariate

# Calculate the observed statistic
dative_obs <-
  dative_spec |>
  calculate(
    stat = "diff in props",
    order = c("spoken", "written")
  )

# Preview
dative_obs
```
\index{R packages!infer}
\cindex{calculate()}
:::

Two differences are that our statistic is now a difference in proportions and that we are asked to specify the order of the levels of `modality`. The statistic is clear, we are investigating whether the proportion of NP realizations of the recipient clause is different between the spoken and written modalities. The order of the levels of `modality` is important because it determines the direction of the alternative hypothesis, specifically how the statistic is calculated (the order of the subtraction).

So our observed statistic `r round(dative_obs$stat, 3)` is the proportion of NP realizations of the recipient clause in the spoken modality minus the proportion of NP realizations of the recipient clause in the written modality, so the NP realization appears `r round(dative_obs$stat * 100, 0)`% more in the spoken modality compared to the written modality.

The question remains, is this difference statistically significant? To answer this question, we generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value\index{p-value}, as seen in @exm-infer-cat-null-hypothesis-bivariate.

::: {#exm-infer-cat-null-hypothesis-bivariate}
```{r}
#| label: infer-cat-null-hypothesis-bivariate

# Generate the null hypothesis distribution
dative_null <-
  dative_spec |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("spoken", "written"))

# Calculate the $p$-value
dative_null |>
  get_p_value(
    obs_stat = dative_obs, # the observed statistic
    direction = "two-sided" # the direction of the alternative hypothesis
  )
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{calculate()}\cindex{get_p_value()}
\index{two-sided test}
:::

Note, when generating the null hypothesis distribution, we use the `hypothesize()` function with the `null` argument set to "independence". This is because we are interested in the relationship between the response and explanatory variables. The null hypothesis is that there is no relationship between the response and explanatory variables. When generating the samples, we use the permutation approach, which randomly shuffles the response variable values for each sample. This simulates the null hypothesis that there is no relationship between the response and explanatory variables.

The $p$-value is reported as $0$. To provide some context, we will generate a confidence interval for our observed statistic using the bootstrap method, as seen in @exm-infer-cat-confidence-interval-bivariate.

::: {#exm-infer-cat-confidence-interval-bivariate}
```r
# Generate bootstrap distribution
dative_boot <-
  dative_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in props", order = c("spoken", "written"))

# Calculate the confidence interval
dative_ci <-
  dative_boot |>
  get_confidence_interval(level = 0.95)

# Preview
dative_ci
```
:::


::: {.content-visible when-format="pdf"}
\vspace{3em}
:::


```{r}
#| label: infer-cat-confidence-interval-bivariate
#| echo: false

# Generate bootstrap distribution
dative_boot <-
  dative_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in props", order = c("spoken", "written"))

# Calculate the confidence interval
dative_ci <-
  dative_boot |>
  get_confidence_interval(level = 0.95)

# Preview
dative_ci
```

The confidence interval does not contain the null hypothesis value of 0 (no difference), which provides evidence that the proportion of NP realizations of the recipient clause is different between the spoken and written modalities.

#### Multivariate analysis

In many scenarios, it is common to have multiple explanatory variables\index{explanatory variables} that need to be considered. In such cases, logistic regression is a suitable modeling technique. Logistic regression\index{logistic regression} allows for the inclusion of both categorical and continuous explanatory variables. The primary objective of using logistic regression is to assess the association between these variables and the response variable. By analyzing this relationship, we can determine how changes in the explanatory variables influence the probability of the outcome occurring.

To explore this scenario, let's posit that:

- NP and PP realizations of the recipient clause are contingent on modality and word length ratio of the recipient and theme\index{research question}.

The length ratio gets at the length of the recipient clause relative to the length of the theme clause. This ratio is an operationalization\index{operationalize} of a phenomenon known as 'Heavy NP' shift. There are many ways to operationalize this phenomenon, but the length ratio is a simple method to approximate the phenomenon. It attempts to capture the idea that the longer the theme clause is relative to the recipient clause, the more likely the recipient clause will be realized as an NP ---in other words, when the theme is relatively longer than the recipient, the theme is ordered last in the sentence, and the recipient is ordered first in the sentence and takes the form of an NP (instead of a PP).

The hypothesis, then, is that @exm-heavy-np-shift (2) would be less likely than (1) because the theme is relatively longer than the recipient.

::: {#exm-heavy-np-shift}

(1) John gave [Mary ~NP~] the large book that I showed you in class yesterday.
(2) John gave the large book that I showed you in class yesterday [to Mary ~PP~].
:::

Let's consider this variable `length_ratio` and `modality` together as explanatory variables for the realizations of the recipient clause `rcp_real`.

Let's create the `length_ratio` variable by dividing the `thm_len` by the `rcp_len`. This will give us values larger than 1 when the theme is longer than the recipient. And since we are working with a skewed distribution\index{skewed distribution}, let's log-transform\index{log transformation} the `length_ratio` variable. In @exm-infer-cat-create-length-ratio, we see the code for creating the `length_ratio` variable.

::: {#exm-infer-cat-create-length-ratio}
```{r}
#| label: infer-cat-create-length-ratio

# Create the `length_ratio_log` variable
dative_tbl <-
  dative_tbl |>
  mutate(
    length_ratio_log = log(thm_len / rcp_len)
  ) |>
  select(-thm_len, -rcp_len)

# Preview
glimpse(dative_tbl)
```
\index{R packages!dplyr}
\cindex{mutate()}\cindex{select()}\cindex{glimpse()}\cindex{log()}
:::

Let's visualize the relationship between `rcp_real` and `length_ratio_log` separately and then together with `modality`, as seen in @exm-infer-cat-bivariate-vis-length-ratio-log\index{bar plot}\index{boxplot}.

::: {#exm-infer-cat-bivariate-vis-length-ratio-log}

```r
# Visualize the proportion of `rcp_real` by `modality`
dative_tbl |>
  ggplot(aes(x = rcp_real, fill = modality)) +
  geom_bar(position = "fill") +
  labs(
    x = "Realization of recipient clause",
    y = "Proportion",
    fill = "Modality"
  )

# Visualize the relationship between `rcp_real` and `length_ratio_log`
dative_tbl |>
  ggplot(aes(x = rcp_real, y = length_ratio_log)) +
  geom_boxplot() +
  labs(
    x = "Realization of recipient clause",
    y = "Length ratio"
  )
```
:::

```{r}
#| label: fig-infer-cat-bivariate-length-ratio-log
#| fig-cap: "Distribution the variables `modality` and `length_ratio_log` by the levels of the `rcp_real` variable."
#| fig-alt: "Two boxplots showing the distribution of the variables `modality` and `length_ratio_log` by the levels of the `rcp_real` variable."
#| fig-subcap:
#|  - "RCP by modality"
#|  - "RCP by length ratio"
#| layout-ncol: 2
#| echo: false 

# Visualize the proportion of `rcp_real` by `modality`
dative_tbl |>
  ggplot(aes(x = rcp_real, fill = modality)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("#525252", "#BABABA")) +
  labs(
    x = "Realization of recipient clause",
    y = "Proportion",
    fill = "Modality"
  ) + 
  theme_qtalr(font_size = 10)

# Visualize the relationship between `rcp_real` and `length_ratio_log`
dative_tbl |>
  ggplot(aes(x = rcp_real, y = length_ratio_log)) +
  geom_boxplot() +
  labs(
    x = "Realization of recipient clause",
    y = "Length ratio"
  ) + 
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_bar()}\cindex{labs()}\cindex{geom_boxplot()}\cindex{labs()}

To understand visualizations in @fig-infer-cat-bivariate-length-ratio-log, remember the null hypothesis is that there is no difference in the proportion of NP and PP realizations of the recipient clause by modality or length ratio. On the flip side, the alternative hypothesis is that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality and length ratio. From the visual inspection, it appears that NP realizations of the recipient clause are more common in the spoken modality and that the NP realizations have a higher overall length ratio (larger theme relative to recipient) than PP realizations of the recipient clause. This suggests that the alternative hypothesis is likely true, but we need to conduct a statistical test to determine if the differences are statistically significant.

Let's calculate the statistics (not statistic) for our logistic regression by specifying the relationship between the response and explanatory variables and then using `fit()` to fit the logistic regression model, as seen in @exm-infer-cat-logistic-regression.

::: {#exm-infer-cat-logistic-regression}
```{r}
#| label: infer-cat-logistic-regression

# Specify the relationship
dative_spec <-
  dative_tbl |>
  specify(
    rcp_real ~ modality + length_ratio_log
  )

# Fit the logistic regression model
dative_fit <-
  dative_spec |>
  fit()

# Preview
dative_fit
```
\index{R packages!infer}
\cindex{specify()}\cindex{fit()}
:::

::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**

The reference level in R is assumed to be the first level alphabetically, unless otherwise specified. We can override this default by using the `fct_relevel()` function from {forcats} [@R-forcats]. The reason we would want to do this is to make the reference level more interpretable. In our case, we would want to make the spoken modality the reference level. This allows us to estimate the difference of the proportion of NP realizations of the recipient as a positive value. Remember that in @fig-infer-cat-bivariate-length-ratio-log-1, the proportion of NP realizations of the recipient clause is higher in the spoken modality than in the written modality. If we were to use the written modality as the reference level, the difference would be negative. Not that we couldn't interpret this, but working with positive integers is easier to interpret.
:::

Note I pointed out statistics, not statistic. In logistic regression models, the number of statistic reported depends on the number of explanatory variables. If there are two variables there will be at least three terms, one for each variable and the intercept term. If one or more variables are categorical, however, there will be additional terms when the categorical variable has three or more levels.

In our case, the `modality` variable has two levels, so there are three terms. The first term is the intercept term, which is the log odds of the proportion of NP realizations of the recipient clause in the written modality when the `length_ratio_log` is 1. The second term is the log odds of the proportion of NP realizations of the recipient clause in the spoken modality when the `length_ratio_log` is 1. The third term is the log odds of the proportion of NP realizations of the recipient clause when the `length_ratio_log` is 1 in the written modality. Notably, the spoken modality does not explicitly appear but is implicitly represented by the `modalitywritten` term statistic. `modalityspoken` is used as the reference level for the `modality` variable. For categorical variables, one of the levels is used as the point of reference, or **reference level**\index{reference level}, for which every other level is compared. 

Now let's generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value for each of the terms, as seen in @exm-infer-cat-null-hypothesis-logistic-regression.

::: {#exm-infer-cat-null-hypothesis-logistic-regression}
```{r}
#| label: infer-cat-null-hypothesis-logistic-regression

# Generate the null hypothesis distribution
dative_null <-
  dative_spec |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# Calculate the $p$-value
dative_null |>
  get_p_value(
    dative_fit, # the observed statistics
    direction = "two-sided" # the direction of the alternative hypothesis
  )
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{fit()}\cindex{get_p_value()}
\index{two-sided test}
:::

It appears that our main effects, `modality` and `length_ratio_log`, are statistically significant. Let's generate the confidence intervals\index{confidence interval} for each of the terms, as seen in @exm-infer-cat-confidence-interval-logistic-regression.

::: {#exm-infer-cat-confidence-interval-logistic-regression}
```{r}
#| label: infer-cat-confidence-interval-logistic-regression

# Generate boostrap distribution
dative_boot <-
  dative_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

# Calculate the confidence interval
dative_ci <-
  dative_boot |>
  get_confidence_interval(
    point_estimate = dative_fit,
    level = 0.95
  )

# Preview
dative_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{fit()}\cindex{get_confidence_interval()}
:::

The confidence intervals for the main effects, `modality` and `length_ratio_log`, do not contain the null hypothesis value of 0, which provides evidence that each of the explanatory variables is related to the proportion of NP realizations of the recipient clause.

::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**

Significance tests are not the only way to evaluate the evidence for the null hypothesis. We can also quantify the effect size\index{effect size} of each of the explanatory variables using the odds ratio to calculate the $r$ (correlation coefficient) and $R^2$\index{R-squared} (coefficient of determination) values. {effectsize} [@R-effectsize] provides a function `logoddsratio_to_r()` to calculate the $r$ and $R^2$ values for logistic regression models.

It can be important to use these measures to distinguish between statistically significant and practically significant results. A statistically significant result is one that is unlikely to have occurred by chance. A practically significant result is one that has a meaningful effect.
:::

Our logistic regression model as specified considers the explanatory variables `modality` and `length_ratio_log` independently, controlling for the other explanatory variable. This is an **additive model**\index{additive model}, which is what we stated in our hypothesis and represented in the formula `y ~ x1 + x2`.

Not all multivariate relationships are additive. We can also hypothesize an interaction between the explanatory variables. An **interaction model**\index{interaction model} is one which hypothesizes that the effect of one explanatory variable on the response variable is dependent on the other explanatory variable(s). In our case, we could have hypothesized that the effect of `length_ratio_log` on the proportion of NP realizations of the recipient clause is dependent on `modality`. We can specify this relationship using the formula approach, as seen in @exm-infer-cat-logistic-regression-interaction.

::: {#exm-infer-cat-logistic-regression-interaction}
```{r}
#| label: infer-cat-logistic-regression-interaction

# Specify the relationship between the response and explanatory variables
dative_inter_spec <-
  dative_tbl |>
  specify(
    rcp_real ~ modality * length_ratio_log
  )
```
\index{R packages!infer}
\cindex{specify()}
:::

Replacing the `+` with a `*` tells the model to consider the interaction between the explanatory variables. A model with an interaction changes the terms and the estimates. In @exm-infer-cat-logistic-regression-interaction-terms, we see the terms for the logistic regression model with an interaction.

::: {#exm-infer-cat-logistic-regression-interaction-terms}
```{r}
#| label: infer-cat-logistic-regression-interaction-terms

# Fit the logistic regression model
dative_inter_fit <-
  dative_inter_spec |>
  fit()

# Preview
dative_inter_fit
```
\index{R packages!infer}
\cindex{fit()}
:::

::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**

As an exercise, consider the following research question:

- NP and PP realizations of the recipient clause are contingent on modality and word length ratio of the recipient and theme, and the effect of the length ratio on the proportion of NP realizations of the recipient clause is dependent on the modality.

Follow the simulation-based process to test this hypothesis. What are the results? What are the implications of the results?
:::

The additional term `modalitywritten:length_ratio_log` is the interaction term. We also see the log odds estimates have changed for the previous terms. This is because this interaction draws some of the explanatory power from the other terms. Whether or not we run an interaction model depends on our research question. Again, the hypothesis precedes the model. If we hypothesize an interaction, then we should run an interaction model. If we do not, then we should not.

### Numeric {#sec-infer-numeric}

We now turn our attention to the analysis scenarios where the response variable is numeric\index{continuous variables}. Just as for categorical variables, we can have univariate, bivariate, and multivariate analysis scenarios. The statistical tests for numeric variables are summarized in @tbl-infer-num-design.

::: {#tbl-infer-num-design tbl-colwidths="[15, 30, 30, 25]"}

| Scenario | Explanatory Variable(s) | Statistical Test | `infer`
|:---------|:------------------------|:-----------------|:-----------------|
| Univariate | - | Mean | `mean` |
| Bivariate  | Numeric | Correlation | `correlation` |
| Bivariate | Categorical (2 levels) | Difference in means | `diff in means` |
| Bivariate | Categorical (3+ levels ) | ANOVA | `f` |
| Multivariate  | Numeric or Categorical | Linear regression | `fit()` |

Statistical test design for numeric response variables
:::

The dataset we will use is drawn from the Switchboard Dialog Act Corpus [@SWDA2008]\index{Switchboard Dialog Act Corpus (SWDA)}\index{corpus!specialized}. The data dictionary is found in @tbl-infer-num-data-dict.

```{r}
#| label: tbl-infer-num-data-dict
#| tbl-cap: "Data dictionary for the transformed SWDA dataset"
#| tbl-colwidths: [15, 18, 15, 52]
#| echo: false

# Data dictionary for the `fillers_tbl` dataset
read_csv("data/swda_fillers_ida_dd.csv") |>
  tt(width = 1)
```

We see the dataset has seven variables. The `fillers_orf` will be used as our response variable and corresponds to the rate of filler usage per speaker, normalized by the number of utterances. The other variables we will consider as explanatory variables are `age`, `sex`, and `education`, providing us a mix of numeric and categorical variables.

The context for these analysis demonstrations comes from the socio-linguistic literature on the use of filled pauses\index{sociolinguistics}. Filled pauses have often been associated with a type of disfluency; speech errors that occur during speech production. However, some authors have argued that filled pauses can act as sociolinguistic markers of socio-demographic characteristics of speakers, such as gender, age, and educational level [@Shriberg1994; @Tottie2011].

```{r}
#| label: infer-num-diagnostics
#| echo: false

# Read the dataset
fillers_tbl <-
  read_csv("data/swda_fillers_ida.csv")

# Establish level names and order for the `education` variable
edu_levels <-
  c("More Than College",
    "College",
    "Less Than College",
    "Less Than High School",
    "Unknown")

# Convert categorical variables to factors
fillers_tbl <-
  fillers_tbl |>
  mutate(
    sex = factor(sex),
    education = factor(education, levels = edu_levels, ordered = TRUE)
  ) |>
  select(-speaker_id, -total_fillers, -total_utts)
```

Reading the dataset and performing some basic diagnostics, a preview of the `fillers_tbl` dataset is seen in @exm-infer-num-dataset\index{descriptive assessment}.

\pagebreak

::: {#exm-infer-num-dataset }
```{r}
#| label: infer-num-dataset

# Preview the dataset
fillers_tbl
```
:::

Our `fillers_tbl` dataset has `r nrow(fillers_tbl)` observations. Again, we will postpone more specific descriptive statistics for treatment in the upcoming scenarios.

#### Univariate analysis

In hypothesis testing, the analysis of a single variable is directed at determining whether or not the distribution or statistic of the variable differs from some expected distribution or statistic. In the case of a single categorical variable with two levels (as @sec-infer-categorical), we sampled from a binomial distribution by chance. In the case of a single numeric variable, we can sample and compare the observed distribution to a theoretical distribution. When approaching hypothesis testing from a theoretical perspective, it is often necessary to assess how well a numeric variable fits the normal distribution as many statistical tests assume that the data are normally distributed. However, we have adopted the simulation-based approach to hypothesis testing, which does not require that the data fit the normal distribution, or any other distribution for that matter.

The other approach to analyzing a single numeric variable is to compare an observed statistic to an expected statistic. This approach requires *a priori* knowledge of the expected statistic. For example, imagine we are interested in testing the hypothesis that the length of words in a medical corpus tend to be longer than the average length of words in English. We would then calculate the observed mean for the length of words in the medical corpus and then generate a null distribution of means for the length of words in English, as in @exm-infer-num-uni-null-mean.

<!-- [ ] tmp fmt -->
\pagebreak

::: {#exm-infer-num-uni-null-mean}
```{r}
#| eval: false
#| label: infer-num-uni-null-mean

# Observed mean
obs_mean <-
  medical_df |>
  specify(response = word_length) |>
  calculate(stat = "mean")

# Null distribution of means
null_mean <-
  medical_df |>
  specify(response = word_length) |>
  hypothesize(null = "point", mu = 5) |>
  generate(reps = 1000, type = "draw") |>
  calculate(stat = "mean")
```
\index{R packages!infer}
\cindex{specify()}\cindex{calculate()}\cindex{hypothesize()}\cindex{generate()}
:::

Note that instead of a `p = ` argument, as was used in the `hypothesize()` step to generate a null distribution of proportions, we use a `mu = ` argument in @exm-infer-num-uni-null-mean to specify the expected mean\index{central tendency}. The rest of the hypothesis testing workflow is the same as for the null distribution of proportions.

::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**

The mean\index{mean} `mu` is not the only statistic we can specify for a numeric variable. We can also specify the median\index{median} `med`, or the standard deviation\index{standard deviation} `sigma`.
:::

In our case, we do not have *a priori* knowledge of the expected statistic for the `fillers_orf` variable, so we will not pursue this approach. However, it is useful to take a closer look at the distribution of a numeric variable in order to detect extreme skewing\index{skewed distribution} and/or outliers\index{outliers}. This is important because the presence of skewing and outliers can affect the results of statistical tests. We can visualize the distribution of the `fillers_orf` variable using a histogram\index{histogram} and density plot\index{density plot} as in @exm-infer-num-uni-hist-dens and rendered in @fig-infer-num-uni-hist-dens.

::: {#exm-infer-num-uni-hist-dens}
```r 
# Histogram-density plot
fillers_tbl |>
  ggplot(aes(x = fillers_orf)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  geom_density() +
  labs(x = "Fillers per 100 utterances", y = "Density")
```

```{r}
#| label: fig-infer-num-uni-hist-dens
#| fig-cap: "Histogram and density plot of the `fillers_orf` variable"
#| fig-alt: "Histogram and density plot indicating that the `fillers_orf` variable is skewed to the right with a particularly high number of speakers who do not use any fillers."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# Histogram-density plot
fillers_tbl |>
  ggplot(aes(x = fillers_orf)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  geom_density() +
  labs(x = "Fillers per 100 utterances", y = "Density") +
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_histogram()}\cindex{labs()}\cindex{geom_density()}
:::

The distribution of `fillers_orf` is indeed skewed to the right. We might have predicted this given that we are working with ratio based on count data, perhaps not. In any case, the skewing we observe tends to compress the distribution and may make it difficult to see any patterns. To mitigate this, we can log transform\index{log transformation} the variable. But we will run into a problem if we have any speakers who do not use any fillers at all as these speakers will have a value of zero, as we can see in @fig-infer-num-uni-hist-dens. The log of zero is undefined. So we need to address this.

Eliminating the speakers who do not use any fillers at all is one option. This is quite extreme as we may lose quite a few speakers and it is not clear that removing data in this way will not cause inordinate bias in the results as these speakers may be different in some way from the rest of the speakers. Looking at the speakers with zero fillers in @exm-infer-num-uni-zero-fillers, we can see that there is some potential for bias as the speakers with zero fillers are not evenly distributed across the levels of the `education` and `sex` variables.

::: {#exm-infer-num-uni-zero-fillers}
```{r}
#| label: infer-num-uni-zero-fillers

# Cross-tabulation of zero fillers by education and sex
fillers_tbl |>
  filter(fillers_orf == 0) |>
  tabyl(education, sex)
```
\index{R packages!janitor}\index{R packages!dplyr}
:::

Another approach is to add a small value to the `fillers_orf` variable, for all speakers. This will allow us to log transform the variable and will likely not have any (or very little) impact on the results. It also allows us to keep these speakers.

Adding values can be done in one of two ways. We can add a small constant value to all speakers, or we can add a small random value to all speakers. The former is easier to implement, but means that we will still have a spike in the distribution at the value of the constant. Since we do not expect that speakers that did not use fillers at all would never do so and that when they do we would not expect them to be at exactly the same rate as other speakers, we can add a small random value to all speakers.

In R, we can use the `jitter()` function to add a small amount of random noise to the variable. Note, however, this random noise can be positive or negative. When a negative value is added to a zero value, we are still in trouble when we go to log-transform. So we need to make sure that none of the jitter produces negative values. We can do this by simply taking the absolute value of the jittered variable with the `abs()` function. Let's see how this works in @exm-infer-num-uni-jitter.

::: {#exm-infer-num-uni-jitter}
```{r}
#| label: infer-num-uni-jitter


set.seed(1234) # for reproducibility

# Add jitter to fillers
fillers_tbl <-
  fillers_tbl |>
  mutate(fillers_orf_jitter = abs(jitter(fillers_orf)))

fillers_tbl
```
\index{R packages!dplyr}
\cindex{mutate()}\cindex{jitter()}\cindex{abs()}\cindex{set.seed()}
:::

The results from @exm-infer-num-uni-jitter show that the `fillers_orf_jitter` variable has been added to the `fillers_tbl` dataset and that zero values for `fillers_orf` now have a small amount of random noise added to them. Note that the other values also have a small amount of random noise added to them, but it is so small that rounding to 2 decimal places makes it look like nothing has changed.

Now let's return to log transforming the `fillers_orf_jitter` variable. We can do this with the `log()` function. Let's see how this works in @exm-infer-num-uni-log.

::: {#exm-infer-num-uni-log}
```{r}
#| label: infer-num-uni-log

# Log transform fillers (with jitter)
fillers_tbl <-
  fillers_tbl |>
  mutate(fillers_orf_log = log(fillers_orf_jitter))

fillers_tbl
```
\index{R packages!dplyr}
\cindex{mutate()}\cindex{log()}
:::

Let's now plot the log-transformed variable, as seen in @exm-infer-num-uni-log-plot\index{log transformation} and visualized in @fig-infer-num-uni-hist-dens-log.

::: {#exm-infer-num-uni-log-plot}
```r
# Histogram-density plot
fillers_tbl |>
  ggplot(aes(x = fillers_orf_log)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  geom_density() +
  labs(x = "Fillers per 100 utterances", y = "Density")
```

```{r}
#| label: fig-infer-num-uni-hist-dens-log
#| fig-cap: "Histogram and density plot of the `fillers_orf_log` variable"
#| fit-alt: "Histogram and density plot of the `fillers_orf_log` variable demonstrating that the log-transformed variable is more normally distributed than the original variable."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# Histogram-density plot
fillers_tbl |>
  ggplot(aes(x = fillers_orf_log)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  geom_density() +
  labs(x = "Fillers per 100 utterances", y = "Density") +
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_histogram()}\cindex{labs()}\cindex{geom_density()}
:::

The distribution of the log-transformed variable is more spread out now, but the zero-filler speakers do show a low-level spike in the left tail of the distribution. Jitter and log transformation, however, smooth over their effect to a large degree.

```{r}
#| label: infer-num-fillers-dataset-clean
#| echo: false

# Drop the jittered and non-log transformed variables
fillers_tbl <-
  fillers_tbl |>
  select(-fillers_orf_jitter, -fillers_orf)
```

#### Bivariate analysis

When considering a numeric\index{numeric variables} response variable\index{response variable} and another variable, it is key to consider the nature of the other variable. If it is a categorical variable\index{categorical variables} with two levels, then we can compare a statistic between the two groups (mean or median)\index{central tendency}. If it is categorical with more than two levels, Analysis of Variance (ANOVA)\index{Analysis of Variance (ANOVA)} is used to compare the means\index{mean}. Finally, if it is a numeric variable, then we can use a correlation test\index{correlation} to see if there is an association between the two variables.

The `fillers_tbl` contains the `sex` variable which is a categorical variable with two levels. According to the literature, filled pauses are associated with differences between men and women [@Shriberg1994; @Tottie2011; @Tottie2014] \index{Tottie}. The findings suggest that men use fillers at a higher rate than women. Let's test to see if this holds for the SWDA data\index{Switchboard Dialog Act Corpus (SWDA)}.

Let's first explore the distribution from a descriptive point of view\index{descriptive assessment}. With a numeric response variable `fillers_orf_log` and a categorical explanatory variable `sex`, a boxplot\index{boxplot} is a natural fit, as seen in @exm-infer-num-bi-vis.

<!-- [ ] tmp fmt -->
\pagebreak

::: {#exm-infer-num-bi-vis}
```r 
# boxplot
fillers_tbl |>
  ggplot(aes(x = fillers_orf_log, y = sex)) +
  geom_boxplot(notch = TRUE) +
  labs(
    x = "Filler use (log)",
    y = "Sex"
  )
```

```{r}
#| label: fig-infer-num-bi-vis
#| fig-cap: "boxplot of the `fillers_orf_log` variable by `sex`"
#| fig-alt: "boxplot showing the distribution of the `fillers_orf_log` variable by the levels of the `sex` variable."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false

# boxplot
fillers_tbl |>
  ggplot(aes(x = fillers_orf_log, y = sex)) +
  geom_boxplot(notch = TRUE) +
  labs(
    x = "Filler use (log)",
    y = "Sex"
  ) +
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_boxplot()}\cindex{labs()}
:::

Looking at the boxplot in @fig-infer-num-bi-vis, we see that there appears to be an overall higher rate of filler use for men, compared to women. We also can see that the random noise added to zero-rate speakers appear as outliers in the left tail. Since I added a notch to the boxplots, we can also gauge to some degree the uncertainty of the median\index{median}. The notches do not overlap, which suggests that the medians are different.

To test these differences, let's follow the simulation-based hypothesis testing workflow and investigate if the apparent difference between men and women is statistically significant, or expected by chance[^medians]. The first steps are found in @exm-infer-num-bi-sex-null.

[^medians]: Given the fact that we added jitter to accommodate the zeros, it may actually make more sense to compare medians, rather than means. But to compare these results with the results from the literature, we will compare means.

<!-- [ ] tmp fmt -->
\pagebreak

::: {#exm-infer-num-bi-sex-null}
```{r}
#| label: infer-num-bi-sex-null

# Specify the relationship
fillers_spec <-
  fillers_tbl |>
  specify(fillers_orf_log ~ sex) # response ~ explanatory

# Observed statistic
fillers_obs <-
  fillers_spec |>
  # diff in means, Male - Female
  calculate(stat = "diff in means", order = c("Male", "Female"))

# Null distribution
fillers_null <-
  fillers_spec |>
  hypothesize(null = "independence") |> # independence = no relationship
  generate(reps = 1000, type = "permute") |> # permute = shuffle
  calculate(stat = "diff in means", order = c("Male", "Female"))

# Calculate the $p$-value
fillers_null |>
  get_p_value(obs_stat = fillers_obs, direction = "greater")
```
\index{R packages!infer}
\cindex{specify()}\cindex{calculate()}\cindex{hypothesize()}\cindex{generate()}\cindex{get_p_value()}
\index{one-sided test}
```{r}
#| include: false

p_value <-
  fillers_null |>
  get_p_value(obs_stat = fillers_obs, direction = "greater")
```
:::

From the analysis performed in @exm-infer-num-bi-sex-null, we can reject the null hypothesis\index{null hypothesis} that there is no difference between the rate of filler use between men and women, as the $p$-value is less than 0.05.

To further assess the uncertainty of the observed statistic, and the robustness of the difference, we calculate a confidence interval, as seen in @exm-infer-num-bi-sex-ci.

::: {#exm-infer-num-bi-sex-ci}
```{r}
#| label: infer-num-bi-sex-ci

# Resampling distribution
fillers_boot <-
  fillers_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("Male", "Female"))

# Calculate the confidence interval
fillers_ci <-
  fillers_boot |>
  get_confidence_interval(level = 0.95)

fillers_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{get_confidence_interval()}\cindex{calculate()}
:::

The confidence interval includes 0, which suggests that the observed difference is questionable. It is of note, however, that the majority of the interval is above 0, which provides some evidence that the observed difference is not due to chance. This result highlights how $p$-values\index{p-value} and confidence intervals\index{confidence interval} together can provide a more nuanced picture of the data.

The second bivariate scenario we can consider is when the explanatory variable\index{explanatory variables} is categorical with more than two levels. We will use ANOVA\index{Analysis of Variance (ANOVA)} to calculate the F statistic ($f$). The `education` variable in the `fillers_tbl` dataset is a categorical variable with five levels. @Tottie2011 suggests that the more educated a speaker, the more fillers they will use. Let's test this hypothesis.

First, we visualize the distribution of the `fillers_orf_log` variable by `education`, as seen in @exm-infer-num-bi-edu-vis.

::: {#exm-infer-num-bi-edu-vis}
```r
# boxplot
fillers_tbl |>
  ggplot(aes(y = fillers_orf_log, x = education)) +
  geom_boxplot(notch = TRUE) +
  labs(
    y = "Filler use (log)",
    x = "Education"
  )
```

```{r}
#| label: fig-infer-num-bi-edu-vis
#| fig-cap: "Visualizations of the `fillers_orf_log` variable by `education`"
#| fig-alt: "boxplot showing the distribution of the `fillers_orf_log` variable by the levels of the `education` variable."
#| fig-width: 7
#| fig-asp: 0.5
#| echo: false

# boxplot
fillers_tbl |>
  ggplot(aes(y = fillers_orf_log, x = education)) +
  geom_boxplot(notch = TRUE) +
  labs(
    y = "Filler use (log)",
    x = "Education"
  ) +
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_boxplot()}\cindex{labs()}
:::

The boxplot\index{boxplot} in @fig-infer-num-bi-edu-vis does not point to any obvious differences between the levels of the `education` variable. There are a fair number of outliers, however, in the two most educated groups. These outliers are likely due to the random noise added to the 0-rate speakers and it is interesting that they are concentrated in the two most educated groups. 

Let's now submit these variables to the simulation-based hypothesis testing workflow to quantify the uncertainty of the observed statistic and determine if the observed difference is statistically significant. Again, the first steps are found in @exm-infer-num-bi-edu.

<!-- [ ] tmp fmt -->
\pagebreak

::: {#exm-infer-num-bi-edu}
```{r}
#| label: infer-num-bi-edu

# Specify the relationship
fillers_spec <-
  fillers_tbl |>
  specify(fillers_orf_log ~ education) # response ~ explanatory

# Observed statistic
fillers_obs <-
  fillers_spec |>
  calculate(stat = "F") # F = variance between groups / variance within groups

# Null distribution
fillers_null <-
  fillers_spec |>
  hypothesize(null = "independence") |> # independence = no relationship
  generate(reps = 1000, type = "permute") |> # permute = shuffle
  calculate(stat = "F")

# Calculate the $p$-value
fillers_null |>
  get_p_value(obs_stat = fillers_obs, direction = "two-sided")
```
\index{R packages!infer}
\cindex{specify()}\cindex{calculate()}\cindex{hypothesize()}\cindex{generate()}\cindex{get_p_value()}
\index{two-sided test}
:::

The analysis in @exm-infer-num-bi-edu suggests that the observed difference between the means of the different levels of the `education` variable are not significantly different from what we would expect by chance.

::: {.callout}
**{{< fa exclamation-triangle >}} Warning**

The $p$-value in @exm-infer-num-bi-edu was calculated using a two-sided test\index{two-sided test}, which is appropriate when the expected directionality is not known. In this case, while we do have an expected directionality, the visualizations strongly suggest that the observed difference is not in line with our expectations. To account for this uncertainty and to be conservative, we choose to use a two-sided test. This allows us to remain open to the possibility that the observed difference may actually be in the opposite direction, rather than solely focusing on our initial expectation. However, it's important to note that the decision to use a one-sided\index{one-sided test} or two-sided test should also consider factors such as the specific research question and the context of the analysis.
:::

Let's now calculate a confidence interval to assess the uncertainty of the observed statistic, as seen in @exm-infer-num-bi-edu-ci.

::: {#exm-infer-num-bi-edu-ci}
```{r}
#| label: infer-num-bi-edu-ci

# Resampling distribution
fillers_boot <-
  fillers_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "F")

# Calculate the confidence interval
fillers_ci <-
  fillers_boot |>
  get_confidence_interval(level = 0.95)

fillers_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{get_confidence_interval()}\cindex{calculate()}
:::

In @exm-infer-num-bi-edu-ci, we see that we are in the opposite situation to the previous bivariate case ---the $p$-value is not significant but the confidence interval does not include 0.

So how do we interpret this? Remember, the $p$-value\index{p-value} is the probability of observing a statistic as extreme or more extreme than the observed statistic, given that the null hypothesis is true. The confidence interval is the range of values that we are 95% confident contains the true population parameter. We should take into consideration two aspects: (1) the confidence interval\index{confidence interval} has a large range (the interval is wide) and (2) the lower limit\index{confidence limits} is near 0. Taken together and in addition to the $p$-value, we can conclude that the observed difference is not statistically significant, and if there is a difference, it is likely to be small or negligible.

#### Multivariate analysis

```{r}
#| label: infer-num-multi-setup
#| include: false

# Read in the data
fillers_type_df <-
  read_csv("data/swda_fillers_type_ida.csv") |>
  mutate(
    sex = factor(sex),
    education = factor(education, levels = edu_levels, ordered = TRUE)
  )


set.seed(1234) # for reproducibility

# Add jitter and log transform fillers
fillers_type_df <-
  fillers_type_df |>
  mutate(
    fillers_orf_jitter = abs(jitter(fillers_orf)),
    fillers_orf_log = log(fillers_orf_jitter)
  ) |>
  select(speaker_id, sex, education, age, filler_type, fillers_orf_log)
```

While bivariate analysis is useful for exploring the relationship between two variables, it is often the case that we want to consider relationships between more than two variables. In this case, we can use multivariate analysis. Linear regression is a common multivariate analysis technique.

In linear regression, we are interested in predicting the value of a numeric response variable\index{response variable} based on the values of the explanatory variables\index{explanatory variables}j. The contribution of the explanatory variables can be considered individually\index{additive model}, as an interaction\index{interaction model}, or as a combination of both.

Let's now introduce a variation of the SWDA\index{Switchboard Dialog Act Corpus (SWDA)} dataset which includes a variable `filler_type` which has two levels, 'uh' and 'um', corresponding to the use of each filler. Here's a preview of the dataset in @exm-infer-num-fillers-type-dataset.

::: {#exm-infer-num-fillers-type-dataset}
```{r}
#| label: infer-num-fillers-type-dataset

fillers_type_df
```
:::

The `fillers_type_df` dataset has `r nrow(fillers_type_df)` observations and 6 variables.  With this dataset, we will explore the hypothesis that the rate of filler use varies by the type of filler across the socio-demographic variable `sex`.

To do this we will use R formula\index{R formula} syntax to specify the variables we want to include in the model and their relationships. The possible relationships appear in @tbl-infer-num-multi-relationships.

::: {#tbl-infer-num-multi-relationships tbl-colwidths="[25, 35, 40]"}

| Relationship | Formula | Description |
|:-------------|:--------|:------------|
| Simple effects | `response ~ explanatory_1 + explanatory_2` | The response variable as a function of each explanatory variable |
| Interaction effects | `response ~ explanatory_1:explanatory_2` | The response variable as a function of the interaction between the two explanatory variables |
| Simple and interaction effects | `response ~ explanatory_1 * explanatory_2` | The response variable as a function of each explanatory variable and the interaction between the two explanatory variables |

Possible relationships in a multivariate analysis
:::

Our hypothesis\index{hypothesis} is that men and women differ in the rates that they use the filler types. This describes an interaction\index{interaction model}, so we can use either the interaction or the simple and interaction effects relationships. To demonstrate the difference between simple and interaction terms, let's approach this using the third relationship (*i.e.* `fillers_orf_log ~ filler_type * sex`).

A plot will help us begin to understand the potential relationships. In @exm-infer-multi-sex-plot, we use a boxplot\index{boxplot} to visualize the relationship between the `fillers_orf_log` variable and the `filler_type` variable, with a `sex` overlay.

::: {#exm-infer-multi-sex-plot}
```r
# boxplot `filler_type`
fillers_type_df |>
  ggplot(aes(y = fillers_orf_log, x = filler_type)) +
  geom_boxplot(notch = TRUE) +
  labs(
    x = "Filler type",
    y = "Fillers per 100 (log)"
  )

# boxplot `filler_type` and `sex`
fillers_type_df |>
  ggplot(aes(y = fillers_orf_log, x = filler_type, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  labs(
    x = "Filler type",
    y = "Fillers per 100 (log)",
    fill = "Sex"
  )
```

```{r}
#| label: fig-infer-multi-sex-plot
#| fig-cap: "Boxplot of the `fillers_orf_log` variable by `filler_type` and `sex`"
#| fig-alt: "boxplot showing the distribution of the `fillers_orf_log` variable by the levels of the `filler_type` variable and the `sex` variable."
#| fig-subcap:
#| - "Boxplot by `filler_type`"
#| - "Boxplot by `filler_type` and `sex`"
#| layout-ncol: 2
#| echo: false

# boxplot `filler_type`
fillers_type_df |>
  ggplot(aes(y = fillers_orf_log, x = filler_type)) +
  geom_boxplot(notch = TRUE) +
  labs(
    x = "Filler type",
    y = "Fillers per 100 (log)"
  ) +
  theme_qtalr(font_size = 10)

# boxplot `filler_type` and `sex`
fillers_type_df |>
  ggplot(aes(y = fillers_orf_log, x = filler_type, fill = sex)) +
  geom_boxplot(notch = TRUE) +
  scale_fill_manual(values = c("#525252", "#BABABA")) +
  labs(
    x = "Filler type",
    y = "Fillers per 100 (log)",
    fill = "Sex"
  ) +
  theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_boxplot()}\cindex{labs()}
:::


Let's interpret the boxplots in @fig-infer-multi-sex-plot. Focusing on @fig-infer-multi-sex-plot-1 first, we see that the filler 'uh' is more frequent than 'um' as the median is distinct and the confidence intervals do not overlap. Now, looking at @fig-infer-multi-sex-plot-2, we see the same distinction between 'uh' and 'um', but we also see that the difference between the use of 'uh' and 'um' is different for males and females. This is the interaction effect we hypothesized. In this case the interaction effect goes in the same direction but the magnitude of the difference is different. The upshot, men and women both use 'uh' more than 'um' but men are even more likely to use 'uh' over 'um' than women.

Let's test this effect using the `infer` workflow. Calculating the observed statistics for the simple and interaction effects is very similar to other designs, except instead of `calculate()` to derive our statistics we will use the `fit()` function, just as we did for logistic regression. Let's go ahead and calculate the observed statistics first, as seen in @exm-infer-num-multi-spec.

::: {#exm-infer-num-multi-spec}
```r
# Specify the relationship
fillers_type_spec <-
  fillers_type_df |>
  specify(fillers_orf_log ~ filler_type * sex)

# Observed statistics
fillers_type_obs <-
  fillers_type_spec |>
  fit()

fillers_type_obs
```

::: {.content-visible when-format="pdf"}
\vspace{1em}
:::

```{r}
#| label: infer-num-multi-spec

# Specify the relationship
fillers_type_spec <-
  fillers_type_df |>
  specify(fillers_orf_log ~ filler_type * sex)

# Observed statistics
fillers_type_obs <-
  fillers_type_spec |>
  fit()

fillers_type_obs
```
\index{R packages!infer}
\cindex{specify()}\cindex{fit()}
:::

The terms in the output from @exm-infer-num-multi-spec provide information as to what the reference levels are. For example, `filler_typeum` tells us that the 'uh' level is the reference for `filler_type` and by the same logic, 'Female' is the reference\index{reference level} for `sex`. These terms provide our simple effect statistics. Each can be understood as the difference between the reference level when the other variables are held constant. Our response variable is log transformed\index{log transformation}, so it is not directly interpretable beyond the fact that smaller units are lower rates of filler use and larger units are higher rates of filler use. So 'um' is used less than 'uh' and men use more fillers than women.

The interaction\index{interaction model} term `fillertypeum:sexMale` is the difference in the rate of fillers for this combination compared to the reference level combination ('uh' and 'Female'). In this case, the observed rate is lower.

We now need to generate a null distribution\index{null hypothesis distribution} to compare the observed statistics to. We will again use the permutation method, but since there is an interaction effect, we need to shuffle the `filler_type` and `sex` variables together. This ensures that any relationship between the two variables is removed. Let's see how this works in @exm-infer-num-multi-null.

::: {#exm-infer-num-multi-null}
```{r}
#| label: infer-num-multi-null

# Null distribution
fillers_type_null <-
  fillers_type_spec |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute", variables = c("filler_type", "sex")) |>
  fit()

# Calculate the $p$-values
fillers_type_null |>
  get_p_value(obs_stat = fillers_type_obs, direction = "two-sided")
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{get_p_value()}
\index{two-sided test}
:::

For the simple effects, we see that `filler_type` is significant but `sex` is not. Remember, when we only considered `sex` in isolation in the bivariate case, we found it to be significant. So why is it not significant now? It is important to remember that in every statistical design, there are other factors that are not considered. When these are not in the model, our effects may appear to account for more of the variance than they actually do. In this case, the `filler_type` variable is accounting for some of the variance that `sex` was accounting for in the bivariate case, enough, it appears, to make `sex` not significant as a simple effect.

Our interaction effect is also significant meaning the observed difference we visualized in @fig-infer-multi-sex-plot is likely not due to chance. The upshot, both men and women use more 'uh' compared to 'um' but men's difference in use is larger than women's.

As always, let's calculate a confidence interval\index{confidence interval} to assess the uncertainty of the observed statistic, as seen in @exm-infer-num-multi-sex-ci.

::: {#exm-infer-num-multi-sex-ci}
```{r}
#| label: infer-num-multi-sex-ci

# Resampling distribution
fillers_type_boot <-
  fillers_type_spec |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()

# Calculate the confidence intervals
fillers_type_ci <-
  fillers_type_boot |>
  get_confidence_interval(level = 0.95, point_estimate = fillers_type_obs)

fillers_type_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{get_confidence_interval()}\cindex{fit()}
:::

From the confidence intervals, we see that zero is not included in any of the intervals, which suggests that the observed differences are not due to chance. Interpreting the width and the proximity to zero, however, suggests that the observed differences for `filler_type` are stronger than for `sex`, which did not result in a significant simple effect. The interaction effect is also significant, but the confidence interval is quite wide and approximates zero. This should raise some questions about the robustness of the observed effect.

## Activities {.unnumbered}

The following activities aim to reinforce the concepts covered in this chapter. You'll review working with key variables, examine data distributions, and employ simulation-based statistical methods using {infer} to test hypotheses about their relationships.

::: {.callout}
**{{< fa regular file-code >}} Recipe**

**What**: Building inference models\
**How**: Read Recipe 10, complete comprehension check, and prepare for Lab 10.\
**Why**: To review and extend your knowledge regarding the simulation-based approach to statistical inference.
:::

::: {.callout}
**{{< fa flask >}} Lab**

**What**: Statistical inference\
**How**: Clone, fork, and complete the steps in Lab 10.\
**Why**: To apply the concepts covered in this chapter to a real-world dataset.
:::

## Summary {.unnumbered}

In sum, in this section we explored the process of null hypothesis testing using {infer}, which is a simulation-based approach to statistical inference. We considered statistical designs, such as univariate, bivariate, and multivariate analyses, and explored the process of hypothesis testing with categorical and numeric response variables. The workflow provided demonstrates that {infer} is a powerful tool for conducting statistical inference, and that it can be used to test a wide range of hypotheses with a similar workflow.