-
Notifications
You must be signed in to change notification settings - Fork 0
/
10_infer.qmd
1693 lines (1301 loc) · 93.9 KB
/
10_infer.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
execute:
echo: true
---
::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.10}
\thispagestyle{chapterfirstpage}
```
:::
# Infer {#sec-infer-chapter}
```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```
::: {.callout}
**{{< fa regular list-alt >}} Outcomes**
- Identify the research goals of inferential data analysis
- Describe the workflow for inferential data analysis
- Indicate the importance of quantifying uncertainty in inferential data analysis
:::
```{r}
#| label: inference-data-packages
#| echo: false
# Packages
```
In this chapter, we consider approaches to deriving knowledge from information which can be generalized to the population from which the data is sampled. This process is known as statistical inference. The discussion here implements descriptive assessments, statistical tests, and evaluation procedures for a series of contexts which are common in the analysis of corpus-based data. During our treatment of these contexts, we will establish a foundational understanding of statistical inference using a simulation-based approach.
::: {.callout}
**{{< fa terminal >}} Lessons**
**What**: Advanced Tables \
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To explore how to enhance dataset summaries using {janitor} and present them effectively with {kableExtra}'s advanced formatting options.
:::
## Orientation {#sec-infer-orientation}
In contrast to exploratory and predictive analyses, inference is not a data-driven endeavor. Rather, the goal of inferential data analysis (IDA)\index{inferential data analysis (IDA)} is to make theoretical claims about the population and assess the extent to which the data supports those claims\index{theory-driven research}. This implicates two key methodological restrictions which are not in play in other analysis methods.
First, the research question\index{research question} and expected findings are formulated *before* the data is analyzed, in fact strictly speaking this should take place even before data collection\index{acquire data}. This helps ensure that the data is aligned with the research question, that the data is representative\index{representativeness} of the population\index{sampling}, and that the analysis has a targeted focus and does not run the risk of becoming a 'just-so' story[^hark] or a 'significance-finding' mission[^phack], both of which violate the principles of significance testing.
[^hark]: "Hypothesis After Result is Known" (HARKing) involves selectively analyzing data, trying different variables or combinations until a significant $p$-value is obtained, or stopping data collection when a significant result is found [@Kerr1998].
[^phack]: "$p$-hacking" is the practice of running multiple tests until a statistically significant result is found. This practice violates the principles of significance testing [@Head2015].
Second, the data used in IDA is only used once. That is to say, the entire dataset is used a single time to statistically interrogate the relationship(s) of interest. In both exploratory\index{exploratory data analysis (EDA)} and predictive data analysis\index{predictive data analysis (PDA)} the data can be approached multiple times in different ways and the results of the analysis can be used to inform the next steps in the analysis. In IDA, however, the data is used to test a specific hypothesis\index{hypothesis} and the results of the analysis are interpreted in the context of that hypothesis.\index{research interpretation}
The methodological approach to IDA\index{inferential data analysis (IDA)} is the most straightforward of the analysis types covered in this textbook. As the research goal is to test a claim, the steps necessary are fewer than in EDA or PDA, where the exploratory nature of these approaches includes various possible iterations. The workflow for IDA is shown in @tbl-infer-workflow.
<!-- Workflow -->
::: {#tbl-infer-workflow tbl-colwidths="[5, 15, 80]"}
| Step | Name | Description |
|:-----|:-----|:------------|
| 1 | Identify | Identify and map the hypothesis statement to the appropriate response and explanatory variables |
| 2 | Inspect | Assess the distribution of the variable(s) with the appropriate descriptive statistics and visualizations. |
| 3 | Interrogate | Apply the appropriate statistical procedure to the dataset. |
| 4 | Interpret | Review the statistical results and interpret them in the context of the hypothesis. |
Workflow for inferential data analysis
:::
Based on the hypothesis\index{hypothesis} statement, we first identify and operationalize\index{operationalize} the variables. The response variable\index{response variable} is the variable whose variation we aim to explain. Additionally, in most statistical designs, one or more explanatory variables\index{explanatory variables} are included in the analysis in an attempt to gauge the extent to which these variables account for the variation in the response variable. For both response and explanatory variables, it is key to confirm that your operationalization of the variables is well-defined and that the data aligns.
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
What are the explanatory and/or response variables in each of these statements? How are these variables operationalized? What key sampling features are necessary for the data to test these hypotheses?
1. There will be statistically significant differences in the kinds of collocations used in English dialects spoken in urban areas compared to those spoken in rural areas.
2. French L2 learners will make more vocabulary errors in oral production than in written production.
3. The association strength between Mandarin words and their English translations will be a significant predictor of translation difficulty for novice translators.
4. The prevalence of gender-specific words in German-speaking communities on distinct online forums will significantly reflect gender roles.
5. The frequency of function words used by Spanish L2 learners will be a significant predictor of their stage in language acquisition.
:::
Next, we determine the informational values\index{informational types} of the variables\index{variables}. The informational value of each variable will condition how we approach visualization, interrogation, and ultimately interpretation of the results. Note that some informational types can be converted to other types, specifically higher-order types can be converted to lower-order types. For example, a continuous variable\index{continuous variables} can be converted to a categorical variable\index{categorical variables}, but not vice versa. It is preferable, however, to use the highest informational value of a variable. Simplifying data results in a loss of information ---which will result in a loss of information and hence statistical power which may lead to results that obscure meaningful patterns in the data [@Baayen2004]\index{Baayen}.
With our design in place, we can now inspect the data. This involves assessing the distribution of the variables using descriptive statistics and visualizations\index{descriptive assessment}. The goal of this step is to confirm the integrity of the data (missing data\index{missing data}, anomalies, *etc*.), identify general patterns in the data, and identify potential outliers\index{outliers}. As much as this is a verification step, it also serves to provide a sense of the data and the extent to which the data aligns with the hypothesis. This is particularly true when statistical designs are complex and involve multiple explanatory variables. An appropriate visualization provides context for interpreting the results of the statistical analysis.
Interrogating the data involves applying the appropriate statistical procedure to the dataset. In the **Null Hypothesis Significance Testing** (NHST)\index{Null Hypothesis Significance Testing (NHST)} paradigm, this process includes calculating a statistic from the data, comparing it to a null hypothesis distribution, and measuring the evidence against the null hypothesis. The **null hypothesis distribution**\index{null hypothesis distribution} is a distribution of statistic values that we would expect if the null hypothesis were true, *i.e.* that there is no difference or relationship between the explanatory and/or response variables. By comparing the observed statistic to the null hypothesis distribution, we can determine the likelihood of observing the observed statistic, if the null hypothesis were true. The estimate of this likelihood is a **$p$-value**\index{p-value}. When the $p$-value is below a threshold, typically 0.05, the result is considered statistically significant. This means that the observed statistic is sufficiently different from the null hypothesis distribution that we can reject the null hypothesis.
Now let's consider how to approach interpreting the results from a statistical test. The $p$-value\index{p-value} provides a probability that the results of our statistical test could be explained by the null hypothesis. When this probability is below the alpha level of 0.05, the result is considered statistically significant, otherwise we have a 'null result' (*i.e.* non-significant).
However, this sets up a binary distinction that can be problematic. On the one hand, what is one to do if a test returns a $p$-value of 0.051? According to standard practice these "marginally significant" results would not be statistically significant. On the other hand, if we get a statistically significant result, say a $p$-value of 0.049, do we move on ---case closed? To address both of these issues, it is important to calculate a confidence interval for the test statistic. The **confidence interval**\index{confidence interval} is the range of values for our test statistic that we would expect the true statistic value to fall within some level of uncertainty. Again, 95% is the most common level of uncertainty. The upper and lower bounds of this range are called the confidence limits for the test statistic.
Used in conjunction with $p$-values\index{p-value}, confidence intervals\index{confidence interval} can provide a more nuanced interpretation of the results of a statistical test. For example, if we get a $p$-value of 0.051, but the confidence interval is very narrow, we can be more confident that the results are reliable. Conversely, if we get a $p$-value of 0.049, but the confidence interval is very wide, we can be less confident that the results are reliable. If our confidence interval contains the null value, then even a significant $p$-value will require a more nuanced interpretation\index{research interpretation}.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
Overgeneralization and undergeneralization are more formally known as Type I and Type II error, respectively. Type I error (false positive) occurs when we reject the null hypothesis when it is true. That is, we erroneously detect a significant result, when in fact the tested relationship is not borne out in the population. Type II error (false negative) occurs when we fail to reject the null hypothesis when it is false. This is a case of missing a significant result due to the limitations of the analysis which can stem from the sample size, the design of the study, or the statistical test used.
:::
It is important to underscore that the purpose of IDA is to draw conclusions from a dataset which are generalizable to the population. These conclusions require that there are rigorous measures to ensure that the results of the analysis do not overgeneralize (suggest there is a relationship when there is not one) and balance that with the fact that we don't want to undergeneralize (miss the fact that there is a relationship in the population, but our analysis was not capable of detecting it).
## Analysis {#sec-infer-analysis}
<!-- Goals of this section -->
In this section, we will discuss the practical application of inferential data analysis. The discussion will be divided into two sections based on the type of response variable: categorical and numeric. We will then explore specific designs for univariate, bivariate, and multivariate tests\index{multivariate analysis}. We will learn and implement NHST\index{Null Hypothesis Significance Testing (NHST)} using a simulation-based workflow. In contrast to theory-based methods, simulation-based methods tend to be more intuitive, easier to implement, and provide a better conceptual understanding of the statistical designs and analyses [@Morris2019; @Rossman2014a].
The steps for implementing a simulation-based approach to significance testing are outlined in @tbl-infer-simulation.
<!-- Simulation-based approach -->
::: {#tbl-infer-simulation tbl-colwidths="[5, 30, 65]"}
| Step | Name | Description |
|:-----|:-----|:------------|
| 1 | Specify | Specify the variables of interest and their relationship |
| 2 | Calculate | Calculate the observed statistic |
| 3 | Hypothesize | Generate the null hypothesis distribution |
| 4 | Get $p$-value | Calculate the $p$-value |
| 5 | Get confidence interval | Calculate the confidence interval |
Simulation-based workflow for significance testing
:::
<!-- Setup: packages/ options/ data -->
{infer} [@R-infer] provides a Tidyverse-friendly\index{Tidyverse} framework to implement simulation-based methods for statistical inference. Designed to be used in conjunction with {tidyverse}, {infer} provides a set of functions that can be used to specify the variables of interest, calculate the observed statistic, generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value\index{p-value} and the confidence interval\index{confidence interval}.
Let's load the necessary packages we will use in this section, as seen in @exm-infer-setup.
\pagebreak
::: {#exm-infer-setup}
```r
# Load packages
library(infer) # for statistical inference
library(skimr) # for descriptive statistics
library(janitor) # for cross-tabulation
```
\cindex{library()}
```{r}
#| label: infer-setup
#| echo: false
# Load packages
library(infer) # for statistical inference
library(skimr) # for descriptive statistics
library(janitor) # for cross-tabulation
```
:::
### Categorical {#sec-infer-categorical}
Here we demonstrate the application of IDA to categorical response variables\index{response variable}\index{categorical variables}. This will include various common statistical designs and analyses. In @tbl-infer-cat-design, we see common design scenarios, the variables involved, and the statistic used in the analysis.
::: {#tbl-infer-cat-design tbl-colwidths="[15, 30, 35, 20]"}
| Scenario | Explanatory Variable(s) | Statistical Test | `infer` |
|:---------|:------------------------|:-----------------|:------------|
| Univariate | - | Proportion | `prop` |
| Bivariate | Categorical | Difference in proportions | `diff in props` |
| Bivariate | Categorical (3+ levels) | Chi-square | `chisq` |
| Multivariate | Categorical or Numeric | Logistic regression | `fit()` |
Statistical test designs for categorical response variables
:::
We will use a derived version of the `dative` dataset from {languageR} [@R-languageR]. It contains over 3k observations describing the realization of the recipient clause in English dative constructions drawn from Switchboard corpus and the Treebank Wall Street Journal collection\index{corpus!reference}. To familiarize ourselves with the dataset, let's consider the data dictionary in @tbl-infer-cat-data-dict.
```{r}
#| label: tbl-infer-cat-data-dict
#| tbl-cap: "Data dictionary for the `dative_tbl` dataset."
#| tbl-colwidths: [10, 25, 15, 50]
#| echo: false
# Data dictionary for the `dative_tbl` dataset
read_csv("data/dative_ida_dd.csv") |>
tt(width = 1)
```
We see that this dataset has four variables, two categorical and two numeric. In our demonstrations we are going to use the `rcp_real` as the response variable, the variable whose variation we are investigating.\index{response variable}
For a bit more context, a dative is the phrase which reflects the entity that takes the recipient role in a ditransitive clause. In English, the recipient (dative) can be realized as either a prepositional phrase (PP) as seen in @exm-infer-cat-dative-examples (1) or as a noun phrase (NP) as seen in (2).
::: {#exm-infer-cat-dative-examples}
Dative examples
(1) John gave the book [to Mary ~PP~].
(2) John gave [Mary ~NP~] the book.
:::
Together these two syntactic options are known as the Dative Alternation [@Bresnan2007].
Let's go ahead and load the dataset, as seen in @exm-infer-cat-read-dative.
::: {#exm-infer-cat-read-dative}
```r
# Load datasets
dative_tbl <-
read_csv("../data/dative_ida.csv")
```
```{r}
#| label: infer-cat-read-dative
#| include: false
# Load datasets
dative_tbl <-
read_csv("data/dative_ida.csv")
```
\index{R packages!readr}
\cindex{read_csv()}
:::
```{r}
#| label: infer-cat-diagnostics
#| include: false
# Convert character variables to factors
dative_tbl <-
dative_tbl |>
mutate(across(where(is.character), factor))
# Statistical overview
dative_tbl |>
skim()
```
In preparation for statistical analysis, I performed a statistical overview and diagnostics of the dataset\index{descriptive assessment}. This included checking for missing data\index{missing data}, outliers\index{outliers}, and anomalies. I also checked the distribution of the variables using descriptive statistics\index{descriptive statistics} and visualizations, noting that the `rcp_len` and `thm_len` variables are right-skewed\index{skewed distribution}. This is something to keep in mind. The results of this overview and diagnostics are not shown here, but they are important steps in the IDA workflow. In this process, I converted the character variables to factors as most statistical tests require factors. A preview of the dataset is shown in @exm-infer-cat-dative-preview.
::: {#exm-infer-cat-dative-preview}
```{r}
#| label: infer-cat-dative-preview
# Preview
glimpse(dative_tbl)
```
:::
We can see that the dataset includes `r format(nrow(dative_tbl), big.mark=",")` observations. We will take a closer look at the descriptive statistics for the variables as we prepare for each analysis.
#### Univariate analysis {#sec-infer-cat-univariate}
The univariate analysis is the simplest statistical design and analysis. It includes only one variable. The goal is to describe the distribution of the levels of the variable. The `rcp_real` variable has two levels: NP and PP. A potential research question\index{research question} for a case like this may aim to test the claim that:
- NP realizations of the recipient clause are the canonical form in English dative constructions, and therefore will be the most frequent realization of the recipient clause.
This hypothesis can be tested using a **difference in proportion test**\index{difference in proportion test}. The null hypothesis\index{null hypothesis} is that there is no difference in the proportion of NP and PP realizations of the recipient clause. The alternative hypothesis\index{alternative hypothesis} is that NP realizations of the recipient clause are more frequent than PP realizations of the recipient clause.
Before we get into statistical analysis, it is always a good idea to cross-tabulate\index{contingency table} or visualize the question, depending on the complexity of the relationship. In @exm-infer-cat-univariate-tbl, we see the code shows the distribution of the levels of the `rcp_real` variable in a contingency table.
::: {#exm-infer-cat-univariate-tbl}
```r
# Contingency table of `rcp_real`
dative_tbl |>
tabyl(rcp_real) |>
adorn_pct_formatting(digits = 2) |>
kable() |>
kable_styling()
```
\index{R packages!janitor}\index{R packages!kableExtra}\index{R packages!knitr}
\cindex{tabyl()}\cindex{adorn_pct_formatting()}\cindex{kable()}\cindex{kable_styling()}
```{r}
#| label: tbl-infer-cat-univariate
#| tbl-cap: "Distribution of the levels of the `rcp_real` variable."
#| tbl-colwidths: [50, 25, 25]
#| echo: false
dative_tbl |>
tabyl(rcp_real) |>
adorn_pct_formatting(digits = 2) |>
as_tibble() |>
tt(width = 1)
```
:::
From @tbl-infer-cat-univariate, we see that the proportion of NP realizations of the recipient clause is higher than the proportion of PP realizations of the recipient clause. However, we cannot conclude that there is a difference in the proportion of NP and PP realizations of the recipient clause\index{proportion}. We need to conduct a statistical test to determine if the difference is statistically significant.
To determine if the distribution of the levels of the `rcp_real` variable is different from what we would expect if the null hypothesis were true, we need to calculate the difference observed in the sample and compare it to the differences observed in many samples where the null hypothesis is true.
First, let's calculate the proportion of NP and PP realizations of the recipient clause in the sample. We turn to the `specify()` function from {infer} to specify the variable of interest, step 1 in the simulation-based workflow in @tbl-infer-simulation. In this case, we only have the response variable\index{response variable}. Furthermore, the argument `success` specifies the level of the response variable that we will use as the 'success'. The term 'success' is used because the `specify()` function was designed for binomial variables where the levels are 'success' and 'failure', as seen in @exm-infer-cat-specify.
::: {#exm-infer-cat-specify}
```{r}
#| label: infer-cat-specify
# Specify the variable of interest
dative_spec <-
dative_tbl |>
specify(
response = rcp_real,
success = "NP"
)
# Preview
dative_spec
```
\index{R packages!infer}
\cindex{specify()}
:::
The `dative_spec` is a data frame with attributes which are used by {infer} to maintain information about the statistical design for the analysis. In this case, we only have information about what the response variable is.
Step 2 is to calculate the observed statistic. The `calculate()` function is used to calculate the proportion statistic setting `stat = "prop"`, as seen in @exm-infer-cat-calculate.
\pagebreak
::: {#exm-infer-cat-calculate}
```{r}
#| label: infer-cat-calculate
# Calculate the proportion statistic
dative_obs <-
dative_spec |>
calculate(stat = "prop")
# Preview
dative_obs
```
\index{R packages!infer}
\cindex{calculate()}
:::
Note that the observed statistic, proportion, is the same as the proportion we calculated in @tbl-infer-cat-univariate. In such a simple example, the summary statistic\index{central tendency} and the observed statistic are the same. But this simple example shows how choosing the 'success' level of the response variable is important. If we had chosen the 'PP' level as the 'success' level, then the observed statistic would be the proportion of PP realizations of the recipient clause. There is nothing wrong with choosing the 'PP' level as the 'success' level, but it would change the direction of the observed statistic.
Now that we have the observed statistic, our goal will be to determine if the observed statistic is different from what we would expect if the null hypothesis were true. To do this, we simulate samples where the null hypothesis\index{null hypothesis distribution} is true, step 3 in our workflow.
Simulation means that we will randomly sample from the `dative_tbl` data frame many times. We need to determine how the sampling takes place. Since `rcp_real` is a variable with only two levels, the null hypothesis is that both levels are equally likely. In other words, in a null hypothesis world, NP and PP we would expect the proportions to roughly be 50/50.
To formalize this hypothesis with `infer` we use the `hypothesize()` function and set the null hypothesis to "point" and the proportion to 0.5. Then we can `generate()` a number of samples, say 1,000, drawn from our 50/50 world. Finally, the `prop` (proportion) statistic is calculated for each of the 1,000 samples and returned in a data frame, as seen in @exm-infer-cat-null-hypothesis.
::: {#exm-infer-cat-null-hypothesis}
```{r}
#| label: infer-cat-null-hypothesis
# Generate the null hypothesis distribution
dative_null <-
dative_spec |>
hypothesize(null = "point", p = 0.5) |>
generate(reps = 1000, type = "draw") |>
calculate(stat = "prop")
# Preview
dative_null
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{calculate()}
:::
The result of @exm-infer-cat-null-hypothesis is a data frame with as many rows\index{observations} as there are samples. Each row contains the proportion statistic for each sample drawn from the hypothesized distribution that the proportion of NP realizations of the recipient clause is 0.5.
To appreciate the null hypothesis distribution\index{null hypothesis distribution}, we can visualize it using a histogram\index{histogram}. {infer} provides a convenient `visualize()` function for visualizing distributions, as seen in @exm-infer-cat-null-hypothesis-vis.
::: {#exm-infer-cat-null-hypothesis-vis}
```r
# Visualize the null hypothesis distribution
visualize(dative_null)
```
```{r}
#| label: fig-infer-cat-null-hypothesis
#| fig-cap: "Simulation-based null distribution"
#| fig-alt: "A histogram showing the spread of values possible under the null hypothesis."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false
# Visualize the null hypothesis distribution
dative_null |>
visualize() +
labs(title = "", y = "Count", x = "Statistic") +
theme_qtalr(font_size = 10)
```
\index{R packages!infer}
\cindex{visualize()}\cindex{labs()}
```{r}
#| label: infer-cat-null-hypothesis-range
#| include: false
null_range <- round(range(dative_null$stat), 2)
null_mean <- round(mean(dative_null$stat), 2)
null_sd <- round(sd(dative_null$stat), 2)
```
:::
In @fig-infer-cat-null-hypothesis, we see that on the x-axis is the proportion statistic of NP realizations of the recipient clause that we would expect if the null hypothesis were true. For the 1,000 samples, the proportion statistic ranges from `r null_range[1]` to `r null_range[2]`. Importantly we can appreciate that most of the proportion statistics are around 0.5. In fact, the mean is `r null_mean` with a standard deviation\index{standard deviation}\index{dispersion} of `r null_sd`, which is what we would expect if the null hypothesis were true. But there is variation, as we would also expect.
Why would we expect variation? Consider the following analogy. If we were to flip a fair coin 10 times, we would expect to get 5 heads and 5 tails. But this doesn't always happen. Sometimes we get 6 heads and 4 tails. Sometimes we get 7 heads and 3 tails, and so on. As the number of flips increases, however, we would expect the proportion of heads to be closer to 0.5, but there would still be variation. The same is true for the null hypothesis distribution. As the number of samples increases, we would expect the proportion of NP realizations of the recipient clause to be closer to 0.5, but there would still be variation. The question is whether the observed statistic we obtained from our data, in @exm-infer-cat-calculate, is within some level of variation that we would expect if the null hypothesis were true.
Let's visualize the observed statistic on the null hypothesis distribution\index{null hypothesis distribution}, as seen in @fig-infer-cat-null-hypothesis-obs, to gauge whether the observed statistic is within some level of variation that we would expect if the null hypothesis were true. The `shade_p_value()` function will take the null hypothesis distribution and the observed statistic and shade the sample statistics that fall within the alpha level\index{alpha level}.
::: {#exm-infer-cat-null-hypothesis-obs}
```r
dative_null |>
visualize() + # note we are adding a visual layer `+`
shade_p_value(
obs_stat = dative_obs, # the observed statistic
direction = "greater" # the direction of the alternative hypothesis
)
```
\index{R packages!infer}
\index{one-sided test}
\cindex{shade_p_value()}\cindex{visualize()}
```{r}
#| label: fig-infer-cat-null-hypothesis-obs
#| fig-cap: "Simulation-based null distribution with the observed statistic."
#| fig-alt: "A histogram showing the spread of values possible under the null hypothesis with the observed statistic represented as a line on the x-axis."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false
# Visualize the null hypothesis distribution with the observed statistic
dative_null |>
visualize() + # note we are adding a visual layer `+`
shade_p_value(
obs_stat = dative_obs, # the observed statistic
direction = "greater", # the direction of the alternative hypothesis
color = "grey"
) +
labs(title = "", y = "Count", x = "Statistic") +
theme_qtalr(font_size = 10)
```
\index{R packages!infer}
\cindex{visualize()}\cindex{shade_p_value()}\cindex{labs()}
\index{one-sided test}
:::
Just from a visual inspection, it is obvious that the observed statistic lies far away from the null distribution, far right of the right tail. No shading appears in this case as the observed statistic is far from the expected variation. This suggests that the observed statistic is not within the level of variation that we would expect if the null hypothesis were true.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
The direction of the alternative hypothesis is important because it determines the $p$-value range. The "two-sided" direction means that we are interested in the proportion being different from 0.5. If we were only interested in the proportion of one outcome being greater than 0.5, then we would use the "greater" direction, or "less" in the opposite scenario.
:::
But we need to quantify this. We need to calculate the probability of observing the observed statistic or a more extreme statistic if the null hypothesis were true, the $p$-value\index{p-value}. Calculating this estimate is step 4 in the workflow. The $p$-value is calculated by counting the number of samples in the null hypothesis distribution that are more extreme than expected within some level of uncertainty. 95% is the most common level of uncertainty, which is called the **alpha level**\index{alpha level}. The remaining 5% of the distribution is the space where the likelihood that the null hypothesis accounts for the statistic is below our given alpha level of 0.05. This means that if the $p$-value is less than 0.05, then we reject the null hypothesis. If the $p$-value is greater than 0.05, then we fail to reject the null hypothesis.
With `infer` we can calculate the $p$-value using the `get_p_value()` function. Let's calculate the $p$-value for our observed statistic, as seen in @exm-infer-cat-p-value.
\pagebreak
::: {#exm-infer-cat-p-value}
```{r}
#| label: infer-cat-p-value
# Calculate the $p$-value (observed statistic)
dative_null |>
get_p_value(
obs_stat = dative_obs, # the observed statistic
direction = "greater" # the direction of the alternative hypothesis
)
```
\index{R packages!infer}
\cindex{get_p_value()}
\index{one-sided test}
```{.r code-line-numbers="false"}
Warning message:
Please be cautious in reporting a $p$-value\index{p-value} of 0. This result is an approximation based on the number of `reps` chosen in the
`generate()` step.
```
:::
The $p$-value for our observed statistic is reported as $0$, with a warning that the $p$-value estimate is contingent on the number of samples we generate in the null distribution. 1,000 is a reasonable number of samples, so we likely have a statistically significant result at the alpha level of 0.05.
The $p$-value\index{p-value} is one, traditionally very common, estimate of uncertainty. Another estimate of uncertainty is the confidence interval, our 5th and final step. The confidence interval\index{confidence interval} is the range of values for our test statistic that we would expect the true statistic value to fall within some level of uncertainty. Again, 95% is the most common level of uncertainty. The upper and lower bounds of this range are called the **confidence limits**\index{confidence limits} for the test statistic. The confidence interval is calculated by calculating the confidence limits for the test statistic for many samples from the observed data. But instead of generating a null hypothesis distribution, we generate a distribution based on resampling from the observed data. This is called the bootstrap distribution. The **bootstrap distribution**\index{bootstrap distribution} is generated by resampling from the observed data, with replacement, many times. This simulates the process of sampling\index{sampling} from the population many times. Each time the test statistic is generated for each sample. The confidence limits are the 2.5th and 97.5th percentiles of the bootstrap distribution. The confidence interval is the range between the confidence limits.
In @exm-infer-cat-confidence-interval, we see the code for calculating the confidence interval for our observed statistic.
\pagebreak
::: {#exm-infer-cat-confidence-interval}
```r
# Generate bootstrap distribution
dative_boot <-
dative_spec |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop")
dative_ci <-
dative_boot |>
get_confidence_interval(level = 0.95) # 95% confidence interval
dative_ci
```
:::
```{r}
#| label: infer-cat-confidence-interval
#| echo: false
# Generate bootstrap distribution
dative_boot <-
dative_spec |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop")
dative_ci <-
dative_boot |>
get_confidence_interval(level = 0.95) # 95% confidence interval
dative_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{get_confidence_interval()}\cindex{calculate()}
Let's visualize the confidence interval using the `visualize()` and `shade_confidence_interval()` function in @exm-infer-cat-confidence-interval-visualize on our bootstrapped samples, as seen in @fig-infer-cat-confidence-interval-visualize.
::: {#exm-infer-cat-confidence-interval-visualize}
```r
# Visualize the bootstrap distribution with the confidence interval
dative_boot |>
visualize() +
shade_confidence_interval(
dative_ci # the confidence interval
)
```
\index{R packages!infer}
\cindex{visualize()}\cindex{shade_confidence_interval()}
```{r}
#| label: fig-infer-cat-confidence-interval-visualize
#| fig-cap: "Bootstrap distribution of the proportion of NP realizations of the recipient clause with the confidence interval."
#| fig-alt: "A histogram showing the spread of values generated by bootstrapping the observed data with the confidence interval represented as a shaded area."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false
# Visualize the bootstrap distribution with the confidence interval
dative_boot |>
visualize() +
shade_confidence_interval(
endpoints = dative_ci, # the confidence interval
color = "grey2",
fill = "grey"
) +
labs(title = "", y = "Count", x = "Statistic") +
theme_qtalr(font_size = 10)
```
:::
The confidence level is the probability that the confidence interval contains the true value. The confidence level is typically set to 0.95 in the social sciences. This means that if the confidence interval contains the null hypothesis value, then we fail to reject the null hypothesis. If the confidence interval does not contain the null hypothesis value, then we reject the null hypothesis.
Confidence intervals\index{confidence interval} are often misinterpreted. Confidence intervals are not the probability that the true value is within the range. The true value is either within the range or not. The confidence interval is the probability that the range contains the true value. This is a subtle but important distinction. Interpreted correctly, confidence intervals can enhance our understanding of the uncertainty of our test statistic and reduces the interpretation of $p$-values\index{p-value} (which are based on a relatively arbitrary alpha level\index{alpha level}) as a binary decision, significant or not significant. Confidence intervals encourage us to think about the uncertainty of our test statistic, as we would expect the true value to fall somewhere within that range, with varying levels of uncertainty.
Our stat is `r dative_obs$stat` and the confidence interval limits are `r dative_ci$lower_ci` and `r dative_ci$upper_ci`. The confidence interval does not contain the null hypothesis\index{null hypothesis} value of 0.5, which supports the evidence from the $p$-value that the proportion of NP realizations of the recipient clause is greater than 0.5.
#### Bivariate analysis {#sec-infer-cat-bivariate}
The univariate case is not very interesting or common in statistical inference, but it is a good place to start to understand the simulation-based process and the logic of statistical inference. The bivariate case, on the other hand, is much more common and interesting. The bivariate case includes two variables. The goal is to test the relationship between the two variables\index{research question}.
Using the `dative_tbl` dataset, we can imagine making the claim that:
- The proportion of NP and PP realizations of the recipient clause are contingent on the modality.
This hypothesis can be approached using a **difference in proportions** test,\index{difference in proportions test} as both variables are binomial (have two levels). The null hypothesis is that there is no difference in the proportion of NP and PP realizations of the recipient clause by modality. The alternative hypothesis is that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality.
We can cross-tabulate or visualize, but let's cross-tabulate this relationship as it is a basic 2-by-2 contingency table\index{contingency table}. In @exm-infer-cat-bivariate-tbl, we see the code for the cross-tabulation of the `rcp_real` and `modality` variables. Note I've made use of {janitor} to adorn this table with percentages, totals, and observation numbers.
::: {#exm-infer-cat-bivariate-tbl}
```r
dative_tbl |>
tabyl(rcp_real, modality) |> # cross-tabulate
adorn_totals(c("row", "col")) |> # provide row and column totals
adorn_percentages("col") |> # add percentages to the columns
adorn_pct_formatting(rounding = "half up", digits = 0) |> # round the digits
adorn_ns() |> # add observation number
adorn_title("combined") |> # add a header title
kable(booktabs = TRUE) |> # pretty table)
kable_styling()
```
\index{R packages!janitor}\index{R packages!kableExtra}\index{R packages!knitr}
\cindex{tabyl()}\cindex{adorn_totals()}\cindex{adorn_percentages()}\cindex{adorn_pct_formatting()}\cindex{adorn_ns()}\cindex{adorn_title()}\cindex{kable()}\cindex{kable_styling()}
```{r}
#| label: tbl-infer-cat-bivariate
#| tbl-cap: "Contingency table for `rcp_real` and `modality`."
#| tbl-colwidths: [36, 22, 22, 22]
#| echo: false
dative_tbl |>
tabyl(rcp_real, modality) |> # cross-tabulate
adorn_totals(c("row", "col")) |> # provide row and column totals
adorn_percentages("col") |> # add percentages to the columns
adorn_pct_formatting(rounding = "half up", digits = 0) |> # round the digits
adorn_ns() |> # add observation number
adorn_title("combined") |> # add a header title
as_tibble() |>
tt(width = 1)
```
:::
In @tbl-infer-cat-bivariate, we can appreciate that the proportion of NP realizations of the recipient clause is higher in both modalities, as we might expect from our univariate analysis. However, the proportion appears to be different with the spoken modality having a higher proportion of NP realizations of the recipient clause than the written modality. But we cannot conclude that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality. We need to conduct a statistical test to determine if the difference is statistically significant.
To determine if the distribution of the levels of the `rcp_real` variable by the levels of the `modality` variable is different from what we would expect if the null hypothesis were true, we need to calculate the difference observed in the sample and compare it to the differences observed in many samples where the null hypothesis is true.
{infer} provides a pipeline, steps 1 through 5, which maintains a consistent workflow for statistical inference. As such, the procedure is very similar to the univariate analysis we performed, with some adjustments. Let's focus on the adjustments. First, our `specify()` call needs to include the relationship between two variables: `rcp_real` and `modality`. The `response` argument is the response variable, which is `rcp_real`. The `explanatory` argument is the explanatory variable, which is `modality`.
There are two approaches to specifying the relationship between the response\index{response variable} and explanatory variables\index{explanatory variables}. The first approach is to specify the response variable and the explanatory variable separately as values of the arguments `response` and `explanatory`. The second approach is to specify the response variable and the explanatory variable as a formula using the `~` operator\cindex{\textasciitilde}. The formula approach is more flexible and allows for more complex relationships between the response and explanatory variables. In @exm-infer-cat-specify-bivariate, we see the code for the `specify()` call using the formula approach.
::: {#exm-infer-cat-specify-bivariate}
```{r}
#| label: infer-cat-specify-bivariate
# Specify the relationship between the response and explanatory variables
dative_spec <-
dative_tbl |>
specify(
rcp_real ~ modality,
success = "NP"
)
# Preview
dative_spec
```
\index{R packages!infer}
\cindex{specify()}
:::
The `dative_spec` now contains attributes about the response and explanatory variables encoded into the data frame.
We now calculate the observed statistic with `calculate()`, as seen in @exm-infer-cat-calculate-bivariate.
::: {#exm-infer-cat-calculate-bivariate}
```{r}
#| label: infer-cat-calculate-bivariate
# Calculate the observed statistic
dative_obs <-
dative_spec |>
calculate(
stat = "diff in props",
order = c("spoken", "written")
)
# Preview
dative_obs
```
\index{R packages!infer}
\cindex{calculate()}
:::
Two differences are that our statistic is now a difference in proportions and that we are asked to specify the order of the levels of `modality`. The statistic is clear, we are investigating whether the proportion of NP realizations of the recipient clause is different between the spoken and written modalities. The order of the levels of `modality` is important because it determines the direction of the alternative hypothesis, specifically how the statistic is calculated (the order of the subtraction).
So our observed statistic `r round(dative_obs$stat, 3)` is the proportion of NP realizations of the recipient clause in the spoken modality minus the proportion of NP realizations of the recipient clause in the written modality, so the NP realization appears `r round(dative_obs$stat * 100, 0)`% more in the spoken modality compared to the written modality.
The question remains, is this difference statistically significant? To answer this question, we generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value\index{p-value}, as seen in @exm-infer-cat-null-hypothesis-bivariate.
::: {#exm-infer-cat-null-hypothesis-bivariate}
```{r}
#| label: infer-cat-null-hypothesis-bivariate
# Generate the null hypothesis distribution
dative_null <-
dative_spec |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "diff in props", order = c("spoken", "written"))
# Calculate the $p$-value
dative_null |>
get_p_value(
obs_stat = dative_obs, # the observed statistic
direction = "two-sided" # the direction of the alternative hypothesis
)
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{calculate()}\cindex{get_p_value()}
\index{two-sided test}
:::
Note, when generating the null hypothesis distribution, we use the `hypothesize()` function with the `null` argument set to "independence". This is because we are interested in the relationship between the response and explanatory variables. The null hypothesis is that there is no relationship between the response and explanatory variables. When generating the samples, we use the permutation approach, which randomly shuffles the response variable values for each sample. This simulates the null hypothesis that there is no relationship between the response and explanatory variables.
The $p$-value is reported as $0$. To provide some context, we will generate a confidence interval for our observed statistic using the bootstrap method, as seen in @exm-infer-cat-confidence-interval-bivariate.
::: {#exm-infer-cat-confidence-interval-bivariate}
```r
# Generate bootstrap distribution
dative_boot <-
dative_spec |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "diff in props", order = c("spoken", "written"))
# Calculate the confidence interval
dative_ci <-
dative_boot |>
get_confidence_interval(level = 0.95)
# Preview
dative_ci
```
:::
::: {.content-visible when-format="pdf"}
\vspace{3em}
:::
```{r}
#| label: infer-cat-confidence-interval-bivariate
#| echo: false
# Generate bootstrap distribution
dative_boot <-
dative_spec |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "diff in props", order = c("spoken", "written"))
# Calculate the confidence interval
dative_ci <-
dative_boot |>
get_confidence_interval(level = 0.95)
# Preview
dative_ci
```
The confidence interval does not contain the null hypothesis value of 0 (no difference), which provides evidence that the proportion of NP realizations of the recipient clause is different between the spoken and written modalities.
#### Multivariate analysis
In many scenarios, it is common to have multiple explanatory variables\index{explanatory variables} that need to be considered. In such cases, logistic regression is a suitable modeling technique. Logistic regression\index{logistic regression} allows for the inclusion of both categorical and continuous explanatory variables. The primary objective of using logistic regression is to assess the association between these variables and the response variable. By analyzing this relationship, we can determine how changes in the explanatory variables influence the probability of the outcome occurring.
To explore this scenario, let's posit that:
- NP and PP realizations of the recipient clause are contingent on modality and word length ratio of the recipient and theme\index{research question}.
The length ratio gets at the length of the recipient clause relative to the length of the theme clause. This ratio is an operationalization\index{operationalize} of a phenomenon known as 'Heavy NP' shift. There are many ways to operationalize this phenomenon, but the length ratio is a simple method to approximate the phenomenon. It attempts to capture the idea that the longer the theme clause is relative to the recipient clause, the more likely the recipient clause will be realized as an NP ---in other words, when the theme is relatively longer than the recipient, the theme is ordered last in the sentence, and the recipient is ordered first in the sentence and takes the form of an NP (instead of a PP).
The hypothesis, then, is that @exm-heavy-np-shift (2) would be less likely than (1) because the theme is relatively longer than the recipient.
::: {#exm-heavy-np-shift}
(1) John gave [Mary ~NP~] the large book that I showed you in class yesterday.
(2) John gave the large book that I showed you in class yesterday [to Mary ~PP~].
:::
Let's consider this variable `length_ratio` and `modality` together as explanatory variables for the realizations of the recipient clause `rcp_real`.
Let's create the `length_ratio` variable by dividing the `thm_len` by the `rcp_len`. This will give us values larger than 1 when the theme is longer than the recipient. And since we are working with a skewed distribution\index{skewed distribution}, let's log-transform\index{log transformation} the `length_ratio` variable. In @exm-infer-cat-create-length-ratio, we see the code for creating the `length_ratio` variable.
::: {#exm-infer-cat-create-length-ratio}
```{r}
#| label: infer-cat-create-length-ratio
# Create the `length_ratio_log` variable
dative_tbl <-
dative_tbl |>
mutate(
length_ratio_log = log(thm_len / rcp_len)
) |>
select(-thm_len, -rcp_len)
# Preview
glimpse(dative_tbl)
```
\index{R packages!dplyr}
\cindex{mutate()}\cindex{select()}\cindex{glimpse()}\cindex{log()}
:::
Let's visualize the relationship between `rcp_real` and `length_ratio_log` separately and then together with `modality`, as seen in @exm-infer-cat-bivariate-vis-length-ratio-log\index{bar plot}\index{boxplot}.
::: {#exm-infer-cat-bivariate-vis-length-ratio-log}
```r
# Visualize the proportion of `rcp_real` by `modality`
dative_tbl |>
ggplot(aes(x = rcp_real, fill = modality)) +
geom_bar(position = "fill") +
labs(
x = "Realization of recipient clause",
y = "Proportion",
fill = "Modality"
)
# Visualize the relationship between `rcp_real` and `length_ratio_log`
dative_tbl |>
ggplot(aes(x = rcp_real, y = length_ratio_log)) +
geom_boxplot() +
labs(
x = "Realization of recipient clause",
y = "Length ratio"
)
```
:::
```{r}
#| label: fig-infer-cat-bivariate-length-ratio-log
#| fig-cap: "Distribution the variables `modality` and `length_ratio_log` by the levels of the `rcp_real` variable."
#| fig-alt: "Two boxplots showing the distribution of the variables `modality` and `length_ratio_log` by the levels of the `rcp_real` variable."
#| fig-subcap:
#| - "RCP by modality"
#| - "RCP by length ratio"
#| layout-ncol: 2
#| echo: false
# Visualize the proportion of `rcp_real` by `modality`
dative_tbl |>
ggplot(aes(x = rcp_real, fill = modality)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("#525252", "#BABABA")) +
labs(
x = "Realization of recipient clause",
y = "Proportion",
fill = "Modality"
) +
theme_qtalr(font_size = 10)
# Visualize the relationship between `rcp_real` and `length_ratio_log`
dative_tbl |>
ggplot(aes(x = rcp_real, y = length_ratio_log)) +
geom_boxplot() +
labs(
x = "Realization of recipient clause",
y = "Length ratio"
) +
theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{aes()}\cindex{geom_bar()}\cindex{labs()}\cindex{geom_boxplot()}\cindex{labs()}
To understand visualizations in @fig-infer-cat-bivariate-length-ratio-log, remember the null hypothesis is that there is no difference in the proportion of NP and PP realizations of the recipient clause by modality or length ratio. On the flip side, the alternative hypothesis is that there is a difference in the proportion of NP and PP realizations of the recipient clause by modality and length ratio. From the visual inspection, it appears that NP realizations of the recipient clause are more common in the spoken modality and that the NP realizations have a higher overall length ratio (larger theme relative to recipient) than PP realizations of the recipient clause. This suggests that the alternative hypothesis is likely true, but we need to conduct a statistical test to determine if the differences are statistically significant.
Let's calculate the statistics (not statistic) for our logistic regression by specifying the relationship between the response and explanatory variables and then using `fit()` to fit the logistic regression model, as seen in @exm-infer-cat-logistic-regression.
::: {#exm-infer-cat-logistic-regression}
```{r}
#| label: infer-cat-logistic-regression
# Specify the relationship
dative_spec <-
dative_tbl |>
specify(
rcp_real ~ modality + length_ratio_log
)
# Fit the logistic regression model
dative_fit <-
dative_spec |>
fit()
# Preview
dative_fit
```
\index{R packages!infer}
\cindex{specify()}\cindex{fit()}
:::
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
The reference level in R is assumed to be the first level alphabetically, unless otherwise specified. We can override this default by using the `fct_relevel()` function from {forcats} [@R-forcats]. The reason we would want to do this is to make the reference level more interpretable. In our case, we would want to make the spoken modality the reference level. This allows us to estimate the difference of the proportion of NP realizations of the recipient as a positive value. Remember that in @fig-infer-cat-bivariate-length-ratio-log-1, the proportion of NP realizations of the recipient clause is higher in the spoken modality than in the written modality. If we were to use the written modality as the reference level, the difference would be negative. Not that we couldn't interpret this, but working with positive integers is easier to interpret.
:::
Note I pointed out statistics, not statistic. In logistic regression models, the number of statistic reported depends on the number of explanatory variables. If there are two variables there will be at least three terms, one for each variable and the intercept term. If one or more variables are categorical, however, there will be additional terms when the categorical variable has three or more levels.
In our case, the `modality` variable has two levels, so there are three terms. The first term is the intercept term, which is the log odds of the proportion of NP realizations of the recipient clause in the written modality when the `length_ratio_log` is 1. The second term is the log odds of the proportion of NP realizations of the recipient clause in the spoken modality when the `length_ratio_log` is 1. The third term is the log odds of the proportion of NP realizations of the recipient clause when the `length_ratio_log` is 1 in the written modality. Notably, the spoken modality does not explicitly appear but is implicitly represented by the `modalitywritten` term statistic. `modalityspoken` is used as the reference level for the `modality` variable. For categorical variables, one of the levels is used as the point of reference, or **reference level**\index{reference level}, for which every other level is compared.
Now let's generate the null hypothesis distribution\index{null hypothesis distribution} and calculate the $p$-value for each of the terms, as seen in @exm-infer-cat-null-hypothesis-logistic-regression.
::: {#exm-infer-cat-null-hypothesis-logistic-regression}
```{r}
#| label: infer-cat-null-hypothesis-logistic-regression
# Generate the null hypothesis distribution
dative_null <-
dative_spec |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
fit()
# Calculate the $p$-value
dative_null |>
get_p_value(
dative_fit, # the observed statistics
direction = "two-sided" # the direction of the alternative hypothesis
)
```
\index{R packages!infer}
\cindex{hypothesize()}\cindex{generate()}\cindex{fit()}\cindex{get_p_value()}
\index{two-sided test}
:::
It appears that our main effects, `modality` and `length_ratio_log`, are statistically significant. Let's generate the confidence intervals\index{confidence interval} for each of the terms, as seen in @exm-infer-cat-confidence-interval-logistic-regression.
::: {#exm-infer-cat-confidence-interval-logistic-regression}
```{r}
#| label: infer-cat-confidence-interval-logistic-regression
# Generate boostrap distribution
dative_boot <-
dative_spec |>
generate(reps = 1000, type = "bootstrap") |>
fit()
# Calculate the confidence interval
dative_ci <-
dative_boot |>
get_confidence_interval(
point_estimate = dative_fit,
level = 0.95
)
# Preview
dative_ci
```
\index{R packages!infer}
\cindex{generate()}\cindex{fit()}\cindex{get_confidence_interval()}
:::
The confidence intervals for the main effects, `modality` and `length_ratio_log`, do not contain the null hypothesis value of 0, which provides evidence that each of the explanatory variables is related to the proportion of NP realizations of the recipient clause.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
Significance tests are not the only way to evaluate the evidence for the null hypothesis. We can also quantify the effect size\index{effect size} of each of the explanatory variables using the odds ratio to calculate the $r$ (correlation coefficient) and $R^2$\index{R-squared} (coefficient of determination) values. {effectsize} [@R-effectsize] provides a function `logoddsratio_to_r()` to calculate the $r$ and $R^2$ values for logistic regression models.
It can be important to use these measures to distinguish between statistically significant and practically significant results. A statistically significant result is one that is unlikely to have occurred by chance. A practically significant result is one that has a meaningful effect.
:::
Our logistic regression model as specified considers the explanatory variables `modality` and `length_ratio_log` independently, controlling for the other explanatory variable. This is an **additive model**\index{additive model}, which is what we stated in our hypothesis and represented in the formula `y ~ x1 + x2`.
Not all multivariate relationships are additive. We can also hypothesize an interaction between the explanatory variables. An **interaction model**\index{interaction model} is one which hypothesizes that the effect of one explanatory variable on the response variable is dependent on the other explanatory variable(s). In our case, we could have hypothesized that the effect of `length_ratio_log` on the proportion of NP realizations of the recipient clause is dependent on `modality`. We can specify this relationship using the formula approach, as seen in @exm-infer-cat-logistic-regression-interaction.
::: {#exm-infer-cat-logistic-regression-interaction}
```{r}
#| label: infer-cat-logistic-regression-interaction
# Specify the relationship between the response and explanatory variables
dative_inter_spec <-
dative_tbl |>
specify(
rcp_real ~ modality * length_ratio_log
)
```
\index{R packages!infer}
\cindex{specify()}
:::
Replacing the `+` with a `*` tells the model to consider the interaction between the explanatory variables. A model with an interaction changes the terms and the estimates. In @exm-infer-cat-logistic-regression-interaction-terms, we see the terms for the logistic regression model with an interaction.
::: {#exm-infer-cat-logistic-regression-interaction-terms}
```{r}
#| label: infer-cat-logistic-regression-interaction-terms
# Fit the logistic regression model
dative_inter_fit <-
dative_inter_spec |>
fit()
# Preview
dative_inter_fit
```
\index{R packages!infer}
\cindex{fit()}
:::
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
As an exercise, consider the following research question:
- NP and PP realizations of the recipient clause are contingent on modality and word length ratio of the recipient and theme, and the effect of the length ratio on the proportion of NP realizations of the recipient clause is dependent on the modality.
Follow the simulation-based process to test this hypothesis. What are the results? What are the implications of the results?
:::
The additional term `modalitywritten:length_ratio_log` is the interaction term. We also see the log odds estimates have changed for the previous terms. This is because this interaction draws some of the explanatory power from the other terms. Whether or not we run an interaction model depends on our research question. Again, the hypothesis precedes the model. If we hypothesize an interaction, then we should run an interaction model. If we do not, then we should not.
### Numeric {#sec-infer-numeric}
We now turn our attention to the analysis scenarios where the response variable is numeric\index{continuous variables}. Just as for categorical variables, we can have univariate, bivariate, and multivariate analysis scenarios. The statistical tests for numeric variables are summarized in @tbl-infer-num-design.
::: {#tbl-infer-num-design tbl-colwidths="[15, 30, 30, 25]"}
| Scenario | Explanatory Variable(s) | Statistical Test | `infer`
|:---------|:------------------------|:-----------------|:-----------------|
| Univariate | - | Mean | `mean` |
| Bivariate | Numeric | Correlation | `correlation` |
| Bivariate | Categorical (2 levels) | Difference in means | `diff in means` |
| Bivariate | Categorical (3+ levels ) | ANOVA | `f` |
| Multivariate | Numeric or Categorical | Linear regression | `fit()` |
Statistical test design for numeric response variables
:::
The dataset we will use is drawn from the Switchboard Dialog Act Corpus [@SWDA2008]\index{Switchboard Dialog Act Corpus (SWDA)}\index{corpus!specialized}. The data dictionary is found in @tbl-infer-num-data-dict.
```{r}
#| label: tbl-infer-num-data-dict
#| tbl-cap: "Data dictionary for the transformed SWDA dataset"
#| tbl-colwidths: [15, 18, 15, 52]
#| echo: false