-
Notifications
You must be signed in to change notification settings - Fork 0
/
3_analysis.qmd
1038 lines (781 loc) · 65.4 KB
/
3_analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
execute:
echo: false
---
::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.3}
\thispagestyle{chapterfirstpage}
```
:::
# Analysis {#sec-analysis-chapter}
<!-- Deriving Knowledge from Data: A Guide to Descriptive and Analytical Methods in Text Analysis -->
```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```
::: {.callout}
**{{< fa regular list-alt >}} Outcomes**
- Recall the fundamental concepts and principles of statistics in data analysis.
- Articulate the roles of diagnostic, analytic, and interpretive statistics in quantitative analysis.
- Compare the similarities and differences between analytic approaches to data analysis.
:::
```{r}
#| label: analysis-packages
pacman::p_load(skimr, janitor, tidytext, forcats)
```
The goal of an analysis is to break down complex information into simpler components which are more readily interpretable. In what follows, we will cover the main steps in this process. The first is to inspect the data to ensure its quality and understand its characteristics. The second is to interrogate the data to uncover patterns and relationships and interpret the findings. To conclude this chapter, I will outline methods to and the importance of communicating the analysis results and procedure in a transparent and reproducible manner.
::: {.callout}
**{{< fa terminal >}} Lessons**
**What**: Summarizing data, Visual summaries\
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To showcase methods for statistical summaries of vectors and data frames and to create informative graphics that enhance data interpretation and analysis.
:::
<!-- Set up the dataset/ dictionary -->
```{r}
#| label: analysis-belc
#| eval: false
# TalkBank API
library(TBDBr)
# Talkbank: BELC
corpus_name <- "slabank"
corpora <- c("slabank", "English", "BELC", "1-written(4t)_10-16")
# Get tokens ----
belc_tokens_tbl <-
getTokens(
corpusName = corpus_name, # corpus name
corpora = corpora
) |> # corpus path
unnest(everything()) # unnest variables
# Get participants ----
belc_participants_tbl <-
getParticipants(
corpusName = corpus_name, # corpus name
corpora = corpora
) |> # corpus path
unnest(everything()) # unnest variables
# Join tokens and participants ----
belc_parts_tokens_tbl <-
left_join(belc_participants_tbl, belc_tokens_tbl)
# Wrangling steps ----
belc_tbl <-
belc_parts_tokens_tbl |>
# separate time_group and part_id
separate(filename, c("time_group", "part_id"), sep = "c") |>
# replace "A" with "T" in time_group labels
mutate(time_group = str_remove(time_group, "A")) |>
# filter out time_group "2B"
filter(time_group != "2B") |>
mutate(time_group = str_c("T", time_group, sep = "")) |>
# remove redundant variables
select(-path, -who, -name, -role, -language, -age) |>
# select names and order
select(part_id, sex,
group = time_group,
month_age = monthage, num_words = numwords,
num_utts = numutts, avg_utt = avgutt, median_utt = medianutt,
utt_id = uid, word_id = wordnum, word, lemma = stem, pos
) |>
# subset columns
select(part_id:month_age, utt_id:pos) |>
# arrange observations
arrange(part_id, utt_id, word_id) |>
# impute missing values
mutate(lemma = case_when(
pos == "n" & is.na(lemma) ~ word,
TRUE ~ lemma
)) |>
# lemma not: I, football, basketball, english
mutate(pos = case_when(
is.na(pos) & !(word %in% c("I", "football", "basketball", "english")) ~ "L2",
is.na(pos) ~ "n",
TRUE ~ pos
)) |>
# remaining empty lemmas from word
mutate(lemma = case_when(
is.na(lemma) ~ word,
TRUE ~ lemma
)) |>
# adjust numberic variables
# adjust utt_id and word_id
mutate(
utt_id = utt_id + 1,
word_id = word_id + 1
)
belc_essay_tbl <-
belc_tbl |>
# group by essay
group_by(part_id, sex, group) |>
# summarize data
summarize(
# number of words
tokens = n(),
# number of unique words
types = n_distinct(word),
# number l1 tokens
l1_tokens = sum(str_count(pos, "L2"))
) |>
# ungroup by essay
ungroup() |>
mutate(
# proportion of L2 to L1 words
prop_l2 = 1 - round((l1_tokens / tokens), 3),
# type/ token ratio
ttr = round((types / tokens), 3),
# assign number to essays
essay_id = str_c("E", row_number(), sep = "")
) |>
# select variables
select(essay_id, part_id:types, ttr, prop_l2)
belc_essay_tbl <-
belc_essay_tbl |>
mutate(
across(
where(is.character),
factor
)
) |>
mutate(group = fct_inorder(group, ordered = TRUE)) |>
mutate(group = fct_relevel(group, "T1", "T2", "T3", "T4"))
# Write data ----
write_rds(belc_essay_tbl, "data/analysis-belc_essay_tbl.rds")
# Create data dictionary ----
create_data_dictionary(belc_essay_tbl, "data/analysis-belc_essay_tbl_dd.csv", model = "gpt-3.5-turbo")
```
<!-- Load dataset/ origin/ dictionary -->
```{r}
#| label: analysis-belc-dataset-data-dictionary
# Dataset in rds format for vector types
belc_essay_tbl <- read_rds("data/analysis-belc_essay_tbl.rds")
# Data dictionary
belc_essay_dd <- read_csv("data/analysis-belc_essay_tbl_dd.csv")
```
## Describe {#sec-analysis-describe}
<!-- Purpose -->
The goal of descriptive statistics is to summarize the data in order to understand and prepare the data for the analysis approach to be performed. This is accomplished through a combination of statistic measures and/ or tabular or graphic summaries. The choice of descriptive statistics is guided by the type of data, as well as the question(s) being asked of the data.
In descriptive statistics, there are four basic questions that are asked of each of the variables in the dataset. Each correspond to a different type of descriptive measure.
1. **Central Tendency**: Where do the data points tend to be located?
2. **Dispersion**: How spread out are the data points?
3. **Distribution**: What is the overall shape of of the data points?
4. **Association**: How are these data points related to other data points?
<!-- Dataset used to groud the discussion -->
To ground this discussion I will introduce a new dataset. This dataset is drawn from the Barcelona English Language Corpus (BELC) [@Munoz2006], which is found in the TalkBank repository. I've selected the "Written composition" task from this corpus which contains 80 writing samples from 36 second language learners of English at different ages. Participants were given the task of writing for 15 minutes on the topic of "Me: my past, present and future". Data was collected for participants from one to three times over the course of seven years (at 10, 12, 16, and 17 years of age).
In @tbl-analysis-belc-dd we see the data dictionary for the BELC dataset which reflects structural and transformational steps I've done so we start with a tidy dataset with `essay_id` as the unit of observation.
```{r}
#| label: tbl-analysis-belc-dd
#| tbl-cap: "Data dictionary for the BELC dataset"
#| tbl-colwidths: [13, 23, 12, 52]
# Data dictionary ----
belc_essay_dd |>
tt(width = 1)
```
Now, let's take a look a the first few observations of the BELC dataset to get another perspective on the dataset as we view the values of the dataset.
```{r}
#| label: tbl-analysis-belc-overview
#| tbl-cap: "First 5 observations of the BELC dataset"
#| tbl-colwidths: [12, 12, 12, 12, 12, 12, 12, 12]
# View data ----
belc_essay_tbl |>
slice_head(n = 5) |>
tt(width = 1)
```
::: {.callout .halfsize}
**{{< fa regular file-alt >}} Case study**
Type-Token Ratio (TTR) is a standard metric for measuring lexical diversity, but it is not without its flaws. Most importantly, TTR is highly sensitive to the word length of the text. @Duran2004 discuss this limitation, and the limitations of other lexical diversity measures and propose a new measure $D$ which shows a stronger correlation with language proficiency in their comparative studies.
:::
In @tbl-analysis-belc-overview, each of the variable are attributes or measures of the `essay_id` variable. `tokens` is the number of total words, `types` is the number of unique words, `ttr` is the ratio of unique words to total words. This is known as the Type-Token Ratio and it is a standard metric for measuring lexical diversity. Finally, the proportion of L2 words (English) to the total words (tokens) is provided in `prop_l2`.
Let's now turn our attention to exploring descriptive measures using the BELC dataset.
### Central tendency {#sec-analysis-central-tendency}
<!-- Location -->
<!-- - Central tendency (mean, median, mode) -->
```{r}
#| label: analysis-belc-descriptive-functions
# Function: calculate the mode ----
calculate_mode <- function(x) {
x |>
# convert to tibble
as_tibble() |>
# count values
count(value) |>
# select most frequent value
filter(n == max(n)) |>
# pull value
pull(value)
}
# Function: calculate the normalized entropy ----
calculate_norm_entropy <- function(x) {
# add NA to x
x <- addNA(x, ifany = TRUE)
# get value proportions
prop <- prop.table(table(x))
# calculate entropy
entropy <- -sum(prop * log2(prop))
# calculate max entropy
max_entropy <- log2(length(prop))
# calculate normalized entropy
normalized_entropy <- entropy / max_entropy
return(normalized_entropy)
}
```
The central tendency is measure which aims to summarize the data points in a variable as the most representative, middle, or most typical value. There are three common measures of central tendency: the mode, mean and median. Each differ in how they summarize the data points.
The **mode** is the value that appears most frequently in a set of values. If there are multiple values with the highest frequency, then the variable is said to be multimodal. The most versatile of the central tendency measures as it can be applied to all levels of measurement, although the mode is not often used for numeric variables as it is not as informative as other measures.
The more common measures for numeric variables are the mean and the median. The **mean** is a summary statistic calculated by summing all the values and dividing by the number of values. The **median** is calculated by sorting all the values in the variable and then selecting the middle value.
:::: {.callout}
**{{< fa regular lightbulb >}} Consider this**
::: {.content-visible when-format="html"}
<img src="figures/data-word-mapper.png" width="35%" style="float: right;">
:::
::: {.content-visible when-format="latex"}
```{=latex}
\begin{wrapfigure}{r}{0.50\textwidth}
\centering
\includegraphics[width=0.45\textwidth]{part_2/figures/data-word-mapper.png}
\end{wrapfigure}
```
:::
@Grieve2018 compiled a 8.9 billion-word corpus of geotagged posts from Twitter between 2013-2014 in the United States. The authors provide a [search interface](https://isogloss.shinyapps.io/isogloss/) to explore relationship between lexical usage and geographic location. Explore this corpus searching for terms related to slang ("hella", "wicked"), geographical ("mountain", "river"), meteorological ("snow", "rain"), and/ or any other terms. What types of patterns do you find? What are the benefits and/ or limitations of this type of data, data summarization, and/ or interface?
:::
```{r}
#| label: tbl-analysis-belc-central-tendency
#| tbl-cap: "Central tendency measures for the BELC dataset"
#| tbl-subcap:
#| - "Categorical variables"
#| - "Numeric variables"
#| layout-ncol: 2
#| layout-valign: top
#| tbl-colwidths: auto
# Skim function ----
aa_skim <- skim_with(
factor = sfl(top_counts = top_counts),
character = sfl(top_counts = top_counts),
numeric = sfl(mean = mean, median = median),
append = FALSE
)
belc_essay_tbl |>
aa_skim() |>
yank("factor") |>
select(-n_missing, -complete_rate) |>
select(variable = skim_variable, everything()) |>
tibble() |>
tt(width = 1, digits = 2)
belc_essay_tbl |>
aa_skim() |>
yank("numeric") |>
select(-n_missing, -complete_rate) |>
select(variable = skim_variable, everything()) |>
tibble() |>
select(variable, mean, median) |>
tt(width = 1, digits = 2)
```
As the mode is the most frequent value, the `top_counts` measure in @tbl-analysis-belc-central-tendency provides the most frequent value for the categorical variables. Mean and median appear but we notice that the mean and median are not the same for the numeric variables. Differences that appear between the mean and median will be of interest to us later in this chapter.
### Dispersion
To understand how representative a central tendency measure is we use a calculation of the the spread of the values around the central tendency, or **dispersion**. Dispersion is a measure of how spread out the values are around the central tendency. The more spread out the values, the less representative the central tendency measure is.
For categorical variables, the spread is framed in terms of how balanced the values are across the levels. One way to do this is to use proportions. The **proportion** of each level is the frequency of the level divided by the total number of values. Another way is to calculate the (normalized) entropy. **Entropy** is a single measure of uncertainty. The more balanced the values are across the levels, the closer entropy is 1. In practice, however, proportions are often used to assess the balance of the values across the levels.
The most common measure of dispersion for numeric variables is the **standard deviation**. The standard deviation is calculated by taking the square root of the variance. The **variance** is the average of the squared differences from the mean. So, more succinctly, the standard deviation is a measure of the spread of the values around the mean. Where the standard deviation is anchored to the mean, the **interquartile range** (IQR) is tied to the median. The median represents the sorted middle of the values, in other words the 50th percentile. The IQR is the difference between the 75th percentile and the 25th percentile.
```{r}
#| label: tbl-analysis-belc-dispersion
#| tbl-cap: "Dispersion measures for the BELC dataset"
#| tbl-subcap:
#| - "Categorical variables"
#| - "Numeric variables"
#| layout-valign: top
#| layout-ncol: 2
#| tbl-colwidths: auto
# Skim function ----
aa_skim <- skim_with(
factor = sfl(norm_entropy = calculate_norm_entropy),
character = sfl(norm_entropy = calculate_norm_entropy),
numeric = sfl(sd = sd, iqr = IQR),
append = FALSE
)
belc_essay_tbl |>
# custom skim function
aa_skim() |>
yank("factor") |>
select(-n_missing, -complete_rate) |>
select(variable = skim_variable, everything()) |>
tibble() |>
select(variable, norm_entropy) |>
tt(width = 1, digits = 2)
belc_essay_tbl |>
aa_skim() |>
yank("numeric") |>
select(-n_missing, -complete_rate) |>
select(variable = skim_variable, everything()) |>
tibble() |>
select(variable, sd, iqr) |>
tt(width = 1, digits = 2)
```
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
The inability to compare summary statistics across variables is a key reason why **standardization** is often applied before submiting a dataset for analysis [@Johnson2008; @Baayen2008a].
Standardization is a scale-based transformation that changes the scale of the values to a common scale, or *z-scores*. The result of this transformation puts data points of each variable on the same scale and allows for direct comparison. Furthermore, standardization also mitigates the influence of variables with large values relative to other variables. This is particularly important in multivariate analysis where the influence of variables with large values can be magnified.
The caveat is that standardization masks the original meaning of the data. That is, if we consider token frequency, before standardization, we can say that a value of 1000 tokens is 1000 tokens. After standardization, we can only say that a value of 1 is 1 standard deviation from the mean. This is why standardization is often applied after the descriptive phase of analysis.
:::
In @tbl-analysis-belc-dispersion-1, the normalized entropy helps us understand the balance of the values across the levels of the categorical variables. In @tbl-analysis-belc-dispersion-2, the standard deviation and IQR provide a sense of the spread of the values around the mean and median, respectively, for the numeric variables.
When interpreting numeric central tendency and dispersion values, it is important to only directly compare column-wise. That is, focusing only on a single variable, not across variables. Each variable, as is, is measured on a different scale and only relative to itself can we make sense of the values.
### Distributions {#sec-analysis-distributions}
<!-- - Distributions -->
Summary statistics of the central tendency and dispersion of a variable provide a sense of the most representative value and how spread out the data is around this value. However, to gain a more comprehensive understanding of the variable, it is key to consider the frequencies of all the data points. The **distribution** of a variable is the pattern or shape of the data that emerges when the frequencies of all data points are considered. This can reveal patterns that might not be immediately apparent from summary statistics alone.
When assessing the distribution of categorical variables, we can use a frequency table or bar plot. **Frequency tables** display the frequency and/ or proportion each level in a categorical variable in a clear and concise manner. In @tbl-analysis-belc-frequency-table we see the frequency table for the variable `sex` and `group`.
<!-- sex frequency table -->
```{r}
#| label: tbl-analysis-belc-frequency-table
#| tbl-cap: "Frequency table for variables `sex` and `group`."
#| tbl-subcap:
#| - "Sex"
#| - "Time group"
#| layout-ncol: 2
#| layout-valign: top
#| tbl-colwidths: auto
belc_essay_tbl |>
tabyl(sex) |>
tibble() |>
select(sex, frequency = n, proportion = percent) |>
tt(width = 1)
belc_essay_tbl |>
tabyl(group) |>
tibble() |>
select(group, frequency = n, proportion = percent) |>
tt(width = 1)
```
A **bar plot** is a type of plot where the x-axis is a categorical variable and the y-axis is the frequency of the values. The frequency is represented by the height of the bar. The variables can be ordered by frequency, alphabetically, or some other order. @fig-analysis-belc-barplots is a bar chart for the variables `sex` and `group` ordered alphabetically.
```{r}
#| label: fig-analysis-belc-barplots
#| fig-cap: "Bar plots for categorical variables `sex` and `group`"
#| fig-alt: "Two bar plots. On the left, a bar plot for the variable sex with the x-axis labeled male and female and the y-axis labeled Frequency. On the right, a bar plot for the variable group with the x-axis labeled T1, T2, T3, and T4 and the y-axis labeled Frequency."
#| fig-subcap:
#| - "Bar plot for `sex`"
#| - "Bar plot for `group`"
#| layout-ncol: 2
# Function to create bar plots ----
create_barplot <- function(data, variable, x_lab = NULL, y_lab = NULL) {
# Create a frequency table for the variable
freq_table <- data |> count({{ variable }})
# Set the y-limits to 0 to the sum of the variable frequencies
ylim <- c(0, sum(freq_table$n))
# Create the bar plot
ggplot(freq_table, aes(x = {{ variable }}, y = n)) +
geom_bar(stat = "identity") +
ylim(ylim) +
labs(x = x_lab, y = y_lab)
}
# Bar plots for categorical variables ----
# Bar plot `sex` ----
belc_essay_tbl |>
create_barplot(sex, "Sex", "Frequency") + theme_qtalr(font_size = 12)
# Bar plot `group` ----
belc_essay_tbl |>
create_barplot(group, "Time group", "Frequency") + theme_qtalr(font_size = 12)
```
So for a frequency table or barplot, we can see the frequency of each level of a categorical variable. This gives us some knowledge about the BELC dataset: there are more girls in the dataset and more essays appear in first and third time groups. If we were to see any clearly loopsided categories, this would be a sign of imbalance in the data and we would need to consider how this might impact our analysis.
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
The goal of descriptive statistics is to summarize the data in a way that is meaningful and interpretable. With this in mind, compare the frequency tables in [-@tbl-analysis-belc-frequency-table] and bar plots in [-@fig-analysis-belc-barplots]. Does one provide a more interpretable summary of the data? Why or why not? Are there any other ways you might communicate this distribution more effectively?
:::
Numeric variables are best understood visually. The most common visualizations of the distribution of a numeric variable are histograms and density plots. **Histograms** are a type of bar plot where the x-axis is a numeric variable and the y-axis is the frequency of the values falling within a determined range of values, or bins. The frequency of values within each bin is represented by the height of the bars.
**Density plots** are a smoothed version of histograms. The y-axis of a density plot is the probability of the values. When frequent values appear closely together, the plot line is higher. When the frequency of values is lower or more spread out, the plot line is lower.
```{r}
#| label: fig-analysis-belc-histogram-density-tokens
#| fig-cap: "Distribution plots for the variable `tokens`."
#| fig-alt: "Three plots with histograms overlayed with density plots. The three plots represent the distribution of the values of the variables `tokens`, `types`, and `ttr`. Of the three, the `ttr` plot is the most symmetric."
#| fig-subcap:
#| - "Histogram"
#| - "Density plot"
#| layout-ncol: 2
# Define range for x-axis
x_range <- belc_essay_tbl |>
pull(tokens) |>
range()
# Histogram ----
belc_essay_tbl |>
ggplot(aes(x = tokens)) +
geom_histogram(bins = 30, color = "black", fill = "white") +
labs(x = "Number of tokens", y = "Frequency") +
scale_x_continuous(limits = x_range) + theme_qtalr(font_size = 12)
# Density plot ----
belc_essay_tbl |>
ggplot(aes(x = tokens)) +
geom_density() +
labs(x = "Number of tokens", y = "Probability") +
scale_x_continuous(limits = x_range) + theme_qtalr(font_size = 12)
```
Both the histogram in @fig-analysis-belc-histogram-density-tokens-1 and the density plot in @fig-analysis-belc-histogram-density-tokens-2 show the distribution of the variable `tokens` in slightly different ways which translate into trade-offs in terms of interpretability.
The histogram shows the frequency of the values in bins. The number of bins and/ or binwidth can be changed for more or less granularity. A rough grain histogram shows the general shape of the distribution, but it is difficult to see the details of the distribution. A fine grain histogram shows the details of the distribution, but it is difficult to see the general shape of the distribution. The density plot shows the general shape of the distribution, but it hides the details of the distribution. Given this trade-off, it is often useful explore outliers with histograms and the overall shape of the distribution with density plots.
```{r}
#| label: fig-analysis-belc-histograms
#| fig-cap: "Histograms for numeric variables `tokens`, `types`, and `ttr`."
#| fig-subcap:
#| - "Number of tokens"
#| - "Number of types"
#| - "Type-token ratio score"
#| layout-ncol: 3
# Histograms ----
belc_essay_tbl |>
ggplot(aes(x = tokens)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "white") +
geom_density() +
labs(x = "", y = "") +
theme(axis.text.y = element_blank()) +
theme_qtalr(font_size = 13)
belc_essay_tbl |>
ggplot(aes(x = types)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "white") +
geom_density() +
labs(x = "", y = "") +
theme(axis.text.y = element_blank()) +
theme_qtalr(font_size = 13)
belc_essay_tbl |>
ggplot(aes(x = ttr)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "white") +
geom_density() +
labs(x = "", y = "") +
theme(axis.text.y = element_blank()) +
theme_qtalr(font_size = 13)
```
In @fig-analysis-belc-histograms we see both histograms and density plots combined for the variables `tokens`, `types`, and `ttr`. Focusing on the details captured in the histogram we are better able to detect potential outliers. Outliers can reflect valid values that are simply extreme or they can reflect something erroneous in the data. To distinguish between these two possibilities, it is important to know the context of the data. Take, for example, @fig-analysis-belc-histograms-3. We see that there is a bin near the value 1.0. Given that the type-token ratio is a ratio of the number of types to the number of tokens, it is unlikely that the type-token ratio would be exactly 1.0 as this would mean that every word in an essay is unique. Another, less dramatic, example is the bin to the far right of @fig-analysis-belc-histograms-1. In this case, the bin represents the number of tokens in an essay. An uptick in the number of essays with a large number of tokens is not surprising and would not typically be considered an outlier. On the other hand, consider the bin near the value 0 in the same plot. It is unlikely that a true essay would have 0, or near 0, words and therefore a closer look at the data is warranted.
It is important to recognize that outliers contribute undue influence to overall measures of central tendency and dispersion. To appreciate this, let's consider another helpful visualization called a **boxplot**. A boxplot is a visual representation which aims to represent the central tendency, dispersion, and distribution of a numeric variable in one plot.
```{r}
#| label: fig-analysis-belc-histogram-boxplot
#| fig-cap: "Understanding the similarities between boxplots and histograms"
#| fig-alt: "Two plots of the `ttr` shown one above the other. On top a histogram and below a boxplot. The histogram is includes vertical lines for the first quartile, median, mean, and third quartile. These are the same values represented by the boxplot. These lines are vertically aligned."
#| fig-subcap:
#| - "Histogram"
#| - "Boxplot"
#| fig-width: 8
#| fig-asp: 0.283
#| layout-nrow: 2
# Calculate quantiles and mean
quants <-
belc_essay_tbl |>
pull(ttr) |>
quantile(probs = c(0.25, 0.5, 0.75))
mean_val <-
belc_essay_tbl |>
pull(ttr) |>
mean()
# Histogram plot ----
p1 <-
belc_essay_tbl |>
ggplot(aes(x = ttr)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "#BDBDBD", fill = "white") +
# geom_density() +
geom_vline(aes(xintercept = quants[[1]]), linetype = "solid") + # first quartile
geom_vline(aes(xintercept = quants[[2]]), linetype = "solid", linewidth = 0.75) + # median
geom_vline(aes(xintercept = mean_val), linetype = "dashed", linewidth = 0.75) + # mean
geom_vline(aes(xintercept = quants[[3]]), linetype = "solid") + # third quartile
labs(x = "", y = "") +
theme(axis.text.y = element_blank()) +
scale_x_continuous(limits = c(0.4, 1.02))
p1
# Boxplot ----
p2 <-
belc_essay_tbl |>
ggplot(aes(x = ttr)) +
geom_boxplot() +
scale_x_continuous(limits = c(0.4, 1.02)) +
annotate("text", x = -0.38, y = 0.38, label = "Median", hjust = 0, vjust = 0) +
geom_segment(aes(y = -0.38, yend = 0.38, x = mean(ttr), xend = mean(ttr)), linetype = "dashed", linewidth = 0.5) +
labs(y = "", x = "") +
theme(axis.text.y = element_blank())
p2
```
In @fig-analysis-belc-histogram-boxplot-2 we see a boxplot for `ttr` variable. The box in the middle of the plot represents the interquartile range (IQR) which is the range of values between the first quartile and the third quartile. The solid line in the middle of the box represents the median. The lines extending from the box are called 'whiskers' and provide the range of values which are within 1.5 times the IQR. Values outside of this range are plotted as individual points.
Now let's consider boxplots from another angle. Just above in @fig-analysis-belc-histogram-boxplot-1 I've plotted a histogram. In this view, we can see that a boxplot is a simplifed histogram augmented with central tendency and dispersion statistics. While histograms focus on the frequency distribution of data points, boxplots focus on the data's quartiles and potential outliers.
Concerning outliers, it is important to address them to safeguard the accuracy of the analysis. There are two main ways to address outliers: eliminate observations with outliers or transform the data. The elimination, or **trimming**, of outliers is more extreme as it removes data but can be the best approach for true outliers. Transforming the data is an approach to mitigating the influence of extreme but valid values. **Transformation** involves applying a mathematical function to the data which changes the scale and/ or shape of the distribution, but does not remove data nor does it change the relative order of the values.
<!-- Normal distribution/ skewed distributions -->
The exploration the data points with histograms and boxplots has helped us to identify outliers. Now we turn to the question of the overall shape of the distribution.
When values are symmetrically dispersed around the central tendency, the distribution is said to be normal. The **Normal Distribution** is characterized by a distribution where the mean and median are the same. The Normal Distribution has a key role in theoretical statistics and is the foundation for many statistical tests. This distribution is also known as the Gaussian Distribution or the Bell Curve for the hallmark bell shape of the distribution. In a normal distribution, extreme values are less likely than values near the center.
When values are not symmetrically dispersed around the central tendency, the distribution is said to be skewed. A distribution in which values tend to disperse to the left of the central tendency is **left skewed** and a distribution in which values tend to disperse to the right of the central tendency is **right skewed**.
Simulations of these distributions appear in @fig-analysis-distributions.
```{r}
#| label: fig-analysis-distributions
#| fig-cap: "Mean and median for normal and skewed distributions"
#| fig-alt: "Three plots that show the distribution of values for left-skewed, normal, and right-skewed distributions. The left skewed distribution has a mean to the left of the median, the normal distribution has a mean equal to the median, and the right skewed distribution has a mean to the right of the median."
#| fig-subcap:
#| - "Left-skewed"
#| - "Normal"
#| - "Right-skewed"
#| layout-ncol: 3
shape1 <- 6
shape2 <- 2
# Left skewed distribution ----
set.seed(123)
left_skew_data <- tibble(value = rbeta(1000, shape1, shape2)) # left skewed
ggplot(left_skew_data, aes(x = value)) +
geom_function(fun = dbeta, args = list(shape1 = shape1, shape2 = shape2), color = "black") +
geom_vline(aes(xintercept = mean(value)), linetype = "dashed") +
geom_vline(aes(xintercept = median(value)), linetype = "solid") +
labs(x = "Values", y = "Density") +
theme(axis.text = element_blank()) + theme_qtalr(font_size = 12)
# Normal distribution ----
set.seed(123)
norm_data <- tibble(value = rnorm(1000))
ggplot(norm_data, aes(x = value)) +
geom_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "black") +
geom_vline(aes(xintercept = median(value)), linetype = "solid") +
geom_vline(aes(xintercept = mean(value)), linetype = "dashed") +
labs(x = "Values", y = "Density") +
theme(axis.text = element_blank()) + theme_qtalr(font_size = 12)
# Right skewed distribution ----
set.seed(123)
right_skew_data <- tibble(value = rbeta(1000, shape2, shape1)) # right skewed
ggplot(right_skew_data, aes(x = value)) +
geom_function(fun = dbeta, args = list(shape1 = shape2, shape2 = shape1), color = "black") +
geom_vline(aes(xintercept = mean(value)), linetype = "dashed") +
geom_vline(aes(xintercept = median(value)), linetype = "solid") +
labs(x = "Values", y = "Density") +
theme(axis.text = element_blank()) + theme_qtalr(font_size = 12)
```
<!-- [ ] Make note somehow that a simulation-based inference method approach will be used in this textbook @Morris2019 and @Rossman2014a -->
Assessing the distribution of a variable is important for two reasons. First, the distribution of a variable can inform the choice of statistical test in theory-based hypothesis testing. Data that are normally, or near-normally distributed are often analyzed using parametric tests while data that exhibit a skewed distributed are often analyzed using non-parametric tests. Second, highly skewed distributions have the effect of compressing the range of values. This can lead to a loss of information and can make it difficult to detect patterns in the data.
Skewed frequency distributions are commonly found for linguistic units (*.e.g* phonemes, morphemes, words, *etc.*). However, these distributions tend to a follow a particular type of skew known as a Zipf distribution. According to **Zipf's Law** [@Zipf1949], the frequency of a linguistic unit is inversely proportional to its rank. In other words, the most frequent units will appear twice as often as the second most frequent unit, three times as often as the third most frequent unit, and so on.
The plot in @fig-analysis-zipf-distribution-1 is simulated data that fits a Zipfian distribution.
```{r}
#| label: fig-analysis-zipf-distribution
#| fig-cap: "Zipfian distribution"
#| fig-alt: "Two plots that show the distribution of values for a Zipfian distribution. The left plot shows the Zipfian distribution and the right plot shows the log-transformed Zipfian distribution. The Zipfian distribution is highly right-skewed, with a deep curve. The log transformation smooths the curve, spreading out the values of the distribution."
#| fig-subcap:
#| - "Zipfian distribution"
#| - "Log-transformed Zipfian distribution"
#| layout-ncol: 2
set.seed(123)
zipf_data <- tibble(rank = 1:100, frequency = 100 / (1:100))
# Zipfian distribution
zipf_data |>
ggplot(aes(x = rank, y = frequency)) +
geom_line() +
labs(x = "Rank", y = "Frequency") + theme_qtalr(font_size = 12)
# Log-transformed Zipfian distribution
zipf_data |>
ggplot(aes(x = rank, y = log(frequency))) +
geom_line() +
labs(x = "Rank", y = "Frequency (log)") + theme_qtalr(font_size = 12)
```
Zipf's law describes a theoretical distribution, and the actual distribution of units in a corpus is affected by various sampling factors, including the size of the corpus. The larger the corpus, the closer the distribution will be to the Zipf distribution.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
As stated above, Zipfian distributions are typical of natural language and are observed a various linguistic levels. This is because natural language is a complex system, and complex systems tend to exhibit Zipfian distributions. Other examples of complex systems that exhibit Zipfian distributions include the size of cities, the frequency of species in ecological communities, the frequency of links in the World Wide Web, *etc.*
:::
In the case that a variable is highly skewed (such as in linguistic frequency distributions), it is often useful to attempt transform the variable to reduce the skewness. In contrast to scale-based transformations (*e.g.* centering and scaling), shape-based transformations change the scale and the shape of the distribution. The most common shape-based transformation is the logarithmic transformation. The **logarithmic transformation** (log-transformation) takes the log (typically base 10) of each value in a variable. The log-transformation is useful for reducing the skewness of a variable as it compresses large values and expands small values. If the skewness is due to these factors, the log-transformation can help, as in the case of the Zipfian distribution in @fig-analysis-zipf-distribution-2.
It is important to note, however, that if scale-based transformations are to be applied to a variable, they should be applied after the log-transformation as the log of negative values is undefined.
### Association
<!-- Purpose: nature and strength -->
We have covered the first three of the four questions we are interested in asking in a descriptive analysis. The fourth, and last, question is whether there is an association between variables. If so, what is the directionality and what is the apparent magnitude of the dependence? Knowing the answers to these questions will help frame our approach to analysis.
To assess association, the number and information types of the variables under consideration are important. Let's start by considering two variables. If we are working with two variables, we are dealing with a **bivariate** relationship. Given there are three informational types (categorical, ordinal, and numeric), there are six logical bivariate combinations: categorical-categorical, categorical-ordinal, categorical-numeric, ordinal-ordinal, ordinal-numeric, and numeric-numeric.
The directionality of a relationship will take the form of a tabular or graphic summary depending on the informational value of the variables involved. In @tbl-analysis-summary-types, we see the appropriate summary types for each of the six bivariate combinations.
::: {#tbl-analysis-summary-types tbl-colwidths="[17, 27, 28, 28]"}
| | Categorical | Ordinal | Numeric |
|-----------------|-------------------|-----------------------------|----------------------|
| **Categorical** | Contingency table | Contingency table/ Bar plot | Pivot table/ Boxplot |
| **Ordinal** | - | Contingency table/ Bar plot | Pivot table/ Boxplot |
| **Numeric** | - | - | Scatterplot |
Summaries for different combinations of variable types
:::
<!-- Nominal + ? -->
Let's first start with the combinations that include a categorical or ordinal variable. Categorical and ordinal variables reflect measures of class-type information, with add meaningful ranks to ordinal variables. To assess a relationship with these variable types, a table is always a good place to start. When combined together, a contingency table is the appropriate table. A **contingency table** is a cross-tabulation of two class-type variables, basically a two-way frequency table. This means that three of the six bivariate combinations are assessed with a contingency table: categorical-categorical, categorical-ordinal, and ordinal-ordinal.
In @tbl-analysis-belc-contingency-tables we see contingency tables for the categorical variable `sex` and ordinal variable `group` in the BELC dataset. A contingency table may include only counts, as in @tbl-analysis-belc-contingency-tables-1, or may include proportions or percentages in an effort to normalize the counts and make them more comparable, as in @tbl-analysis-belc-contingency-tables-2.
<!-- sex + group contingency table -->
```{r}
#| label: tbl-analysis-belc-contingency-tables
#| tbl-cap: "Contingency tables for categorical variable `sex` and ordinal variable `group`"
#| tbl-subcap:
#| - "Counts"
#| - "Percentages"
#| layout: [[1, 1]]
#| layout-valign: top
#| tbl-colwidths: [25, 25, 25, 25]
belc_essay_tbl |>
tabyl(group, sex) |>
adorn_totals(c("row", "col")) |>
as_tibble() |>
tt(width = 1, digits = 0)
belc_essay_tbl |>
tabyl(group, sex) |>
adorn_totals(c("row", "col")) |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 2) |>
as_tibble() |>
tt(width = 1, digits = 0)
```
It is sometimes helpful to visualize a contingency table as a bar plot when there are a larger number of levels in either or both of the variables. Again, looking at the relationship between `sex` and `group`, we see that we can plot the counts or the proportions. In @fig-analysis-belc-bar-plots, we see both.
<!-- sex + group bar plots (counts/ proportions) -->
```{r}
#| label: fig-analysis-belc-bar-plots
#| fig-cap: "Bar plots for the relationship between `sex` and `group`"
#| fig-alt: "Two bar plots. On the left, a bar plot for the relationship between `sex` and `group` as counts on the y-axis. On the right, a bar plot for the relationship between `sex` and `group` as proportions on the y-axis."
#| fig-subcap:
#| - "Counts"
#| - "Proportions"
#| layout-ncol: 2
belc_essay_tbl |>
ggplot(aes(x = sex, y = after_stat(count), fill = group)) +
geom_bar(position = "stack", color = "black") +
scale_fill_brewer(palette = "Greys") +
labs(x = "Sex", y = "Frequency", fill = "Group") +
ylim(0, 80) +
theme_qtalr(font_size = 12)
belc_essay_tbl |>
ggplot(aes(x = sex, fill = group)) +
geom_bar(position = "fill", color = "black") +
scale_fill_brewer(palette = "Greys") +
labs(x = "Sex", y = "Proportion", fill = "Group") +
theme_qtalr(font_size = 12)
```
To summarize and assess the relationship between a categorical or an ordinal variable and a numeric variable, we cannot use a contingency table. Instead, this type of relationship is best summarized in a table using a summary statistic in a **pivot table**. A pivot table is a table in which a class-type variable is used to group a numeric variable by some summary statistic appropriate for numeric variables, *e.g.* mean, median, standard deviation, *etc.*
In @tbl-analysis-belc-pivot-table, we see a pivot table for the relationship between `group` and `tokens` in the BELC dataset. Specifically, we see the mean number of tokens by group. We see that the mean number of tokens increases from Group T1 to T4, which is consistent with the idea that the students in the higher groups are writing longer essays.
<!-- group + tokens pivot table (mean) -->
```{r}
#| label: tbl-analysis-belc-pivot-table
#| tbl-cap: "Pivot table for the mean `tokens` by `group`"
#| tbl-colwidths: [50, 50]
belc_essay_tbl |>
group_by(group) |>
summarise(mean_tokens = mean(tokens)) |>
tt(width = 1)
```
Although a pivot table may be appropriate for targeted numeric summaries, a visualization is often more informative for assessing the dispersion and distribution of a numeric variable by a categorical or ordinal variable. There are two main types of visualizations for this type of relationship: a boxplot and a **violin plot**. A violin plot is a visualization that summarizes the distribution of a numeric variable by a categorical or ordinal variable, adding the overall shape of the distribution, much as a density plot does for histograms.
In @fig-analysis-belc-boxplot-violin-plot, we see both a boxplot and a violin plot for the relationship between `group` and `tokens` in the BELC dataset. From the boxplot in @fig-analysis-belc-boxplot-violin-plot-1, we see that the general trend towards more tokens used by students in higher groups. But we can also appreciate the dispersion of the data within each group looking at the boxes and whiskers. On the surface it appears that the data for groups T1 and T3 are closer to each other than groups T2 and T4, in which there is more variability within these groups. Furthermore, we can see outliers in groups T1 and T3, but not in groups T2 and T4. From the violin plot in @fig-analysis-belc-boxplot-violin-plot-2, we can see the same information, but we can also see the overall shape of the distribution of tokens within each group. In this plot, it is very clear that group T4 includes a wide range of token counts.
<!-- group + tokens boxplot/ voilin plot -->
```{r}
#| label: fig-analysis-belc-boxplot-violin-plot
#| fig-cap: "Boxplot and violin plot for the relationship between `group` and `tokens`"
#| fig-alt: "Two plots that `group` on the x-axis and `tokens` on the y-axis. The left plot is a boxplot and the right plot is a violin plot. The boxplot shows the median, first and third quartiles, and the whiskers. The violin plot shows the distribution of the data by the width of the plot at the points where the data is most dense."
#| fig-subcap:
#| - "Boxplot"
#| - "Violin plot"
#| layout-ncol: 2
belc_essay_tbl |>
ggplot(aes(x = group, y = tokens)) +
geom_boxplot(color = "black") +
labs(x = "Group", y = "Tokens") +
theme_qtalr(font_size = 12)
belc_essay_tbl |>
ggplot(aes(x = group, y = tokens)) +
geom_violin(color = "black") +
labs(x = "Group", y = "Tokens") +
theme_qtalr(font_size = 12)
```
<!-- Numeric + Numeric -->
The last bivariate combination is numeric-numeric. To summarize this type of relationship a scatterplot is used. A **scatterplot** is a visualization that plots each data point as a point in a two-dimensional space, with one numeric variable on the x-axis and the other numeric variable on the y-axis. Depending on the type of relationship you are trying to assess, you may want to add a trend line to the scatterplot. A trend line is a line that summarizes the overall trend in the relationship between the two numeric variables. To assess the extent to which the relationship is linear, a straight line is drawn which minimizes the distance between the line and the points.
In @fig-analysis-belc-scatter-plot, we see a scatterplot and a scatterplot with a trend line for the relationship between `ttr` and `types` in the BELC dataset. We see that there is an apparent positive relationship between these two variables, which is consistent with the idea that as the number of types increases, the type-token ratio increases. In other words, as the number of unique words increases, so does the lexical diversity of the text. Since we are evaluating a linear relationship, we are assessing the extent to which there is a **correlation** between `ttr` and `types`. A correlation simply means that as the values of one variable change, the values of the other variable change in a consistent manner.
<!-- ttr + types scatter plot: points, points + trend line -->
```{r}
#| label: fig-analysis-belc-scatter-plot
#| fig-cap: "Scatter plot for the relationship between `ttr` and `types`"
#| fig-alt: "Two scatterplots in which the y-axis is `ttr` and the x-axis is `types`. The first scatterplot shows the points only. The second scatterplot shows the points with a linear trend line which minimizes the distance between the line and the points. In this case, that line slopes from the top left to the bottom right."
#| fig-subcap:
#| - "Points"
#| - "Points with a linear trend line"
#| layout-ncol: 2
#| fig-pos: H
belc_essay_tbl |>
ggplot(aes(x = types, y = ttr)) +
geom_point(color = "black", alpha = 0.5) +
labs(x = "Number of types", y = "Type-Token Ratio score") +
theme_qtalr(font_size = 12)
belc_essay_tbl |>
ggplot(aes(x = types, y = ttr)) +
geom_point(color = "black", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Number of types", y = "Type-Token Ratio score") +
theme_qtalr(font_size = 12)
```
## Analyze {#sec-analysis-analyze}
The goal of analysis, generally, is to generate knowledge from information. The type of knowledge generated and the process by which it is generated, however, differ and can be broadly grouped into three analysis types: exploratory, predictive, and inferential.
In this section, I will elaborate briefly on the distinctions between analysis types seen in @tbl-analysis-analysis-types. I will structure the discussion moving from the least structured (inductive) to most structured (deductive) approach to deriving knowledge from information with the aim to provide enough information for you to identify these research approaches in the literature and to make appropriate decisions as to which approach your research should adopt.
::: {#tbl-analysis-analysis-types tbl-colwidths="[15, 19, 22, 22, 22]"}
| Type | Aims | Approach | Methods | Evaluation |
|------|------|----------|---------|------------|
| Exploratory | Explore: gain insight | Inductive, data-driven, and iterative | Descriptive, pattern detection with machine learning (unsupervised) | Associative |
| Predictive | Predict: validate associations | Semi-deductive, data-/ theory-driven, and iterative | Predictive modeling with machine learning (supervised) | Model performance, feature importance, and associative |
| Inferential | Explain: test hypotheses | Deductive, theory-driven, and non-iterative | Hypothesis testing with statistical tests | Causal |
Overview of analysis types
:::
### Explore {#sec-analysis-explore}
In **Exploratory Data Analysis (EDA)**, we use a variety of methods to identify patterns, trends, and relations within and between variables. The goal of EDA is uncover insights in an inductive, data-driven manner. That is to say, that we do not enter into EDA with a fixed hypothesis in mind, but rather we explore intuition, probe anecdote, and follow hunches to identify patterns and relationships and to evaluate whether and why they are meaningful. We are admittedly treading new or unfamiliar terrain letting the data guide our analysis. This means that we can use and reuse the same data to explore different angles and approaches adjusting our methods and measures as we go. In this way, EDA is an iterative, meaning generating process.
<!-- Identification of variables -->
In line with the investigative nature of EDA, the identification of variables of interest is a discovery process. We most likely have a intuition about the variables we would like to explore, but we are able to adjust our variables as need be to suit our research aims. When the identification and selection of variables is open, the process is known as **feature engineering**. A process that is much an art as a science, feature engineering leverages a mixture of relevant domain knowledge, intuition, and trial and error to identify features that serve to best represent the data and to best serve the research aims. Furthermore, the roles of features in EDA are fluid --no variable has a special status, as seen in @fig-eda-variables. We will see that in other types of analysis, some or all the roles of the variables are fixed.
::: {#fig-eda-variables}
[![](figures/analysis-eda-variables.drawio.png){width=85%}]{fig-alt="A table which has columns labeled as 'feat_1', 'feat_2', on so on. Above, these columns are labeled as 'features'. This figure aims to show that no feature has a special status in exploratory data analysis."}
Roles of variables in exploratory data analysis
:::
Any given dataset could serve as a starting point to explore many different types of research questions. In order to maintain research coherence so our efforts to not careen into a free-for-all, we need to tether our feature engineering to a unit of analysis that is relevant to the research question. A **unit of analysis** is the entity that we are interested in studying. Not to be confused with the unit of observation, which is the entity that we are able to observe and measure [@Sedgwick2015]. Depending on the perspective we are interested in investigating, the choice of how to approach engineering features to gain insight will vary.
By the same token, approaches for interrogating the dataset can differ significantly, between research projects and within the same project, but for instructive purposes, let's draw a distinction between descriptive methods and unsupervised learning methods, as seen in @tbl-eda-methods.
::: {#tbl-eda-methods tbl-colwidths="[50, 50]"}
| Descriptive methods | Unsupervised learning methods |
|---------------------|-----------------------------|
| Frequency analysis | Cluster analysis |
| Co-occurence analysis | Principal component analysis |
| Keyness analysis | Topic Modeling |
| | Vector space models |
Some common exploratory data analysis methods
:::
The first group, **descriptive methods** can be seen as an extenstion of the descriptive statistics covered earlier in this chapter including statistic, tabular, and visual techniques. The second group, **unsupervised learning**, is a subtype of machine learning in which an algorithm is used to find patterns within and between variables in the data without any guidance (supervision). In this way, the algorithm, or machine learner, is left to make connections and associations wherever they may appear in the input data.
Either through descriptive, unsupervised learning methods, or a combination of both, EDA employs quantitative methods to summarize, reduce, and sort complex datasets in order to provide the researcher novel perspective to be qualitatively assessed. Exploratory methods produce results that require associative thinking and pattern detection. Speculative as they are, the results from exploratory methods can be highly informative and lead to new insight and inspire further study in directions that may not have been expected.
### Predict {#sec-analysis-predict}
**Predictive Data Analysis (PDA)** employs a variety of techniques to examine and evaluate the association strength between a variable or set of variables, with a specific focus on predicting a target variable. The aim of PDA is to construct models that can accurately forecast future outcomes, using either data-driven or theory-driven approaches. In this process, **supervised learning** methods, where the machine learning algorithm is guided (supervised) by a target outcome variable, are used. This means we don't begin PDA with a completely open-ended exploration, but rather with an objective - accurate predictions. However, the path to achieving this objective can be flexible, allowing us freedom to adjust our models and methods. Unlike EDA, where the entire dataset can be reused for different approaches, PDA requires a portion of the data to be reserved for evaluation, enhancing the validity of our predictive models. Thus, PDA is an iterative process that combines the flexibility of exploratory analysis with the rigor of confirmatory analysis.
<!-- Identification of variables -->
There are two types of variables in PDA: the outcome variable and the predictor variables, or features. The **outcome variable** is the variable that the researcher is trying to predict. It is the only variable that is necessarily fixed as part of the research question. The features are the variables that are used to predict the outcome variable. An overview of the roles of these variables in PDA is shown in @fig-pda-variables.
::: {#fig-pda-variables}
[![](figures/analysis-pda-variables.drawio.png){width=85%}]{fig-alt="A table which has one column labeled 'outcome' and the other columns labeled 'feat_1', 'feat_2', on so on. Above, these the columns there roles are labled as 'Outcome' and 'Features'. This figure aims to show that the outcome variable is fixed and the features are flexible in predictive data analysis."}
Roles of variables in predictive data analysis
:::
Feature selection can be either data-driven or theory-driven. Data-driven features are those that are engineered to enhance predictive power, while theory-driven features are those that are selected based on theoretical relevance.
The approach to interrogating the dataset includes three main steps: feature engineering, model selection, and model evaluation. We've discussed feature engineering, so what is model selection and model evaluation?
**Model selection** is the process of choosing a machine learning algorithm and set of features that produces the best prediction accuracy for the outcome variable. To refine our approach such that we arrive at the best combination of algorithm and features, we need to train our machine learner on a variety of combinations and evaluate the accuracy of each.
There are many different types of machine learning algorithms, each with their own strengths and weaknesses. The first rough cut is to decide what type of outcome variable we are predicting: categorical or numeric. If the outcome variable is categorical, we are performing a **classification** task, and if the outcome variable is numeric, we are performing a **regression** task. As we see in @tbl-pda-algorithms, there are various algorithms that can be used for each task.
::: {#tbl-pda-algorithms tbl-colwidths="[50, 50]"}
| Classification | Regression |
|:-----------------------|:--------------------------|
| Logistic Regression | Linear Regression |
| Random Forest Classifier | Random Forest Regressor |
| Support Vector Machine | Support Vector Regression |
| Neural Network Classifier | Neural Network Regressor |
Some common supervised learning algorithms used in PDA
:::
There are a number of algorithm-specific strengths and weaknesses to be considered in the process of model selection. These hinge on characteristics of the data, such as the size of the dataset, the number of features, the type of features, and the expected type of relationships between features or on computing resources, such as the amount of time available to train the model or the amount of memory available to store the model.
**Model evaluation** is the process of assessing the accuracy of the model on the test set, which is a proxy for how well the model will generalize to new data. Model evaluation is performed quantitatively by calculating the accuracy of the model. It is important to note that whether the accuracy metrics are good is to some degree qualitative judgment.
### Infer {#sec-analysis-infer}
The most commonly recognized of the three data analysis approaches, **Inferential data analysis (IDA)** is the bread-and-butter of science. IDA is a deductive, theory-driven approach in which all aspects of analysis stem from a pre-determined premise, or hypothesis, about the nature of a relationship in the world and then aims to test whether this relationship is statistically supported given the evidence. Since the goal is to infer conclusions about a certain relationship in the population based on a statistical evaluation of a (corpus) sample, the representativeness of the sample is of utmost importance. Furthermore, the use of the data is limited to the scope of the hypothesis --that is, the data cannot be used iteratively for exploratory purposes.
<!-- Identify variables -->
The selection of variables and the roles they play in the analysis are determined by the hypothesis. In a nutshell, a **hypothesis** is a formal statement about the state of the world. This statement is theory-driven meaning that it is predicated on previous research. We are not exploring or examining relationships, rather we are testing a specific relationship. In practice, however, we are in fact proposing two mutally exclusive hypotheses. The first is the **Alternative Hypothesis**, or $H_1$. This is the hypothesis I just described --the statement grounded in the previous literature outlining a predicted relationship. The second is the **Null Hypothesis**, or $H_0$. This is the flip-side of the hypothesis testing coin and states that there is no difference or relationship. Together $H_1$ and $H_0$ cover all logical outcomes.
Now, in standard IDA one variable is the response variable and one or more variables are explanatory variables. The **response variable**, sometimes referred to as the outcome or dependent variable, is the variable which contains the information which is hypothesized to depend on the information in the explanatory variable(s). It is the variable whose variation a research study seeks to explain. An **explanatory variable**, sometimes referred to as a independent or predictor variable, is a variable whose variation is hypothesized to explain the variation in the response variable.
Explanatory variables add to the complexity of a study because they are part of our research focus, specifically our hypothesis. It is, however, common to include other variables which are not of central focus, but are commonly assumed to contribute to the explanation of the variation of the response variable. These are known as **control variables**. Control variables are included in the analysis to account for the influence of other variables on the relationship between the response and explanatory variables, but will not be included in the hypothesis nor interpreted in our results.
We can now see in @fig-analysis-ida-variables the variables roles assigned to variables in a hypothesis-driven study.
::: {#fig-analysis-ida-variables}
[![](figures/analysis-ida-variables.drawio.png){width=85%}]{fig-alt="A table which has one column labeled 'response', two columns 'expl_1' and 'expl_2', and two more labeled 'cont_1' and 'cont_2'. Above, these the columns there roles are labled as 'Response', 'Explanatory', and 'Control'. This figure aims to show that the response variable and the explanatory variables are fixed and the control variables can be used in inferential data analysis."}
Roles of variables in inferential data analysis
:::
The type of statistical test that one chooses is based on (1) the informational value of the dependent variable and (2) the number of predictor variables included in the analysis. Together these two characteristics go a long way in determining the appropriate class of statistical test (see @Gries2013a and @Paquot2020a for a more exhaustive description).
IDA relies heavily on quantitative evaluation methods to draw conclusions that can be generalized to the target population. It is key to understand that our goal in hypothesis testing is not to find evidence in support of $H_1$, but rather to assess the likelihood that we can reliably reject $H_0$.
Traditionally, $p$-values have been used to determine the likelihood of rejecting $H_0$. A $p$-value is the probability of observing a test statistic as extreme as the one observed, given that $H_0$ is true. However, $p$-values are not the only metric used to evaluate the likelihood of rejecting $H_0$. Other metrics, such as effect size and confidence intervals, are also used to interpret the results of hypothesis tests.
## Communicate {#sec-analysis-communicate}
<!-- Purpose -->
Conducting research should be enjoyable and personally rewarding but the effort you have invested and knowledge you have generated should be shared with others. Whether part of a blog, presentation, journal article, or for your own purposes it is important to document your analysis results and process in a way that is informative and interpretable. This enhances the value of your work, allowing others to learn from your experience and build on your findings.
### Report {#sec-analysis-report}
<!-- Purpose -->