-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path8_explore.qmd
More file actions
1706 lines (1276 loc) · 90.2 KB
/
8_explore.qmd
File metadata and controls
1706 lines (1276 loc) · 90.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
execute:
echo: true
---
::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.8}
\thispagestyle{chapterfirstpage}
```
:::
# Explore {#sec-explore-chapter}
```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```
:::{.callout}
**{{< fa regular list-alt >}} Outcomes**
- Determine the suitability of exploratory data analysis for a research project.
- Understand descriptive analysis and unsupervised learning methods, their strengths in pattern recognition and data summarization.
- Interpret insights from data summarization and pattern recognition, considering their potential to guide further research.
:::
In this chapter, we examine a wide range of strategies for exploratory data analysis. The chapter outlines two main branches of exploratory data analysis: descriptive analysis which statistically and/or visually summarizes a dataset and unsupervised learning which is a machine learning approach that does not assume any particular relationship between variables in a dataset. Either through descriptive or unsupervised learning methods, exploratory data analysis employs quantitative methods to summarize, reduce, and sort complex datasets and statistically and visually interrogate a dataset in order to provide the researcher novel perspective to be qualitatively assessed.
::: {.callout}
**{{< fa terminal >}} Lessons**
**What**: Advanced objects\
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To learn about advanced objects in R, including lists and matrices, and create, inspect, access, and manipulate these objects.
:::
## Orientation {#sec-explore-orientation}
The goal of exploratory data analysis is to discover, describe, and posit new hypotheses.\index{exploratory data analysis (EDA)} This analysis approach is best-suited for research questions where the literature is scarce, where the gap in knowledge is wide, or where new territories are being explored. The researcher may not know what to expect, but they are willing to let the data speak for itself. The researcher is open to new insights and new questions that may emerge from the analysis process.
While exploratory data analysis allows flexibility, it is essential to have a guiding research question\index{research question} that provides a focus for the analysis. This question will help to determine the variables of interest and the methods to be used. The research question will also help to determine the relevance of the results and the potential for the results to be used in further research.
The general workflow for exploratory data analysis is shown in @tbl-explore-workflow.
::: {#tbl-explore-workflow tbl-colwidths="[5, 15, 80]"}
| Step | Name | Description |
|:----:|:-----|:------------|
| 1 | Identify | Consider the research question and identify variables of potential interest to provide insight into our question. |
| 2 | Inspect | Check for missing data, outliers, *etc*. and check data distributions and transform if necessary. |
| 3 | Interrogate | Submit the selected variables to descriptive (frequency, keyword, co-occurrence analysis, *etc.*) or unsupervised learning (clustering, dimensionality reduction, vector spacing modeling, *etc.*) methods to provide quantitative measures to evaluate. |
| 4 | Interpret | Evaluate the results and determine if they are valid and meaningful to respond to the research question. |
| 5 | Iterate (Optional) | Repeat steps 1-4 as new questions emerge from your interpretation. |
Workflow for exploratory data analysis
:::
## Analysis {#sec-explore-analysis}
To frame our demonstration and discussion of exploratory data analysis, let's tackle a task. The task will be to identify relevant materials for an English- language learner (ELL) textbook.\index{English as a second language (ESL)} This will involve multiple research questions and allow us to illustrate some very fundamental concepts that emerge across text analysis research in both descriptive and unsupervised learning approaches.
Since our task is geared towards English language use, we will want a representative data sample. For this, we will use the Manually Annotated Sub-Corpus of American English (MASC)\index{Manually Annotated Sub-Corpus of American English (MASC)} of the American National Corpus [@Ide2008]\index{American National Corpus (ANC)}.
The data dictionary for the dataset we will use as our point of departure is shown in @tbl-explore-masc-dd-show.
<!-- Show data dictionary -->
```{r}
#| label: tbl-explore-masc-dd-show
#| tbl-cap: "Data dictionary for the MASC dataset"
#| tbl-colwidths: [15, 15, 15, 55]
#| echo: false
# Read in the data dictionary
read_csv("data/masc_transformed_dd.csv") |>
filter(!variable %in% c("description", "domain")) |>
tt(width = 1)
```
<!-- Load the MASC dataset/ preview -->
First, I'll read in and preview the dataset in @exm-explore-masc-read.
::: {#exm-explore-masc-read}
```r
# Read the dataset
masc_tbl <-
read_csv("../data/masc/masc_transformed.csv")
# Preview the MASC dataset
glimpse(masc_tbl)
```
:::
\index{R packages!readr}\index{R packages!dplyr}
\cindex{read_csv()}\cindex{glimpse()}
<!--Add space to push output to next page-->
::: {.content-visible when-format="pdf"}
\vspace{1em}
:::
```{r}
#| label: explore-masc-read
#| echo: false
# Read and subset the MASC dataset
masc_tbl <-
read_csv("data/masc_transformed.csv") |>
select(-description, -domain)
# Preview the MASC dataset
glimpse(masc_tbl)
```
From the output in @exm-explore-masc-read, we get some sense of the structure of the dataset. However, we also need to perform diagnostic and descriptive procedures.\index{descriptive statistics} This will include checking for missing data and anomalies and assessing central tendency, dispersion, and/or distributions of the variables. This may include using {skimr}, {dplyr}, {stringr}, {ggplot2}, *etc.* to identify the most relevant variables for our task and to identify any potential issues with the dataset.
```{r}
#| label: exm-explore-masc-doc-id
#| include: false
# Change doc_id to character
masc_tbl <-
masc_tbl |>
mutate(doc_id = as.character(doc_id))
masc_org_obs <- nrow(masc_tbl)
masc_org_vars <- ncol(masc_tbl)
```
```{r}
#| label: exm-explore-masc-diagnostic
#| include: false
# Load package
library(skimr)
# Generate summary of the MASC dataset
masc_tbl |>
skim()
masc_tbl |>
filter(is.na(lemma))
# insert term for missing lemma (if any)
masc_tbl <-
masc_tbl |>
mutate(lemma =
case_when(
is.na(lemma) ~ term,
TRUE ~ lemma
)
)
# Remove missing lemma
masc_tbl <-
masc_tbl |>
filter(lemma != "na")
masc_tbl |>
skim()
# masc_tbl |>
# mutate(term_len = nchar(term)) |>
# ggplot(aes(x = term_len)) +
# geom_histogram(binwidth = 1)
# line divisions, URLs, email addresses, and other non-words
library(stringr)
masc_tbl <-
masc_tbl |>
mutate(term_len = nchar(term)) |>
filter(!str_detect(term, "\\.(com|edu|net|org)")) |>
filter(!str_detect(term, "\\w[@]\\w")) |>
filter(!(term_len > 3 & pos == "PUNCT")) |>
filter(!str_detect(term, "^[=0-9\\.]{12,}")) |>
filter(!str_detect(term, "(http|https|www)")) |>
filter(!(term_len > 18 & str_detect(term, "\\w-\\w"))) |>
filter(term_len < 25) |>
arrange(doc_id, term_num) |>
select(-term_len)
# Filter out lemmas with PUNCT or SYM for pos
masc_tbl <-
masc_tbl |>
filter(!(pos %in% c("CD", "FW", "LS", "SYM", "PUNCT")))
masc_new_obs <- nrow(masc_tbl)
masc_new_vars <- ncol(masc_tbl)
masc_skm_obj <-
masc_tbl |>
skim() |>
as_tibble() |>
janitor::clean_names()
doc_id <- filter(masc_skm_obj, skim_variable == "doc_id")
```
After a descriptive and diagnostic assessment of the dataset\index{descriptive assessment}, not included here, I identified and addressed missing data\index{missing data} and anomalies (including many non-words).\index{text normalization} I also recoded the `doc_id` variable to a character variable.\index{variable recoding} The dataset now has `r format(masc_new_obs, big.mark = ",")` observations, a reduction from the original `r format(masc_org_obs, big.mark = ",")` observations. There are 392 documents, 2 modalities, 18 genres, almost 38k unique terms\index{terms} (which are words), almost 26k lemmas\index{lemmas}, and 34 distinct POS tags.
### Descriptive analysis {#sec-explore-descriptive}
Descriptive analysis\index{descriptive methods} includes common techniques such as frequency analysis\index{frequency analysis} to determine the most frequent words or phrases, dispersion analysis\index{dispersion analysis} to see how terms are distributed throughout a document or corpus, keyword analysis\index{keyword analysis} to identify distinctive terms, and/or co-occurrence analysis\index{co-occurrence analysis} to see what terms tend to appear together.
Using the MASC dataset, we will entertain questions such as:
- What are the most common terms a beginning ELL should learn?
- Are there term differences between spoken and written discourses that should be emphasized?
- What are some of the most common verb particle constructions?
Along the way, we will discuss frequency, dispersion, and co-occurrence measures. In addition, we will apply various descriptive analysis techniques and visualizations to explore the dataset and identify new questions and new variables of interest.
#### Frequency analysis {#sec-explore-frequency}
<!-- 4 I's: identify, inspect, interrogate, interpret -->
At its core, frequency analysis\index{frequency analysis} is a descriptive method that counts the number of times a linguistic unit\index{terms} occurs in a dataset. The results of frequency analysis can be used to describe the dataset and to identify terms that are linguistically distinctive or distinctive to a particular group or sub-group in the dataset.
<!--- Raw frequency (counting) --->
##### Raw frequency {#sec-explore-frequency-raw}
\vspace{-1em}
Let's consider what the most common words in the MASC dataset are as a starting point to making inroads on our task by identifying relevant vocabulary for an ELL textbook.
In the `masc_tbl` data frame we have the linguistic unit `term`\index{terms} which corresponds to the word-level annotation of the MASC. The `lemma` corresponds to the base form of each term, for words with inflectional morphology\index{morphological features}, the lemma is the word sans the inflection (*e.g.* is/be, are/be). For other words, the `term` and the `lemma` will be the same (*e.g.* the/the, in/in). These two variables pose a choice point for us: do we consider words to be the actual forms or the base forms? There is an argument to be made for both. In this case, I will operationalize\index{operationalize} our linguistic unit as the `lemma` variable, as this will allow us to group words with distinct inflectional morphology together.
To perform a basic word frequency analysis, we can apply `summarize()` in combination with `n()` or the convenient `count()` function from {dplyr}. Our sorted lemma counts appear in @exm-explore-masc-count.
::: {#exm-explore-masc-count}
```{r}
#| label: explore-masc-count
# Lemma count, sorted in descending order
masc_tbl |>
count(lemma, sort = TRUE)
```
\index{R packages!dplyr}
\cindex{count()}
:::
The output of this raw frequency\index{raw frequency} tabulation in @exm-explore-masc-count is a data frame with two columns: `lemma` and `n`.
As we discussed in @sec-analysis-distributions, the frequency of linguistic units in a corpus tends to be highly right-skewed distribution\index{skewed distribution}, approximating the Zipf distribution.\index{Zipf distribution} If we calculate the **cumulative frequency**\index{cumulative frequency}, a rolling sum of the frequency term by term, of the lemmas in the `masc_tbl` data frame, we can see that the top 10 types account for around 25% of the lemmas used in the entire corpus ---by 100 types that increases to near 50% and 1,000 around 75%, as seen in @fig-explore-masc-count-cumulative.
```{r}
#| label: fig-explore-masc-count-cumulative
#| fig-cap: "Cumulative frequency of lemmas"
#| fig-alt: "A line plot showing the cumulative frequency of lemmas in the MASC dataset. Vertical lines at 10, 100, and 1000 types show that the top 10 lemmas account for 25% of the corpus, 100 lemmas account for 50% of the corpus, and 1000 lemmas account for 75% of the corpus."
#| fig-width: 6
#| fig-asp: 0.30
#| echo: false
# Calculate cumulative frequency
lemma_cumul_freq <-
masc_tbl |>
count(lemma) |>
arrange(desc(n)) |>
mutate(cumulative = cumsum(n)) |>
mutate(percent = cumulative / sum(n))
p <-
lemma_cumul_freq |>
slice_head(n = 1500) |>
ggplot(aes(x = reorder(lemma, desc(n)), y = percent, group = 1)) +
geom_path(color = "black") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1), expand = c(0, 0)) +
scale_x_discrete(breaks = seq(0, 1000, 100), labels = seq(0, 1000, 100), expand = c(0, 0)) +
labs(x = "Types", y = "Cumulative frequency (%)") +
theme(axis.text = element_text(size = 10),
panel.grid.major = element_line(color = "#cccccc")) +
theme_qtalr(font_size = 10)
p +
geom_vline(xintercept = 10, linetype = "dashed", color = "grey40") +
annotate("text", x = 105, y = 0.20, label = "10 lemmas", size = 3, angle = 0) +
geom_vline(xintercept = 100, linetype = "dashed", color = "grey40") +
annotate("text", x = 210, y = 0.45, label = "100 lemmas", size = 3, angle = 0) +
geom_vline(xintercept = 1000, linetype = "dashed", color = "grey40") +
annotate("text", x = 1120, y = 0.65, label = "1000 lemmas", size = 3, angle = 0)
```
If we look at the types that appear within the first 50 most frequent, you can likely also appreciate another thing about language use. Let's list the top 50 types in @tbl-explore-masc-count-top-50.
```{r}
#| label: tbl-explore-masc-count-top-50
#| tbl-cap: "Top 50 lemma types"
#| tbl-colwidths: [20, 20, 20, 20, 20]
#| echo: false
# Top 50 types
lemma_cumul_freq |>
slice_head(n = 50) |>
pull(lemma) |>
matrix(ncol = 5) |>
kable(col.names = NULL)
```
For the most part, the most frequent words are function words. **Function words**\index{function words} are a closed class of relatively few words that are used to express grammatical relationships between content words (*e.g.* determiners, prepositions, pronouns, and auxiliary verbs). Given the importance of these words, it then is no surprise that they comprise many of the most frequent words in a corpus.
Another key observation is that for those the content words\index{content words} (*e.g.* nouns, verbs, adjectives, adverbs) that do figure in the most frequent words, we find that they are quite generic semantically speaking. That is, they are words that are used in a wide range of contexts and take a wide range of meanings. Take for example the adjective 'good'. It can be used to describe a wide range of nouns, such as 'good food', 'good people', 'good times', *etc*. A sometimes near-synonym of 'good', for example 'good student', is the word 'studious'. Yet, 'studious' is not as frequent as 'good' as it is used to describe a narrower range of nouns, such as 'studious student', 'studious scholar', 'studious researcher', *etc*. In this way, 'studious' is more semantically specific than 'good'.
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
Based on what you now know about the expected distribution of words in a corpus, what if your were asked to predict what the most frequent English word used is in each U.S. State? What would you predict? How confident would you be in your prediction? What if you were asked to predict what the most frequent word used is in the language of a given country? What would you want to know before making your prediction?
:::
So common across corpus samples, in some analyses these function words (and sometimes generic content words) are considered irrelevant and are filtered out. In our ELL materials task, however, we might exclude them for the simple fact that it will be a given that we will teach these words given their overall frequency. Let's aim to focus solely on the content words in the dataset.
One approach to filtering out these words is to use a list of words to exclude, known as a **stopwords**\index{stopwords} lexicon. {tidytext} includes a data frame `stop_words` which includes stopword lexicons for English. We can select a lexicon from `stop_words` and use `anti_join()` to filter out the words that appear in the `word` variable from the `lemma` variable in the `masc_tbl` data frame.\index{R packages!tidytext}\index{joining datasets!anti join}
Eliminating words in this fashion, however, may not always be the best approach. Available lists of stopwords vary in their contents and are determined by other researchers for other potential uses. We may instead opt to create our own stopword list that is tailored to the task, or we may opt to use a statistical approach based on their distribution in the dataset using frequency and/or dispersion measures.
For our case, however, we have another available strategy. Since our task is to identify relevant vocabulary, beyond the fundamental function words in English, we can use the POS tags to reduce our dataset to just the content words, that is nouns, verbs, adjectives, and adverbs.\index{part-of-speech tagging} Using the Penn tagset as reference, we can create a vector with the POS tags we want to retain and then use the `filter()` function on the datasets. I will assign this new data frame to `masc_content_tbl` to keep it separate from our main data frame `masc_tbl`, seen in @exm-explore-masc-filter-pos.
::: {#exm-explore-masc-filter-pos}
```{r}
#| label: explore--masc-filter-pos
# Penn Tagset for content words
# Nouns: NN, NNS,
# Verbs: VB, VBD, VBG, VBN, VBP, VBZ
# Adjectives: JJ, JJR, JJS
# Adverbs: RB, RBR, RBS
content_pos <- c("NN", "NNS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "JJ", "JJR", "JJS", "RB", "RBR", "RBS")
# Select content words
masc_content_tbl <-
masc_tbl |>
filter(pos %in% content_pos)
```
\index{R packages!dplyr}
\cindex{filter()}
:::
Let's now preview the top 50 lemmas in the `masc_content_tbl` data frame to see how the most
frequent words have changed in @tbl-explore-masc-filter-pos.
```{r}
#| label: tbl-explore-masc-filter-pos
#| tbl-cap: "Frequency of tokens after filtering out lemmas with POS tags that are not content words"
#| tbl-colwidths: [20, 20, 20, 20, 20]
#| echo: false
# Preview top 50
masc_content_tbl |>
count(lemma, sort = TRUE) |>
slice_head(n = 50) |>
pull(lemma) |>
matrix(ncol = 5) |>
kable(col.names = NULL)
```
The resulting list in @tbl-explore-masc-filter-pos paints a different picture of the most frequent words in the dataset. The most frequent words are now content words, and included in most frequent words are more semantically specific words. We now have reduced the number of observations by `r label_percent()(1-nrow(masc_content_tbl) / nrow(masc_tbl))` focusing on the content words. We are getting closer to identifying the vocabulary that we want to include in our ELL materials, but we will need some more tools to help us identify the most relevant vocabulary.
\pagebreak
##### Dispersion {#sec-explore-frequency-dispersion}
\vspace{-1em}
**Dispersion** is a measure of how evenly distributed a linguistic unit is across a dataset.\index{dispersion} This is a key concept in text analysis, as important as frequency. It is important to recognize that frequency and dispersion are measures of different characteristics. We can have two words that occur with the same frequency, but one word may be more evenly distributed across a dataset than the other. Depending on the researcher's aims, this may be an important distinction to make. For our task, it is likely the case that we want to capture words that are well-dispersed across the dataset, as words that have a high frequency and a low dispersion tend to be connected to a particular context, whether that be a particular genre, a particular speaker, a particular topic, *etc*. In other research, the aim may be the reverse; to identify words that are highly frequent and highly concentrated in a particular context to identify words that are distinctive to that context.
There are a variety of measures that can be used to estimate the distribution of types across a corpus. Let's focus on three measures: document frequency ($df$), inverse document frequency ($idf$), and Gries' Deviation of Proportions ($dp$).
The most basic measure is **document frequency** ($df$).\index{document frequency} This is the number of documents in which a type appears at least once. For example, if a type appears in 10 documents, then the document frequency is 10. This is a very basic measure, but it is a decent starting point.
A nuanced version of document frequency is **inverse document frequency** ($idf$).\index{inverse document frequency} This measure takes the total number of documents and divides it by the document frequency. This results in a measure that is inversely proportional to the document frequency. That is, the higher the document frequency, the lower the inverse document frequency. This measure is often log-transformed\index{log transformation} to spread out the values.
One thing to consider about $df$ and $idf$ is that neither takes into account the length of the documents in which the type appears nor the spread of each type within each document. To take these factors into account, we can use Gries' **deviation of proportions** ($dp$) measure [@Gries2023, pp. 87-88].\index{Gries' deviation of proportions} The $dp$ measure considers the proportion of a type's frequency in each document relative to its total frequency. This produces a measure that is more sensitive to the distribution of types within and across documents in a corpus.
\pagebreak
Let's consider how these measures differ with three scenarios:
1. Scenario A: A type with a token frequency of 100 appears in each of the 10 documents in a corpus. Each document is 100 words long, and the type appears 10 times in each document.
2. Scenario B: The same type with a token frequency of 100 appears in each of the 10 documents, each 100 words long. However, in this scenario, the type appears once in 9 documents and 91 times in 1 document.
3. Scenario C: Nine of the documents constitute 99% of the corpus. The type appears once in each of these 9 documents and 91 times in the 10th document.
In these scenarios, Scenario A is the most dispersed, Scenario B is less dispersed, and Scenario C is the least dispersed. Despite these differences, the type's document frequency ($df$) and inverse document frequency ($idf$) scores remain the same across all three scenarios. However, the dispersion ($dp$) score will accurately reflect the increasing concentration of the type's dispersion from Scenario A to Scenario B to Scenario C.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
You may wonder why we would want to use $df$ or $idf$ at all. The answer is some combination of the fact that they are computationally less expensive to calculate, they are widely used (especially $idf$), and/or in many practical situations they often highly correlated with $dp$.
:::
So for our task we will use $dp$ as our measure of dispersion.\index{dispersion} {qtkit} includes the `calc_type_metrics()` function which calculates, among other metrics, the dispersion metrics $df$, $idf$, and/or $dp$. Let's select `dp` and assign the result to `masc_lemma_disp`, as seen in @exm-explore-masc-dp.
::: {#exm-explore-masc-dp}
```{r}
#| label: explore-masc-dp
# Load package
library(qtkit)
# Calculate deviance of proportions (DP)
masc_lemma_disp <-
masc_content_tbl |>
calc_type_metrics(
type = lemma,
documents = doc_id,
dispersion = "dp"
) |>
arrange(dp)
# Preview
masc_lemma_disp |>
slice_head(n = 10)
```
\index{R packages!qtkit}\index{R packages!dplyr}
\cindex{calc_type_metrics()}\cindex{arrange()}\cindex{slice_head()}
:::
We would like to identify lemmas that are frequent and well-dispersed. But an important question arises, what is the threshold for frequency and dispersion that we should use to identify the lemmas that we want to include in our ELL materials?
There are statistical approaches to identifying natural breakpoints but a visual inspection is often good enough for practical purposes. Let's create a density plot to see if there is a natural break in the distribution of our dispersion measure, as seen in @fig-explore-masc-dp-density.
::: {#exm-explore-masc-dp-density}
```r
# Density plot of dp
masc_lemma_disp |>
ggplot(aes(x = dp)) +
geom_density() +
scale_x_continuous(breaks = seq(0, 1, .1)) +
labs(x = "Deviation of Proportions", y = "Density")
```
:::
```{r}
#| label: fig-explore-masc-dp-density
#| fig-cap: "Density plot of lemma dispersion"
#| fig-alt: "The plot shows a bend in the distribution between 0.85 and 0.97."
#| fig-width: 6
#| fig-asp: 0.30
#| echo: false
# Density plot of dp
masc_lemma_disp |>
ggplot(aes(x = dp)) +
geom_density() +
scale_x_continuous(breaks = seq(0, 1, .1)) +
labs(x = "Deviation of Proportions", y = "Density") +
theme_qtalr(font_size = 10)
```
\index{R packages!ggplot2}
\cindex{ggplot()}\cindex{geom_density()}\cindex{scale_x_continuous()}\cindex{labs()}\cindex{aes()}\cindex{seq()}
What we are looking for is a distinctive bend in the distribution of dispersion measures. In @fig-explore-masc-dp-density, we can see one roughly between $0.87$ and $0.97$. The inflection point appears to be near $0.95$. This bend is called an elbow, and using this bend to make informed decisions about thresholds is called the **elbow method**\index{elbow method}.
In @exm-explore-masc-dp-filter, I filter out lemmas that have a dispersion measure less than $0.95$.
\pagebreak
::: {#exm-explore-masc-dp-filter}
```{r}
#| label: explore--masc-dp-filter
# Filter for lemmas with dp <= 0.95
masc_lemma_disp_thr <-
masc_lemma_disp |>
filter(dp <= 0.95) |>
arrange(desc(n))
```
\index{R packages!dplyr}
\cindex{filter()}\cindex{arrange()}
:::
Then in [Tables @tbl-explore-masc-dp-filter-preview-top] and [-@tbl-explore-masc-dp-filter-preview-bottom], I preview the top and bottom 25 lemmas in the dataset.
```{r}
#| label: tbl-explore-masc-dp-filter-preview-top
#| tbl-cap: "Top 25 lemmas after our dispersion threshold"
#| tbl-colwidths: [20, 20, 20, 20, 20]
#| echo: false
# Preview top
masc_lemma_disp_thr |>
slice_head(n = 25) |>
pull(type) |>
matrix(ncol = 5) |>
kable(col.names = NULL)
```
```{r}
#| label: tbl-explore-masc-dp-filter-preview-bottom
#| tbl-cap: "Bottom 25 lemmas after our dispersion threshold"
#| tbl-colwidths: [20, 20, 20, 20, 20]
#| echo: false
# Preview bottom
masc_lemma_disp_thr |>
filter(type != "fu") |> # Remove 'fu' lemma
slice_tail(n = 25) |>
pull(type) |>
matrix(ncol = 5) |>
kable(col.names = NULL)
```
We now have a solid candidate list of common vocabulary that is spread well across the corpus.
\pagebreak
##### Relative frequency {#sec-explore-frequency-relative}
\vspace{-1em}
Gauging frequency and dispersion across the entire corpus sets the foundation for any frequency analysis, but it is often the case that we want to compare the frequency and/or dispersion of linguistic units across corpora or sub-corpora.\index{relative frequency}
In the case of the MASC dataset, for example, we may want to compare metrics across the two modalities or the various genres. Simply comparing raw frequency\index{raw frequency} counts across these sub-corpora is not a good approach, and can be misleading, as the sub-corpora will likely vary in size. For example, if one sub-corpus is twice as large as another sub-corpus, then, all else being equal, the frequency counts will be twice as large in the larger sub-corpus. This is why we use relative frequency measures, which are normalized by the size of the sub-corpus.
::: {.callout .halfsize}
**{{< fa regular lightbulb >}} Consider this**
A variable in the MASC dataset that has yet to be used is the `pos` variable. How could we use this POS variable to refine our frequency and dispersion analysis of lemma types?
Hint: consider lemma forms that may be tagged with different parts of speech.
:::
To normalize the frequency of linguistic units across sub-corpora, we can use the **relative frequency**\index{relative frequency} ($rf$) measure. This is the frequency of a linguistic unit divided by the total number of linguistic units in the sub-corpus. This bakes in the size of the sub-corpus into the measure. The notion of relative frequency is key to all research working with text, as it is the basis for the statistical approach to text analysis where comparisons are made.
There are some field-specific terms that are used to refer to relative frequency measures. For example, in NLP\index{Natural Language Processing (NLP)} literature, the relative frequency measure is often referred to as the **term frequency**\index{term frequency} ($tf$). In corpus linguistics, the relative frequency measure is often modified slightly to include a constant (*e.g.* $rf * 100$) which is known as the **observed relative frequency**\index{observed relative frequency} ($orf$). Although the observed relative frequency per number of tokens is not strictly necessary, it is often used to make the values more interpretable as we can now talk about an observed relative frequency of 1.5 as a linguistic unit that occurs 1.5 times per 100 linguistic units.
Let's consider how we might compare the frequency\index{frequency analysis} and dispersion\index{dispersion analysis} of lemmas\index{lemmas} across the two modalities in the MASC dataset, spoken and written. To make this a bit more interesting and more relevant, let's add the `pos` variable to our analysis. The intent, then, will be to identify lemmas tagged with particular parts of speech that are particularly indicative of each modality.
We can do this by collapsing the `lemma` and `pos` variables into a single variable, `lemma_pos`, with the `str_c()` function, as seen in @exm-explore-masc-type.
::: {#exm-explore-masc-type}
```{r}
#| label: explore-masc-type
# Collapse lemma and pos into type
masc_content_tbl <-
masc_content_tbl |>
mutate(lemma_pos = str_c(lemma, pos, sep = "_"))
# Preview
masc_content_tbl |>
slice_head(n = 5)
```
\index{R packages!stringr}\index{R packages!dplyr}
\cindex{mutate()}\cindex{str_c()}\cindex{slice_head()}
:::
Now this will increase the number of lemma types in the dataset as we are now considering lemmas where the same lemma form is tagged with different parts of speech.
Getting back to calculating the frequency and dispersion of lemmas in each modality, we can use the `calc_type_metrics()` function with `lemma_pos` as our type argument. We will, however, need to apply this function to each sub-corpus independently and then concatenate the two data frames\index{concatenating datasets}. This function returns a (raw) frequency ($n$) measure by default, but we can specify the `frequency` argument to `rf` to calculate the relative frequency of the linguistic units as in @exm-explore-masc-metrics-modality.
::: {#exm-explore-masc-metrics-modality}
```{r}
#| label: explore--masc-metrics-modality
# Calculate relative frequency
# Spoken
masc_spoken_metrics <-
masc_content_tbl |>
filter(modality == "Spoken") |>
calc_type_metrics(
type = lemma_pos,
documents = doc_id,
frequency = "rf",
dispersion = "dp"
) |>
mutate(modality = "Spoken") |>
arrange(desc(n))
# Written
masc_written_metrics <-
masc_content_tbl |>
filter(modality == "Written") |>
calc_type_metrics(
type = lemma_pos,
documents = doc_id,
frequency = "rf",
dispersion = "dp"
) |>
mutate(modality = "Written") |>
arrange(desc(n))
# Concatenate metrics
masc_metrics <-
bind_rows(masc_spoken_metrics, masc_written_metrics)
# Preview
masc_metrics |>
slice_head(n = 5)
```
\index{R packages!dplyr}\index{R packages!qtkit}
\cindex{calc_type_metrics()}\cindex{mutate()}\cindex{arrange()}\cindex{slice_head()}\cindex{bind_rows()}
:::
With the `rf` measure, we are now in a position to compare 'apples to apples', as you might say. We can now compare the relative frequency of lemmas across the two modalities. Let's preview the top 5 lemmas in each modality, as seen in @exm-explore-masc-relative-frequency-top.
::: {#exm-explore-masc-relative-frequency-top}
```r
# Preview top 10 lemmas in each modality
masc_metrics |>
group_by(modality) |>
slice_max(n = 10, order_by = rf)
```
::: {.content-visible when-format="pdf"}
\vspace{5em}
:::
```{r}
#| label: explore--masc-relative-frequency-top
#| echo: false
# Preview top 10 lemmas in each modality
masc_metrics |>
group_by(modality) |>
slice_max(n = 10, order_by = rf)
```
\index{R packages!dplyr}
\cindex{group_by()}\cindex{slice_max()}
:::
We can appreciate, now, that there are similarities and a few differences between the most frequent lemmas for each modality. First, there are similar lemmas in written and spoken modalities, such as 'be', 'have', and 'not'. Second, the top 10 include verbs and adverbs. Now we are looking at the most frequent types, so it is not surprising that we see more in common than not. However, looking close we can see that contracted forms are more frequent in the spoken modality, such as 'isn't', 'don't', and 'can't' and that ordering of the verb tenses differs to some degree. Whether these are important distinctions for our task is something we will need to consider.
We can further cull our results by filtering out lemmas that are not well-dispersed across the sub-corpora. Although it may be tempting to use the threshold we used earlier, we should consider that the sizes of the sub-corpora are different and the distribution of the dispersion measure may be different. With this in mind, we need to visualize the distribution of the dispersion measure for each modality and apply the elbow method\index{elbow method} to identify a threshold for each modality.
After assessing the density plots for the dispersion of each modality via the elbow method, we update our thresholds. We maintain the $0.95$ threshold for the written sub-corpus and use a $0.79$ threshold for the spoken sub-corpus. I apply these filters as seen in @exm-explore-masc-subcorpora-filtered.
\pagebreak
::: {#exm-explore-masc-subcorpora-filtered}
```{r}
#| label: explore--masc-subcorpora-filtered
# Filter for lemmas with
# dp <= 0.95 for written and
# dp <= .79 for spoken
masc_metrics_thr <-
masc_metrics |>
filter(
(modality == "Written" & dp <= 0.95) |
(modality == "Spoken" & dp <= .79)
) |>
arrange(desc(rf))
```
\index{R packages!dplyr}
\cindex{filter()}\cindex{arrange()}
:::
Filtering the less-dispersed types reduces the dataset from `r format(nrow(masc_metrics), big.mark = ",")` to `r format(nrow(masc_metrics_thr), big.mark = ",")` observations. This will provide us with a more succinct list of common and well-dispersed lemmas that are used in each modality.
As much as the frequency and dispersion measures can provide us with a starting point, it does not provide an understanding of what types are more indicative of a particular sub-corpus, modality sub-corpora in our case. We can do this by calculating the log odds ratio of each lemma in each modality.
The **log odds ratio**\index{log odds ratio} is a measure that quantifies the difference between the frequencies of a type in two corpora or sub-corpora. In spirit and in name, it compares the odds of a type occurring in one corpus versus the other. The values range from negative to positive infinity, with negative values indicating that the type is more frequent in the first corpus and positive values indicating that the lemma is more frequent in the second corpus. The magnitude of the value indicates the strength of the association.
{tidylo} provides a convenient function `bind_log_odds()` to calculate the log odds ratio, and a weighed variant, for each type in each sub-corpus. The weighted log odds ratio measure provides a more robust and interpretable measure for comparing term frequencies across corpora, especially when term frequencies are low or when corpora are of different sizes. The weighting (or standardization\index{standardization}) also makes it easier to identify terms that are particularly distinctive or characteristic of one corpus over another.
Let's calculate the weighted log odds ratio for each lemma in each modality and preview the top 10 lemmas in each modality, as seen in @exm-explore-masc-log-odds-weighted.
\pagebreak
::: {#exm-explore-masc-log-odds-weighted}
```{r}
# Load package
library(tidylo)
# Calculate log odds ratio
masc_metrics_thr <-
masc_metrics_thr |>
bind_log_odds(
set = modality,
feature = type,
n = n
)
# Preview top 10 lemmas in each modality
masc_metrics_thr |>
group_by(modality) |>
slice_max(n = 10, order_by = log_odds_weighted)
```
\index{R packages!dplyr}\index{R packages!tidylo}
\cindex{bind_log_odds()}\cindex{group_by()}\cindex{slice_max()}
:::
Let's imagine we would like to extract the most indicative verbs for each modality using the weighted log odds as our measure. We can do this with a little regex\index{regular expression (regex)} magic. Let's use the `str_subset()` function to filter for lemmas that contain `_V` and then use `slice_max()` to extract the top 10 most indicative verb lemmas, as seen in @exm-explore-masc-log-odds-weighted-verbs.
::: {#exm-explore-masc-log-odds-weighted-verbs}
```{r}
#| label: explore--masc-log-odds-weighted-verbs
# Preview (ordered by log_odds_weighted)
masc_metrics_thr |>
group_by(modality) |>
filter(str_detect(type, "_V")) |>
slice_max(n = 10, order_by = log_odds_weighted) |>
select(-n)
```
\index{R packages!dplyr}\index{R packages!stringr}
\cindex{group_by()}\cindex{filter()}\cindex{slice_max()}\cindex{select()}
:::
Note that the log odds are larger for the spoken modality than the written modality. This indicates that theses types are more strongly indicative of the spoken modality than the types in the written modality are indicative of the written modality. This is not surprising, as the written modality is typically more diverse in terms of lexical usage than the spoken modality, where the terms tend to be repeated more often, including verbs.\index{lexical diversity}
#### Co-occurrence analysis {#sec-explore-co-occurrence}
Moving forward on our task, we have a general idea of the vocabulary that we want to include in our ELL materials and can identify lemma types that are particularly indicative of each modality. Another useful approach to complement our analysis is to identify words that co-occur with our target lemmas (verbs). In English, it is common for verbs to appear with a preposition or adverb, such as 'give up', 'look after'. These 'phrasal verbs' form a semantic unit that is distinct from the verb alone.
In a case such as this, we are aiming to do a co-occurrence analysis. Co-occurrence analysis\index{co-occurrence analysis} is a set of methods that are used to identify words that appear in close proximity to a target type.
<!-- Concordances -->
An exploratory, primarily qualitative, approach is to display the co-occurrence of words in a Keyword in Context (KWIC) search. **KWIC**\index{Keyword in Context (KWIC)} produces a table that displays the target word in the center of the table and the words that appear before and after the target word within some defined window context. This is a useful approach for spot identifying co-occurring patterns which include the target word or phrase. However, it can be a time-consuming process to manually inspect these results and is likely not a feasible approach for large datasets.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
KWIC tables are a common tool in corpus linguistics and can be used either before or after a quantitative analysis. If you are interested, {quanteda} includes a function `kwic()` that can be used to create a KWIC table.\index{R packages!quanteda}
:::
<!-- N-grams -->
A straightforward quantitative way to explore co-occurrence is to set the unit of observation\index{unit of observation} to an ngram\index{ngrams} of word terms. Then, the frequency and dispersion metrics can be calculated for each ngram. Yet, there is an issue with this approach for our purposes. The frequency and dispersion of ngrams does not necessarily relate to whether the two words form a semantic unit. For example, in any given corpus there will be highly frequent pairings of function words, such as 'of the', 'in the', 'to the', *etc*. These combinations our bound to occur frequently in large part because the high frequency of each individual word. However, these combinations do not have the same semantic cohesion as other, likely lower-frequency, ngrams such as 'look after', 'give up', *etc*.
<!-- Collocation -->
To better address our question, we can use a statistical measure to estimate collocational strength between two words. A **collocation**\index{collocation} is a sequence of words that co-occur more often than would be expected by chance. A common measure of collocation is the **pointwise mutual information** (PMI)\index{pointwise mutual information (PMI)} measure. PMI scores reflect the likelihood of two words occurring together given their individual frequencies and compares this to the actual co-occurrence frequency. A high PMI indicates a strong semantic association between the words.
One consideration that we need to take into account for our goal to identify verb particle constructions, is how we ultimately want to group our `lemma_pos` values. This is particularly important given the fact that our `pos` tags for verbs include information about the verb's tense and person attributes. This means that a verb in a verb particle bigram, such as 'look after', will be represented by multiple `lemma_pos` values, such as `look_VB`, `look_VBP`, `look_VBD`, and `look_VBG`. We want to group the verb particle bigrams by a single verb value, so we need to reclassify\index{feature reclassification} the `pos` values for verbs. We can do this with the `case_when()` function from {dplyr}.
In @exm-explore-masc-lemma-pos, I recode the `pos` values for verbs to `V` and then join the `lemma` and `pos` columns into a single string.
::: {#exm-explore-masc-lemma-pos}
```{r}
#| label: explore--masc-bigrams-prep
masc_lemma_pos_tbl <-
masc_tbl |>
mutate(pos = case_when(
str_detect(pos, "^V") ~ "V",
TRUE ~ pos
)) |>
group_by(doc_id) |>
mutate(lemma_pos = str_c(lemma, pos, sep = "_")) |>
ungroup()
```
\index{R packages!dplyr}\index{R
packages!stringr}
\cindex{mutate()}\cindex{case_when()}\cindex{str_c()}\cindex{group_by()}\cindex{ungroup()}
:::
Let's calculate the PMI for all the bigrams in the MASC dataset. We can use the `calc_assoc_metrics()` function from {qtkit}. We need to specify the `association` argument to `pmi` and the `type` argument to `bigrams`, as seen in @exm-explore-masc-bigrams-pmi.
::: {#exm-explore-masc-bigrams-pmi}
```r
masc_lemma_pos_assoc <-
masc_lemma_pos_tbl |>
calc_assoc_metrics(
doc_index = doc_id,
token_index = term_num,
type = lemma_pos,
association = "pmi"
)
# Preview
masc_lemma_pos_assoc |>
arrange(desc(pmi)) |>
slice_head(n = 10)
```
::: {.content-visible when-format="pdf"}
\vspace{3em}
:::
```{r}
#| label: explore-masc-bigrams-pmi
#| echo: false
masc_lemma_pos_assoc <-
masc_lemma_pos_tbl |>
calc_assoc_metrics(
doc_index = doc_id,
token_index = term_num,
type = lemma_pos,
association = "pmi"
)
# Preview
masc_lemma_pos_assoc |>
arrange(desc(pmi)) |>
slice_head(n = 10)
```
\index{R packages!qtkit}\index{R packages!dplyr}
\cindex{calc_assoc_metrics()}\cindex{arrange()}\cindex{slice_head()}
:::
One caveat to using the PMI\index{pointwise mutual information (PMI)} measure is that it is sensitive to the frequency of the words. If the words in a bigram pair are infrequent, and especially if they only occur once, then the PMI measure will be unduly inflated. To mitigate this issue, we can apply a frequency threshold to the bigrams before calculating the PMI measure. Let's filter out bigrams that occur less than 10 times and have a positive PMI, and while we are at it, let's also filter `x` and `y` for the appropriate forms we are targeting, either `_V` and `_IN`, as seen @exm-explore-masc-bigrams-pmi-filtered.
::: {#exm-explore-masc-bigrams-pmi-filtered}
```{r}
#| label: explore-masc-bigrams-pmi-filtered
# Filter for target bigrams
masc_verb_part_assoc <-
masc_lemma_pos_assoc |>
filter(n >= 10 & pmi > 0) |>
filter(str_detect(x, "_V")) |>
filter(str_detect(y, "_IN"))
# Preview
masc_verb_part_assoc |>
slice_max(order_by = pmi, n = 10)
```
\index{R packages!dplyr}\index{R
packages!stringr}
\cindex{filter()}\cindex{arrange()}\cindex{slice_head()}\cindex{str_detect()}\cindex{desc()}
:::
We have a working method for identify verb particle constructions. We can clean up the results a bit by removing the POS tags from the `x` and `y` variables, up our minimum PMI value, and create a network plot to visualize the results. A **network plot**\index{network plot} is a type of graph that shows relationships between entities. In this case, the entities are verbs and particles, and the relationships are the PMI values between them. The connections between are represented by edges, and the thickness of the edges is proportional to the PMI value.
```{r}
#| label: fig-explore-masc-verb-part-network
#| fig-cap: "Network plot of verb particle constructions"
#| fig-alt: "A network plot showing the association between verbs and prepositions in the MASC dataset. The plot shows a network of verbs and prepositions connected by edges with varying thicknesses."
#| fig-width: 8
#| fig-asp: 0.75
#| echo: false
# Clean up results
masc_verb_part_assoc_plot <-
masc_verb_part_assoc |>
filter(pmi >= 2) |>
filter(str_detect(y, "that_|whether_", negate = TRUE)) |>
mutate(
x = str_remove(x, "_V.*"),
y = str_remove(y, "_IN")
)
# Create an association network plot
# `x` and `y` are the nodes
# `pmi` is the edge weight
library(igraph)
library(ggraph)
masc_verb_part_assoc_plot |>
graph_from_data_frame() |>
ggraph(layout = "nicely") +
geom_edge_link(aes(color = pmi),
alpha = 0.8,
edge_width = 0.8,
arrow = grid::arrow(length = unit(8, "points"))
) +
geom_node_text(aes(label = name), repel = TRUE) +
scale_edge_color_gradient(low = "grey90", high = "grey20") +
theme_void()
```
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
{ggplot2} cannot create network plots directly, so we use {ggraph} [@R-ggraph] and {igraph} [@R-igraph] to create the network plot. For more information on creating network plots, see the {ggraph} documentation.
:::
From @fig-explore-masc-verb-part-network, and from the underlying data, we can explore verb particle constructions. We could go further and apply our co-occurrence methods to each modality separately, if we wanted to identify verb particle constructions that are distinctive to each modality. We could also apply our co-occurrence methods to other parts of speech, such as adjectives and nouns, to identify collocations of these parts of speech. There is much more to explore with co-occurrence analysis, but this should give you a good idea of the types of questions that can be addressed.
### Unsupervised learning {#sec-explore-unsupervised}
Aligned in purpose with descriptive approaches, unsupervised learning\index{unsupervised learning} approaches to exploratory data analysis\index{exploratory data analysis (EDA)} are used to identify patterns in the data from an algorithmic perspective. Common methods in text analysis include principal component analysis, clustering, and vector space modeling.
We will continue to use the MASC dataset\index{Manually Annotated Sub-Corpus of American English (MASC)} as we develop materials for our ELL textbook to illustrate unsupervised learning methods. In the process, we will explore the following questions:
- Can we identify and group documents based on linguistic features or co-occurrence patterns of the data itself?
- Do the groups of documents relate to categories in the dataset?
- Can we estimate the semantics of words based on their co-occurrence patterns?
Through these questions we will build on our knowledge of frequency, dispersion, and co-occurrence analysis and introduce concepts and methods associated with machine learning.
#### Clustering {#sec-explore-clustering}
**Clustering**\index{clustering} is an unsupervised learning technique that can be used to group similar items in the text data, helping to organize the data into distinct categories and discover relationships between different elements in the text. The main steps in the procedure includes identifying the relevant linguistic features\index{feature selection} to use for clustering, representing the features\index{feature engineering} in a way that can be used for clustering, applying a clustering algorithm to the data, and then interpreting the results.
In our ELL textbook task, we may very well want to explore the similarities and/or differences between the documents based on the distribution of linguistic features. This provides us a view to evaluate to what extent the variables in the dataset, say genre for this demonstration, map to the distribution of linguistic features. Based on this evaluation, we may want to consider re-categorizing the documents, collapsing categories, or even adding new categories.
Instead of relying entirely on the variables' values in the MASC dataset, we can let the data itself say something about how documents may or may not be related. Yet, a pivotal question is what linguistic features we will use, otherwise known as **feature selection**\index{feature selection}. We could use terms or lemmas, but we may want to consider other features, such as parts of speech or some co-occurrence pattern. We are not locked into using one criterion, and we can perform clustering multiple times with different features, but we should consider the implications of our feature selection for our interpretation of the results\index{research interpretation}.
Imagine that among the various features that we are interested in associating documents, we consider lemma use and POS use. However, we need to operationalize\index{operationalize} what we mean by 'use'. In machine learning, this process is known as **feature engineering**\index{feature engineering}. We likely want to use some measure of frequency. Since we are comparing documents, a relative frequency measure will be most useful. Another consideration it means to use lemmas or POS tags as our features. Each represents a different linguistic of the documents. Lemmas represent the lexical diversity\index{lexical diversity} of the documents while POS tags approximate the grammatical diversity\index{grammatical diversity} of the documents [@Petrenz2011].
Let's assume that our interest is to gauge the grammatical diversity of the documents, so we will go with POS tags. With this approach, we aim to distinguish between documents in a way that may allow us to consider whether genre-document categories are meaningful, along grammatical lines.
The next question to address in any analysis is how to represent the features. In machine learning, the most common way to represent \mbox{document-feature} relationships is in a matrix\index{matrix}. In our case, we want to create a matrix with the documents in the rows and the features in the columns. The values in the matrix will be the operationalization\index{operationalize} of grammatical diversity in each document. This configuration is known as a **document-term matrix** (DTM)\index{document-term matrix (DTM)}.
To recast a data frame into a DTM, we can use the `cast_dtm()` function from {tidytext}. This function takes a data frame with a document identifier, a feature identifier, and a value for each observation and casts it into a matrix. Operations such as normalization are easily and efficiently performed in R on matrices, so initially we can cast a frequency table of POS tags into a matrix and then normalize the matrix by documents.\index{R packages!tidytext}\index{standardization}
Let's see how this works with the MASC dataset in @exm-explore-masc-dtms.
```{r}
#| label: exm-explore-masc-filter-us
#| include: false
# Filter out lemmas with PUNCT, SYM, NNP, NNPS for pos
masc_tbl <-
masc_tbl |>
filter(!(pos %in% c("CD", "FW", "LS", "SYM", "PUNCT", "NNP", "NNPS")))
```
::: {#exm-explore-masc-dtms}
```{r}
#| label: explore-masc-dtms
#| results: hold
# Load package
library(tidytext)
# Create a document-term matrix of POS tags
masc_pos_dtm <-
masc_tbl |>
count(doc_id, pos) |>
cast_dtm(doc_id, pos, n) |>
as.matrix()
# Inspect
dim(masc_pos_dtm)
# Preview
masc_pos_dtm[1:5, 1:5]
```
\index{R packages!tidytext}
\cindex{count()}\cindex{cast_dtm()}\cindex{as.matrix()}\cindex{dim()}\cindex{library()}
:::
The matrix `masc_pos_dtm` has `r nrow(masc_pos_dtm)` documents and `r ncol(masc_pos_dtm)` POS tags. The values in the matrix are the frequency of each POS tag in each document. Note to preview a subset of the contents of a matrix, such as in @exm-explore-masc-dtms, we use bracket syntax `[]` instead of the `head()` function.\cindex{[]}
We can now normalize the matrix by documents. We can do this by dividing each feature count by the total count in each document. This is a row-wise transformation, so we can use the `rowSums()` function from base R to calculate the total count in each document. Then each count divided by its row's total count, as seen in @exm-explore-masc-dtms-normalized.\index{transformation}
::: {#exm-explore-masc-dtms-normalized}
```{r}
#| label: explore--masc-dtms-normalized
# Normalize pos matrix by documents
masc_pos_dtm <-
masc_pos_dtm / rowSums(masc_pos_dtm)
```
\cindex{rowSums()}
:::
There are two concerns to address before we can proceed with clustering.\index{clustering} First, clustering algorithm performance tends to degrade with the number of features. Second, clustering algorithms perform better with more informative features. That is to say, features that are more distinct across the documents provide better information for deriving useful clusters.
We can address both of these concerns by reducing the number of features and increasing the informativeness of the features. To accomplish this is to use dimensionality reduction. **Dimensionality reduction**\index{dimensionality reduction} is a set of methods that are used to reduce the number of features in a dataset while retaining as much information as possible. The most common method for dimensionality reduction is **principal component analysis** (PCA)\index{principal component analysis (PCA)}. PCA is a method that transforms a set of correlated variables into a set of uncorrelated variables, known as principal components. The principal components are ordered by the amount of variance that they explain in the data. The first principal component explains the most variance, the second principal component explains the second most variance, and so on.
We can apply PCA to the matrix and assess how well it accounts for the variation in the data and how the variation is distributed across components. The `prcomp()` function from base R can be used to perform PCA.
Let's apply PCA to the matrix, as seen in @exm-explore-masc-dtms-pca.
::: {#exm-explore-masc-dtms-pca}
```{r}
#| label: explore--masc-dtms-pca
set.seed(123) # for reproducibility
# Apply PCA to matrix
masc_pos_pca <-
masc_pos_dtm |>
prcomp()
```
\cindex{prcomp()}
:::
We can visualize the amount of variance explained by each principal component with a scree plot. A **scree plot**\index{scree plot} is a bar plot\index{bar plot} ordered by the amount of variance explained by each principal component. The `fviz_eig()` function from {factoextra} implements a scree plot on a PCA object. We can set the number of components to visualize with `ncp =`, as seen in @exm-explore-masc-dtms-pca-scree.