-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy path9_predict.qmd
More file actions
2064 lines (1546 loc) · 100 KB
/
9_predict.qmd
File metadata and controls
2064 lines (1546 loc) · 100 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
execute:
echo: true
---
::: {.content-visible when-format="pdf"}
```{=latex}
\setDOI{10.4324/9781003393764.9}
\thispagestyle{chapterfirstpage}
```
:::
# Predict {#sec-predict-chapter}
```{r}
#| label: setup-options
#| child: "../_common.qmd"
#| cache: false
```
::: {.callout}
**{{< fa regular list-alt >}} Outcomes**
- Identify the research goals of predictive data analysis
- Describe the workflow for predictive data analysis
- Recognize quantitative and qualitative methods for evaluating predictive models
:::
In this chapter, I introduce supervised learning as an approach to text analysis. Supervised learning aims to establish a relationship between a target (or outcome) variable and a set of feature variables derived from text data. By leveraging this relationship, statistical generalizations (models) can be created to accurately predict values of the target variable based on the values of the feature variables. Throughout the chapter, we explore practical tasks and theoretical applications of statistical learning in text analysis.
::: {.callout}
**{{< fa terminal >}} Lessons**
**What**: Advanced Visualization\
**How**: In an R console, load {swirl}, run `swirl()`, and follow prompts to select the lesson.\
**Why**: To dive deeper into {ggplot2} to enhance visual summaries and provide an introduction to {factoextra} and {ggfortify} that extend {ggplot2} capabilities to model objects.
:::
## Orientation {#sec-predict-orientation}
Predictive data analysis (PDA)\index{predictive data analysis (PDA)} is a powerful analysis method for making predictions about new or future data based on patterns in existing data. PDA is a type of supervised learning\index{supervised learning}, which means that it involves training a model on a labeled dataset where the input data and desired output are both provided. The model is able to make predictions\index{text regression} or classifications\index{text classification} based on the input data by learning the relationships between the input and output data. Supervised machine learning is an important tool for linguists studying language and communication, as it allows us to analyze language data to identify patterns or trends in language use, assess hypotheses, and prescribe actions.\index{machine learning}
The approach to conducting predictive analysis shares some commonalities with exploratory data analysis (@sec-explore-orientation) (as well as inferential analysis @sec-infer-chapter), but there are also some key differences. Consider the workflow in @tbl-predict-workflow.
<!-- Workflow -->
::: {#tbl-predict-workflow tbl-colwidths="[5, 15, 80]"}
| Step | Name | Description |
|------|-------------|-------------|
| 1 | Identify | Consider the research question and aim and identify relevant variables |
| 2 | | Split the data into representative training and testing sets |
| 3 | | Apply variable selection and engineering procedures |
| 4 | Inspect | Inspect the data to ensure that it is in the correct format and that the training and testing sets are representative of the data |
| 5 | Interrogate | Train and evaluate the model on the training set, adjusting models or hyperparameters as needed, to produce a final model |
| 6 | (Optional) Iterate | Repeat steps 3-5 to select new variables, models, hyperparameters |
| 7 | Interpret | Interpret the results of the final model in light of the research question or hypothesis |
Workflow for predictive data analysis
:::
Focusing on the overlap with other analysis methods, we can see some fundamental steps such as identifying relevant variables, inspecting the data, interrogating the data, and interpreting the results. And if our research aim is exploratory in nature, iteration may also be a part of the workflow.\index{research aim}
There are two main differences, however, between the PDA and the EDA workflow we discussed in @sec-explore-chapter\index{exploratory data analysis (EDA)}. The first is that PDA requires partitioning the data into training and testing sets. The training set is used to develop the model, and the testing set is used to evaluate the model's performance. This strategy is used to ensure that the model is robust and generalizes well to new data. It is well known, and makes intuitive sense, that using the same data to develop and evaluate a model likely will not produce a model that generalizes well to new data. This is because the model will have potentially conflated the nuances of the data ('the noise') with any real trends ('the signal') and therefore will not be able to generalize well to new data. This is called **overfitting**\index{overfitting} and by holding out a portion of the data for testing, we can evaluate the model's performance on data that it has not seen before and therefore get a more accurate estimate of the generalizable trends in the data.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
Prediction modeling is a hot topic. Given the potential to make actionable predictions about future outcomes, it attracts a lot of attention from organizations which aim to leverage data to make informed decisions. It's use in research is also growing beyond the development of better models and using predictive models to address research questions and hypotheses.
We will apply predictive modeling in the context of language data as a semi-inductive method. However, it is also increasingly used in hypothesis testing scenarios, see @Gries2014, @Deshors2016, and @Baayen2011 for examples\index{Gries}\index{Baayen}.
:::
Another procedure to avoid the perils of overfitting, is to use resampling methods as part of the model evaluation on the training set. Resampling\index{resampling} is the process of repeatedly drawing samples from the training set and evaluating the model on each sample. The two most common resampling methods are **bootstrapping**\index{bootstrapping} (resampling with replacement) and **cross-validation**\index{cross-validation} (resampling without replacement). The performance of these multiple models is summarized and the error between them is assessed. The goal is to minimize the performance differences between the models while maximizing the overall performance. These measures go a long way to avoiding overfitting and therefore maximizing the chance that the training phase will produce a model which is robust at the testing phase.
The second difference, not reflected in the workflow but inherent in predictive analysis, is that PDA requires a fixed outcome variable\index{outcome variable}. This means that the outcome variable must be defined from the outset and cannot be changed during the analysis. Furthermore, the informational nature of the outcome variable will dictate what type of algorithm we choose to interrogate the data and how we will evaluate the model's performance.
If the outcome is categorical in nature, we will use a **classification algorithm**\index{classification algorithms} (*e.g.* logistic regression, naive Bayes, *etc.*). Classification evaluation metrics include accuracy, precision, recall, and F1 scores (a metric which balances precision and recall) which can be derived from and visualized in a cross-tabulation of the predicted and actual outcome values\index{research interpretation}.
If the outcome is numeric in nature, we will use a **regression algorithm**\index{regression algorithms} (*e.g.* linear regression, support vector regression, *etc.*). Since the difference between prediction and actual values is numeric, metrics that quantify numerical differences, such as root mean square error (RMSE) or $R^2$, are used to evaluate the model's performance\index{research interpretation}.
The evaluation of the model is quantitative on the one hand, but it is also qualitative in that we need to consider the implications of the model's performance in light of the research question or hypothesis. Furthermore, depending on our research question we may be interested in exploring the features that are most important to the model's performance. This is called **feature importance**\index{feature importance} and can be derived from the model's coefficients or weights. Notably, however, some of the most powerful models in use today, such as deep neural networks, are not easily interpretable and therefore feature importance is not easily derived. This is something to keep in mind when considering the research question and the type of model that will be used to address it\index{research interpretation}.
## Analysis {#sec-predict-analysis}
<!-- Goals of this section -->
In this section, we now turn to the practical application of predictive data analysis. The discussion will be separated into classification and regression tasks, as model selection and evaluation procedures differ between the two. For each task, we will frame a research goal and work through the process of building a predictive model to address that goal. Along the way we will cover concepts and methods that are common to both classification and regression tasks and specific to each.
<!-- Research questions -->
To frame our analyses, we will posit research aimed at identifying language usage patterns in second language use, one for a classification task and one for a regression task. Our first research question will be to assess whether Spanish language use can be used to predict natives and L1 English learners (categorical). Our second research question will be to gauge the extent to which the L1 English learners' Spanish language placement test scores (numeric) can be predicted based on their language use.
```{r}
#| label: predict-class-curate-data-run
#| eval: false
#| echo: false
# Read in the datasets
learners_tbl <-
read_csv("data/cedel2_learners.csv") |> # read in the learners dataset outcome variable |>
mutate(subcorpus = "Learner") |>
select(subcorpus, place_score = placement_test_score_percent, proficiency, text) # select variables
natives_tbl <-
read_csv("data/cedel2_natives.csv") |> # read in the natives dataset
mutate(place_score = NA) |> # create an outcome variable
mutate(proficiency = "Native") |> # create a proficiency variable
select(subcorpus, place_score, proficiency, text) # select the text column
# Combine the datasets by row
cedel_tbl <-
bind_rows( # combine the datasets by row
learners_tbl,
natives_tbl
) |>
mutate(doc_id = row_number()) |> # create a document id
select(doc_id, subcorpus, place_score, proficiency, text) # select variables
# Write to disk
cedel_tbl |>
write_csv("data/cedel2_transformed.csv")
# Create data dictionary
create_data_dictionary(
data = cedel_tbl,
file_path = "data/cedel2_transformed_dd.csv",
model = "gpt-3.5-turbo",
grouping = "subcorpus"
)
```
```{r}
#| label: tbl-predict-cedel2-data-dictionary
#| tbl-cap: "Data dictionary for the CEDEL2 corpus"
#| tbl-colwidths: [15, 15, 15, 55]
#| echo: false
# Read in the data dictionary for the CEDEL2 corpus
read_csv("data/cedel2_transformed_dd.csv") |>
tt(width = 1)
```
We will use data from the CEDEL2 corpus [@Lozano2022]\index{Corpus Escrito del Español como L2 (CEDEL)}. We will include a subset of the variables from this data that are relevant to our research questions\index{research question}. The data dictionary for this dataset is seen in @tbl-predict-cedel2-data-dictionary.
Let's go ahead and read the transformed dataset and preview it in @exm-predict-cedel-read.
::: {#exm-predict-cedel-read}
```r
# Read in the dataset
cedel_tbl<-
read_csv("../data/cedel2/cedel2_transformed.csv")
# Preview
glimpse(cedel_tbl)
```
```{r}
#| label: predict-cedel-read
#| echo: false
# Read in the dataset
cedel_tbl<-
read_csv("data/cedel2_transformed.csv")
# Preview
glimpse(cedel_tbl)
```
\index{R packages!readr}
\cindex{read_csv()}\cindex{glimpse()}
:::
The output of @exm-predict-cedel-read provides some structural information about the dataset, number of rows and columns as well as variable types.
```{r}
#| label: predict-cedel-diagnostics
#| include: false
# Load packages
library(skimr)
library(janitor)
# Diagnostics
cedel_tbl |>
skim()
# There appear to be 1051 NA values for place_score. This is likely due to the fact that the natives dataset does not have a place_score variable, let's take a look.
cedel_tbl |>
filter(is.na(place_score)) |>
select(subcorpus, place_score, proficiency) |>
head()
# Yes, placement score is NA for the natives dataset.
# Next, let's look at how the proficiency variable is distributed.
cedel_tbl |>
tabyl(proficiency) |>
adorn_pct_formatting(digits = 1)
# Lower beginner and upper advanced have very few observations. But since this will not be a class, we can proceed with the analysis.
# Next, let's look at the distribution of the place_score variable.
cedel_tbl |>
ggplot(aes(x = place_score)) +
geom_histogram(aes(y = after_stat(density))) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of placement scores",
x = "Placement score",
y = "Frequency")
# Some left skew, with an apparent bimodal distribution. Since this is only a learner dataset, we can assume that the bimodal distribution is due to the fact that the learners are at different levels of proficiency. Let's explore this further.
cedel_tbl |>
ggplot(aes(x = place_score, fill = proficiency)) +
# geom_histogram(aes(y = after_stat(density)), position = "identity", alpha = 0.5, binwidth = 2) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of placement scores by proficiency",
x = "Placement score",
y = "Frequency")
# Looks like our two peaks are due to the lower and upper proficiency levels. This is good to know and will be useful for our analysis.
# Let's gauge the variablity of the place_score variable by proficiency. A boxplot will be useful for this.
# Before we do that, let's make proficiency and subscorpus a factor variable in the order we want.
cedel_tbl <-
cedel_tbl |>
mutate(proficiency =
factor(proficiency,
levels =
c("Lower beginner",
"Upper beginner",
"Lower intermediate",
"Upper intermediate",
"Lower advanced",
"Upper advanced",
"Native")
),
subcorpus = factor(subcorpus, levels = c("Learner", "Native")
)
)
cedel_tbl |>
ggplot(aes(x = proficiency, y = place_score, fill = proficiency)) +
geom_boxplot() +
labs(title = "Variability of placement scores by proficiency",
x = "Proficiency",
y = "Placement score")
# Finally, let's look at the distribution of the subcorpus variable.
cedel_tbl |>
tabyl(subcorpus) |>
adorn_pct_formatting(digits = 1)
```
After I performed some diagnostics and made some adjustments based on a descriptive assessment\index{descriptive assessment}, the dataset is in good order to proceed with the analysis. I updated the variables `subcorpus` and `proficiency` as factor variables and ordered them in a way that makes sense for the analysis. The `place_score` variable is distributed well across the proficiency levels. The `subcorpus` variable is less balanced, with around 65% of the texts being from learners. This is not a problem, but it is something to keep in mind when building and interpreting the predictive models.
We will be using the Tidymodels framework\index{Tidymodels} in R to perform these analyses. {tidymodels} is a meta-package, much like {tidyverse}, that provides a consistent interface for machine learning modeling. Some key packages unique to {tidymodels} are {recipes}, {parsnip}, {workflows}, and {tune}. {recipes} includes functions for pre-processing and engineering features. {parsnip} provides a consistent interface for specifying modeling algorithms. {worflows} allows us to combine recipes and models into a single pipeline. Finally, {tune} give us the ability to evaluate and adjust, or 'tune', the parameters of models.
Since we are using text data, we will also be using {textrecipes} which makes various functions available for pre-processing text including extracting and engineering features.
Let's go ahead and do the setup, loading the necessary packages, seen in @exm-predict-packages-data.
::: {#exm-predict-packages-data}
```r
# Load packages
library(tidymodels) # modeling metapackage
library(textrecipes) # text pre-processing
# Prefer tidymodels functions
tidymodels_prefer()
```
```{r}
#| label: predict-packages-data-run
#| echo: false
# Load packages
library(tidymodels) # modeling metapackage
library(textrecipes) # text pre-processing
library(janitor) # data inspection
# Prefer tidymodels functions
tidymodels_prefer()
```
\index{R packages!tidymodels}\index{R packages!textrecipes}
\cindex{tidymodels_prefer()}\cindex{library()}
:::
### Text classification {#sec-predict-text-classification}
<!-- Research question: outcome and features -->
The goal of text classification analysis is to develop a model that can accurately label text samples as either native or learner. This is a binary classification problem. We will approach this problem from an exploratory perspective, and therefore our aim is to identify features from the text that best distinguish between the two classes and explore the features that are most important to the model's performance.\index{text classification}
Let's modify the data frame to include only the variables we need for this analysis, assigning it to `cls_tbl`. In the process, we will rename the `subcorpus` variable to `outcome` to reflect that it is the outcome variable. This is seen in @exm-predict-class-data.
::: {#exm-predict-class-data}
```{r}
#| label: predict-class-data
# Rename subcorpus to outcome
cls_tbl <-
cedel_tbl |>
select(outcome = subcorpus, proficiency, text)
```
:::
<!-- Step 1: identify features -->
Let's begin the workflow from @tbl-predict-workflow by identifying the features that we will use to classify the texts. There may be many features that we could use. These could be features derived from raw text (*e.g.* characters, words, ngrams, *etc.*), feature vectors (*e.g.* word embeddings), or meta-linguistic features (*e.g.* part-of-speech tags, syntactic parses, or semantic features) that have been derived from these through manual or automatic annotation.\index{feature selection}
If as part of our research question\index{research question} the types of features are included, then we should proceed toward deriving those features. If not, a simple approach is to use words as the predictor features\index{predictor features}. This will serve as a baseline for more complex models, if necessary.
This provides us the linguistic unit we will use, but we still need to decide how to operationalize what we mean by 'use' in our research statement. Do we use raw token counts? Do we use normalized frequencies? Do we use some type of weighting scheme? These are questions that we need to consider as we embark on this analysis. Since we are exploring, we can use trial-and-error or consider the implications of each approach and choose the one that best fits our research question ---or both.
Let's approach this with a bit more nuance as we already have some domain knowledge about word use. First, we know that the frequency distribution of words is highly skewed\index{skewed distribution}, meaning that a few words occur very frequently and most words occur very infrequently. Second, we know that the most frequent words in a language are often function words\index{function words} (*e.g.* 'the', 'and', 'of', *etc.*) and that these words are not very informative for distinguishing between classes of texts. Third, we know that comparing raw counts across texts conflates the influence text class lengths.
With these considerations in mind, we will tokenize by words and apply a metric known as the **term-frequency inverse-document frequency** ($tf$-$idf$)\index{term-frequency inverse-document frequency (tf-idf)}. The $tf$-$idf$ measure, as the name suggests, is the product of $tf$ and $idf$ for each term. In effect, it produces a weighting scheme, which will downweight words that are common across all documents and upweight words that are unique to a document. It also mitigates the varying lengths of the documents. This is a common approach in text classification and is a good starting point for our analysis.\index{feature engineering}
<!-- Step 2: Initial split -->
With our features and engineering approach identified, we can move on to step 2 of our workflow and split the data into training and testing sets. We make the splits to our data at this point to draw a line in the sand between the data we will use to train the model and the data we will use to test the model. A typical approach in supervised machine learning\index{supervised learning} is to allocate around 75-80% of the data to the training set and the remaining 20-25% to the testing set, depending on the number of observations\index{observations}. We have `r nrow(cls_tbl)` observations in our dataset, so we can allocate 80% of the data to the training set and 20% of the data to the testing set.
In @exm-predict-class-split, we will use the `initial_split()` function from {rsample} to split the data into training and testing sets. The `initial_split()` function takes a data frame and a proportion and returns a `split` object which contains the training and testing sets. We will use the `strata` argument to stratify the data by the `outcome` variable. This will ensure that the training and testing sets have the same proportion of native and learner texts.\index{outcome variable}
\pagebreak
::: {#exm-predict-class-split}
```{r}
#| label: predict-class-split
set.seed(123) # for reproducibility
# Split the data into training and testing sets
cls_split <-
initial_split(
data = cls_tbl,
prop = 0.8,
strata = outcome
)
# Create training set
cls_train <- training(cls_split) # 80% of data
# Create testing set
cls_test <- testing(cls_split) # 20% of data
```
\index{R packages!rsample}
\cindex{initial_split()}\cindex{training()}\cindex{testing()}\cindex{set.seed()}
:::
A confirmation of the distribution of the data across the training and testing sets as well as a breakdown of the outcome variable, created by {janitor}'s `tabyl()` function, can be seen in @exm-predict-class-split-tabyl.
::: {#exm-predict-class-split-tabyl}
```{r}
#| label: predict-class-split-tabyl
#| results: hold
# View the distribution
# Training set
cls_train |>
tabyl(outcome) |>
adorn_totals("row") |>
adorn_pct_formatting(digits = 1)
# Testing set
cls_test |>
tabyl(outcome) |>
adorn_totals("row") |>
adorn_pct_formatting(digits = 1)
```
\index{R packages!janitor}
\cindex{tabyl()}\cindex{adorn_totals()}\cindex{adorn_pct_formatting()}
:::
We can see that the split was successful. The training and testing sets have very similar proportions of native and learner texts.
<!-- Step 3: Integrate: plan to select and engineer features -->
We are now ready to create a 'recipe', step 3 in our analysis. A recipe is Tidymodels terminology for a set of instructions or blueprint which specifies the outcome variable and the predictor variable and determines how to pre-process and engineer the feature variables.
We will use the `recipe()` function from {recipes} to create the recipe. The `recipe()` function minimally takes a formula and a data frame and returns a `recipe` object. **R formulas**\index{R formula} provide a way to specify relationships between variables and are used extensively in R data modeling. Formulas specify the outcome variable ($y$) and the predictor variable(s)\index{predictor variables} ($x_1 .. x_n$). For example, `y ~ x` can be read as "y as a function of x". In our particular case, we will use the formula `outcome ~ text` to specify that the outcome variable is the `outcome` variable and the predictor variable is the `text` variable. The code is seen in @exm-predict-class-recipe.\cindex{\textasciitilde}
::: {#exm-predict-class-recipe}
```{r}
#| label: predict-class-recipe
# Create a recipe
base_rec <-
recipe(
formula = outcome ~ text, # formula
data = cls_train
)
# Preview
base_rec
```
\index{R packages!recipes}
\cindex{recipe()}
```
── Recipe ─────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 1
```
:::
The recipe object at this moment contains just one instruction, what the variables are and what their relationship is.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
R formulas are a powerful way to specify relationships between variables and are used extensively in data modeling including exploratory, predictive, and inferential analysis. The basic formula syntax is `y ~ x` where `y` is the outcome variable and `x` is the feature variable. The formula syntax can be extended to include multiple feature variables, interactions, and transformations. For more information on R formulas, see R for Data Science [@Wickham2017]
:::
{recipes} provides a wide range of `step_*()` functions which can be applied to the recipe to specify how to engineer the variables in our recipe call. These include functions to scale (*e.g* `step_center()`, `step_scale()`, *etc.*) and transform (*e.g.* `step_log()`, `step_pca()`, *etc.*) numeric variables, and functions to encode (*e.g.* `step_dummy()`, `step_labelencode()`, *etc.*) categorical variables.\index{feature engineering}
These step functions are great when we have selected the variables we want to use in our model and we want to engineer them in a particular way. In our case, however, we need to derive features from the text in the `text` column of datasets before we engineer them.
To ease this process, {textrecipes} provides a number of step functions for pre-processing text data. These include functions to tokenize\index{tokens} (*e.g.* `step_tokenize()`), remove stop words\index{stopwords} (*e.g.* `step_stopwords()`), and to derive meta-features\index{metadata} (*e.g.* `step_lemma()`, `step_stem()`, *etc.*)[^1]. Furthermore, there are functions to engineer features in ways that are particularly relevant to text data, such as feature frequencies and weights (*e.g.* `step_tf()`, `step_tfidf()`, *etc.*) and token filtering (*e.g.* `step_tokenfilter()`).\index{standardization}
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
For other tokenization strategies and feature engineering methods, see {textrecipes} documentation [@R-textrecipes]. There are, however, packages which provide integration with `textrecipes` for other languages, for example, {washoku} for Japanese text processing [@R-washoku].
:::
[^1]: Note that functions for meta-features require more sophisticated text analysis software to be installed on the computing environment (e.g. {spacyr} for `step_lemma()`, `step_pos()`, *etc.*). See {textrecipes} documentation for more information.
So let's build on our basic recipe `cls_rec` by adding steps relevant to our task. To extract our features, we will use the `step_tokenize()` function to tokenize the text into words. The default behavior of the `step_tokenize()` function is to tokenize the text into words, but other token units can be derived and various options can be added to the function call (as {tokenizers} is used under the hood). Adding the `step_tokenize()` function to our recipe is seen in @exm-predict-class-recipe-tokenize.
::: {#exm-predict-class-recipe-tokenize}
```{r}
#| label: predict-class-recipe-tokenize
# Add step to tokenize the text
cls_rec <-
base_rec |>
step_tokenize(text) # tokenize
# Preview
cls_rec
```
\index{R packages!tokenizers}\index{R packages!textrecipes}
\cindex{step_tokenize()}
::: {.content-visible when-format="pdf"}
\vspace{1em}
:::
```
── Recipe ─────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 1
── Operations
• Tokenization for: text
```
:::
The recipe object `cls_rec` now contains two instructions, one for the outcome variable and one for the feature variable. The feature variable instruction specifies that the text should be tokenized into words.
We now need to consider how to engineer the word features. If we add `step_tf()` we will get a matrix of token counts by default\index{matrix}, with the option to specify other weights. The step function `step_tfidf()` creates a matrix of term frequencies\index{frequency} weighted by inverse document frequency\index{inverse document frequency}.
We decided in step 1 that we will start with $tf$-$idf$\index{term-frequency inverse-document frequency (tf-idf)}, so we will add `step_tfidf()` to our recipe. This is seen in @exm-predict-class-recipe-tfidf.
::: {#exm-predict-class-recipe-tfidf}
```{r}
#| label: predict-class-recipe-tfidf
# Add step to tokenize the text
cls_rec <-
cls_rec |>
step_tfidf(text, smooth_idf = FALSE)
# Preview
cls_rec
```
\index{R packages!textrecipes}
\cindex{step_tfidf()}
```
── Recipe ─────────────────────────────────────────
Number of variables by role
outcome: 1
predictor: 1
── Operations
• Tokenization for: text
• Term frequency-inverse document frequency with: text
```
:::
<!-- Step 4: Inspect: -->
<!-- [ ] tmp fmt -->
\pagebreak
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
The `step_tfidf()` function by default adds a *smoothing term* to the inverse document frequency ($idf$) calculation\index{inverse document frequency}. This setting has the effect of reducing the influence of the $idf$ calculation. Thus, terms that appear in many (or all) documents will not be downweighted as much as they would be if the smoothing term was not added. For our purposes, we want to downweight or eliminate the influence of the most frequent terms, so we will set `smooth_idf = FALSE`.
:::
To make sure things are in order and that the recipe performs as expected, we can use the functions `prep()` and `bake()` to inspect the recipe. The `prep()` function takes a recipe object and a data frame and returns a `prep` object. The `prep` object contains the recipe and the data frame with the feature variables engineered according to the recipe. The `bake()` function takes a `prep` object and an optional new dataset to apply the recipe to. If we only want to see the application to the training set, we can use the `new_data = NULL` argument.
In @exm-predict-class-recipe-prep, we use the `prep()` and `bake()` functions to create a data frame with the feature variables. We can then inspect the data frame to see if the recipe performs as expected.
::: {#exm-predict-class-recipe-prep}
```{r}
#| label: predict-class-recipe-prep
# Prep and bake
cls_bake <-
cls_rec |>
prep() |> # create a prep object
bake(new_data = NULL) # apply to training set
# Preview
dim(cls_bake)
```
\index{R packages!recipes}
\cindex{prep()}\cindex{bake()}\cindex{dim()}
:::
The resulting engineered features\index{feature engineering} data frame\index{data frame} has `r format(nrow(cls_bake), big.mark = ",")` observations and `r format(ncol(cls_bake), big.mark = ",")` variables. That is a lot of features! Given the fact that for each writing sample, only a small subset of them will actually appear, most of our cells will be filled with zeros. This is what is known as a **sparse matrix**\index{sparse matrix}\index{matrix}.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
When applying tokenization and feature engineering steps to text data the result is often contained in a matrix object. Using {recipes} a data frame with a matrix-like structure is returned. Remember, a matrix is a data frame where all the vector types\index{vector types} are the same.
Furthermore, the features are prefixed with the variable name and transformation step labels. In @exm-predict-class-recipe-prep we applied $tf$-$idf$ to the `text` variable. Therefore the features are prefixed with `tfidf_text_`.
:::
But we should pause. This is an unwieldy number of features, on for every single word, for a model and it is likely that many of these features are not useful for our classification task. Furthermore, the more features we have, the more chance these features will capture the nuances of these particular writing samples increasing the likelihood we overfit the model\index{overfitting}. All in all, we need to reduce the number of features.
We can filter out features by stopword\index{stopwords} list or by frequency\index{frequency} of occurrence. Let's start by frequency of occurrence. We can set the maximum number of the top features with an arbitrary threshold to start. The `step_tokenfilter()` function can filters out features on a number of criteria. Let's use the `max_tokens` argument to set the maximum number of features to 100.
This particular step needs to be applied before the `step_tfidf()` step, so we will add it to our recipe before the `step_tfidf()` step. This is seen in @exm-predict-class-recipe-tokenfilter.
::: {#exm-predict-class-recipe-tokenfilter}
```{r}
#| label: predict-class-recipe-tokenfilter
#| results: hold
# Rebuild recipe with tokenfilter step
cls_rec <-
base_rec |>
step_tokenize(text) |>
step_tokenfilter(text, max_tokens = 100) |>
step_tfidf(text, smooth_idf = FALSE)
# Prep and bake
cls_bake <-
cls_rec |>
prep() |>
bake(new_data = NULL)
# Preview
dim(cls_bake)
cls_bake[1:5, 1:5]
```
\index{R packages!recipes}\index{R packages!textrecipes}
\cindex{step_tokenfilter()}\cindex{step_tokenize()}\cindex{step_tfidf()}
\cindex{prep()}\cindex{bake()}\cindex{dim()}
:::
We now have a manageable set of features, and fewer of which will have a as many zeros. Only during the interrogation step will we know if they are useful.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
The `prep()` and `bake()` functions are useful for inspecting the recipe and the engineered features, but they are not required to build a recipe. When a recipe is added to a workflow, the `prep()` and `bake()` functions are called automatically as part of the process.
:::
<!-- Step 5: Interrogate -->
We are now ready to turn our attention to step 5 of our workflow, interrogating the data. In this step, we will first select a classification algorithm\index{classification algorithms}, then add this algorithm and our recipe to a workflow object. We will then use the workflow object to train and assess the resulting models, adjusting them until we believe we have a robust final model to apply on the testing set for our final evaluation.
<!-- Select a classification algorithm -->
There are many classification algorithms to choose from with their own strengths and shortcomings. In @tbl-predict-class-algorithms, we list some of the most common classification algorithms and their characteristics.
::: {#tbl-predict-class-algorithms tbl-colwidths="[15, 28, 28, 29]"}
| Algorithm | Strengths | Shortcomings | Tuning Recommendation |
|-----------|-----------|--------------|-----------------------|
| Logistic regression | Interpretable, fast, high-dimensional data | Linear relationship, not for complex tasks | Cross-validate regularization strength |
| Naive Bayes | Interpretable, fast, high-dimensional data, multi-class | Assumes feature (naive) independence, poor with small data | None |
| Decision trees | Nonlinear, interpretable, numerical/ categorical data | Overfitting, high variance | Cross-validate maximum tree depth |
| Random forest | Nonlinear, numerical/ categorical data, less overfitting | Less interpretable, poor with high-dimensional data | Cross-validate number of trees |
| Support vector machines | Nonlinear, high-dimensional data, numerical/ categorical | Requires parameter tuning, memory intensive | Cross-validate regularization parameter |
| Neural networks | Nonlinear, large data, auto feature learning | Overfitting, difficult to interpret, expensive | Cross-validate learning rate |
Common classification algorithms
:::
<!-- Models: simple to complex -->
In the process of selecting an algorithm, simple, computationally efficient, and interpretable models are preferred over complex, computationally expensive, and uninterpretable models, all things being equal. Only if the performance of the simple model is not good enough should we move on to a more complex model.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
{parsnip} provides a consistent interface to many different models, `r nrow(parsnip::model_db)` at the time of writing. You can peruse the list of models by running `parsnip::model_db`.
You can also retrieve the list of potential engines for a given model specification with the `show_engines()` function. For example, `show_engines("logistic_reg")` will return a data frame with the engines available for the logistic regression model specification. Note, the engines represent R packages that need to be installed to use the engine.
:::
With this end mind, we will start with a simple logistic regression\index{logistic regression} model to see how well we can classify the texts in the training set with the features we have engineered. We will use the `logistic_reg()` function from {parsnip} to specify the logistic regression model. We then select the implementation engine (`glmnet` General Linear Model) and the mode of the model (`classification`). The implementation engine is the software that will be used to fit the model. The code to set up the model specification is seen in @exm-predict-class-model-spec.
::: {#exm-predict-class-model-spec}
```{r}
#| label: predict-class-model-spec
#| output: false
# Create a model specification
cls_spec <-
logistic_reg() |>
set_engine("glmnet")
# Preview
cls_spec
```
```
Logistic Regression Model Specification (classification)
Computational engine: glmnet
```
\index{R packages!parsnip}\index{R packages!glmnet}
\cindex{logistic_reg()}\cindex{set_engine()}
:::
<!-- WARN: define hyperparameters here -->
Now, different algorithms will have different parameters that can be adjusted which can affect the performance of the model (see @tbl-predict-class-algorithms). As not to confuse these parameters with the features, which are also parameters of the model, these are given the name **hyperparameters**\index{hyperparameters}. The adjustment process is called **hyperparameter tuning**\index{hyperparameter tuning} and involves fitting the model to the training set with different hyperparameters and evaluating the model's performance to determine the best hyperparameter values to use for the model.
<!-- [ ] tmp fmt -->
\pagebreak
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
You can find the hyperparameters for a model-engine by consulting the `parsnip::model_db` object and unnesting the `parameters` column. For example, `parsnip::model_db |> filter(model == "logistic_reg") |> unnest(parameters)` will return a data frame with the hyperparameters for the logistic regression model.
To learn more about the hyperparameters for a specific model, you can consult the documentation for `parsnip` model (*e.g.* `?logistic_reg`).
:::
For example, the logistic regression model using `glmnet` can be tuned to prevent overfitting. The regularization typically applied is the LASSO (L1) penalty[^lasso]. The `logistic_reg()` function takes the arguments `penalty` and `mixture`. We set `mixture = 1`, but we now need to decide what value to use for the strength of the `penalty` argument. Values can range from 0 to 1, where 0 indicates no penalty and 1 indicates a maximum penalty.
[^lasso]: The LASSO (least absolute shrinkage and selection operator) is a type of regularization that penalizes the absolute value of the coefficients. In essence, it smooths the coefficients by shrinking them towards zero to avoid coefficients picking up on particularities of the training data that will not generalize to new data.
Instead of guessing, we will use {tune} to tune the hyperparameters of the model. The `tune()` function serves as a placeholder for the hyperparameters we want to tune. We can add the `tune()` function to our model specification to specify the hyperparameters we want to tune. The code is seen in @exm-predict-class-model-spec-tune.
::: {#exm-predict-class-model-spec-tune}
```{r}
#| label: predict-class-model-spec-tune-lasso
#| output: false
# Create a model specification (with tune)
cls_spec <-
logistic_reg(penalty = tune(), mixture = 1) |>
set_engine("glmnet")
# Preview
cls_spec
```
```
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = 1
Computational engine: glmnet
```
:::
We can see now that the `cls_spec` model specification now includes the `tune()` function as the value for the `penalty` argument.
<!-- Combine recipe and model specification -->
To tune our model, we will need to combine our recipe and model specification into a workflow object which sequences our feature selection, engineering, and model selection. We will use the `workflow()` function from {workflows} to do this. The code is seen in @exm-predict-class-workflow.
::: {#exm-predict-class-workflow}
```{r}
#| label: predict-class-workflow
#| output: false
# Create a workflow
cls_wf <-
workflow() |>
add_recipe(cls_rec) |>
add_model(cls_spec)
# Preview
cls_wf
```
\index{R packages!workflows}
\cindex{workflow()}\cindex{add_recipe()}\cindex{add_model()}
```
══ Workflow ═══════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ───────────────────────────────────
3 Recipe Steps
• step_tokenize()
• step_tokenfilter()
• step_tfidf()
── Model ──────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = 1
Computational engine: glmnet
```
:::
We now have a workflow `cls_wf` that includes our recipe and model specification, including the `tune()` function as a placeholder for a range of values for the penalty hyperparameter. To tune the penalty hyperparameter\index{hyperparameters}, we use the `grid_regular()` function from {dials} to specify a grid of values to try. Let's choose a random set of 10 values, as seen in @exm-predict-class-model-spec-tune-grid-values.
::: {#exm-predict-class-model-spec-tune-grid-values}
```{r}
#| label: predict-class-model-spec-tune-grid-values
# Create a grid of values for the penalty hyperparameter
cls_grid <-
grid_regular(penalty(), levels = 10)
# Preview
cls_grid
```
\index{R packages!dials}
\cindex{grid_regular()}
:::
The 10 values chosen to be in the grid range from nearly 0 to 1, where 0 indicates no penalty and 1 indicates a strong penalty.
Now to perform the tuning and arrive at an optimal value for `penalty` we need to create a tuning workflow. We do this by calling the `tune_grid()` function using our tuning model specification workflow, a resampling object, and our hyperparameter grid and return a `tune_grid` object.
Resampling\index{resampling} is a strategy that allows us to generate multiple training and testing sets from a single dataset ---in this case the training data we split at the outset. Each generated training-testing pair is called a fold. Which is why this type of resampling is called **k-fold cross-validation**\index{k-fold cross-validation}. The `vfold_cv()` function from {rsample} takes a data frame and a number of folds and returns a `vfold_cv` object. We will apply the `cls_wf` workflow to the 10 folds of the training set with `tune_grid()`. For each fold, each of the 10 values of the penalty hyperparameter will be tried and the model's performance will be evaluated. The code is seen in @exm-predict-class-model-spec-tune-grid-cv.
::: {#exm-predict-class-model-spec-tune-grid-cv}
```{r}
#| label: predict-class-model-spec-tune-grid-cv
set.seed(123) # for reproducibility
# Create a resampling object
cls_vfold <- vfold_cv(cls_train, v = 10)
# Tune the model
cls_tune <-
tune_grid(
cls_wf,
resamples = cls_vfold,
grid = cls_grid
)
# Preview
cls_tune
```
\index{R packages!rsample}
\cindex{vfold_cv()}\cindex{tune_grid()}\cindex{set.seed()}
:::
The `cls_tune` object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the `collect_metrics()` function on the `cls_tune` object, as seen in @exm-predict-class-model-spec-tune-grid-collect. Passing the `cls_tune` object to `autoplot()` produces the visualization in @fig-predict-class-model-spec-tune-grid-collect.
::: {#exm-predict-class-model-spec-tune-grid-collect}
```r
# Collect the results of the tuning
cls_tune_metrics <-
collect_metrics(cls_tune)
# Visualize metrics
autoplot(cls_tune)
```
```{r}
#| label: fig-predict-class-model-spec-tune-grid-collect
#| fig-cap: "Metrics for each fold of the tuning process"
#| fig-alt: "Two line plots one showing accuracy and the other showing ROC-AUC for each fold of the tuning process. The y-axis is the metric value and the x-axis is the penalty hyperparameter value."
#| fig-width: 6
#| fig-asp: 0.5
#| echo: false
# Collect the results of the tuning
cls_tune_metrics <-
collect_metrics(cls_tune)
# Visualize metrics
cls_tune_metrics |>
filter(.metric != "brier_class") |>
ggplot(aes(x = penalty, y = mean)) +
geom_line() +
geom_point() +
scale_x_log10() +
facet_wrap( ~ .metric, nrow = 2) +
labs(x = "Amount of regularization", y = "") +
theme_qtalr(font_size = 10)
```
\index{R packages!tune}\index{R packages!ggplot2}
\cindex{collect_metrics()}\cindex{autoplot()}
:::
The most common metrics for model performance in classification are accuracy\index{accuracy} and the area under the **receiver operating characteristic area under the curve** (ROC-AUC)\index{receiver operating characteristic area under the curve (ROC-AUC)}. Accuracy is simply the proportion of correct predictions. The ROC-AUC provides a single score which summarizes how well the model can distinguish between classes. The closer to 1 the more discriminative power the model has.
In the plot of the metrics, @fig-predict-class-model-spec-tune-grid-collect, we can see that the many of the penalty values performed similarly, with a drop-off in performance at the higher values. Conveniently, the `show_best()` function from {tune} takes a `tune_grid` object and returns the best performing hyperparameter values\index{hyperparameters}. The code is seen in @exm-predict-class-model-spec-tune-grid-collect-best.
::: {#exm-predict-class-model-spec-tune-grid-collect-best}
```{r}
#| label: predict-class-model-spec-tune-grid-collect-best
# Show the best performing hyperparameter value
cls_tune |>
show_best(metric = "roc_auc")
```
\index{R packages!tune}
\cindex{show_best()}
:::
We can make this selection programmatically by using the `select_best()` function. This function needs a metric to select by. We will use the ROC-AUC and select the best value for the penalty hyperparameter. The code is seen in @exm-predict-class-model-spec-tune-grid-collect-select.
\pagebreak
::: {#exm-predict-class-model-spec-tune-grid-collect-select}
```{r}
#| label: predict-class-model-spec-tune-grid-collect-select
# Select the best performing hyperparameter value
cls_best <-
select_best(cls_tune, metric = "roc_auc")
# Preview
cls_best
```
\index{R packages!tune}
\cindex{select_best()}
:::
All of that to tune a hyperparameter! Now we can update the model specification and workflow with the best performing hyperparameter value using the previous `cls_wf_tune` workflow and the `finalize_workflow()` function. The `finalize_workflow()` function takes a workflow and the selected parameters and returns an updated `workflow` object, as seen in @exm-predict-class-tune-hyperparameters-update-workflow.
::: {#exm-predict-class-tune-hyperparameters-update-workflow}
```{r}
#| label: predict-class-tune-hyperparameters-update-workflow
#| output: false
# Update model specification
cls_wf_lasso <-
cls_wf |>
finalize_workflow(cls_best)
# Preview
cls_wf_lasso
```
\index{R packages!workflows}
\cindex{finalize_workflow()}
```
══ Workflow ═══════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ───────────────────────────────────
• step_tokenize()
• step_tokenfilter()
• step_tfidf()
── Model ──────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = 0.000464158883361278
mixture = 1
Computational engine: glmnet
```
:::
Our model specification and the workflow are updated with the tuned hyperparameter.
As a reminder, we are still working in step 5 of our workflow, interrogating the data. So far, we have selected and engineered the features, split the data into training and testing sets, and selected a classification algorithm. We have also tuned the hyperparameters of the model and updated the model specification and workflow with the best performing hyperparameter value.
The next step is to assess the performance of the model on the training set given the features we have engineered, the algorithm we have selected, and the hyperparameters we have tuned. Instead of evaluating the model on the training set directly, we will use cross-validation\index{k-fold cross-validation} on the training set to gauge the variability of the model.
The reason for this is that the model's performance on the entire training set at once is not a reliable indicator of the model's performance on new data ---just imagine if you were to take the same test over and over again, you would get better and better at the test, but that doesn't mean you've learned the material any better. Cross-validation is a technique that allows us to estimate the model's performance on new data by simulating the process of training and testing the model on different subsets of the training data.
Similar to what we did to tune the hyperparameters, we can use cross-validation to gauge the variability of the model. The `fit_resamples()` function takes a workflow and a resampling object and returns metrics for each fold. The code is seen in @exm-predict-class-tune-hyperparameters-evaluate-workflow-cv.
::: {#exm-predict-class-tune-hyperparameters-evaluate-workflow-cv}
```{r}
#| label: predict-class-tune-hyperparameters-evaluate-workflow-cv
# Cross-validate workflow
cls_lasso_cv <-
cls_wf_lasso |>
fit_resamples(
resamples = cls_vfold,
# save predictions for confusion matrix
control = control_resamples(save_pred = TRUE)
)
```
\index{R packages!tune}
\cindex{fit_resamples()}
:::
We want to aggregate the metrics across the folds to get a sense of the variability of the model. The `collect_metrics()` function takes the results of a cross-validation and returns a data frame with the metrics.
\pagebreak
::: {#exm-predict-class-tune-hyperparameters-evaluate-workflow-cv-collect}
```{r}
#| label: predict-class-tune-hyperparameters-evaluate-workflow-cv-collect
# Collect metrics
collect_metrics(cls_lasso_cv)
```
\cindex{collect_metrics()}
:::
From the accuracy and ROC-AUC metrics in @exm-predict-class-tune-hyperparameters-evaluate-workflow-cv-collect it appears we have a decent candidate model, but there is room for potential improvement. A good next step is to evaluate the model errors and see if there are any patterns that can be addressed before considering what approach to take to improve the model.
::: {.callout .halfsize}
**{{< fa regular hand-point-up >}} Tip**
To provide context in terms of what is a good model performance, it is useful to compare the model's performance to a null model. A **null model**\index{null model} (or baseline model) is a simple model that is easy to implement and provides a benchmark for the model's performance. For classification tasks\index{text classification}, a common null model is to predict the most frequent class. In modeling, this is the minimal benchmark we want to beat, if we are doing better than this, we are doing better than chance.
:::
For classification tasks, a good place to start is to visualize a confusion matrix. A **confusion matrix**\index{confusion matrix} is a cross-tabulation of the predicted and actual outcomes. The `conf_mat_resampled()` function takes a `fit_resamples` object (with predictions saved) and returns a table (`tidy = FALSE`) with the confusion matrix for the aggregated folds. We can pass this to the `autoplot()` function to plot as in @exm-predict-class-tune-hyperparameters-evaluate-workflow-cv-confusion.
```{r}
#| label: fig-predict-class-tune-hyperparameters-evaluate-workflow-cv-confusion
#| include: false
# Plot confusion matrix
p <- cls_lasso_cv |>
conf_mat_resampled(tidy = FALSE) |>
autoplot(type = "heatmap") +
theme_qtalr(font_size = 10) +
theme(legend.position = "none")
ggsave("figures/fig-class-tune-hyperparameters-evaluate-workflow-cv-confusion.png", p, width = 4.5, height = 4, dpi = 300)
```
::: {#exm-predict-class-tune-hyperparameters-evaluate-workflow-cv-confusion}
```r
# Plot confusion matrix
cls_lasso_cv |>
conf_mat_resampled(tidy = FALSE) |>
autoplot(type = "heatmap")
```
\index{R packages!tune}\index{R packages!ggplot2}
\cindex{conf_mat_resampled()}\cindex{autoplot()}
:::
::: {#fig-class-tune-hyperparameters-evaluate-workflow-cv-confusion}
{width="75%" fig-alt="Heatmap of the confusion matrix for the aggregated folds of the cross-validation. Actual classes are on the y-axis and predicted classes are on the x-axis."}
Confusion matrix for the aggregated folds of the cross-validation
:::
The top left to bottom right diagonal contains the true positives and true negatives. These are the correct predictions. The top right to bottom left diagonal contains the false positives and false negatives ---our errors. The convention is to speak of one class being the positive class and the other class being the negative class. In our case, we will consider the positive class to be the 'learner' class and the negative class to be the 'natives' class.
We can see that there are more learners falsely predicted to be natives than the other way around. This may be due to the fact that there are simply more learners than natives in the dataset or this could signal that there are some learners that are more similar to natives than other learners. Clearly this can't be the entire explanation as the model is not perfect, even some natives are classified falsely as learners! But it may be an interesting avenue for further exploration. Perhaps these are learners that are more advanced or have a particular style of writing that is more similar to natives.
::: {.callout .halfsize}
**{{< fa medal >}} Dive deeper**
Another perspective often applied to evaluate a model is the receiver operating characteristic (ROC) curve\index{receiver operating characteristic (ROC) curve}. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds. This metric, and visualization, can be useful to gauge the model's ability to distinguish between the two classes. {yardstick} provides the `roc_curve()` function to calculate the ROC curve on an `fit_resamples` object.
:::
<!-- Model improvement: tune max tokens -->
To improve supervised learning models, consider: