/
build_pipeline.Rmd
111 lines (78 loc) · 3.14 KB
/
build_pipeline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: "Build a pipeline"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{build_pipeline}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
rmarkdown.html_vignette.check_title = FALSE
)
```
## Load libraries
```{r eval = FALSE}
library(autotextclassifier) # Auto text classifier
library(parallel) # Parallel processing
library(doParallel) # Parallel processing
library(here) # Creating reproducible file paths
library(patchwork) # Putting ggplots together
library(recipes) # Preprocessing
library(zeallot) # Multiple assignments
library(yardstick) # Metrics
```
## Import data
```{r eval = FALSE}
load(file = here("inst/extdata/sample_data.rda"))
```
## Data munging
Don't forget to make sure that the type of the outcome variable should be factor.
```{r eval = FALSE}
names(sample_data) <- c("category", "org_name", "ein", "text")
sample_data$category <- as.factor(sample_data$category)
```
## Apply basic recipe
The `rec` object provides the following information. The function also checks whether the `text` column has missing values or includes extremely short documents (less than five words).
There are two broad basic options for text preprocessing.
1. Without word embedding
* Tokenization for text [trained]
* Stop word removal for text [trained]
* Text filtering for text [trained]
* Term frequency-inverse document frequency with text [trained]
2. With word embedding
* Tokenization for text [trained]
* Stop word removal for text [trained]
* Text filtering for text [trained]
* Word embeddings aggregated from text [trained]
```{r eval = FALSE}
# Without word embedding
rec <- apply_basic_recipe(sample_data, category ~ text, text)
# With word embedding
rec_alt <- apply_basic_recipe(sample_data, category ~ text, text, add_embedding = TRUE)
```
## Build a pipeline
The `build_pipeline` function reduces the steps one needs to take a classifier pipeline. The pipeline involves data splitting, creating tuning parameters, search spaces, workflows, 10-fold cross-validation samples, finding the best model from each algorithm and fitting the best model from each algorithm to the data.
```{r eval = FALSE}
# Using parallel processing to speed up
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makeCluster(all_cores[1] - 1)
registerDoParallel(cl)
```
```{r eval = FALSE}
set.seed(1234)
c(lasso_fit, rand_fit, xg_fit) %<-% build_pipeline(rec, category, rec, prop_ratio = 0.8, metric_choice = "roc_auc")
```
## Evaluate the model using visualization
```{r eval = FALSE}
# Based on the class-based metrics
viz_class_fit(lasso_fit, "Lasso", test_x_class, test_y_class, "class") +
viz_class_fit(rand_fit, "Random forest", test_x_class, test_y_class, "class") +
viz_class_fit(xg_fit, "XGBoost", test_x_class, test_y_class, "class")
# Based on the probability-based metrics
viz_class_fit(lasso_fit, "Lasso", test_x_class, test_y_class, "probability") +
viz_class_fit(rand_fit, "Random forest", test_x_class, test_y_class, "probability") +
viz_class_fit(xg_fit, "XGBoost", test_x_class, test_y_class, "probability")
```