-
Notifications
You must be signed in to change notification settings - Fork 187
/
Copy pathbenchmarks_xptr.Rmd
168 lines (134 loc) · 5.44 KB
/
benchmarks_xptr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: "Performance improvements"
author: Kohei Watanabe and Stefan Müller
output:
html_document:
toc: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
comment = "##",
fig.width = 8,
fig.height = 2,
dpi = 150,
warning = FALSE
)
```
```{r, echo=FALSE, include=FALSE}
#data_corpus_guardian <- readRDS('/home/kohei/Dropbox/Public/data_corpus_guardian2016-10k.rds')
#data_corpus_guardian <- readRDS('C:/Users/watan/Dropbox/Public/data_corpus_guardian2016-10k.rds')
data_corpus_guardian <- quanteda.corpora::download("data_corpus_guardian")
```
## Overview and benchmarking approach
**quanteda** version 4.0 can process textual data significantly faster than its earlier versions thanks to the `tokens_xptr object` and a new glob pattern matching mechanism. More information on the features and advantages of the new xptr object are available in a [separate vignette](./articles/pkgdown/tokens_xptr.html).
How we performed the comparison: We created the **quanteda3** package from **quanteda** version 3.3 and compared it with version 4.0 on a Windows laptop with AMD Ryzen 7 PRO processor (8 cores). We used sentences from 10,000 English-language news articles in this benchmarking.
We repeated the same operation using different versions of the same functions to get the distribution of execution time. The result shows that the execution time of many v4.0 functions is about half of their version 3.3 counterparts.
```{r getting-started}
# remotes::install_github("quanteda/quanteda3")
library("quanteda")
library("ggplot2")
# create text corpus
corp <- corpus_reshape(data_corpus_guardian)
# tokenize corpus
toks <- tokens(corp, remove_punct = FALSE, remove_numbers = FALSE,
remove_symbols = FALSE)
# transform tokens object to tokens_xptr object
xtoks <- as.tokens_xptr(toks)
ndoc(toks) # the number of sentences
sum(ntoken(toks)) # the total number of tokens
```
## Tokenising a corpus
Although the v4 tokenizer is more flexible, its speed is roughly the same as v3 tokenizer. The shorter execution time of the version 4 is due to the faster removal of punctuation marks, numbers and symbols. We compare the performance by tokenizing the corpus with the `tokens()` function from **quanteda** version version 4.0 and 3.3 ("v4" and "v3" in the plots, respectively).
```{r, echo=FALSE, include=FALSE}
# adjust custom ggplot2 theme to make it consistent with other vignettes
ggplot2::theme_set(ggplot2::theme_bw())
```
```{r tokenising-corpus}
microbenchmark::microbenchmark(
v3 = quanteda3::tokens(corp, remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE),
v4 = tokens(corp, remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE),
times = 10
) |> autoplot(log = FALSE)
```
## Modifying tokens objects
`as.tokens_xptr()` is inserted before the v4 functions to keep the original `tokens_xptr` object intact.
```{r modifying-tokens}
# generate n-grams
microbenchmark::microbenchmark(
v3 = quanteda3::tokens_ngrams(toks),
v4 = as.tokens_xptr(xtoks) |>
tokens_ngrams(),
times = 10
) |> autoplot(log = FALSE)
# lookup dictionary keywords
microbenchmark::microbenchmark(
v3 = quanteda3::tokens_lookup(toks, dictionary = data_dictionary_LSD2015),
v4 = as.tokens_xptr(xtoks) |>
tokens_lookup(dictionary = data_dictionary_LSD2015),
times = 10
) |> autoplot(log = FALSE)
# remove stop words
microbenchmark::microbenchmark(
v3 = quanteda3::tokens_remove(toks, pattern = stopwords("en"), padding = TRUE),
v4 = as.tokens_xptr(xtoks) |>
tokens_remove(pattern = stopwords("en"), padding = TRUE),
times = 10
) |> autoplot(log = FALSE)
# compound tokens
microbenchmark::microbenchmark(
v3 = quanteda3::tokens_compound(toks, pattern = "&", window = 1),
v4 = as.tokens_xptr(xtoks) |>
tokens_compound(pattern = "&", window = 1),
times = 10
) |> autoplot(log = FALSE)
# group sentences to articles
microbenchmark::microbenchmark(
v3 = quanteda3::tokens_group(toks),
v4 = tokens_group(xtoks),
times = 10
) |> autoplot(log = FALSE)
```
## Combining tokens objects
Combining tokens objects using `c()` is also substantially faster.
```{r combining-tokens}
# get first 5000 documents
toks1 <- head(toks, 5000)
# get last 5000 documents
toks2 <- tail(toks, 5000)
# transform both objects to tokens_xptr objects
xtoks1 <- as.tokens_xptr(toks1)
xtoks2 <- as.tokens_xptr(toks2)
# combine tokens objects
microbenchmark::microbenchmark(
v3 = quanteda3:::c.tokens(toks1, toks2),
v4 = quanteda:::c.tokens_xptr(xtoks1, xtoks2),
times = 10
) |> autoplot(log = FALSE)
```
## Constructing a document-feature matrix
We also compare the speed of constructing a document-feature matrix (DFM) using tokens objects.
```{r dfm-from-tokens}
microbenchmark::microbenchmark(
v3 = quanteda3::dfm(toks),
v4 = dfm(xtoks),
times = 10
) |> autoplot(log = FALSE)
```
## Simple pipeline: tokenising a corpus and creating a document-feature matrix
```{r simple-pipeline}
microbenchmark::microbenchmark(
v3 = quanteda3::tokens(corp) |>
quanteda3::tokens_remove(stopwords("en"), padding = TRUE) |>
quanteda3::dfm(remove_padding = TRUE),
v4 = tokens(corp, xptr = TRUE) |>
tokens_remove(stopwords("en"), padding = TRUE) |>
dfm(remove_padding = TRUE),
times = 10
) |> autoplot(log = FALSE)
```