-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching to post-processing of special tokens #1503
Comments
That's a brilliant idea, and would finally give us the option to separate the core token segmenter function from our own preferred handling of segmented tokens. This would make it possible (finally!) to address #276. Other "tokenizers" are faster than |
If we are going to do minimal pre-processing and provide a handful of post-processing functions ( |
If we really want to move to post-processing, better to add tokens_compound(list(c("#", "*"), c("@", "*")), concatenator = "") # twitter
tokens_compound(c("*", "-"), c("-", "*")), concatenator = "") # hyphen is the same as tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") # twitter
tokens_compound("-", window = c(1, 1), concatenator = "") # hyphen
|
Post-processing of twitter tags is 3 times ( require(quanteda)
require(stringi)
require(lubridate)
quanteda_options(threads = 8)
corp <- readRDS("~/Documents/Sputnik/Data/data_corpus_tweets.RDS")
split <- function(x) {
stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>%
as.tokens()
}
post <- function(x) {
stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>%
as.tokens() %>%
tokens_compound(list(c("#", "*"), c("@", "*"), c("*", "-"), c("-", "*")), concatenator = "") %>%
tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}
post2 <- function(x) {
stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>%
as.tokens() %>%
tokens_compound("-", window = c(1, 1), concatenator = "") %>%
tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") %>%
tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}
pre <- function (x) {
tokens(x, remove_punct = FALSE)
}
txt <- texts(corp)
microbenchmark::microbenchmark(
split(txt),
post(txt),
post2(txt),
pre(txt),
times = 10
)
txt2 <- head(txt, 10000)
microbenchmark::microbenchmark(
as.tokens(lis),
times = 10
)
lis <- stri_split_boundaries(head(txt, 10000), type = "word", skip_word_none = FALSE)
profvis::profvis(
as.tokens(lis)
)
v <- unlist(lis, use.names = FALSE)
microbenchmark::microbenchmark(
#fastmatch::coalesce(v),
unique(v),
v[!duplicated(v)],
rle(v),
times = 2
)
toks <- tokens(txt)
microbenchmark::microbenchmark(
tokens_compound(toks, phrase("not *"), concatenator = ""),
tokens_compound(toks, "not", window = c(0, 1), concatenator = ""),
times = 5
)
|
Post-processing is as fast as current pre-processing:
The text was updated successfully, but these errors were encountered: