Consider switching to post-processing of special tokens #1503

koheiw · 2018-11-24T07:43:10Z

Post-processing is as fast as current pre-processing:

ndoc(corp)
##  [1] 84599
system.time({
toks <- stri_split_boundaries(texts(corp), 
                              type = "word", skip_word_none = FALSE) %>% 
        as.tokens() %>% 
        tokens_compound(phrase(c("# *", "@ #", "* -", "- *")), concatenator = "") %>% 
        tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
})
##   user  system elapsed 
## 127.989  11.377  85.952 
system.time({
toks <- tokens(corp)
})
##    user  system elapsed 
##  98.810   2.519  77.285

kbenoit · 2018-11-24T10:19:15Z

That's a brilliant idea, and would finally give us the option to separate the core token segmenter function from our own preferred handling of segmented tokens. This would make it possible (finally!) to address #276. Other "tokenizers" are faster than tokens(x, what = "word") but only because of our special handling of things like Twitter punctuation characters.

koheiw · 2018-11-24T22:45:14Z

If we are going to do minimal pre-processing and provide a handful of post-processing functions (tokens.tokens(), tokens_split(), tokens_compound(), tokens_select() with a new position argument), users can do tag extraction on tokens, and we can kill corpus_segment(), which is one of the ugliest functions we have.

koheiw · 2019-12-12T06:36:45Z

If we really want to move to post-processing, better to add window to tokens_compound() because

tokens_compound(list(c("#", "*"), c("@", "*")), concatenator = "") # twitter
tokens_compound(c("*", "-"), c("-", "*")), concatenator = "") # hyphen

is the same as

tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") # twitter
tokens_compound("-", window = c(1, 1), concatenator = "") # hyphen

windows is better because we can avoid the used of "*".

koheiw · 2019-12-15T11:18:04Z

Post-processing of twitter tags is 3 times ( (129 - 47) / (75 - 47)) faster than current pre-processing.

require(quanteda)
require(stringi)
require(lubridate)
quanteda_options(threads = 8)

corp <- readRDS("~/Documents/Sputnik/Data/data_corpus_tweets.RDS")

split <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens()
}

post <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens() %>% 
    tokens_compound(list(c("#", "*"), c("@", "*"), c("*", "-"), c("-", "*")), concatenator = "") %>% 
    tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}

post2 <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens() %>% 
    tokens_compound("-", window = c(1, 1), concatenator = "") %>% 
    tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") %>% 
    tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}

pre <- function (x) {
  tokens(x, remove_punct = FALSE)
}

txt <- texts(corp)
microbenchmark::microbenchmark(
  split(txt),
  post(txt), 
  post2(txt),
  pre(txt), 
  times = 10
)

txt2 <- head(txt, 10000)
microbenchmark::microbenchmark(
  as.tokens(lis),
  times = 10
)

lis <- stri_split_boundaries(head(txt, 10000), type = "word", skip_word_none = FALSE)
profvis::profvis(
as.tokens(lis)
)

v <- unlist(lis, use.names = FALSE)
microbenchmark::microbenchmark(
  #fastmatch::coalesce(v),
  unique(v),
  v[!duplicated(v)],
  rle(v),
  times = 2
)

toks <- tokens(txt)
microbenchmark::microbenchmark(
  tokens_compound(toks, phrase("not *"), concatenator = ""),
  tokens_compound(toks, "not", window = c(0, 1), concatenator = ""),
  times = 5
)

Unit: seconds
       expr       min        lq      mean    median        uq       max neval
 split(txt)  44.87085  46.32514  47.64618  47.55630  49.03696  50.73318    10
  post(txt)  72.38992  75.56673  78.32646  78.83763  80.78479  82.59906    10
 post2(txt)  69.82946  71.31607  75.88459  73.76547  74.61518 100.48264    10
   pre(txt) 123.00084 124.02030 129.71545 126.18319 130.29893 153.10076    10


> length(txt)
[1] 1292364

> post2(head(txt))
tokens from 6 documents.
text1 :
 [1] "Trump"         "Shifts"        "Away"          "From"          "US-Russian"   
 [6] "Cybersecurity" "Group"         " - "           "Russian"       "MP"           
[11] ":"             "https"         ":"             "/"             "/"            
[16] "t.co"          "/"             "y0v00rk0XC"    "via"           "@SputnikInt"  

text2 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text3 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text4 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text5 :
 [1] "#Trump"         "shifts"         "away"           "from"           "US-Russian"    
 [6] "#cybersecurity" "group"          "due"            "to"             "pressure"      
[11] " - "            "Russian"        "MP"             "https"          ":"             
[16] "/"              "/"              "t.co"           "/"              "RbHsX6C8BX"    
[21] "…"              "https"          ":"              "/"              "/"             
[26] "t.co"           "/"              "rkLsYeCYgM"    

text6 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"

koheiw added enhancement performance tokens labels Nov 24, 2018

kbenoit changed the title ~~Consider swithicing to post-processing of special tokens~~ Consider switching to post-processing of special tokens Nov 24, 2018

kbenoit added this to the v2.0 milestone Dec 18, 2018

kbenoit modified the milestones: v2.0, version 2.0 Feb 24, 2019

kbenoit mentioned this issue May 13, 2019

Fix (most) tokens() inconsistencies #1687

Merged

This was referenced Dec 12, 2019

Upgrade tokens_compound() #1797

Closed

Change to post processing #1801

Closed

kbenoit added this to To do in Innsbruck work plan Jan 11, 2020

kbenoit mentioned this issue Jan 20, 2020

New tokens() infrastructure #1854

Closed

kbenoit closed this as completed Jan 20, 2020

kbenoit removed this from To do in Innsbruck work plan Jan 21, 2020

kbenoit mentioned this issue Jan 26, 2020

Substantial improvement and simplification of tokens approach #1857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching to post-processing of special tokens #1503

Consider switching to post-processing of special tokens #1503

koheiw commented Nov 24, 2018

kbenoit commented Nov 24, 2018

koheiw commented Nov 24, 2018

koheiw commented Dec 12, 2019 •

edited

Loading

koheiw commented Dec 15, 2019

Consider switching to post-processing of special tokens #1503

Consider switching to post-processing of special tokens #1503

Comments

koheiw commented Nov 24, 2018

kbenoit commented Nov 24, 2018

koheiw commented Nov 24, 2018

koheiw commented Dec 12, 2019 • edited Loading

koheiw commented Dec 15, 2019

koheiw commented Dec 12, 2019 •

edited

Loading