Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching to post-processing of special tokens #1503

Closed
koheiw opened this issue Nov 24, 2018 · 4 comments · Fixed by #1857
Closed

Consider switching to post-processing of special tokens #1503

koheiw opened this issue Nov 24, 2018 · 4 comments · Fixed by #1857

Comments

@koheiw
Copy link
Collaborator

koheiw commented Nov 24, 2018

Post-processing is as fast as current pre-processing:

ndoc(corp)
##  [1] 84599
system.time({
toks <- stri_split_boundaries(texts(corp), 
                              type = "word", skip_word_none = FALSE) %>% 
        as.tokens() %>% 
        tokens_compound(phrase(c("# *", "@ #", "* -", "- *")), concatenator = "") %>% 
        tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
})
##   user  system elapsed 
## 127.989  11.377  85.952 
system.time({
toks <- tokens(corp)
})
##    user  system elapsed 
##  98.810   2.519  77.285
@kbenoit
Copy link
Collaborator

kbenoit commented Nov 24, 2018

That's a brilliant idea, and would finally give us the option to separate the core token segmenter function from our own preferred handling of segmented tokens. This would make it possible (finally!) to address #276. Other "tokenizers" are faster than tokens(x, what = "word") but only because of our special handling of things like Twitter punctuation characters.

@kbenoit kbenoit changed the title Consider swithicing to post-processing of special tokens Consider switching to post-processing of special tokens Nov 24, 2018
@koheiw
Copy link
Collaborator Author

koheiw commented Nov 24, 2018

If we are going to do minimal pre-processing and provide a handful of post-processing functions (tokens.tokens(), tokens_split(), tokens_compound(), tokens_select() with a new position argument), users can do tag extraction on tokens, and we can kill corpus_segment(), which is one of the ugliest functions we have.

@kbenoit kbenoit added this to the v2.0 milestone Dec 18, 2018
@kbenoit kbenoit modified the milestones: v2.0, version 2.0 Feb 24, 2019
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 12, 2019

If we really want to move to post-processing, better to add window to tokens_compound() because

tokens_compound(list(c("#", "*"), c("@", "*")), concatenator = "") # twitter
tokens_compound(c("*", "-"), c("-", "*")), concatenator = "") # hyphen

is the same as

tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") # twitter
tokens_compound("-", window = c(1, 1), concatenator = "") # hyphen

windows is better because we can avoid the used of "*".

This was referenced Dec 12, 2019
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 15, 2019

Post-processing of twitter tags is 3 times ( (129 - 47) / (75 - 47)) faster than current pre-processing.

require(quanteda)
require(stringi)
require(lubridate)
quanteda_options(threads = 8)

corp <- readRDS("~/Documents/Sputnik/Data/data_corpus_tweets.RDS")

split <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens()
}

post <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens() %>% 
    tokens_compound(list(c("#", "*"), c("@", "*"), c("*", "-"), c("-", "*")), concatenator = "") %>% 
    tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}

post2 <- function(x) {
  stri_split_boundaries(x, type = "word", skip_word_none = FALSE) %>% 
    as.tokens() %>% 
    tokens_compound("-", window = c(1, 1), concatenator = "") %>% 
    tokens_compound(c("#", "@"), window = c(0, 1), concatenator = "") %>% 
    tokens_remove("^[\\p{Z}\\p{C}]+$", valuetype = "regex")
}

pre <- function (x) {
  tokens(x, remove_punct = FALSE)
}

txt <- texts(corp)
microbenchmark::microbenchmark(
  split(txt),
  post(txt), 
  post2(txt),
  pre(txt), 
  times = 10
)

txt2 <- head(txt, 10000)
microbenchmark::microbenchmark(
  as.tokens(lis),
  times = 10
)

lis <- stri_split_boundaries(head(txt, 10000), type = "word", skip_word_none = FALSE)
profvis::profvis(
as.tokens(lis)
)

v <- unlist(lis, use.names = FALSE)
microbenchmark::microbenchmark(
  #fastmatch::coalesce(v),
  unique(v),
  v[!duplicated(v)],
  rle(v),
  times = 2
)

toks <- tokens(txt)
microbenchmark::microbenchmark(
  tokens_compound(toks, phrase("not *"), concatenator = ""),
  tokens_compound(toks, "not", window = c(0, 1), concatenator = ""),
  times = 5
)
Unit: seconds
       expr       min        lq      mean    median        uq       max neval
 split(txt)  44.87085  46.32514  47.64618  47.55630  49.03696  50.73318    10
  post(txt)  72.38992  75.56673  78.32646  78.83763  80.78479  82.59906    10
 post2(txt)  69.82946  71.31607  75.88459  73.76547  74.61518 100.48264    10
   pre(txt) 123.00084 124.02030 129.71545 126.18319 130.29893 153.10076    10


> length(txt)
[1] 1292364

> post2(head(txt))
tokens from 6 documents.
text1 :
 [1] "Trump"         "Shifts"        "Away"          "From"          "US-Russian"   
 [6] "Cybersecurity" "Group"         " - "           "Russian"       "MP"           
[11] ":"             "https"         ":"             "/"             "/"            
[16] "t.co"          "/"             "y0v00rk0XC"    "via"           "@SputnikInt"  

text2 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text3 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text4 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"             

text5 :
 [1] "#Trump"         "shifts"         "away"           "from"           "US-Russian"    
 [6] "#cybersecurity" "group"          "due"            "to"             "pressure"      
[11] " - "            "Russian"        "MP"             "https"          ":"             
[16] "/"              "/"              "t.co"           "/"              "RbHsX6C8BX"    
[21] "…"              "https"          ":"              "/"              "/"             
[26] "t.co"           "/"              "rkLsYeCYgM"    

text6 :
 [1] "RT"             "@SputnikInt"    ":"              "#Trump"         "shifts"        
 [6] "away"           "from"           "US-Russian"     "#cybersecurity" "group"         
[11] "due"            "to"             "pressure"       " - "            "Russian"       
[16] "MP"             "https"          ":"              "/"              "/"             
[21] "t.co"           "/"              "RbHsX6C8BX"     "#USRussia"      "#put"          
[26] "…"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants