tokens() options are inconsistent with what = "fasterword" #1447

kbenoit · 2018-10-07T16:54:43Z

Different

remove_punctuation:

tokens("This, a test.", what = "fasterword", remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "This," "a"     "test."

and Twitter characters:

tokens("@kenbenoit #quanteda", what = "fasterword", 
       remove_twitter = TRUE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "@kenbenoit" "#quanteda" 

tokens("@kenbenoit #quanteda", what = "fasterword", 
       remove_twitter = FALSE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "@kenbenoit" "#quanteda"

remove_separators behaves differently:

txt <- "one two\nthree\tfour"

# unexpected
tokens(txt, what = "fasterword", remove_separators = FALSE)
## tokens from 1 document.
## text1 :
## [1] "one"              "two\nthree\tfour"

# correct
tokens(txt, what = "fasterword", remove_separators = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"   "two"   "three" "four" 

# correct
tokens(txt, what = "word", remove_separators = FALSE)
## tokens from 1 document.
## text1 :
## [1] "one"   " "     "two"   "\n"    "three" "\t"    "four" 

# correct
tokens(txt, what = "word", remove_separators = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"   "two"   "three" "four"

Same

remove_url = TRUE is ok:

tokens("https://quanteda.io is our website", what = "fasterword", remove_url = TRUE)
## tokens from 1 document.
## text1 :
## [1] "is"      "our"     "website"

tokens("https://quanteda.io is our website", what = "fasterword", remove_url = FALSE)
## tokens from 1 document.
## text1 :
## [1] "https://quanteda.io" "is"      "our"     "website"

remove_numbers is ok:

tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = TRUE) %>%
    as.character()
## [1] "red"      "balloons" "4ever"
tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = FALSE) %>%
    as.character()
## [1] "99"       "red"      "balloons" "4ever"

remove_symbols is ok:

tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = TRUE) %>%
    as.character()
## [1] "2"      "3"      "TRUE"   "#logic"
tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = FALSE) %>%
    as.character()
## [1] "2"      "<"      "3"      "="      "TRUE"   "#logic"

remove_hyphens is ok:

txt <- "Jacob-Rees - second-rate"
identical(
    tokens(txt, what = "word", remove_hyphens = FALSE) %>% as.character(),
    tokens(txt, what = "fasterword", remove_hyphens = FALSE) %>% as.character()
)
## [1] TRUE
identical(
    tokens(txt, what = "word", remove_hyphens = TRUE) %>% as.character(),
    tokens(txt, what = "fasterword", remove_hyphens = TRUE) %>% as.character()
)
## [1] TRUE

The text was updated successfully, but these errors were encountered:

koheiw · 2018-10-08T21:56:41Z

This seems to be caused by a change in #1420

quanteda/R/tokens.R

Lines 670 to 674 in bedca07

    
           tok <- if (remove_separators) { 
        
               stri_split_regex(txt, "\\p{WHITE_SPACE}+")  
        
           } else { 
        
               stri_split_regex(txt, "\\p{Z}+")  
        
           }

- Revert change in ad359e6. - Fix handling of new line markers. - Update documentation as well. - Correct tests

Address #1447

kbenoit · 2018-10-24T14:24:42Z

#1459 solved the separators issue, now that just leaves the punctuation and Twitter differences.

kbenoit added the tokens label Oct 7, 2018

koheiw added a commit that referenced this issue Oct 22, 2018

Address #1447

8e206f9

- Revert change in ad359e6. - Fix handling of new line markers. - Update documentation as well. - Correct tests

kbenoit added a commit that referenced this issue Oct 24, 2018

Merge pull request #1459 from quanteda/issue-1447

73d63a1

Address #1447

This was referenced Oct 24, 2018

tokens(x, remove_punct) behaviour is inconsistent for what = "fast(er|est)word" #1464

Closed

tokens(x, remove_twitter) behaviour is inconsistent for what = "fast(er|est)word" #1465

Closed

kbenoit closed this as completed Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens() options are inconsistent with what = "fasterword" #1447

tokens() options are inconsistent with what = "fasterword" #1447

kbenoit commented Oct 7, 2018 •

edited

koheiw commented Oct 8, 2018

kbenoit commented Oct 24, 2018

tokens() options are inconsistent with what = "fasterword" #1447

tokens() options are inconsistent with what = "fasterword" #1447

Comments

kbenoit commented Oct 7, 2018 • edited

Different

Same

koheiw commented Oct 8, 2018

kbenoit commented Oct 24, 2018

kbenoit commented Oct 7, 2018 •

edited