Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens() options are inconsistent with what = "fasterword" #1447

Closed
kbenoit opened this issue Oct 7, 2018 · 2 comments
Closed

tokens() options are inconsistent with what = "fasterword" #1447

kbenoit opened this issue Oct 7, 2018 · 2 comments
Labels

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 7, 2018

Different

remove_punctuation:

tokens("This, a test.", what = "fasterword", remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "This," "a"     "test."

and Twitter characters:

tokens("@kenbenoit #quanteda", what = "fasterword", 
       remove_twitter = TRUE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "@kenbenoit" "#quanteda" 

tokens("@kenbenoit #quanteda", what = "fasterword", 
       remove_twitter = FALSE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "@kenbenoit" "#quanteda" 

remove_separators behaves differently:

txt <- "one two\nthree\tfour"

# unexpected
tokens(txt, what = "fasterword", remove_separators = FALSE)
## tokens from 1 document.
## text1 :
## [1] "one"              "two\nthree\tfour"

# correct
tokens(txt, what = "fasterword", remove_separators = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"   "two"   "three" "four" 

# correct
tokens(txt, what = "word", remove_separators = FALSE)
## tokens from 1 document.
## text1 :
## [1] "one"   " "     "two"   "\n"    "three" "\t"    "four" 

# correct
tokens(txt, what = "word", remove_separators = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"   "two"   "three" "four" 

Same

remove_url = TRUE is ok:

tokens("https://quanteda.io is our website", what = "fasterword", remove_url = TRUE)
## tokens from 1 document.
## text1 :
## [1] "is"      "our"     "website"

tokens("https://quanteda.io is our website", what = "fasterword", remove_url = FALSE)
## tokens from 1 document.
## text1 :
## [1] "https://quanteda.io" "is"      "our"     "website"

remove_numbers is ok:

tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = TRUE) %>%
    as.character()
## [1] "red"      "balloons" "4ever"
tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = FALSE) %>%
    as.character()
## [1] "99"       "red"      "balloons" "4ever"  

remove_symbols is ok:

tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = TRUE) %>%
    as.character()
## [1] "2"      "3"      "TRUE"   "#logic"
tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = FALSE) %>%
    as.character()
## [1] "2"      "<"      "3"      "="      "TRUE"   "#logic" 

remove_hyphens is ok:

txt <- "Jacob-Rees - second-rate"
identical(
    tokens(txt, what = "word", remove_hyphens = FALSE) %>% as.character(),
    tokens(txt, what = "fasterword", remove_hyphens = FALSE) %>% as.character()
)
## [1] TRUE
identical(
    tokens(txt, what = "word", remove_hyphens = TRUE) %>% as.character(),
    tokens(txt, what = "fasterword", remove_hyphens = TRUE) %>% as.character()
)
## [1] TRUE
@kbenoit kbenoit added the tokens label Oct 7, 2018
@koheiw
Copy link
Collaborator

koheiw commented Oct 8, 2018

This seems to be caused by a change in #1420

quanteda/R/tokens.R

Lines 670 to 674 in bedca07

tok <- if (remove_separators) {
stri_split_regex(txt, "\\p{WHITE_SPACE}+")
} else {
stri_split_regex(txt, "\\p{Z}+")
}

koheiw added a commit that referenced this issue Oct 22, 2018
- Revert change in ad359e6.
- Fix handling of new line markers.
- Update documentation as well.
- Correct tests
kbenoit added a commit that referenced this issue Oct 24, 2018
@kbenoit
Copy link
Collaborator Author

kbenoit commented Oct 24, 2018

#1459 solved the separators issue, now that just leaves the punctuation and Twitter differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants