Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens(x, what = "character") fails with Twitter characters #637

Closed
kbenoit opened this issue Apr 5, 2017 · 0 comments · Fixed by #709
Closed

tokens(x, what = "character") fails with Twitter characters #637

kbenoit opened this issue Apr 5, 2017 · 0 comments · Fixed by #709
Assignees
Labels

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 5, 2017

tokens("This: is, a @test!", what = "character", remove_punct = FALSE)
# tokens from 1 document.
# Component 1 :
#     [1] "T" "h" "i" "s" ":" "i" "s" "," "a" "_" "a" "s" "_" "t" "e" "s" "t" "!"

tokens("This: is, a @test!", what = "character", remove_punct = TRUE)
# tokens from 1 document.
# Component 1 :
#  [1] "T" "h" "i" "s" "i" "s" "a" "a" "s" "t" "e" "s" "t"

It no doubt has to do with our handling of Twitter characters, even though these are not supposed to apply to character segmentation. The replacement after tokenizing is failing because the regex to match the replacement does not work for character segmentation,

I want to overhaul the whole token-segmentation code, but until we do we ought to fix this with a patch.

@kbenoit kbenoit added the tokens label Apr 5, 2017
@kbenoit kbenoit modified the milestone: CRAN v0.9.9.9000 Apr 6, 2017
koheiw added a commit that referenced this issue May 4, 2017
@koheiw koheiw mentioned this issue May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants