tokens(x, what = "character") fails with Twitter characters #637

kbenoit · 2017-04-05T22:49:47Z

tokens("This: is, a @test!", what = "character", remove_punct = FALSE)
# tokens from 1 document.
# Component 1 :
#     [1] "T" "h" "i" "s" ":" "i" "s" "," "a" "_" "a" "s" "_" "t" "e" "s" "t" "!"

tokens("This: is, a @test!", what = "character", remove_punct = TRUE)
# tokens from 1 document.
# Component 1 :
#  [1] "T" "h" "i" "s" "i" "s" "a" "a" "s" "t" "e" "s" "t"

It no doubt has to do with our handling of Twitter characters, even though these are not supposed to apply to character segmentation. The replacement after tokenizing is failing because the regex to match the replacement does not work for character segmentation,

I want to overhaul the whole token-segmentation code, but until we do we ought to fix this with a patch.

kbenoit added the tokens label Apr 5, 2017

kbenoit assigned koheiw Apr 5, 2017

kbenoit modified the milestone: CRAN v0.9.9.9000 Apr 6, 2017

koheiw added a commit that referenced this issue May 4, 2017

Add test for #637

66413ff

koheiw mentioned this issue May 4, 2017

Issue 637 #709

Merged

kbenoit closed this as completed in #709 May 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens(x, what = "character") fails with Twitter characters #637

tokens(x, what = "character") fails with Twitter characters #637

kbenoit commented Apr 5, 2017

tokens(x, what = "character") fails with Twitter characters #637

tokens(x, what = "character") fails with Twitter characters #637

Comments

kbenoit commented Apr 5, 2017