Skip to content

tokens(x, what = "character") fails with Twitter characters #637

@kbenoit

Description

@kbenoit
tokens("This: is, a @test!", what = "character", remove_punct = FALSE)
# tokens from 1 document.
# Component 1 :
#     [1] "T" "h" "i" "s" ":" "i" "s" "," "a" "_" "a" "s" "_" "t" "e" "s" "t" "!"

tokens("This: is, a @test!", what = "character", remove_punct = TRUE)
# tokens from 1 document.
# Component 1 :
#  [1] "T" "h" "i" "s" "i" "s" "a" "a" "s" "t" "e" "s" "t"

It no doubt has to do with our handling of Twitter characters, even though these are not supposed to apply to character segmentation. The replacement after tokenizing is failing because the regex to match the replacement does not work for character segmentation,

I want to overhaul the whole token-segmentation code, but until we do we ought to fix this with a patch.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions