tokens("This: is, a @test!", what = "character", remove_punct = FALSE)
# tokens from 1 document.
# Component 1 :
# [1] "T" "h" "i" "s" ":" "i" "s" "," "a" "_" "a" "s" "_" "t" "e" "s" "t" "!"
tokens("This: is, a @test!", what = "character", remove_punct = TRUE)
# tokens from 1 document.
# Component 1 :
# [1] "T" "h" "i" "s" "i" "s" "a" "a" "s" "t" "e" "s" "t"
It no doubt has to do with our handling of Twitter characters, even though these are not supposed to apply to character segmentation. The replacement after tokenizing is failing because the regex to match the replacement does not work for character segmentation,
I want to overhaul the whole token-segmentation code, but until we do we ought to fix this with a patch.
It no doubt has to do with our handling of Twitter characters, even though these are not supposed to apply to character segmentation. The replacement after tokenizing is failing because the regex to match the replacement does not work for character segmentation,
I want to overhaul the whole token-segmentation code, but until we do we ought to fix this with a patch.