tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

kbenoit · 2018-08-30T14:34:02Z

Because of how we split for what = "faster word", we are running into this problem:

> tokens("one\ntwo\tthree", what = "fasterword")
tokens from 1 document.
text1 :
[1] "one\ntwo\tthree"

Those should be split into three tokens.

This behaviour seems to come from stringi:

> stringi::stri_split_regex("one\ntwo\tthree", "\\p{Z}+")
[[1]]
[1] "one\ntwo\tthree"

> stringi::stri_split_regex("one\ntwo\tthree", "\\p{WHITE_SPACE}+")
[[1]]
[1] "one"   "two"   "three"

Because the Z unicode category should match for the \p and \n, I filed an issue for this at gagolews/stringi#327.

The text was updated successfully, but these errors were encountered:

Fix #1420

kbenoit added bug tokens labels Aug 30, 2018

kbenoit assigned kbenoit and koheiw Aug 30, 2018

koheiw closed this as completed in ad359e6 Sep 2, 2018

koheiw added a commit that referenced this issue Sep 2, 2018

Merge pull request #1424 from quanteda/Issue-1420

78ce83b

Fix #1420

koheiw mentioned this issue Oct 8, 2018

tokens() options are inconsistent with what = "fasterword" #1447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

kbenoit commented Aug 30, 2018

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

Comments

kbenoit commented Aug 30, 2018