Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

Closed
kbenoit opened this issue Aug 30, 2018 · 0 comments
Closed

tokens(x, what = "fasterword") bug? not splitting on \n \t \r #1420

kbenoit opened this issue Aug 30, 2018 · 0 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Aug 30, 2018

Because of how we split for what = "faster word", we are running into this problem:

> tokens("one\ntwo\tthree", what = "fasterword")
tokens from 1 document.
text1 :
[1] "one\ntwo\tthree"

Those should be split into three tokens.

This behaviour seems to come from stringi:

> stringi::stri_split_regex("one\ntwo\tthree", "\\p{Z}+")
[[1]]
[1] "one\ntwo\tthree"

> stringi::stri_split_regex("one\ntwo\tthree", "\\p{WHITE_SPACE}+")
[[1]]
[1] "one"   "two"   "three"

Because the Z unicode category should match for the \p and \n, I filed an issue for this at gagolews/stringi#327.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants