Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfm error with special characters #554

Closed
jhfowler opened this issue Feb 15, 2017 · 4 comments · Fixed by #555
Closed

dfm error with special characters #554

jhfowler opened this issue Feb 15, 2017 · 4 comments · Fixed by #555

Comments

@jhfowler
Copy link

i am trying to turn a list of tweets into a dfm and it occasionally creates an error on special characters. here's an example:

library(quanteda)
x <- dfm(corpus("あ゙づい゙"))
Error in Matrix::sparseMatrix(i = index, p = cumsum(c(1, lengths(x))) - :
NA's in (i,j) are not allowed

@jhfowler
Copy link
Author

jhfowler commented Feb 16, 2017

On further inspection, it seems like the error is always caused by Hiragana. Strings that fail:
"づい゙"
"゛ん゙"
"たー゚"

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 16, 2017

OK, definitely one for @koheiw. It's an issue with the hashing function:

> tokens(c("づい゙", "゛ん゙", "たー゚"))
tokens from 3 documents.
Component 1 :
[1] ""  NA    "い゙"

Component 2 :
[1] ""  NA    "ん゙"

Component 3 :
[1] ""  NA    "ー゚"

> tokens(c("づい゙", "゛ん゙", "たー゚"), hash = FALSE)
tokenizedTexts from 3 documents.
Component 1 :
[1] ""  ""    "い゙"

Component 2 :
[1] ""  ""    "ん゙"

Component 3 :
[1] ""  ""    "ー゚"

@jhfowler
Copy link
Author

jhfowler commented Feb 16, 2017 via email

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 16, 2017 via email

@koheiw koheiw mentioned this issue Feb 17, 2017
kbenoit added a commit that referenced this issue Feb 18, 2017
But skip test on CI and CRAN.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants