dfm error with special characters #554

jhfowler · 2017-02-15T18:59:24Z

i am trying to turn a list of tweets into a dfm and it occasionally creates an error on special characters. here's an example:

library(quanteda)
x <- dfm(corpus("あﾞづいﾞ"))
Error in Matrix::sparseMatrix(i = index, p = cumsum(c(1, lengths(x))) - :
NA's in (i,j) are not allowed

jhfowler · 2017-02-16T21:35:18Z

On further inspection, it seems like the error is always caused by Hiragana. Strings that fail:
"づいﾞ"
"゛んﾞ"
"たーﾟ"

kbenoit · 2017-02-16T22:21:59Z

OK, definitely one for @koheiw. It's an issue with the hashing function:

> tokens(c("づいﾞ", "゛んﾞ", "たーﾟ"))
tokens from 3 documents.
Component 1 :
[1] "づ"  NA    "いﾞ"

Component 2 :
[1] "゛"  NA    "んﾞ"

Component 3 :
[1] "た"  NA    "ーﾟ"

> tokens(c("づいﾞ", "゛んﾞ", "たーﾟ"), hash = FALSE)
tokenizedTexts from 3 documents.
Component 1 :
[1] "づ"  ""    "いﾞ"

Component 2 :
[1] "゛"  ""    "んﾞ"

Component 3 :
[1] "た"  ""    "ーﾟ"

jhfowler · 2017-02-16T22:42:35Z

Thanks! BTW I am really loving the package. This is the first snag I've hit. j

…

On Feb 16, 2017, at 2:22 PM, Kenneth Benoit ***@***.***> wrote: OK, definitely one for @koheiw. It's an issue with the hashing function: > tokens(c("づいﾞ", "゛んﾞ", "たーﾟ")) tokens from 3 documents. Component 1 : [1] "づ" NA "いﾞ" Component 2 : [1] "゛" NA "んﾞ" Component 3 : [1] "た" NA "ーﾟ" > tokens(c("づいﾞ", "゛んﾞ", "たーﾟ"), hash = FALSE) tokenizedTexts from 3 documents. Component 1 : [1] "づ" "" "いﾞ" Component 2 : [1] "゛" "" "んﾞ" Component 3 : [1] "た" "" "ーﾟ" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kbenoit · 2017-02-16T22:43:50Z

Thanks Jim! Testimonials always welcome at #461. Ken On 16 Feb 2017, at 23:42, jhfowler <notifications@github.com<mailto:notifications@github.com>> wrote: Thanks! BTW I am really loving the package. This is the first snag I've hit. j

On Feb 16, 2017, at 2:22 PM, Kenneth Benoit ***@***.******@***.***>> wrote: OK, definitely one for @koheiw. It's an issue with the hashing function: > tokens(c("づいﾞ", "゛んﾞ", "たーﾟ")) tokens from 3 documents. Component 1 : [1] "づ" NA "いﾞ" Component 2 : [1] "゛" NA "んﾞ" Component 3 : [1] "た" NA "ーﾟ" > tokens(c("づいﾞ", "゛んﾞ", "たーﾟ"), hash = FALSE) tokenizedTexts from 3 documents. Component 1 : [1] "づ" "" "いﾞ" Component 2 : [1] "゛" "" "んﾞ" Component 3 : [1] "た" "" "ーﾟ" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#554 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACFMZg-FwdZKUpgMe6rhN8zRv7BlVoA8ks5rdNDbgaJpZM4MCHB_>.

But skip test on CI and CRAN.

koheiw mentioned this issue Feb 17, 2017

Issue hiragana #555

Merged

kbenoit closed this as completed in #555 Feb 18, 2017

kbenoit added a commit that referenced this issue Feb 18, 2017

Add back #554 test to test-tokens.R

b7c89d0

But skip test on CI and CRAN.

kbenoit mentioned this issue Mar 28, 2023

Incoporate RBBI tokenizer for v4.0 #2216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dfm error with special characters #554

dfm error with special characters #554

jhfowler commented Feb 15, 2017

jhfowler commented Feb 16, 2017 •

edited

kbenoit commented Feb 16, 2017

jhfowler commented Feb 16, 2017 via email

kbenoit commented Feb 16, 2017 via email

dfm error with special characters #554

dfm error with special characters #554

Comments

jhfowler commented Feb 15, 2017

jhfowler commented Feb 16, 2017 • edited

kbenoit commented Feb 16, 2017

jhfowler commented Feb 16, 2017 via email

kbenoit commented Feb 16, 2017 via email

jhfowler commented Feb 16, 2017 •

edited