I am working with utf-8 encoded, multilingual text (mostly hebrew, english, russian, some other languages/charactersets). Trying to construct a dfm, I noticed that even if all text is read correctly in utf-8, and handled correctly by the tokens() command, upon dfm() construction, the encoding is lost and i get gibberish:
> x
[1] "привет hi שלום moin" "להיתראות пока bye tschüß"
> tokens(x)
tokens from 2 documents.
text1 :
[1] "привет" "hi" "שלום" "moin"
text2 :
[1] "להיתראות" "пока" "bye" "tschüß"
> dfm(tokens(x))
Document-feature matrix of: 2 documents, 8 features (50% sparse).
2 x 8 sparse Matrix of class "dfm"
features
docs привет hi שלו×\u009d moin להיתר×\u0090ות пока bye tschüß
text1 1 1 1 1 0 0 0 0
text2 0 0 0 0 1 1 1 1
> dfm(x)
Document-feature matrix of: 2 documents, 8 features (50% sparse).
2 x 8 sparse Matrix of class "dfm"
features
docs привет hi שלו×\u009d moin להיתר×\u0090ות пока bye tschüß
text1 1 1 1 1 0 0 0 0
text2 0 0 0 0 1 1 1 1
curiously, though, I have _sometimes_ been able to obtain correct output, for instance, like this:
> x
[1] "привет hi שלום moin" "להיתראות пока bye tschüß Hi"
> tokens(x)
tokens from 2 documents.
text1 :
[1] "привет" "hi" "שלום" "moin"
text2 :
[1] "להיתראות" "пока" "bye" "tschüß" "Hi"
> dfm(tokens(x))
Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
2 x 8 sparse Matrix of class "dfm"
features
docs привет hi שלום moin להיתראות
text1 1 1 1 1 0
text2 0 1 0 0 1
features
docs пока bye tschüß
text1 0 0 0
text2 1 1 1
> dfm(x)
Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
2 x 8 sparse Matrix of class "dfm"
features
docs привет hi שלום moin להיתראות
text1 1 1 1 1 0
text2 0 1 0 0 1
features
docs пока bye tschüß
text1 0 0 0
text2 1 1 1
in this case, all i changed was to add one redundant token in ascii set to the original text; i have no idea why this would make a difference. So what would help me is to understand why the first example loses the encoding in dfm() and the second does not, so i can build a workaround, or if this were fixed.
i am using the latest R release in RStudio
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readtext_0.71 quanteda_1.3.0 stm_1.3.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 xml2_1.2.0 magrittr_1.5 stopwords_0.9.0 munsell_0.5.0 colorspace_1.3-2 tm_0.7-4 lattice_0.20-35 R6_2.2.2
[10] rlang_0.2.1 fastmatch_1.1-0 httr_1.3.1 stringr_1.3.1 plyr_1.8.4 tools_3.5.0 parallel_3.5.0 grid_3.5.0 data.table_1.11.4
[19] gtable_0.2.0 utf8_1.1.4 cli_1.0.0 spacyr_0.9.9 assertthat_0.2.0 lazyeval_0.2.1 RcppParallel_4.4.0 tibble_1.4.2 crayon_1.3.4
[28] Matrix_1.2-14 NLP_0.1-11 ggplot2_2.2.1 SnowballC_0.5.1 slam_0.1-43 stringi_1.1.7 compiler_3.5.0 pillar_1.2.3 scales_0.5.0
[37] lubridate_1.7.4
I am working with utf-8 encoded, multilingual text (mostly hebrew, english, russian, some other languages/charactersets). Trying to construct a dfm, I noticed that even if all text is read correctly in utf-8, and handled correctly by the tokens() command, upon dfm() construction, the encoding is lost and i get gibberish: