Skip to content

encoding lost in dfm #1387

@christianbaden

Description

@christianbaden

I am working with utf-8 encoded, multilingual text (mostly hebrew, english, russian, some other languages/charactersets). Trying to construct a dfm, I noticed that even if all text is read correctly in utf-8, and handled correctly by the tokens() command, upon dfm() construction, the encoding is lost and i get gibberish:

> x
[1] "привет hi שלום moin"      "להיתראות пока bye tschüß"
> tokens(x)
tokens from 2 documents.
text1 :
[1] "привет" "hi"     "שלום"   "moin"  

text2 :
[1] "להיתראות" "пока"     "bye"      "tschüß"  

> dfm(tokens(x))
Document-feature matrix of: 2 documents, 8 features (50% sparse).
2 x 8 sparse Matrix of class "dfm"
       features
docs    привет hi שלו×\u009d moin להיתר×\u0090ות пока bye tschüß
  text1            1  1        1    1                0        0   0        0
  text2            0  0        0    0                1        1   1        1
> dfm(x)
Document-feature matrix of: 2 documents, 8 features (50% sparse).
2 x 8 sparse Matrix of class "dfm"
       features
docs    привет hi שלו×\u009d moin להיתר×\u0090ות пока bye tschüß
  text1            1  1        1    1                0        0   0        0
  text2            0  0        0    0                1        1   1        1

curiously, though, I have _sometimes_ been able to obtain correct output, for instance, like this:

> x
[1] "привет hi שלום moin"         "להיתראות пока bye tschüß Hi"
> tokens(x)
tokens from 2 documents.
text1 :
[1] "привет" "hi"     "שלום"   "moin"  

text2 :
[1] "להיתראות" "пока"     "bye"      "tschüß"   "Hi"      

> dfm(tokens(x))
Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
2 x 8 sparse Matrix of class "dfm"
       features
docs                                              привет hi                             שלום moin                                                         להיתראות
  text1                                                1  1                                1    1                                                                0
  text2                                                0  1                                0    0                                                                1
       features
docs                                пока bye tschüß
  text1                                0   0      0
  text2                                1   1      1
> dfm(x)
Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
2 x 8 sparse Matrix of class "dfm"
       features
docs                                              привет hi                             שלום moin                                                         להיתראות
  text1                                                1  1                                1    1                                                                0
  text2                                                0  1                                0    0                                                                1
       features
docs                                пока bye tschüß
  text1                                0   0      0
  text2                                1   1      1

in this case, all i changed was to add one redundant token in ascii set to the original text; i have no idea why this would make a difference. So what would help me is to understand why the first example loses the encoding in dfm() and the second does not, so i can build a workaround, or if this were fixed.

i am using the latest R release in RStudio
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readtext_0.71  quanteda_1.3.0 stm_1.3.3     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       xml2_1.2.0         magrittr_1.5       stopwords_0.9.0    munsell_0.5.0      colorspace_1.3-2   tm_0.7-4           lattice_0.20-35    R6_2.2.2          
[10] rlang_0.2.1        fastmatch_1.1-0    httr_1.3.1         stringr_1.3.1      plyr_1.8.4         tools_3.5.0        parallel_3.5.0     grid_3.5.0         data.table_1.11.4 
[19] gtable_0.2.0       utf8_1.1.4         cli_1.0.0          spacyr_0.9.9       assertthat_0.2.0   lazyeval_0.2.1     RcppParallel_4.4.0 tibble_1.4.2       crayon_1.3.4      
[28] Matrix_1.2-14      NLP_0.1-11         ggplot2_2.2.1      SnowballC_0.5.1    slam_0.1-43        stringi_1.1.7      compiler_3.5.0     pillar_1.2.3       scales_0.5.0      
[37] lubridate_1.7.4   

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions