wordscores's predicted value become NA #1380

koheiw · 2018-06-19T03:54:53Z

Describe the bug

Wordscore fails to predict

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

ws <- textmodel_wordscores(data_dfm_lbgexample, c(NA, 1.5, 1.0, NA, NA, NA))
predict(ws, newdata = data_dfm_lbgexample, se.fit = TRUE)

Its ouput

$fit
      R1       R2       R3       R4       R5       V1 
     NaN 1.385582 1.114418      NaN      NaN 1.279662 

$se.fit
[1] NaN NaN NaN NaN NaN NaN

Expected behavior

Predict scores for all the documents

System information

Please run sessionInfo() and paste the output.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.12

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testthat_2.0.0 quanteda_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       lubridate_1.7.4    lattice_0.20-35    R6_2.2.2           grid_3.4.4         plyr_1.8.4         gtable_0.2.0       magrittr_1.5       scales_0.5.0      
[10] RcppParallel_4.4.0 ggplot2_2.2.1      pillar_1.2.3       stringi_1.2.3      rlang_0.2.1        lazyeval_0.2.1     data.table_1.11.4  Matrix_1.2-14      stopwords_0.9.0   
[19] fastmatch_1.1-0    tools_3.4.4        stringr_1.3.1      munsell_0.5.0      yaml_2.1.19        spacyr_0.9.9       compiler_3.4.4     colorspace_1.3-2   tibble_1.4.2

Additional info

NA values in word parameters seem to be the cause of the problem

> ws$wordscores
       A        B        C        D        E        F        G        H        I        J        K        L        M        N        O        P        Q        R        S 
     NaN      NaN      NaN      NaN      NaN 1.500000 1.500000 1.500000 1.500000 1.500000 1.487500 1.487288 1.467949 1.438889 1.382199 1.297927 1.202073 1.117801 1.061111 
       T        U        V        W        X        Y        Z       ZA       ZB       ZC       ZD       ZE       ZF       ZG       ZH       ZI       ZJ       ZK 
1.032051 1.012712 1.012500 1.000000 1.000000 1.000000 1.000000 1.000000      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN

The text was updated successfully, but these errors were encountered:

koheiw · 2018-06-19T04:37:56Z

The easiest solution would be removing words with NA but it worth investigating why NAs are produced.

koheiw · 2018-06-20T07:38:47Z

Workaround is to use smoothing.

@koheiw

- Omits features from the `wordscores` in a `textmodel_wordscores` class object that were all zero in a training set. - Implements a new warning when features are found in `newdata` (through `predict.textmodel_wordscores()`) for which no word scores were computed. These values will not contribute to the computation of the text's predicted document score. - Keeps the old variable names that @koheiw didn't like because they match the original quantity names from LBG (2003) (for better or worse...) - Fixes #1380

Alternative solution to #1380

koheiw added bug textmodel labels Jun 19, 2018

koheiw assigned kbenoit and koheiw Jun 19, 2018

kbenoit mentioned this issue Jul 3, 2018

Alternative solution to #1380 #1392

Merged

kbenoit closed this as completed in #1392 Jul 3, 2018

kbenoit added a commit that referenced this issue Jul 3, 2018

Merge pull request #1392 from quanteda/issue-1380-alt

9cba696

Alternative solution to #1380

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordscores's predicted value become NA #1380

wordscores's predicted value become NA #1380

koheiw commented Jun 19, 2018

koheiw commented Jun 19, 2018

koheiw commented Jun 20, 2018

wordscores's predicted value become NA #1380

wordscores's predicted value become NA #1380

Comments

koheiw commented Jun 19, 2018

Describe the bug

Reproducible code

Expected behavior

System information

Additional info

koheiw commented Jun 19, 2018

koheiw commented Jun 20, 2018