Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordscores's predicted value become NA #1380

Closed
koheiw opened this issue Jun 19, 2018 · 2 comments
Closed

wordscores's predicted value become NA #1380

koheiw opened this issue Jun 19, 2018 · 2 comments
Assignees

Comments

@koheiw
Copy link
Collaborator

koheiw commented Jun 19, 2018

Describe the bug

Wordscore fails to predict

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

ws <- textmodel_wordscores(data_dfm_lbgexample, c(NA, 1.5, 1.0, NA, NA, NA))
predict(ws, newdata = data_dfm_lbgexample, se.fit = TRUE)

Its ouput

$fit
      R1       R2       R3       R4       R5       V1 
     NaN 1.385582 1.114418      NaN      NaN 1.279662 

$se.fit
[1] NaN NaN NaN NaN NaN NaN

Expected behavior

Predict scores for all the documents

System information

Please run sessionInfo() and paste the output.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.12

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testthat_2.0.0 quanteda_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       lubridate_1.7.4    lattice_0.20-35    R6_2.2.2           grid_3.4.4         plyr_1.8.4         gtable_0.2.0       magrittr_1.5       scales_0.5.0      
[10] RcppParallel_4.4.0 ggplot2_2.2.1      pillar_1.2.3       stringi_1.2.3      rlang_0.2.1        lazyeval_0.2.1     data.table_1.11.4  Matrix_1.2-14      stopwords_0.9.0   
[19] fastmatch_1.1-0    tools_3.4.4        stringr_1.3.1      munsell_0.5.0      yaml_2.1.19        spacyr_0.9.9       compiler_3.4.4     colorspace_1.3-2   tibble_1.4.2      

Additional info

NA values in word parameters seem to be the cause of the problem

> ws$wordscores
       A        B        C        D        E        F        G        H        I        J        K        L        M        N        O        P        Q        R        S 
     NaN      NaN      NaN      NaN      NaN 1.500000 1.500000 1.500000 1.500000 1.500000 1.487500 1.487288 1.467949 1.438889 1.382199 1.297927 1.202073 1.117801 1.061111 
       T        U        V        W        X        Y        Z       ZA       ZB       ZC       ZD       ZE       ZF       ZG       ZH       ZI       ZJ       ZK 
1.032051 1.012712 1.012500 1.000000 1.000000 1.000000 1.000000 1.000000      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN 
@koheiw
Copy link
Collaborator Author

koheiw commented Jun 19, 2018

The easiest solution would be removing words with NA but it worth investigating why NAs are produced.

@koheiw
Copy link
Collaborator Author

koheiw commented Jun 20, 2018

Workaround is to use smoothing.

kbenoit added a commit that referenced this issue Jul 3, 2018
- Omits features from the `wordscores` in a `textmodel_wordscores` class object that were all zero in a training set.
- Implements a new warning when features are found in `newdata` (through `predict.textmodel_wordscores()`) for which no word scores were computed.  These values will not contribute to the computation of the text's predicted document score.
- Keeps the old variable names that @koheiw didn't like because they match the original quantity names from LBG (2003) (for better or worse...)
- Fixes #1380
kbenoit added a commit that referenced this issue Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants