NA fill at the tail if textstat_frequency() is set above the number of features in a given dfm. #1929

martincadek opened this issue Apr 17, 2020 · 1 comment · Fixed by #1930


Describe the bug

Suppose dfm_data contains 1595 features. The user wants to see all of them to quickly scroll if the features are sensible. Users may probably go for something like this to View( ) the features and call textstat_frequency(dfm_data, 2000) %>% View(). However, this results in 2000 - 1595 = 405 columns at the tail of data.frame to be labelled as NA due "group" column running above the maximum possible number. See screenshot:

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

test <- corpus_subset(data_corpus_inaugural, Year > 1980); 
textstat_frequency(dfm(test), 4000) %>% View() # then scrool down.

Expected behavior

The View() should not continue over the maximum possible number of features in data.frame and stop at whatever the max value of nfeat(dfm(test)) is.

## System information

Please run sessionInfo() and paste the output.

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gghighlight_0.3.0         ggrepel_0.8.2             lexicon_1.3.0            
 [4] rlang_0.4.5               ids_1.0.1                 showtext_0.7-1           
 [7] showtextdb_2.0            sysfonts_0.8              flextable_0.5.9          
[10] janitor_2.0.1             tidylog_1.0.0             newsmap_0.7.1            
[13] officer_0.3.8             patchwork_1.0.0           factoextra_1.0.7         
[16] e1071_1.7-3               caret_6.0-86              lattice_0.20-38          
[19] spacyr_1.2.1              readtext_0.76             quanteda.textmodels_0.9.1
[22] Cairo_1.5-12              quanteda_2.0.1            here_0.1                 
[25] colorspace_1.4-1          forcats_0.5.0             stringr_1.4.0            
[28] dplyr_0.8.5               purrr_0.3.3               readr_1.3.1              
[31] tidyr_1.0.2               tibble_3.0.0              ggplot2_3.3.0            
[34] tidyverse_1.3.0          

loaded via a namespace (and not attached):
 [1] ellipsis_0.3.0       class_7.3-15         rprojroot_1.3-2      snakecase_0.11.0    
 [5] base64enc_0.1-3      fs_1.4.1             rstudioapi_0.11      prodlim_2019.11.13  
 [9] fansi_0.4.1          lubridate_1.7.8      xml2_1.3.1           codetools_0.2-16    
[13] splines_3.6.3        knitr_1.28           jsonlite_1.6.1       pROC_1.16.2         
[17] packrat_0.5.0        broom_0.5.5          cluster_2.1.0        kernlab_0.9-29      
[21] dbplyr_1.4.2         compiler_3.6.3       httr_1.4.1           backports_1.1.6     
[25] assertthat_0.2.1     Matrix_1.2-18        cli_2.0.2            htmltools_0.4.0     
[29] tools_3.6.3          gtable_0.3.0         glue_1.4.0           reshape2_1.4.4      
[33] LiblineaR_2.10-8     fastmatch_1.1-0      Rcpp_1.0.4.6         cellranger_1.1.0    
[37] vctrs_0.2.4          nlme_3.1-147         iterators_1.0.12     timeDate_3043.102   
[41] gower_0.2.1          xfun_0.13            stopwords_1.0        rvest_0.3.5         
[45] lifecycle_0.2.0      MASS_7.3-51.5        scales_1.1.0         ipred_0.9-9         
[49] clisymbols_1.2.0     hms_0.5.3            SparseM_1.78         gdtools_0.2.2       
[53] rpart_4.1-15         stringi_1.4.6        foreach_1.5.0        zip_2.0.4           
[57] lava_1.6.7           pkgconfig_2.0.3      systemfonts_0.1.1    RSSL_0.9.1          
[61] evaluate_0.14        recipes_0.1.10       tidyselect_1.0.0     plyr_1.8.6          
[65] magrittr_1.5         R6_2.4.1             generics_0.0.2       DBI_1.1.0           
[69] pillar_1.4.3         haven_2.2.0          withr_2.1.2          survival_3.1-8      
[73] nnet_7.3-12          modelr_0.1.6         crayon_1.3.4         uuid_0.1-4          
[77] rmarkdown_2.1        syuzhet_1.0.4        grid_3.6.3           readxl_1.3.1        
[81] data.table_1.12.8    ModelMetrics_1.2.2.2 reprex_0.3.0         digest_0.6.25       
[85] RcppParallel_5.0.0   stats4_3.6.3         munsell_0.5.0        quadprog_1.5-8      
kbenoit commented Apr 18, 2020

Thanks, that's a bug! Affects the result with groups too.

Here's a simpler reproducible example:

## Package version: 2.0.2

corp <- c("a a b c d", "a d d e", "a b b")
dfmat <- dfm(corp)

# should not have NA
textstat_frequency(dfmat, n = 6)
##   feature frequency rank docfreq group
## 1       a         4    1       3   all
## 2       b         3    2       2   all
## 3       d         3    2       2   all
## 4       c         1    4       1   all
## 5       e         1    4       1   all
## 6    <NA>        NA   NA      NA   all
textstat_frequency(dfmat, n = 6, groups = c(1, 2, 2))
##    feature frequency rank docfreq group
## 1        a         2    1       1     1
## 2        b         1    2       1     1
## 3        c         1    2       1     1
## 4        d         1    2       1     1
## 5     <NA>        NA   NA      NA     1
## 6     <NA>        NA   NA      NA     1
## 7        a         2    1       2     2
## 8        b         2    1       1     2
## 9        d         2    1       1     2
## 10       e         1    4       1     2
## 11    <NA>        NA   NA      NA     2
## 12    <NA>        NA   NA      NA     2

