Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA fill at the tail if textstat_frequency() is set above the number of features in a given dfm. #1929

Closed
martincadek opened this issue Apr 17, 2020 · 1 comment · Fixed by #1930

Comments

@martincadek
Copy link

Describe the bug

Suppose dfm_data contains 1595 features. The user wants to see all of them to quickly scroll if the features are sensible. Users may probably go for something like this to View( ) the features and call textstat_frequency(dfm_data, 2000) %>% View(). However, this results in 2000 - 1595 = 405 columns at the tail of data.frame to be labelled as NA due "group" column running above the maximum possible number. See screenshot:
image

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

test <- corpus_subset(data_corpus_inaugural, Year > 1980); 
head(dfm(test));
textstat_frequency(dfm(test), 4000) %>% View() # then scrool down.

Expected behavior

The View() should not continue over the maximum possible number of features in data.frame and stop at whatever the max value of nfeat(dfm(test)) is.

## System information

Please run sessionInfo() and paste the output.

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gghighlight_0.3.0         ggrepel_0.8.2             lexicon_1.3.0            
 [4] rlang_0.4.5               ids_1.0.1                 showtext_0.7-1           
 [7] showtextdb_2.0            sysfonts_0.8              flextable_0.5.9          
[10] janitor_2.0.1             tidylog_1.0.0             newsmap_0.7.1            
[13] officer_0.3.8             patchwork_1.0.0           factoextra_1.0.7         
[16] e1071_1.7-3               caret_6.0-86              lattice_0.20-38          
[19] spacyr_1.2.1              readtext_0.76             quanteda.textmodels_0.9.1
[22] Cairo_1.5-12              quanteda_2.0.1            here_0.1                 
[25] colorspace_1.4-1          forcats_0.5.0             stringr_1.4.0            
[28] dplyr_0.8.5               purrr_0.3.3               readr_1.3.1              
[31] tidyr_1.0.2               tibble_3.0.0              ggplot2_3.3.0            
[34] tidyverse_1.3.0          

loaded via a namespace (and not attached):
 [1] ellipsis_0.3.0       class_7.3-15         rprojroot_1.3-2      snakecase_0.11.0    
 [5] base64enc_0.1-3      fs_1.4.1             rstudioapi_0.11      prodlim_2019.11.13  
 [9] fansi_0.4.1          lubridate_1.7.8      xml2_1.3.1           codetools_0.2-16    
[13] splines_3.6.3        knitr_1.28           jsonlite_1.6.1       pROC_1.16.2         
[17] packrat_0.5.0        broom_0.5.5          cluster_2.1.0        kernlab_0.9-29      
[21] dbplyr_1.4.2         compiler_3.6.3       httr_1.4.1           backports_1.1.6     
[25] assertthat_0.2.1     Matrix_1.2-18        cli_2.0.2            htmltools_0.4.0     
[29] tools_3.6.3          gtable_0.3.0         glue_1.4.0           reshape2_1.4.4      
[33] LiblineaR_2.10-8     fastmatch_1.1-0      Rcpp_1.0.4.6         cellranger_1.1.0    
[37] vctrs_0.2.4          nlme_3.1-147         iterators_1.0.12     timeDate_3043.102   
[41] gower_0.2.1          xfun_0.13            stopwords_1.0        rvest_0.3.5         
[45] lifecycle_0.2.0      MASS_7.3-51.5        scales_1.1.0         ipred_0.9-9         
[49] clisymbols_1.2.0     hms_0.5.3            SparseM_1.78         gdtools_0.2.2       
[53] rpart_4.1-15         stringi_1.4.6        foreach_1.5.0        zip_2.0.4           
[57] lava_1.6.7           pkgconfig_2.0.3      systemfonts_0.1.1    RSSL_0.9.1          
[61] evaluate_0.14        recipes_0.1.10       tidyselect_1.0.0     plyr_1.8.6          
[65] magrittr_1.5         R6_2.4.1             generics_0.0.2       DBI_1.1.0           
[69] pillar_1.4.3         haven_2.2.0          withr_2.1.2          survival_3.1-8      
[73] nnet_7.3-12          modelr_0.1.6         crayon_1.3.4         uuid_0.1-4          
[77] rmarkdown_2.1        syuzhet_1.0.4        grid_3.6.3           readxl_1.3.1        
[81] data.table_1.12.8    ModelMetrics_1.2.2.2 reprex_0.3.0         digest_0.6.25       
[85] RcppParallel_5.0.0   stats4_3.6.3         munsell_0.5.0        quadprog_1.5-8      
@kbenoit
Copy link
Collaborator

kbenoit commented Apr 18, 2020

Thanks, that's a bug! Affects the result with groups too.

Here's a simpler reproducible example:

library("quanteda")
## Package version: 2.0.2

corp <- c("a a b c d", "a d d e", "a b b")
dfmat <- dfm(corp)

# should not have NA
textstat_frequency(dfmat, n = 6)
##   feature frequency rank docfreq group
## 1       a         4    1       3   all
## 2       b         3    2       2   all
## 3       d         3    2       2   all
## 4       c         1    4       1   all
## 5       e         1    4       1   all
## 6    <NA>        NA   NA      NA   all
textstat_frequency(dfmat, n = 6, groups = c(1, 2, 2))
##    feature frequency rank docfreq group
## 1        a         2    1       1     1
## 2        b         1    2       1     1
## 3        c         1    2       1     1
## 4        d         1    2       1     1
## 5     <NA>        NA   NA      NA     1
## 6     <NA>        NA   NA      NA     1
## 7        a         2    1       2     2
## 8        b         2    1       1     2
## 9        d         2    1       1     2
## 10       e         1    4       1     2
## 11    <NA>        NA   NA      NA     2
## 12    <NA>        NA   NA      NA     2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants