textstat_frequency after dfm_tfidf weighting #1646

Astelix · 2019-03-13T20:50:07Z

Describe the bug

I tried to replicate the basic TF-IDF example in Julia Silges Tidytext-Book with Quanteda.
This worked in Quanta 1.4.1 but not in version 1.4.2 from Github.

Reproducible code

library(tidyverse)
#> Warning: Paket 'tidyverse' wurde unter R Version 3.4.2 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.4.3 erstellt
library(tidytext)
#> Warning: Paket 'tidytext' wurde unter R Version 3.4.4 erstellt
library(stringr)
library(quanteda)
#> Package version: 1.4.2
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attache Paket: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(janeaustenr)

########################################################################
# From Silge chapter 3
########################################################################

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()
#> Warning: Paket 'bindrcpp' wurde unter R Version 3.4.4 erstellt

tidy_books <- original_books %>%
  unnest_tokens(word, text) %>% 
  group_by(book,word) %>% 
  count() %>%
  ungroup()

########################################################################
# Replicate Silges first TFIDF example with Quanteda
########################################################################

austen_dfm <- tidy_books %>% cast_dfm(book, word, n) %>%
  dfm(remove=stopwords("en"), tolower = T) %>%
  dfm_group(groups=docnames()) %>%
  dfm_tfidf()

res <-
  textstat_frequency(austen_dfm, groups = docnames(austen_dfm), n=15) %>%
  mutate(feature = factor(feature, levels=rev(unique(feature))))
#> Error: will not group a weighted dfm; use force = TRUE to override

ggplot(res, aes(feature, frequency, fill=group)) +
  geom_col(show.legend = F) +
  facet_wrap(~group, ncol=2, scales="free") + 
  coord_flip()
#> Error in ggplot(res, aes(feature, frequency, fill = group)): Objekt 'res' nicht gefunden


<sup>Created on 2019-03-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)</sup>## Expected behavior

It should do the same as 1.4.1 or textstat_frequency should get a FORCE parameter that makes it work again with tfidf-weights

## System information

sessionInfo()
#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: OS X El Capitan 10.11.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 janeaustenr_0.1.5 quanteda_1.4.2
#> [4] tidytext_0.2.0 forcats_0.3.0 stringr_1.3.1
#> [7] dplyr_0.7.8 purrr_0.2.5 readr_1.1.1
#> [10] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0
#> [13] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.5 reshape2_1.4.3 haven_1.1.2
#> [4] lattice_0.20-35 colorspace_1.3-2 SnowballC_0.6.0
#> [7] htmltools_0.3.6 yaml_2.2.0 rlang_0.3.1
#> [10] pillar_1.3.1 foreign_0.8-69 glue_1.3.0
#> [13] withr_2.1.2 modelr_0.1.2 readxl_1.1.0
#> [16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0
#> [19] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
#> [22] psych_1.8.10 evaluate_0.12 knitr_1.20
#> [25] parallel_3.4.1 tokenizers_0.1.4 broom_0.4.5
#> [28] Rcpp_1.0.0 spacyr_1.0.1 backports_1.1.2
#> [31] scales_1.0.0 RcppParallel_4.4.2 jsonlite_1.6
#> [34] fastmatch_1.1-0 stopwords_0.9.0 mnormt_1.5-5
#> [37] hms_0.4.2 digest_0.6.18 stringi_1.3.1
#> [40] grid_3.4.1 rprojroot_1.3-2 cli_1.0.1
#> [43] tools_3.4.1 magrittr_1.5 lazyeval_0.2.1
#> [46] crayon_1.3.4 pkgconfig_2.0.2 Matrix_1.2-11
#> [49] data.table_1.12.0 xml2_1.2.0 lubridate_1.7.4
#> [52] assertthat_0.2.0 rmarkdown_1.10 httr_1.3.1
#> [55] R6_2.3.0 nlme_3.1-131 compiler_3.4.1


## Additional info

Please add any other information about the issue.

The text was updated successfully, but these errors were encountered:

koheiw · 2019-03-13T22:16:11Z

This is a changed in #1547. @kbenoit, it might be better to allow weight grouped DFMs.

kbenoit · 2019-03-14T00:52:09Z

Well, we now do allow grouping of weighted dfms (see #1547) but we did not pass this ability through to textstat_frequency(). For consistency, we should do this, but there remain very good reasons why this is a bad idea. These have to do with the fact that the properties of neither (inverse) document frequency nor logarithms are preserved by summing them.

Consider a simple example with three documents.

library("quanteda")
## Package version: 1.4.1

dfmat <- dfm(c(
  "a a b b b c d",
  "a c c c c",
  "b b b d"
))
dfmat
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
##        features
## docs    a b c d
##   text1 2 3 1 1
##   text2 1 0 4 0
##   text3 0 3 0 1

This is what the tf-idf weighted version of this matrix looks like:

dfm_tfidf(dfmat)
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
##        features
## docs            a         b         c         d
##   text1 0.3521825 0.5282738 0.1760913 0.1760913
##   text2 0.1760913 0         0.7043650 0        
##   text3 0         0.5282738 0         0.1760913

and here is a grouped version:

dfm_group(dfmat, groups = c(1, 1, 2))
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs a b c d
##    1 3 3 5 1
##    2 0 3 0 1

If we "group" the weighted matrix using the default summing of the cells method (the only method we currently use), then this is the result:

dfm_tfidf(dfmat) %>%
  dfm_group(groups = c(1, 1, 2), force = TRUE)
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs         a         b         c         d
##    1 0.5282738 0.5282738 0.8804563 0.1760913
##    2 0         0.5282738 0         0.1760913

but this is very different from what happens if we group it first and then weight it:

dfm_group(dfmat, groups = c(1, 1, 2)) %>%
  dfm_tfidf()
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs       a b       c d
##    1 0.90309 0 1.50515 0
##    2 0       0 0       0

Here, b and d are weighted to zero because they occur in each new "document", defined as the group. This is very different from having summed the weights pre-grouping. Weighting before halves the weight of a and makes it equal in importance with b, which has a zero weight if grouping happens first.

The moral of the story is: Make sure you understand what you intend by a "document" before you weight by inverse document frequency. My view is that the weighting first and then grouping is simply wrong.

koheiw · 2019-03-14T02:31:21Z

I though "Error: will not group a weighted dfm; use force = TRUE" is from dfm_group() but I was wrong. It does not make sense to do dfm_tfidf() %>% textstat_frequency().

kbenoit · 2019-03-14T02:57:40Z

No, probably not when the textstat_frequency output columns are called frequency and docfreq.

Astelix · 2019-03-14T07:23:31Z

I did the grouping before the computation of tfidf and used the same grouping in textstat_frequency. I was not interested in docfreq and I knew that frequeny was weighted by tf_idf. I basically wanted to extract the results from the dfm to a plotable dataframe. The result of the procedure replicated Silges plots exactly (but on a different scale). I can completley circumvent textstat_frequency and use the following procedure, which is more complicated and leads to exactly the same results.

austen_dfm <- tidy_books %>% cast_dfm(book, word, n) %>%
  dfm(remove=stopwords("en"), tolower = T) %>%
  dfm_group(groups=docnames()) %>%
  dfm_tfidf()

austen_trip <- convert(austen_dfm, to="tripletlist") 
austen_df <- data.frame(book=austen_trip$document,
                        feature=austen_trip$feature,
                        tf_idf=austen_trip$frequency)


res <- austen_df %>%
  arrange(desc(tf_idf)) %>%
  mutate(feature=factor(feature, levels=rev(unique(feature)))) %>%
  group_by(book) %>%
  slice(1:15) %>%
  ungroup()

^{Created on 2019-03-14 by the reprex package (v0.2.1)}

kbenoit · 2019-03-25T06:43:41Z

@koheiw I've been thinking more about this, and looking more at the above application, plus finding myself overriding our rules (using force = TRUE and suppressing warnings) in the dev-svm branch, to get better classifier performance from machine learning.

I think I'm coming around to the view that we could allow a user to group a weighted dfm and weight an already weighted dfm, with an option not to force it with a warning, but rather to allow it always and toggle the warning off or not (at least this calls attention to a procedure that may not be recommended, or warns the user if they forgot about an earlier weighting).

What do you think?

koheiw · 2019-03-26T06:52:08Z

If textstat_frequency() is used primarily for converting to a long format for tidytext, convert() should be used (I am not sure if it is possible currently). However, I am fine to give users freedom to be creative in data transformation as far as these safe-guards measures are consistent across our functions.

By the way, can you tell me what combination of feature weighting improved SVM performance?

kbenoit · 2019-03-26T07:04:44Z

tf-idf improved predictive performance a lot. For computational performance, in the current dev-svm branch the key is sending it a matrix that is as sparse as possible. I tried "scaling" (centering and dividing by the sd) and this caused the computation to fail entirely because the matrix had gone 100% dense.

koheiw · 2019-03-26T23:08:32Z

It is probably because SVM is a non-text model that takes any quantity for prediction. Are there feature weighting functions for regular Matrix that you can use instead? (if not, we should generalize the code in dfm_weight() to for broader usage)

kbenoit · 2019-03-26T23:57:46Z

The testing I've done is based on a linear SVM, which seems to be the recommended kernel for textual features. The need to normalize in some fashion is the same as for a measure such as correlation: the space in which the separating plane is computed is affected by the absolute positions of the inputs, and this means that longer documents influence the decision values more. But proportion weighting does not perform nearly as well as tf-idf, even in our default version where the tf are counts (and not proportions). So here, the idf seems to be helping a lot by getting rid of common features.

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

performance <- function(mytable, verbose = TRUE) {
  truePositives <- mytable[1, 1]
  trueNegatives <- sum(diag(mytable)[-1])
  falsePositives <- sum(mytable[1, ]) - truePositives
  falseNegatives <- sum(mytable[, 1]) - truePositives
  precision <- truePositives / (truePositives + falsePositives)
  recall <- truePositives / (truePositives + falseNegatives)
  accuracy <- sum(diag(mytable)) / sum(mytable)
  tnr <- trueNegatives / (trueNegatives + falsePositives)
  balanced_accuracy <- sum(c(precision, tnr), na.rm = TRUE) / 2
  if (verbose) {
    print(mytable)
    cat(
      "\n    precision =", round(precision, 2),
      "\n       recall =", round(recall, 2),
      "\n     accuracy =", round(accuracy, 2),
      "\n    bal. acc. =", round(balanced_accuracy, 2),
      "\n"
    )
  }
  invisible(c(precision, recall))
}

# define training texts and the "true" govt/opp status
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") == "Haughey", "Govt", NA)
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") %in% c("Spring", "deRossa"), "Opp", y)
truth <- ifelse(docvars(data_corpus_dailnoconf1991, "party") %in% c("FF", "PD"), "Govt", "Opp")

# no weighting: poor
dfm(data_corpus_dailnoconf1991) %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt    6   0
##   Opp    19  33
## 
##     precision = 1 
##        recall = 0.24 
##      accuracy = 0.67 
##     bal. acc. = 1

# proportions: poor, predicts everyone to be opposition
dfm(data_corpus_dailnoconf1991) %>%
  dfm_weight(scheme = "prop") %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##      truth
## .     Govt Opp
##   Opp   25  33
## 
##     precision = 0.43 
##        recall = 1 
##      accuracy = 0.43 
##     bal. acc. = 0.22

# scaled - results in a fully dense dfm, and poor performance
dfm(data_corpus_dailnoconf1991) %>%
  scale() %>%
  as.dfm() %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   24  25
##   Opp     1   8
## 
##     precision = 0.49 
##        recall = 0.96 
##      accuracy = 0.55 
##     bal. acc. = 0.37

# tf-idf: better
dfm(data_corpus_dailnoconf1991) %>%
  dfm_tfidf() %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   16   3
##   Opp     9  30
## 
##     precision = 0.84 
##        recall = 0.64 
##      accuracy = 0.79 
##     bal. acc. = 0.88

# tf-idf: best with document frequency weights
dfm(data_corpus_dailnoconf1991) %>%
  dfm_tfidf() %>%
  textmodel_svm(y, weight = "docfreq") %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   15   2
##   Opp    10  31
## 
##     precision = 0.88 
##        recall = 0.6 
##      accuracy = 0.79 
##     bal. acc. = 0.91

kbenoit · 2019-03-28T05:30:51Z

Solution after discussing with @koheiw - add ability to pass arguments through to dfm_group() via .... @kbenoit to implement.

kbenoit changed the title ~~textstat_frequency on dam_tfidf~~ textstat_frequency after dfm_tfidf weighting Mar 14, 2019

kbenoit modified the milestones: v1.5 desirables, v1.5 essentials Mar 28, 2019

kbenoit self-assigned this Mar 28, 2019

kbenoit mentioned this issue Mar 28, 2019

Add ... to textstat_frequency to pass to dfm_group #1656

Merged

kbenoit closed this as completed in #1656 Mar 28, 2019

scottdallman mentioned this issue Feb 26, 2020

Fix #15 by forcing dfm_weight and dfm_group quanteda/quanteda.textmodels#16

Merged

tomwagstaff-opml mentioned this issue Aug 23, 2023

Can't plot a comparison word cloud on grouped DFM with TF-IDF weighting #2290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

textstat_frequency after dfm_tfidf weighting #1646

textstat_frequency after dfm_tfidf weighting #1646

Astelix commented Mar 13, 2019

koheiw commented Mar 13, 2019

kbenoit commented Mar 14, 2019

koheiw commented Mar 14, 2019

kbenoit commented Mar 14, 2019

Astelix commented Mar 14, 2019 •

edited by kbenoit

Loading

kbenoit commented Mar 25, 2019

koheiw commented Mar 26, 2019

kbenoit commented Mar 26, 2019

koheiw commented Mar 26, 2019

kbenoit commented Mar 26, 2019 •

edited

Loading

kbenoit commented Mar 28, 2019

textstat_frequency after dfm_tfidf weighting #1646

textstat_frequency after dfm_tfidf weighting #1646

Comments

Astelix commented Mar 13, 2019

Describe the bug

Reproducible code

koheiw commented Mar 13, 2019

kbenoit commented Mar 14, 2019

koheiw commented Mar 14, 2019

kbenoit commented Mar 14, 2019

Astelix commented Mar 14, 2019 • edited by kbenoit Loading

kbenoit commented Mar 25, 2019

koheiw commented Mar 26, 2019

kbenoit commented Mar 26, 2019

koheiw commented Mar 26, 2019

kbenoit commented Mar 26, 2019 • edited Loading

kbenoit commented Mar 28, 2019

Astelix commented Mar 14, 2019 •

edited by kbenoit

Loading

kbenoit commented Mar 26, 2019 •

edited

Loading