Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textstat_frequency after dfm_tfidf weighting #1646

Closed
Astelix opened this issue Mar 13, 2019 · 11 comments
Closed

textstat_frequency after dfm_tfidf weighting #1646

Astelix opened this issue Mar 13, 2019 · 11 comments
Assignees

Comments

@Astelix
Copy link

Astelix commented Mar 13, 2019

Describe the bug

I tried to replicate the basic TF-IDF example in Julia Silges Tidytext-Book with Quanteda.
This worked in Quanta 1.4.1 but not in version 1.4.2 from Github.

Reproducible code

library(tidyverse)
#> Warning: Paket 'tidyverse' wurde unter R Version 3.4.2 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.4.3 erstellt
library(tidytext)
#> Warning: Paket 'tidytext' wurde unter R Version 3.4.4 erstellt
library(stringr)
library(quanteda)
#> Package version: 1.4.2
#> Parallel computing: 2 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attache Paket: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(janeaustenr)

########################################################################
# From Silge chapter 3
########################################################################

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()
#> Warning: Paket 'bindrcpp' wurde unter R Version 3.4.4 erstellt

tidy_books <- original_books %>%
  unnest_tokens(word, text) %>% 
  group_by(book,word) %>% 
  count() %>%
  ungroup()

########################################################################
# Replicate Silges first TFIDF example with Quanteda
########################################################################

austen_dfm <- tidy_books %>% cast_dfm(book, word, n) %>%
  dfm(remove=stopwords("en"), tolower = T) %>%
  dfm_group(groups=docnames()) %>%
  dfm_tfidf()

res <-
  textstat_frequency(austen_dfm, groups = docnames(austen_dfm), n=15) %>%
  mutate(feature = factor(feature, levels=rev(unique(feature))))
#> Error: will not group a weighted dfm; use force = TRUE to override

ggplot(res, aes(feature, frequency, fill=group)) +
  geom_col(show.legend = F) +
  facet_wrap(~group, ncol=2, scales="free") + 
  coord_flip()
#> Error in ggplot(res, aes(feature, frequency, fill = group)): Objekt 'res' nicht gefunden


<sup>Created on 2019-03-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)</sup>## Expected behavior

It should do the same as 1.4.1 or textstat_frequency should get a FORCE parameter that makes it work again with tfidf-weights

## System information

sessionInfo()
#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: OS X El Capitan 10.11.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 janeaustenr_0.1.5 quanteda_1.4.2
#> [4] tidytext_0.2.0 forcats_0.3.0 stringr_1.3.1
#> [7] dplyr_0.7.8 purrr_0.2.5 readr_1.1.1
#> [10] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0
#> [13] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.5 reshape2_1.4.3 haven_1.1.2
#> [4] lattice_0.20-35 colorspace_1.3-2 SnowballC_0.6.0
#> [7] htmltools_0.3.6 yaml_2.2.0 rlang_0.3.1
#> [10] pillar_1.3.1 foreign_0.8-69 glue_1.3.0
#> [13] withr_2.1.2 modelr_0.1.2 readxl_1.1.0
#> [16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0
#> [19] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
#> [22] psych_1.8.10 evaluate_0.12 knitr_1.20
#> [25] parallel_3.4.1 tokenizers_0.1.4 broom_0.4.5
#> [28] Rcpp_1.0.0 spacyr_1.0.1 backports_1.1.2
#> [31] scales_1.0.0 RcppParallel_4.4.2 jsonlite_1.6
#> [34] fastmatch_1.1-0 stopwords_0.9.0 mnormt_1.5-5
#> [37] hms_0.4.2 digest_0.6.18 stringi_1.3.1
#> [40] grid_3.4.1 rprojroot_1.3-2 cli_1.0.1
#> [43] tools_3.4.1 magrittr_1.5 lazyeval_0.2.1
#> [46] crayon_1.3.4 pkgconfig_2.0.2 Matrix_1.2-11
#> [49] data.table_1.12.0 xml2_1.2.0 lubridate_1.7.4
#> [52] assertthat_0.2.0 rmarkdown_1.10 httr_1.3.1
#> [55] R6_2.3.0 nlme_3.1-131 compiler_3.4.1


## Additional info

Please add any other information about the issue.
@koheiw
Copy link
Collaborator

koheiw commented Mar 13, 2019

This is a changed in #1547. @kbenoit, it might be better to allow weight grouped DFMs.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 14, 2019

Well, we now do allow grouping of weighted dfms (see #1547) but we did not pass this ability through to textstat_frequency(). For consistency, we should do this, but there remain very good reasons why this is a bad idea. These have to do with the fact that the properties of neither (inverse) document frequency nor logarithms are preserved by summing them.

Consider a simple example with three documents.

library("quanteda")
## Package version: 1.4.1

dfmat <- dfm(c(
  "a a b b b c d",
  "a c c c c",
  "b b b d"
))
dfmat
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
##        features
## docs    a b c d
##   text1 2 3 1 1
##   text2 1 0 4 0
##   text3 0 3 0 1

This is what the tf-idf weighted version of this matrix looks like:

dfm_tfidf(dfmat)
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
##        features
## docs            a         b         c         d
##   text1 0.3521825 0.5282738 0.1760913 0.1760913
##   text2 0.1760913 0         0.7043650 0        
##   text3 0         0.5282738 0         0.1760913

and here is a grouped version:

dfm_group(dfmat, groups = c(1, 1, 2))
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs a b c d
##    1 3 3 5 1
##    2 0 3 0 1

If we "group" the weighted matrix using the default summing of the cells method (the only method we currently use), then this is the result:

dfm_tfidf(dfmat) %>%
  dfm_group(groups = c(1, 1, 2), force = TRUE)
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs         a         b         c         d
##    1 0.5282738 0.5282738 0.8804563 0.1760913
##    2 0         0.5282738 0         0.1760913

but this is very different from what happens if we group it first and then weight it:

dfm_group(dfmat, groups = c(1, 1, 2)) %>%
  dfm_tfidf()
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
##     features
## docs       a b       c d
##    1 0.90309 0 1.50515 0
##    2 0       0 0       0

Here, b and d are weighted to zero because they occur in each new "document", defined as the group. This is very different from having summed the weights pre-grouping. Weighting before halves the weight of a and makes it equal in importance with b, which has a zero weight if grouping happens first.

The moral of the story is: Make sure you understand what you intend by a "document" before you weight by inverse document frequency. My view is that the weighting first and then grouping is simply wrong.

@kbenoit kbenoit changed the title textstat_frequency on dam_tfidf textstat_frequency after dfm_tfidf weighting Mar 14, 2019
@koheiw
Copy link
Collaborator

koheiw commented Mar 14, 2019

I though "Error: will not group a weighted dfm; use force = TRUE" is from dfm_group() but I was wrong. It does not make sense to do dfm_tfidf() %>% textstat_frequency().

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 14, 2019

No, probably not when the textstat_frequency output columns are called frequency and docfreq.

@Astelix
Copy link
Author

Astelix commented Mar 14, 2019

I did the grouping before the computation of tfidf and used the same grouping in textstat_frequency. I was not interested in docfreq and I knew that frequeny was weighted by tf_idf. I basically wanted to extract the results from the dfm to a plotable dataframe. The result of the procedure replicated Silges plots exactly (but on a different scale). I can completley circumvent textstat_frequency and use the following procedure, which is more complicated and leads to exactly the same results.

austen_dfm <- tidy_books %>% cast_dfm(book, word, n) %>%
  dfm(remove=stopwords("en"), tolower = T) %>%
  dfm_group(groups=docnames()) %>%
  dfm_tfidf()

austen_trip <- convert(austen_dfm, to="tripletlist") 
austen_df <- data.frame(book=austen_trip$document,
                        feature=austen_trip$feature,
                        tf_idf=austen_trip$frequency)


res <- austen_df %>%
  arrange(desc(tf_idf)) %>%
  mutate(feature=factor(feature, levels=rev(unique(feature)))) %>%
  group_by(book) %>%
  slice(1:15) %>%
  ungroup()

Created on 2019-03-14 by the reprex package (v0.2.1)

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 25, 2019

@koheiw I've been thinking more about this, and looking more at the above application, plus finding myself overriding our rules (using force = TRUE and suppressing warnings) in the dev-svm branch, to get better classifier performance from machine learning.

I think I'm coming around to the view that we could allow a user to group a weighted dfm and weight an already weighted dfm, with an option not to force it with a warning, but rather to allow it always and toggle the warning off or not (at least this calls attention to a procedure that may not be recommended, or warns the user if they forgot about an earlier weighting).

What do you think?

@koheiw
Copy link
Collaborator

koheiw commented Mar 26, 2019

If textstat_frequency() is used primarily for converting to a long format for tidytext, convert() should be used (I am not sure if it is possible currently). However, I am fine to give users freedom to be creative in data transformation as far as these safe-guards measures are consistent across our functions.

By the way, can you tell me what combination of feature weighting improved SVM performance?

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 26, 2019

tf-idf improved predictive performance a lot. For computational performance, in the current dev-svm branch the key is sending it a matrix that is as sparse as possible. I tried "scaling" (centering and dividing by the sd) and this caused the computation to fail entirely because the matrix had gone 100% dense.

@koheiw
Copy link
Collaborator

koheiw commented Mar 26, 2019

It is probably because SVM is a non-text model that takes any quantity for prediction. Are there feature weighting functions for regular Matrix that you can use instead? (if not, we should generalize the code in dfm_weight() to for broader usage)

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 26, 2019

The testing I've done is based on a linear SVM, which seems to be the recommended kernel for textual features. The need to normalize in some fashion is the same as for a measure such as correlation: the space in which the separating plane is computed is affected by the absolute positions of the inputs, and this means that longer documents influence the decision values more. But proportion weighting does not perform nearly as well as tf-idf, even in our default version where the tf are counts (and not proportions). So here, the idf seems to be helping a lot by getting rid of common features.

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

performance <- function(mytable, verbose = TRUE) {
  truePositives <- mytable[1, 1]
  trueNegatives <- sum(diag(mytable)[-1])
  falsePositives <- sum(mytable[1, ]) - truePositives
  falseNegatives <- sum(mytable[, 1]) - truePositives
  precision <- truePositives / (truePositives + falsePositives)
  recall <- truePositives / (truePositives + falseNegatives)
  accuracy <- sum(diag(mytable)) / sum(mytable)
  tnr <- trueNegatives / (trueNegatives + falsePositives)
  balanced_accuracy <- sum(c(precision, tnr), na.rm = TRUE) / 2
  if (verbose) {
    print(mytable)
    cat(
      "\n    precision =", round(precision, 2),
      "\n       recall =", round(recall, 2),
      "\n     accuracy =", round(accuracy, 2),
      "\n    bal. acc. =", round(balanced_accuracy, 2),
      "\n"
    )
  }
  invisible(c(precision, recall))
}

# define training texts and the "true" govt/opp status
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") == "Haughey", "Govt", NA)
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") %in% c("Spring", "deRossa"), "Opp", y)
truth <- ifelse(docvars(data_corpus_dailnoconf1991, "party") %in% c("FF", "PD"), "Govt", "Opp")

# no weighting: poor
dfm(data_corpus_dailnoconf1991) %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt    6   0
##   Opp    19  33
## 
##     precision = 1 
##        recall = 0.24 
##      accuracy = 0.67 
##     bal. acc. = 1

# proportions: poor, predicts everyone to be opposition
dfm(data_corpus_dailnoconf1991) %>%
  dfm_weight(scheme = "prop") %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##      truth
## .     Govt Opp
##   Opp   25  33
## 
##     precision = 0.43 
##        recall = 1 
##      accuracy = 0.43 
##     bal. acc. = 0.22

# scaled - results in a fully dense dfm, and poor performance
dfm(data_corpus_dailnoconf1991) %>%
  scale() %>%
  as.dfm() %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   24  25
##   Opp     1   8
## 
##     precision = 0.49 
##        recall = 0.96 
##      accuracy = 0.55 
##     bal. acc. = 0.37

# tf-idf: better
dfm(data_corpus_dailnoconf1991) %>%
  dfm_tfidf() %>%
  textmodel_svm(y) %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   16   3
##   Opp     9  30
## 
##     precision = 0.84 
##        recall = 0.64 
##      accuracy = 0.79 
##     bal. acc. = 0.88

# tf-idf: best with document frequency weights
dfm(data_corpus_dailnoconf1991) %>%
  dfm_tfidf() %>%
  textmodel_svm(y, weight = "docfreq") %>%
  predict() %>%
  table(truth) %>%
  performance()
##       truth
## .      Govt Opp
##   Govt   15   2
##   Opp    10  31
## 
##     precision = 0.88 
##        recall = 0.6 
##      accuracy = 0.79 
##     bal. acc. = 0.91

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 28, 2019

Solution after discussing with @koheiw - add ability to pass arguments through to dfm_group() via .... @kbenoit to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants