-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
textstat_frequency after dfm_tfidf weighting #1646
Comments
Well, we now do allow grouping of weighted dfms (see #1547) but we did not pass this ability through to Consider a simple example with three documents. library("quanteda")
## Package version: 1.4.1
dfmat <- dfm(c(
"a a b b b c d",
"a c c c c",
"b b b d"
))
dfmat
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
## features
## docs a b c d
## text1 2 3 1 1
## text2 1 0 4 0
## text3 0 3 0 1 This is what the tf-idf weighted version of this matrix looks like: dfm_tfidf(dfmat)
## Document-feature matrix of: 3 documents, 4 features (33.3% sparse).
## 3 x 4 sparse Matrix of class "dfm"
## features
## docs a b c d
## text1 0.3521825 0.5282738 0.1760913 0.1760913
## text2 0.1760913 0 0.7043650 0
## text3 0 0.5282738 0 0.1760913 and here is a grouped version: dfm_group(dfmat, groups = c(1, 1, 2))
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
## features
## docs a b c d
## 1 3 3 5 1
## 2 0 3 0 1 If we "group" the weighted matrix using the default summing of the cells method (the only method we currently use), then this is the result: dfm_tfidf(dfmat) %>%
dfm_group(groups = c(1, 1, 2), force = TRUE)
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
## features
## docs a b c d
## 1 0.5282738 0.5282738 0.8804563 0.1760913
## 2 0 0.5282738 0 0.1760913 but this is very different from what happens if we group it first and then weight it: dfm_group(dfmat, groups = c(1, 1, 2)) %>%
dfm_tfidf()
## Document-feature matrix of: 2 documents, 4 features (25.0% sparse).
## 2 x 4 sparse Matrix of class "dfm"
## features
## docs a b c d
## 1 0.90309 0 1.50515 0
## 2 0 0 0 0 Here, b and d are weighted to zero because they occur in each new "document", defined as the group. This is very different from having summed the weights pre-grouping. Weighting before halves the weight of a and makes it equal in importance with b, which has a zero weight if grouping happens first. The moral of the story is: Make sure you understand what you intend by a "document" before you weight by inverse document frequency. My view is that the weighting first and then grouping is simply wrong. |
I though "Error: will not group a weighted dfm; use force = TRUE" is from |
No, probably not when the |
I did the grouping before the computation of tfidf and used the same grouping in textstat_frequency. I was not interested in docfreq and I knew that frequeny was weighted by tf_idf. I basically wanted to extract the results from the dfm to a plotable dataframe. The result of the procedure replicated Silges plots exactly (but on a different scale). I can completley circumvent textstat_frequency and use the following procedure, which is more complicated and leads to exactly the same results. austen_dfm <- tidy_books %>% cast_dfm(book, word, n) %>%
dfm(remove=stopwords("en"), tolower = T) %>%
dfm_group(groups=docnames()) %>%
dfm_tfidf()
austen_trip <- convert(austen_dfm, to="tripletlist")
austen_df <- data.frame(book=austen_trip$document,
feature=austen_trip$feature,
tf_idf=austen_trip$frequency)
res <- austen_df %>%
arrange(desc(tf_idf)) %>%
mutate(feature=factor(feature, levels=rev(unique(feature)))) %>%
group_by(book) %>%
slice(1:15) %>%
ungroup() Created on 2019-03-14 by the reprex package (v0.2.1) |
@koheiw I've been thinking more about this, and looking more at the above application, plus finding myself overriding our rules (using I think I'm coming around to the view that we could allow a user to group a weighted dfm and weight an already weighted dfm, with an option not to force it with a warning, but rather to allow it always and toggle the warning off or not (at least this calls attention to a procedure that may not be recommended, or warns the user if they forgot about an earlier weighting). What do you think? |
If By the way, can you tell me what combination of feature weighting improved SVM performance? |
tf-idf improved predictive performance a lot. For computational performance, in the current |
It is probably because SVM is a non-text model that takes any quantity for prediction. Are there feature weighting functions for regular Matrix that you can use instead? (if not, we should generalize the code in |
The testing I've done is based on a linear SVM, which seems to be the recommended kernel for textual features. The need to normalize in some fashion is the same as for a measure such as correlation: the space in which the separating plane is computed is affected by the absolute positions of the inputs, and this means that longer documents influence the decision values more. But proportion weighting does not perform nearly as well as tf-idf, even in our default version where the tf are counts (and not proportions). So here, the idf seems to be helping a lot by getting rid of common features. library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
performance <- function(mytable, verbose = TRUE) {
truePositives <- mytable[1, 1]
trueNegatives <- sum(diag(mytable)[-1])
falsePositives <- sum(mytable[1, ]) - truePositives
falseNegatives <- sum(mytable[, 1]) - truePositives
precision <- truePositives / (truePositives + falsePositives)
recall <- truePositives / (truePositives + falseNegatives)
accuracy <- sum(diag(mytable)) / sum(mytable)
tnr <- trueNegatives / (trueNegatives + falsePositives)
balanced_accuracy <- sum(c(precision, tnr), na.rm = TRUE) / 2
if (verbose) {
print(mytable)
cat(
"\n precision =", round(precision, 2),
"\n recall =", round(recall, 2),
"\n accuracy =", round(accuracy, 2),
"\n bal. acc. =", round(balanced_accuracy, 2),
"\n"
)
}
invisible(c(precision, recall))
}
# define training texts and the "true" govt/opp status
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") == "Haughey", "Govt", NA)
y <- ifelse(docvars(data_corpus_dailnoconf1991, "name") %in% c("Spring", "deRossa"), "Opp", y)
truth <- ifelse(docvars(data_corpus_dailnoconf1991, "party") %in% c("FF", "PD"), "Govt", "Opp")
# no weighting: poor
dfm(data_corpus_dailnoconf1991) %>%
textmodel_svm(y) %>%
predict() %>%
table(truth) %>%
performance()
## truth
## . Govt Opp
## Govt 6 0
## Opp 19 33
##
## precision = 1
## recall = 0.24
## accuracy = 0.67
## bal. acc. = 1
# proportions: poor, predicts everyone to be opposition
dfm(data_corpus_dailnoconf1991) %>%
dfm_weight(scheme = "prop") %>%
textmodel_svm(y) %>%
predict() %>%
table(truth) %>%
performance()
## truth
## . Govt Opp
## Opp 25 33
##
## precision = 0.43
## recall = 1
## accuracy = 0.43
## bal. acc. = 0.22
# scaled - results in a fully dense dfm, and poor performance
dfm(data_corpus_dailnoconf1991) %>%
scale() %>%
as.dfm() %>%
textmodel_svm(y) %>%
predict() %>%
table(truth) %>%
performance()
## truth
## . Govt Opp
## Govt 24 25
## Opp 1 8
##
## precision = 0.49
## recall = 0.96
## accuracy = 0.55
## bal. acc. = 0.37
# tf-idf: better
dfm(data_corpus_dailnoconf1991) %>%
dfm_tfidf() %>%
textmodel_svm(y) %>%
predict() %>%
table(truth) %>%
performance()
## truth
## . Govt Opp
## Govt 16 3
## Opp 9 30
##
## precision = 0.84
## recall = 0.64
## accuracy = 0.79
## bal. acc. = 0.88
# tf-idf: best with document frequency weights
dfm(data_corpus_dailnoconf1991) %>%
dfm_tfidf() %>%
textmodel_svm(y, weight = "docfreq") %>%
predict() %>%
table(truth) %>%
performance()
## truth
## . Govt Opp
## Govt 15 2
## Opp 10 31
##
## precision = 0.88
## recall = 0.6
## accuracy = 0.79
## bal. acc. = 0.91 |
Describe the bug
I tried to replicate the basic TF-IDF example in Julia Silges Tidytext-Book with Quanteda.
This worked in Quanta 1.4.1 but not in version 1.4.2 from Github.
Reproducible code
sessionInfo()
#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: OS X El Capitan 10.11.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 janeaustenr_0.1.5 quanteda_1.4.2
#> [4] tidytext_0.2.0 forcats_0.3.0 stringr_1.3.1
#> [7] dplyr_0.7.8 purrr_0.2.5 readr_1.1.1
#> [10] tidyr_0.8.2 tibble_2.0.1 ggplot2_3.1.0
#> [13] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.5 reshape2_1.4.3 haven_1.1.2
#> [4] lattice_0.20-35 colorspace_1.3-2 SnowballC_0.6.0
#> [7] htmltools_0.3.6 yaml_2.2.0 rlang_0.3.1
#> [10] pillar_1.3.1 foreign_0.8-69 glue_1.3.0
#> [13] withr_2.1.2 modelr_0.1.2 readxl_1.1.0
#> [16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0
#> [19] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
#> [22] psych_1.8.10 evaluate_0.12 knitr_1.20
#> [25] parallel_3.4.1 tokenizers_0.1.4 broom_0.4.5
#> [28] Rcpp_1.0.0 spacyr_1.0.1 backports_1.1.2
#> [31] scales_1.0.0 RcppParallel_4.4.2 jsonlite_1.6
#> [34] fastmatch_1.1-0 stopwords_0.9.0 mnormt_1.5-5
#> [37] hms_0.4.2 digest_0.6.18 stringi_1.3.1
#> [40] grid_3.4.1 rprojroot_1.3-2 cli_1.0.1
#> [43] tools_3.4.1 magrittr_1.5 lazyeval_0.2.1
#> [46] crayon_1.3.4 pkgconfig_2.0.2 Matrix_1.2-11
#> [49] data.table_1.12.0 xml2_1.2.0 lubridate_1.7.4
#> [52] assertthat_0.2.0 rmarkdown_1.10 httr_1.3.1
#> [55] R6_2.3.0 nlme_3.1-131 compiler_3.4.1
The text was updated successfully, but these errors were encountered: