Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topfeatures() misleading on fcm with only upper triangle #2141

Closed
kbenoit opened this issue Oct 15, 2021 · 3 comments · Fixed by #2328
Closed

topfeatures() misleading on fcm with only upper triangle #2141

kbenoit opened this issue Oct 15, 2021 · 3 comments · Fixed by #2328
Milestone

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 15, 2021

fcm() objects created with the default of tri = TRUE produce misleading feature frequencies when using topfeatures(), which calls the inherited method for dfm objects that simply sums the columns. This down weights features that occur first.

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

txt <- c(
  "a b c d",
  "a a b c d",
  "c d"
)
dfmat <- tokens(txt, remove_punct = TRUE) %>%
  dfm()

fcmat <- fcm(dfmat)
fcmat
## Feature co-occurrence matrix of: 4 by 4 features.
##         features
## features a b c d
##        a 1 3 3 3
##        b 0 0 2 2
##        c 0 0 0 3
##        d 0 0 0 0

topfeatures(fcmat)
## d c b a 
## 8 5 3 1

A solution would be to create a new method fcm.topfeatures() that first forces the matrix to be symmetric and then sums the columns.

topfeatures.fcm <- function(x,
                            n = 10,
                            decreasing = TRUE,
                            scheme = c("count", "docfreq"),
                            ...) {
  topfeatures(as.dfm(Matrix::forceSymmetric(x)))
}

topfeatures.fcm(fcmat)
##  a  c  d  b 
## 10  8  8  7
@TKrarup
Copy link

TKrarup commented Oct 15, 2021

Consider adding a diag = F option to filter out same-token co-occurences (which are rarely analytically interesting).

@kbenoit kbenoit added this to the v4 release milestone Apr 12, 2023
@koheiw
Copy link
Collaborator

koheiw commented Nov 27, 2023

We might want to disable topfeatures() or return the margins.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens("a bb cc a dd")

fcmt <- fcm(toks)
fcmt@meta$object$margin
#>  a bb cc dd 
#>  2  1  1  1

dfmt <- dfm(toks)
topfeatures(dfmt)
#>  a bb cc dd 
#>  2  1  1  1

Created on 2023-11-27 with reprex v2.0.2

@kbenoit
Copy link
Collaborator Author

kbenoit commented Nov 29, 2023

I'd favour disabling topfeatures() altogether for fcm objects, since they are not defined in the same way. The current man page for topfeatures() refers to dfm objects. It only works for fcm objects because of inheritance.

kbenoit pushed a commit that referenced this issue Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants