Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ tokens functions return stochastic ordering #2100

Closed
kbenoit opened this issue Mar 25, 2021 · 8 comments
Closed

C++ tokens functions return stochastic ordering #2100

kbenoit opened this issue Mar 25, 2021 · 8 comments
Assignees
Labels

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 25, 2021

A user pointed out to me that the tokens ordering is indeterminate, when creating ngrams. Below, I set the threads to 1, just in case it was multithreading that was causing the issue.

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 1 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

quanteda_options(threads = 1)
set.seed(123)

toks <- tokens(data_corpus_inaugural)
toks1 <- tokens_ngrams(toks, n = 2)
toks2 <- tokens_ngrams(toks, n = 2)

setequal(types(toks1), types(toks2))
## [1] TRUE
all(types(toks1) == types(toks2))
## [1] TRUE
tail(types(toks1))
## [1] "._Driven"      "Driven_by"     "by_conviction" "love_with"    
## [5] "God_protect"   "troops_."
tail(types(toks2))
## [1] "._Driven"      "Driven_by"     "by_conviction" "love_with"    
## [5] "God_protect"   "troops_."

This was already a known issue wit tokens_compound(), as in https://stackoverflow.com/questions/66256443/tokens-compound-in-quanteda-changes-the-order-of-features.

Is this something we should fix? Or just treat their ordering as indeterminate, like a Python set (but only on creation)?

Created on 2021-03-25 by the reprex package (v1.0.0)

@koheiw
Copy link
Collaborator

koheiw commented Mar 25, 2021

I don't think there is any practical problem. Ngram generation is multi-threaded, so tokens_ngram() and tokens_compund() assign different IDs to types depending on the task scheduling. The order of the types affect the order of features in DFM but it can be made unique by dfmt <- dfmt[,sort(featnames(dfmt))].

@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 26, 2021

I never thought it a problem either, but I can see two aspects where fixing the order could be desirable.

  1. From R, set.seed() does not guarantee the same results. For replication, for instance with resampling approaches, it would be desirable to ensure reproducibility.
  2. Some dfm methods, despite being bag of words, assign random starting values, and these are affected by the column order in a dfm. (the topicmodels package does this for instance). So different orderings mean different model results.

Of course there are ways to fix this (as you point out above), but most users are likely to expect set.seed() to make this happen.

@koheiw
Copy link
Collaborator

koheiw commented Mar 26, 2021

Users should set threads = 1 to get the same results. as.list() also helps. There is no other way.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Mar 26, 2021

I thought the same about threads, but in the example above, setting it to 1 did not prevent the differences in types mapping.

Re-running the code above - and in fact looking at what I originally pasted! - it is stable when threads is set to 1. I think I got this wrong because interactively, it let me set the threads to 1 but was still not at 1. Using reprex() however set it just once.

One way to make them the same would be to sort the types before returning the final value.

@koheiw
Copy link
Collaborator

koheiw commented May 12, 2021

The random order of the types becomes a bigger problem in DFM, so I often sort columns to make it.

dfmt <- dfm(toks)
dfmt <- dfmt[,order(featnames(dfmt))]

We could add dfm(sort = TRUE) to do the same.

@koheiw
Copy link
Collaborator

koheiw commented Nov 18, 2021

@kbenoit, it seems that there are people who are puzzled by the problem. See comments on my blog post.

I left this issue open until now because sorting in alphabetical order is too artificial, but here is the solution. The DFM should be the same as threads = 1.

require(quanteda)
#> Loading required package: quanteda
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.

#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
quanteda_options(threads = 8)

toks <- tokens(data_corpus_inaugural)
toks_ng1 <- tokens_ngrams(toks, n = 2)
toks_ng2 <- tokens_ngrams(toks, n = 2)
identical(types(toks_ng1), types(toks_ng2))
#> [1] FALSE
setequal(types(toks_ng1), types(toks_ng2))
#> [1] TRUE

dfmt1 <- dfm(toks_ng1, tolower = FALSE)
dfmt2 <- dfm(toks_ng2, tolower = FALSE)
identical(featnames(dfmt1), featnames(dfmt2))
#> [1] FALSE
setequal(featnames(dfmt1), featnames(dfmt2))
#> [1] TRUE

# get the types in order of their occurrences
u1 <- unique(unlist(unclass(toks_ng1), use.names = FALSE))
f1 <- types(toks_ng1)[u1]

u2 <- unique(unlist(unclass(toks_ng2), use.names = FALSE))
f2 <- types(toks_ng2)[u2]

# sort columns of DFM
identical(featnames(dfmt1[,f1]), featnames(dfmt2[,f2]))
#> [1] TRUE

dfmt1[,f1]
#> Document-feature matrix of: 59 documents, 66,616 features (96.97% sparse) and 4 docvars.
#>                  features
#> docs              Fellow-Citizens_of of_the the_Senate Senate_and and_of
#>   1789-Washington                  1     20          1          1      2
#>   1793-Washington                  0      4          0          0      1
#>   1797-Adams                       0     29          0          0      2
#>   1801-Jefferson                   0     28          0          0      3
#>   1805-Jefferson                   0     17          0          0      1
#>   1809-Madison                     0     20          0          0      2
#>                  features
#> docs              the_House House_of of_Representatives Representatives_:
#>   1789-Washington         2        2                  2                 1
#>   1793-Washington         0        0                  0                 0
#>   1797-Adams              0        0                  0                 0
#>   1801-Jefferson          0        0                  0                 0
#>   1805-Jefferson          0        0                  0                 0
#>   1809-Madison            0        0                  0                 0
#>                  features
#> docs              :_Among
#>   1789-Washington       1
#>   1793-Washington       0
#>   1797-Adams            0
#>   1801-Jefferson        0
#>   1805-Jefferson        0
#>   1809-Madison          0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 66,606 more features ]

@kbenoit
Copy link
Collaborator Author

kbenoit commented Nov 18, 2021

Stabilising the order to threads = 1 is a great idea, esp since setting the seed in R will not make this stable.

Is tokens_ngrams() the only function where this happens?

@koheiw
Copy link
Collaborator

koheiw commented Nov 19, 2021

Both tokens_ngrams() and tokens_compound().

@koheiw koheiw closed this as completed Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants