C++ tokens functions return stochastic ordering #2100

kbenoit · 2021-03-25T13:38:45Z

A user pointed out to me that the tokens ordering is indeterminate, when creating ngrams. Below, I set the threads to 1, just in case it was multithreading that was causing the issue.

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 1 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

quanteda_options(threads = 1)
set.seed(123)

toks <- tokens(data_corpus_inaugural)
toks1 <- tokens_ngrams(toks, n = 2)
toks2 <- tokens_ngrams(toks, n = 2)

setequal(types(toks1), types(toks2))
## [1] TRUE
all(types(toks1) == types(toks2))
## [1] TRUE
tail(types(toks1))
## [1] "._Driven"      "Driven_by"     "by_conviction" "love_with"    
## [5] "God_protect"   "troops_."
tail(types(toks2))
## [1] "._Driven"      "Driven_by"     "by_conviction" "love_with"    
## [5] "God_protect"   "troops_."

This was already a known issue wit tokens_compound(), as in https://stackoverflow.com/questions/66256443/tokens-compound-in-quanteda-changes-the-order-of-features.

Is this something we should fix? Or just treat their ordering as indeterminate, like a Python set (but only on creation)?

^{Created on 2021-03-25 by the reprex package (v1.0.0)}

The text was updated successfully, but these errors were encountered:

koheiw · 2021-03-25T22:28:34Z

I don't think there is any practical problem. Ngram generation is multi-threaded, so tokens_ngram() and tokens_compund() assign different IDs to types depending on the task scheduling. The order of the types affect the order of features in DFM but it can be made unique by dfmt <- dfmt[,sort(featnames(dfmt))].

kbenoit · 2021-03-26T11:02:05Z

I never thought it a problem either, but I can see two aspects where fixing the order could be desirable.

From R, set.seed() does not guarantee the same results. For replication, for instance with resampling approaches, it would be desirable to ensure reproducibility.
Some dfm methods, despite being bag of words, assign random starting values, and these are affected by the column order in a dfm. (the topicmodels package does this for instance). So different orderings mean different model results.

Of course there are ways to fix this (as you point out above), but most users are likely to expect set.seed() to make this happen.

koheiw · 2021-03-26T11:39:07Z

Users should set threads = 1 to get the same results. as.list() also helps. There is no other way.

kbenoit · 2021-03-26T11:42:02Z

~~I thought the same about threads, but in the example above, setting it to 1 did not prevent the differences in types mapping.~~

Re-running the code above - and in fact looking at what I originally pasted! - it is stable when threads is set to 1. I think I got this wrong because interactively, it let me set the threads to 1 but was still not at 1. Using reprex() however set it just once.

One way to make them the same would be to sort the types before returning the final value.

koheiw · 2021-05-12T12:03:48Z

The random order of the types becomes a bigger problem in DFM, so I often sort columns to make it.

dfmt <- dfm(toks)
dfmt <- dfmt[,order(featnames(dfmt))]

We could add dfm(sort = TRUE) to do the same.

koheiw · 2021-11-18T12:08:49Z

@kbenoit, it seems that there are people who are puzzled by the problem. See comments on my blog post.

I left this issue open until now because sorting in alphabetical order is too artificial, but here is the solution. The DFM should be the same as threads = 1.

require(quanteda)
#> Loading required package: quanteda
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.

#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
quanteda_options(threads = 8)

toks <- tokens(data_corpus_inaugural)
toks_ng1 <- tokens_ngrams(toks, n = 2)
toks_ng2 <- tokens_ngrams(toks, n = 2)
identical(types(toks_ng1), types(toks_ng2))
#> [1] FALSE
setequal(types(toks_ng1), types(toks_ng2))
#> [1] TRUE

dfmt1 <- dfm(toks_ng1, tolower = FALSE)
dfmt2 <- dfm(toks_ng2, tolower = FALSE)
identical(featnames(dfmt1), featnames(dfmt2))
#> [1] FALSE
setequal(featnames(dfmt1), featnames(dfmt2))
#> [1] TRUE

# get the types in order of their occurrences
u1 <- unique(unlist(unclass(toks_ng1), use.names = FALSE))
f1 <- types(toks_ng1)[u1]

u2 <- unique(unlist(unclass(toks_ng2), use.names = FALSE))
f2 <- types(toks_ng2)[u2]

# sort columns of DFM
identical(featnames(dfmt1[,f1]), featnames(dfmt2[,f2]))
#> [1] TRUE

dfmt1[,f1]
#> Document-feature matrix of: 59 documents, 66,616 features (96.97% sparse) and 4 docvars.
#>                  features
#> docs              Fellow-Citizens_of of_the the_Senate Senate_and and_of
#>   1789-Washington                  1     20          1          1      2
#>   1793-Washington                  0      4          0          0      1
#>   1797-Adams                       0     29          0          0      2
#>   1801-Jefferson                   0     28          0          0      3
#>   1805-Jefferson                   0     17          0          0      1
#>   1809-Madison                     0     20          0          0      2
#>                  features
#> docs              the_House House_of of_Representatives Representatives_:
#>   1789-Washington         2        2                  2                 1
#>   1793-Washington         0        0                  0                 0
#>   1797-Adams              0        0                  0                 0
#>   1801-Jefferson          0        0                  0                 0
#>   1805-Jefferson          0        0                  0                 0
#>   1809-Madison            0        0                  0                 0
#>                  features
#> docs              :_Among
#>   1789-Washington       1
#>   1793-Washington       0
#>   1797-Adams            0
#>   1801-Jefferson        0
#>   1805-Jefferson        0
#>   1809-Madison          0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 66,606 more features ]

kbenoit · 2021-11-18T14:19:15Z

Stabilising the order to threads = 1 is a great idea, esp since setting the seed in R will not make this stable.

Is tokens_ngrams() the only function where this happens?

…ess #2100

koheiw · 2021-11-19T11:33:18Z

Both tokens_ngrams() and tokens_compound().

kbenoit added the tokens label Mar 25, 2021

kbenoit assigned koheiw and kbenoit Mar 25, 2021

koheiw added a commit that referenced this issue Nov 18, 2021

Sort the columns if DFM in order of the occurrences of tokens to addr…

62a491c

…ess #2100

koheiw mentioned this issue Nov 18, 2021

Sort DFM columns #2150

Merged

koheiw closed this as completed Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ tokens functions return stochastic ordering #2100

C++ tokens functions return stochastic ordering #2100

kbenoit commented Mar 25, 2021

koheiw commented Mar 25, 2021 •

edited

Loading

kbenoit commented Mar 26, 2021

koheiw commented Mar 26, 2021

kbenoit commented Mar 26, 2021 •

edited

Loading

koheiw commented May 12, 2021

koheiw commented Nov 18, 2021

kbenoit commented Nov 18, 2021

koheiw commented Nov 19, 2021

C++ tokens functions return stochastic ordering #2100

C++ tokens functions return stochastic ordering #2100

Comments

kbenoit commented Mar 25, 2021

koheiw commented Mar 25, 2021 • edited Loading

kbenoit commented Mar 26, 2021

koheiw commented Mar 26, 2021

kbenoit commented Mar 26, 2021 • edited Loading

koheiw commented May 12, 2021

koheiw commented Nov 18, 2021

kbenoit commented Nov 18, 2021

koheiw commented Nov 19, 2021

koheiw commented Mar 25, 2021 •

edited

Loading

kbenoit commented Mar 26, 2021 •

edited

Loading