-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ tokens functions return stochastic ordering #2100
Comments
I don't think there is any practical problem. Ngram generation is multi-threaded, so |
I never thought it a problem either, but I can see two aspects where fixing the order could be desirable.
Of course there are ways to fix this (as you point out above), but most users are likely to expect |
Users should set |
Re-running the code above - and in fact looking at what I originally pasted! - it is stable when threads is set to 1. I think I got this wrong because interactively, it let me set the threads to 1 but was still not at 1. Using reprex() however set it just once. One way to make them the same would be to sort the types before returning the final value. |
The random order of the types becomes a bigger problem in DFM, so I often sort columns to make it. dfmt <- dfm(toks)
dfmt <- dfmt[,order(featnames(dfmt))] We could add |
@kbenoit, it seems that there are people who are puzzled by the problem. See comments on my blog post. I left this issue open until now because sorting in alphabetical order is too artificial, but here is the solution. The DFM should be the same as require(quanteda)
#> Loading required package: quanteda
#> Warning: package 'quanteda' was built under R version 4.0.5
#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.
#> Warning in stringi::stri_info(): Your native charset is not a superset of US-
#> ASCII. This may cause serious problems. Consider switching to UTF-8.
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
quanteda_options(threads = 8)
toks <- tokens(data_corpus_inaugural)
toks_ng1 <- tokens_ngrams(toks, n = 2)
toks_ng2 <- tokens_ngrams(toks, n = 2)
identical(types(toks_ng1), types(toks_ng2))
#> [1] FALSE
setequal(types(toks_ng1), types(toks_ng2))
#> [1] TRUE
dfmt1 <- dfm(toks_ng1, tolower = FALSE)
dfmt2 <- dfm(toks_ng2, tolower = FALSE)
identical(featnames(dfmt1), featnames(dfmt2))
#> [1] FALSE
setequal(featnames(dfmt1), featnames(dfmt2))
#> [1] TRUE
# get the types in order of their occurrences
u1 <- unique(unlist(unclass(toks_ng1), use.names = FALSE))
f1 <- types(toks_ng1)[u1]
u2 <- unique(unlist(unclass(toks_ng2), use.names = FALSE))
f2 <- types(toks_ng2)[u2]
# sort columns of DFM
identical(featnames(dfmt1[,f1]), featnames(dfmt2[,f2]))
#> [1] TRUE
dfmt1[,f1]
#> Document-feature matrix of: 59 documents, 66,616 features (96.97% sparse) and 4 docvars.
#> features
#> docs Fellow-Citizens_of of_the the_Senate Senate_and and_of
#> 1789-Washington 1 20 1 1 2
#> 1793-Washington 0 4 0 0 1
#> 1797-Adams 0 29 0 0 2
#> 1801-Jefferson 0 28 0 0 3
#> 1805-Jefferson 0 17 0 0 1
#> 1809-Madison 0 20 0 0 2
#> features
#> docs the_House House_of of_Representatives Representatives_:
#> 1789-Washington 2 2 2 1
#> 1793-Washington 0 0 0 0
#> 1797-Adams 0 0 0 0
#> 1801-Jefferson 0 0 0 0
#> 1805-Jefferson 0 0 0 0
#> 1809-Madison 0 0 0 0
#> features
#> docs :_Among
#> 1789-Washington 1
#> 1793-Washington 0
#> 1797-Adams 0
#> 1801-Jefferson 0
#> 1805-Jefferson 0
#> 1809-Madison 0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 66,606 more features ] |
Stabilising the order to Is |
Both |
A user pointed out to me that the tokens ordering is indeterminate, when creating ngrams. Below, I set the threads to 1, just in case it was multithreading that was causing the issue.
This was already a known issue wit
tokens_compound()
, as in https://stackoverflow.com/questions/66256443/tokens-compound-in-quanteda-changes-the-order-of-features.Is this something we should fix? Or just treat their ordering as indeterminate, like a Python set (but only on creation)?
Created on 2021-03-25 by the reprex package (v1.0.0)
The text was updated successfully, but these errors were encountered: