-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark dictionary lookup functions #960
Comments
In 'normal' usage of microbenchmark::microbenchmark(
fixed=tokens_lookup(toks, dict_liwc, valuetype='fixed', verbose=FALSE),
glob=tokens_lookup(toks, dict_liwc, valuetype='glob', verbose=FALSE),
times=1
)
#Unit: seconds
# expr min lq mean median uq max neval
# fixed 61.17901 61.17901 61.17901 61.17901 61.17901 61.17901 1
# glob 101.81795 101.81795 101.81795 101.81795 101.81795 101.81795 1
However, it takes forever to finish lematization (or custom stemming), because of the large number of keys. It is taking a lot of time in converting character to ID in I was thinking to create a types converter for users who wish to use custom word stemmer. The code would be like: type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer
# takes forever
dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000
tokens_lookup(toks, dict_stem[1:100], valuetype='fixed', verbose=FALSE)
# in 3 sec
tokens_convert <- function(x, from, to) {
type <- to[fastmatch::fmatch(attr(x, 'types'), from)]
type <- ifelse(is.na(type), from, type)
attr(x, 'types') <- type
quanteda:::tokens_recompile(x)
}
tokens_convert(toks, from = type, to = stem)
|
What do you mean by the "custom stemming" - at what stage does this happen in the |
It does not happen in type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer
dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000
dict_stem[50010:50020]
# Dictionary object with 11 key entries.
# - [dobrica]:
# - dobrica
# - [dobrindt]:
# - dobrindt
# - [dobrinja]:
# - dobrinja
# - [dobriskey]:
# - dobriskey
# - [dobro]:
# - dobro
# - [dobrokhotov]:
# - dobrokhotov
# - [dobromyslova]:
# - dobromyslova
# - [dobroyd]:
# - dobroyd
# - [dobrynia]:
# - dobrynia
# - [dobrynska]:
# - dobrynska
# - [dobrzycka]:
# - dobrzycka
tokens_lookup(toks, dict_stem, valuetype='fixed', verbose=FALSE) This is really inefficient. |
It is also good to have an argument for a user-defined function to be applied to types: tokens_convert(toks, fun = yourStemmer) This is the same as: toks <- tokens(txt)
types <- attr(toks, 'types')
attr(toks, 'types') <- yourStemmer(types)
toks <- quanteda:::tokens_recompile(toks) Advanced users (including me) would love this. |
Performance is slow when the dictionaries are large, and text are large, but we are not really sure where the bottlenecks are occurring. See this comment
for instance from SO.
For
dfm_lookup()
tokens_lookup()
The text was updated successfully, but these errors were encountered: