Benchmark dictionary lookup functions #960

kbenoit · 2017-09-12T12:14:45Z

Performance is slow when the dictionaries are large, and text are large, but we are not really sure where the bottlenecks are occurring. See this comment
for instance from SO.

For

dfm_lookup()
tokens_lookup()

The text was updated successfully, but these errors were encountered:

koheiw · 2017-09-14T09:18:10Z

In 'normal' usage of tokens_lookup (i.e. N-to-N matching), it's performance is still good (despite it being such a complex function for nuanced results). With the Guardian corpus (500MB) and LIWC2007 dictionary (10,603 words in 64 keys):

microbenchmark::microbenchmark(
    fixed=tokens_lookup(toks, dict_liwc, valuetype='fixed', verbose=FALSE),
    glob=tokens_lookup(toks, dict_liwc, valuetype='glob', verbose=FALSE),
    times=1
)
#Unit: seconds
#  expr       min        lq      mean    median        uq       max neval
# fixed  61.17901  61.17901  61.17901  61.17901  61.17901  61.17901     1
#  glob 101.81795 101.81795 101.81795 101.81795 101.81795 101.81795     1

However, it takes forever to finish lematization (or custom stemming), because of the large number of keys. It is taking a lot of time in converting character to ID in regex2id(), but it is absolutely unnecessary in 1-to-1matching. For example, we are just stemming types of tokens in tokens_wordstem().

I was thinking to create a types converter for users who wish to use custom word stemmer. The code would be like:

type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer

# takes forever
dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000
tokens_lookup(toks, dict_stem[1:100], valuetype='fixed', verbose=FALSE)

# in 3 sec
tokens_convert <- function(x, from, to) {
    type <- to[fastmatch::fmatch(attr(x, 'types'), from)]
    type <- ifelse(is.na(type), from, type)
    attr(x, 'types') <- type
    quanteda:::tokens_recompile(x)
}

tokens_convert(toks, from = type, to = stem)

kbenoit · 2017-09-14T11:23:26Z

What do you mean by the "custom stemming" - at what stage does this happen in the tokens_lookup() process?

koheiw · 2017-09-14T11:56:55Z

It does not happen in tokens_lookup(), but dictionary lookup is currently the only way for users to covert tokens (see #514). In order to use steammers not in quanteda, users have to create and apply a massive dictionary in this way:

type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer

dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000

dict_stem[50010:50020]
# Dictionary object with 11 key entries.
# - [dobrica]:
#     - dobrica
# - [dobrindt]:
#     - dobrindt
# - [dobrinja]:
#     - dobrinja
# - [dobriskey]:
#     - dobriskey
# - [dobro]:
#     - dobro
# - [dobrokhotov]:
#     - dobrokhotov
# - [dobromyslova]:
#     - dobromyslova
# - [dobroyd]:
#     - dobroyd
# - [dobrynia]:
#     - dobrynia
# - [dobrynska]:
#     - dobrynska
# - [dobrzycka]:
#     - dobrzycka

tokens_lookup(toks, dict_stem, valuetype='fixed', verbose=FALSE)

This is really inefficient.

koheiw · 2017-09-14T12:24:03Z

It is also good to have an argument for a user-defined function to be applied to types:

tokens_convert(toks, fun = yourStemmer)

This is the same as:

toks <- tokens(txt)
types <- attr(toks, 'types')
attr(toks, 'types') <- yourStemmer(types)
toks <- quanteda:::tokens_recompile(toks)

Advanced users (including me) would love this.

kbenoit assigned kbenoit and koheiw Sep 12, 2017

kbenoit added dictionary performance labels Sep 12, 2017

koheiw changed the title ~~benchmark dictionary lookup functions~~ Benchmark dictionary lookup functions Sep 14, 2017

kbenoit added this to the v0.99.x refresh milestone Sep 18, 2017

kbenoit removed their assignment Sep 18, 2017

kbenoit closed this as completed Sep 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark dictionary lookup functions #960

Benchmark dictionary lookup functions #960

kbenoit commented Sep 12, 2017

koheiw commented Sep 14, 2017 •

edited

Loading

kbenoit commented Sep 14, 2017

koheiw commented Sep 14, 2017

koheiw commented Sep 14, 2017 •

edited

Loading

Benchmark dictionary lookup functions #960

Benchmark dictionary lookup functions #960

Comments

kbenoit commented Sep 12, 2017

koheiw commented Sep 14, 2017 • edited Loading

kbenoit commented Sep 14, 2017

koheiw commented Sep 14, 2017

koheiw commented Sep 14, 2017 • edited Loading

koheiw commented Sep 14, 2017 •

edited

Loading

koheiw commented Sep 14, 2017 •

edited

Loading