Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark dictionary lookup functions #960

Closed
kbenoit opened this issue Sep 12, 2017 · 4 comments
Closed

Benchmark dictionary lookup functions #960

kbenoit opened this issue Sep 12, 2017 · 4 comments

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 12, 2017

Performance is slow when the dictionaries are large, and text are large, but we are not really sure where the bottlenecks are occurring. See this comment
for instance from SO.

For

  • dfm_lookup()
  • tokens_lookup()
@koheiw koheiw changed the title benchmark dictionary lookup functions Benchmark dictionary lookup functions Sep 14, 2017
@koheiw
Copy link
Collaborator

koheiw commented Sep 14, 2017

In 'normal' usage of tokens_lookup (i.e. N-to-N matching), it's performance is still good (despite it being such a complex function for nuanced results). With the Guardian corpus (500MB) and LIWC2007 dictionary (10,603 words in 64 keys):

microbenchmark::microbenchmark(
    fixed=tokens_lookup(toks, dict_liwc, valuetype='fixed', verbose=FALSE),
    glob=tokens_lookup(toks, dict_liwc, valuetype='glob', verbose=FALSE),
    times=1
)
#Unit: seconds
#  expr       min        lq      mean    median        uq       max neval
# fixed  61.17901  61.17901  61.17901  61.17901  61.17901  61.17901     1
#  glob 101.81795 101.81795 101.81795 101.81795 101.81795 101.81795     1

However, it takes forever to finish lematization (or custom stemming), because of the large number of keys. It is taking a lot of time in converting character to ID in regex2id(), but it is absolutely unnecessary in 1-to-1matching. For example, we are just stemming types of tokens in tokens_wordstem().

I was thinking to create a types converter for users who wish to use custom word stemmer. The code would be like:

type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer

# takes forever
dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000
tokens_lookup(toks, dict_stem[1:100], valuetype='fixed', verbose=FALSE)

# in 3 sec
tokens_convert <- function(x, from, to) {
    type <- to[fastmatch::fmatch(attr(x, 'types'), from)]
    type <- ifelse(is.na(type), from, type)
    attr(x, 'types') <- type
    quanteda:::tokens_recompile(x)
}

tokens_convert(toks, from = type, to = stem)

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 14, 2017

What do you mean by the "custom stemming" - at what stage does this happen in the tokens_lookup() process?

@koheiw
Copy link
Collaborator

koheiw commented Sep 14, 2017

It does not happen in tokens_lookup(), but dictionary lookup is currently the only way for users to covert tokens (see #514). In order to use steammers not in quanteda, users have to create and apply a massive dictionary in this way:

type <- char_tolower(attr(toks, 'types'))
type <- type[stringi::stri_detect_regex(type, '^[a-zA-Z]+$')]
stem <- char_wordstem(type) # this will be user's custom stemmer

dict_stem <- dictionary(split(type, stem))
length(dict_stem) #216000

dict_stem[50010:50020]
# Dictionary object with 11 key entries.
# - [dobrica]:
#     - dobrica
# - [dobrindt]:
#     - dobrindt
# - [dobrinja]:
#     - dobrinja
# - [dobriskey]:
#     - dobriskey
# - [dobro]:
#     - dobro
# - [dobrokhotov]:
#     - dobrokhotov
# - [dobromyslova]:
#     - dobromyslova
# - [dobroyd]:
#     - dobroyd
# - [dobrynia]:
#     - dobrynia
# - [dobrynska]:
#     - dobrynska
# - [dobrzycka]:
#     - dobrzycka

tokens_lookup(toks, dict_stem, valuetype='fixed', verbose=FALSE)

This is really inefficient.

@koheiw
Copy link
Collaborator

koheiw commented Sep 14, 2017

It is also good to have an argument for a user-defined function to be applied to types:

tokens_convert(toks, fun = yourStemmer)

This is the same as:

toks <- tokens(txt)
types <- attr(toks, 'types')
attr(toks, 'types') <- yourStemmer(types)
toks <- quanteda:::tokens_recompile(toks)

Advanced users (including me) would love this.

@kbenoit kbenoit added this to the v0.99.x refresh milestone Sep 18, 2017
@kbenoit kbenoit removed their assignment Sep 18, 2017
@kbenoit kbenoit closed this as completed Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants