-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider implementing tokens_recompile in C++ if necessary #510
Comments
We should discuss this carefully before you spend time on it, since there are a lot of good arguments for keeping the code data objects in R rather than living via pointers to the C++ space. Better short-term solution is to see if we can improve the performance of |
I will not start writing this code anytime soon. I just wrote down my future idea. |
I am investigation the bottle necks in toks <- tokens(data_corpus_guardian)
dict_lex <- dictionary(file='/home/kohei/Documents/Dictionary/Lexicoder/LSDaug2015/LSD2015_NEG.lc3')
seq_lex <- quanteda:::sequence2list(unlist(dict_lex, use.names = FALSE))
length(seq_lex) # 4581
out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 33.7241 secs
## qatd_cpp_tokens_compound: 35.35927 secs
## tokens_hashed_recompile: 13.99193 secs
out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 10.63352 secs
## qatd_cpp_tokens_compound: 25.27257 secs
## tokens_hashed_recompile: 17.59145 secs |
My guess is that the bottlenecks are in the The recompile function could be implemented in C++, since it basically just reindexes the types table to a) eliminate gaps and b) join duplicates. Is the overhead of passing the object not costly? Note: I'd to make |
Moving tokens to the C++ side entirely is an interesting idea, but would be a more fundamental change. This would be similar to the data.table approach (which is admittedly a great approach). |
I think that the problem is in lapply too. It is worth making experimental recompiler in C++. It is easy, but it is difficult to speed up |
Good point on |
We need more tests, but the results of the C++ version of recompiler is promising. out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 51.37615 secs
## qatd_cpp_tokens_compound: 38.23306 secs
## tokens_hashed_recompile: 27.79845 secs
## qatd_cpp_recompile: 4.334961 secs
out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 11.30806 secs
## qatd_cpp_tokens_compound: 26.17273 secs
## tokens_hashed_recompile: 16.10702 secs
## qatd_cpp_recompile: 5.497874 secs |
Excellent! I suggest changing |
Fixed in f60ab8e. |
While the
qatd_cpp_tokens_*
are really fast,tokens_hased_recompile
appear to be the bottle neck. Since there is no character encoding is involved, we can recompilationtokens
in C++ much faster. This would also be one step to an architecture wheretokens
object remain in the C++ side until requested.The text was updated successfully, but these errors were encountered: