Consider implementing tokens_recompile in C++ if necessary #510

koheiw · 2017-01-24T15:12:20Z

While the qatd_cpp_tokens_* are really fast, tokens_hased_recompile appear to be the bottle neck. Since there is no character encoding is involved, we can recompilation tokens in C++ much faster. This would also be one step to an architecture where tokens object remain in the C++ side until requested.

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-01-24T15:55:39Z

We should discuss this carefully before you spend time on it, since there are a lot of good arguments for keeping the code data objects in R rather than living via pointers to the C++ space. Better short-term solution is to see if we can improve the performance of tokens_hashed_recompile, after identifying the precise bottlenecks.

koheiw · 2017-01-24T15:58:33Z

I will not start writing this code anytime soon. I just wrote down my future idea.

koheiw · 2017-02-04T16:42:26Z

I am investigation the bottle necks in dev-wrap branch. The C++ function is doing well with the large corpus, but the auxiliary functions are taking a lot of time.

toks <- tokens(data_corpus_guardian)
dict_lex <- dictionary(file='/home/kohei/Documents/Dictionary/Lexicoder/LSDaug2015/LSD2015_NEG.lc3')
seq_lex <- quanteda:::sequence2list(unlist(dict_lex, use.names = FALSE))
length(seq_lex) # 4581

out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 33.7241 secs 
## qatd_cpp_tokens_compound: 35.35927 secs 
## tokens_hashed_recompile: 13.99193 secs

out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 10.63352 secs 
## qatd_cpp_tokens_compound: 25.27257 secs 
## tokens_hashed_recompile: 17.59145 secs

kbenoit · 2017-02-04T17:02:28Z

My guess is that the bottlenecks are in the lapply parts of tokens_hashed_recompile(). I could be wrong on this, but match() and the other direct indexing functions in R are already as fast as what we could reimplement in C++.

The recompile function could be implemented in C++, since it basically just reindexes the types table to a) eliminate gaps and b) join duplicates. Is the overhead of passing the object not costly?

Note: I'd to make tokens an S4 object in a restructuring of the tokens class, but this change could be independent of that.

kbenoit · 2017-02-04T17:04:22Z

Moving tokens to the C++ side entirely is an interesting idea, but would be a more fundamental change. This would be similar to the data.table approach (which is admittedly a great approach).

koheiw · 2017-02-05T08:22:58Z

I think that the problem is in lapply too. It is worth making experimental recompiler in C++. It is easy, but it is difficult to speed up regex2id, because it uses to regex match. Any idea how?

kbenoit · 2017-02-05T08:31:04Z

Good point on regex2id() - I created a separate issue for this (#542).

koheiw · 2017-02-05T18:11:08Z

We need more tests, but the results of the C++ version of recompiler is promising.

out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 51.37615 secs 
## qatd_cpp_tokens_compound: 38.23306 secs 
## tokens_hashed_recompile: 27.79845 secs 
## qatd_cpp_recompile: 4.334961 secs 

out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 11.30806 secs 
## qatd_cpp_tokens_compound: 26.17273 secs 
## tokens_hashed_recompile: 16.10702 secs 
## qatd_cpp_recompile: 5.497874 secs

kbenoit · 2017-02-05T19:23:17Z

Excellent!

I suggest changing tokens_hashed_recompile to call the C++ routine by default, with an old = FALSE option that if TRUE, calls the R version. That way we can implement extensive tests to make sure the returns are identical.

kbenoit · 2017-02-06T21:22:41Z

Fixed in f60ab8e.

koheiw added enhancement performance labels Jan 24, 2017

koheiw self-assigned this Jan 24, 2017

koheiw changed the title ~~Implement qatd_cpp_tokens_recompile()~~ Consider implementing tokens_recompile in C++ if necessary Jan 24, 2017

koheiw added a commit that referenced this issue Feb 4, 2017

Add timing to investigate #510

009ff1c

kbenoit closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider implementing tokens_recompile in C++ if necessary #510

Consider implementing tokens_recompile in C++ if necessary #510

koheiw commented Jan 24, 2017

kbenoit commented Jan 24, 2017

koheiw commented Jan 24, 2017

koheiw commented Feb 4, 2017

kbenoit commented Feb 4, 2017

kbenoit commented Feb 4, 2017

koheiw commented Feb 5, 2017

kbenoit commented Feb 5, 2017

koheiw commented Feb 5, 2017

kbenoit commented Feb 5, 2017

kbenoit commented Feb 6, 2017

Consider implementing tokens_recompile in C++ if necessary #510

Consider implementing tokens_recompile in C++ if necessary #510

Comments

koheiw commented Jan 24, 2017

kbenoit commented Jan 24, 2017

koheiw commented Jan 24, 2017

koheiw commented Feb 4, 2017

kbenoit commented Feb 4, 2017

kbenoit commented Feb 4, 2017

koheiw commented Feb 5, 2017

kbenoit commented Feb 5, 2017

koheiw commented Feb 5, 2017

kbenoit commented Feb 5, 2017

kbenoit commented Feb 6, 2017