Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider implementing tokens_recompile in C++ if necessary #510

Closed
koheiw opened this issue Jan 24, 2017 · 10 comments
Closed

Consider implementing tokens_recompile in C++ if necessary #510

koheiw opened this issue Jan 24, 2017 · 10 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Jan 24, 2017

While the qatd_cpp_tokens_* are really fast, tokens_hased_recompile appear to be the bottle neck. Since there is no character encoding is involved, we can recompilation tokens in C++ much faster. This would also be one step to an architecture where tokens object remain in the C++ side until requested.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 24, 2017

We should discuss this carefully before you spend time on it, since there are a lot of good arguments for keeping the code data objects in R rather than living via pointers to the C++ space. Better short-term solution is to see if we can improve the performance of tokens_hashed_recompile, after identifying the precise bottlenecks.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 24, 2017

I will not start writing this code anytime soon. I just wrote down my future idea.

@koheiw koheiw changed the title Implement qatd_cpp_tokens_recompile() Consider implementing tokens_recompile in C++ if necessary Jan 24, 2017
koheiw added a commit that referenced this issue Feb 4, 2017
@koheiw
Copy link
Collaborator Author

koheiw commented Feb 4, 2017

I am investigation the bottle necks in dev-wrap branch. The C++ function is doing well with the large corpus, but the auxiliary functions are taking a lot of time.

toks <- tokens(data_corpus_guardian)
dict_lex <- dictionary(file='/home/kohei/Documents/Dictionary/Lexicoder/LSDaug2015/LSD2015_NEG.lc3')
seq_lex <- quanteda:::sequence2list(unlist(dict_lex, use.names = FALSE))
length(seq_lex) # 4581

out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 33.7241 secs 
## qatd_cpp_tokens_compound: 35.35927 secs 
## tokens_hashed_recompile: 13.99193 secs

out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 10.63352 secs 
## qatd_cpp_tokens_compound: 25.27257 secs 
## tokens_hashed_recompile: 17.59145 secs 

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 4, 2017

My guess is that the bottlenecks are in the lapply parts of tokens_hashed_recompile(). I could be wrong on this, but match() and the other direct indexing functions in R are already as fast as what we could reimplement in C++.

The recompile function could be implemented in C++, since it basically just reindexes the types table to a) eliminate gaps and b) join duplicates. Is the overhead of passing the object not costly?

Note: I'd to make tokens an S4 object in a restructuring of the tokens class, but this change could be independent of that.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 4, 2017

Moving tokens to the C++ side entirely is an interesting idea, but would be a more fundamental change. This would be similar to the data.table approach (which is admittedly a great approach).

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 5, 2017

I think that the problem is in lapply too. It is worth making experimental recompiler in C++. It is easy, but it is difficult to speed up regex2id, because it uses to regex match. Any idea how?

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 5, 2017

Good point on regex2id() - I created a separate issue for this (#542).

@koheiw
Copy link
Collaborator Author

koheiw commented Feb 5, 2017

We need more tests, but the results of the C++ version of recompiler is promising.

out <- tokens_compound(toks, seq_lex, valuetype='glob', join=TRUE)
## regex2id: 51.37615 secs 
## qatd_cpp_tokens_compound: 38.23306 secs 
## tokens_hashed_recompile: 27.79845 secs 
## qatd_cpp_recompile: 4.334961 secs 

out <- tokens_compound(toks, c('not *'), valuetype='glob', join=TRUE)
## regex2id: 11.30806 secs 
## qatd_cpp_tokens_compound: 26.17273 secs 
## tokens_hashed_recompile: 16.10702 secs 
## qatd_cpp_recompile: 5.497874 secs

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 5, 2017

Excellent!

I suggest changing tokens_hashed_recompile to call the C++ routine by default, with an old = FALSE option that if TRUE, calls the R version. That way we can implement extensive tests to make sure the returns are identical.

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 6, 2017

Fixed in f60ab8e.

@kbenoit kbenoit closed this as completed Feb 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants