Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track R-GSOC-2016 Progress #2

Open
qinwf opened this issue Feb 10, 2016 · 5 comments
Open

Track R-GSOC-2016 Progress #2

qinwf opened this issue Feb 10, 2016 · 5 comments

Comments

@qinwf
Copy link
Owner

qinwf commented Feb 10, 2016

No description provided.

@gagolews
Copy link

Qin,

Congratulations, as you probably already know, the RE2 project has been accepted!

@qinwf
Copy link
Owner Author

qinwf commented May 18, 2016

Project Status Report

May 19 - 2016

Changes during Community Bonding

1. Setup continuous integration and code coverage test

This package now checks CI on Mac, Linux, and Windows, and the code coverage status is checked by codecov.io.

2. More docs and tests

Add more docs and test cases for new functions and existing functions.

3. Documentation Pages

Initial work on the documentation pages https://qinwf.github.io/re2r_doc/ .

4. Parallel Support

All pattern matching routines have been implemented to work in parallel with RcppParallel.

5. Add split and locate functions

Add split and locate methods for pattern matching.

6. Add regular expression visualization with regexper library

Add show_regex function to visualize RE2 regular expression.

re2 images

7. Improve Performance

Use Google Performance Tools to profile the compiled C++ codes. Rewrite some critical code using raw R-C API to avoid the overhead of Rcpp_PreserveObject and other Rcpp helper functions.

Issue Status

#3 Solaris build

There will be changes now and then. We can test Solaris in the future.

#4 Long Vector Tests

Initial test cases was added.

#5 Match failure when LC_COLLATE is not UTF-8

Use stringi::stri_enc_toutf8 to convert input strings and pattern strings to UTF-8. Changes were landed.

Initial test cases was added.

#6 Question: argument order

Change order from (pattern, string) to (string, pattern) . Changes were landed.

#7 Using SET_STRING_ELT and Rf_mkCharLenCE to handle output string encoding

Changes were landed.

There is one case to take care of. It is that Rcpp exception strings are set to be native encoding instead of UTF-8 encoding, and if a pattern can not be parsed, the error message raised from Rcpp may contain strange characters. To fix it, we can remove Rcpp dependency in the near future.

Now most parts of the code are Rcpp independent, it should be easy to fix.

#8 Handle NA_STRING

All pattern matching routines have been implemented, including match, replace, detect, extract, split, locate, and quote.

Initial test cases was added.

Future Plan

1. Follow the timeline in the proposal

See the proposal.

2. Make functions vectorized

Make functions accept multiple patterns with multiple strings.

3. Add more test cases and close existing issues

Add more tests cases and improve the test coverage ratio.

4. Maybe some new ideas and refine APIs

Thanks for any help and advice!

@gagolews @tdhock

@gagolews
Copy link

gagolews commented May 18, 2016

You're way ahead the timeline! Theoretically, you should now "Look for examples of how regular expressions are used in existing R packages." 😛 Congrats!

@tdhock
Copy link

tdhock commented May 19, 2016

about vectorizing, I think it is mainly necessary to vectorize the subject (not the pattern), since the typical usage is "apply this single regex to this set of subjects"

@gagolews
Copy link

On the other hand, @qinwf could make the API as much similar to stringi (and hence stringr) as possible. Who knows, maybe re2r will some day be wrapped by stringr too..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants