Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Make the Source class report the encoding #677
This pull request is the first of a series to provide multibyte encoding support to readr on the tokenizers. I made a first attempt in #673, but that PR was too big to be easily reviewed.
With this pull request, the datasource() R function and the associated C++ Source class now will require an encoding. If this encoding is ambiguous in endianness (like UTF-16 or UTF-32, which mandate a Byte Order Mark) the BOM is detected and skipped (as happened before) and the encoding is updated to reflect the endianness (UTF-16LE, UTF16-BE...).
Future pull requests will use this encoding for proper source parsing.
As this PR is not a new feature (not yet at least) and it is not a bug fix, I have not added a NEWS entry. Feel free to ask for it if you feel it is better to have it.
2 times, most recently
May 16, 2017
Once this is merged I will submit another PR, that offers the possibility of multiple comments and doing the comment parsing in the Source class (instead of duplicated code in Source and Tokenizers):