New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the Source class report the encoding #677

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
1 participant
@zeehio
Copy link
Contributor

zeehio commented May 16, 2017

This pull request is the first of a series to provide multibyte encoding support to readr on the tokenizers. I made a first attempt in #673, but that PR was too big to be easily reviewed.

With this pull request, the datasource() R function and the associated C++ Source class now will require an encoding. If this encoding is ambiguous in endianness (like UTF-16 or UTF-32, which mandate a Byte Order Mark) the BOM is detected and skipped (as happened before) and the encoding is updated to reflect the endianness (UTF-16LE, UTF16-BE...).

Future pull requests will use this encoding for proper source parsing.

As this PR is not a new feature (not yet at least) and it is not a bug fix, I have not added a NEWS entry. Feel free to ask for it if you feel it is better to have it.

@zeehio zeehio force-pushed the zeehio:add_source_encoding branch 2 times, most recently from 240ab09 to 76fdebe May 16, 2017

@zeehio zeehio changed the title Make the Source class report the encoding [RFC] Make the Source class report the encoding May 17, 2017

@zeehio

This comment has been minimized.

Copy link
Contributor Author

zeehio commented May 17, 2017

Once this is merged I will submit another PR, that offers the possibility of multiple comments and doing the comment parsing in the Source class (instead of duplicated code in Source and Tokenizers):

@zeehio zeehio force-pushed the zeehio:add_source_encoding branch from 76fdebe to 95d4463 Oct 24, 2017

Provide encoding to datasource()
If this encoding is ambiguous in endianness (like UTF-16 or UTF-32 which mandate
a Byte Order Mark) the BOM is detected and skipped (as before) and the encoding
is updated to reflect the endianness (UTF-16LE, UTF16-BE...)

@zeehio zeehio force-pushed the zeehio:add_source_encoding branch from 95d4463 to 1323253 Dec 17, 2017

@zeehio zeehio changed the title [RFC] Make the Source class report the encoding Make the Source class report the encoding Dec 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment