Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes when parsing data which is not valid UTF-8 #48

Closed
ysangkok opened this issue May 22, 2016 · 6 comments
Closed

Crashes when parsing data which is not valid UTF-8 #48

ysangkok opened this issue May 22, 2016 · 6 comments

Comments

@ysangkok
Copy link

ysangkok commented May 22, 2016

janus@zeus ~/pyte/examples % PYTHONPATH=.. python3 capture.py cat /dev/urandom
Traceback (most recent call last):
  File "capture.py", line 43, in <module>
    stream.feed(os.read(fd, 1024).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 1: invalid start byte

This is a simple but contrived example. But you get the same problem with e.g. Shift-JIS content:

PYTHONPATH=.. python3 capture.py curl \
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml -H 'User-Agent: Mozilla/9'
@ysangkok
Copy link
Author

Also, it could be valid UTF-8 but if a multi-byte character is on the border of the 1024-byte blocks, it wouldn't decode correctly.

@jquast
Copy link

jquast commented May 23, 2016

This issue sounds familiar to something i've worked with before.

The solution is to use python's https://docs.python.org/2.7/library/codecs.html#codecs.getincrementalencoder rather than attempt to decode each individual byte block received, the byte block should be "feed" into an incremental decoder instance, and "final=True" should only be used on the final byte (such as closed/EOF), allowing for partial decoding of byte blocks as-they-are-received

@superbobry
Copy link
Collaborator

superbobry commented May 23, 2016

Thanks @jquast, we do something similar to what you've described in ByteStream, albeit we never pass finall=True because of the poor API choices we did in the past :)

It looks like the incremental encoder initialized with errors="replace" is as bulletproof as it gets. @ysangkok, what do you think?

@ysangkok
Copy link
Author

ysangkok commented May 23, 2016

Yeah, sounds great :)

Another option would be to not decode, and dump on stdout straight using sys.stdout.buffer.write.

@superbobry
Copy link
Collaborator

Cool, I'll push a fix.

Another option would be to not decode, and dump on stdout straight using std.stdout.buffer.write.

AFAIK this won't work on Python2.

@jquast
Copy link

jquast commented May 26, 2016

Verified, 953c098 is just as i described, good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants