Crashes when parsing data which is not valid UTF-8 #48

ysangkok · 2016-05-22T23:06:47Z

janus@zeus ~/pyte/examples % PYTHONPATH=.. python3 capture.py cat /dev/urandom
Traceback (most recent call last):
  File "capture.py", line 43, in <module>
    stream.feed(os.read(fd, 1024).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 1: invalid start byte

This is a simple but contrived example. But you get the same problem with e.g. Shift-JIS content:

PYTHONPATH=.. python3 capture.py curl \
http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml -H 'User-Agent: Mozilla/9'

The text was updated successfully, but these errors were encountered:

ysangkok · 2016-05-23T08:01:36Z

Also, it could be valid UTF-8 but if a multi-byte character is on the border of the 1024-byte blocks, it wouldn't decode correctly.

jquast · 2016-05-23T18:24:48Z

This issue sounds familiar to something i've worked with before.

The solution is to use python's https://docs.python.org/2.7/library/codecs.html#codecs.getincrementalencoder rather than attempt to decode each individual byte block received, the byte block should be "feed" into an incremental decoder instance, and "final=True" should only be used on the final byte (such as closed/EOF), allowing for partial decoding of byte blocks as-they-are-received

superbobry · 2016-05-23T21:37:38Z

Thanks @jquast, we do something similar to what you've described in ByteStream, albeit we never pass finall=True because of the poor API choices we did in the past :)

It looks like the incremental encoder initialized with errors="replace" is as bulletproof as it gets. @ysangkok, what do you think?

ysangkok · 2016-05-23T22:01:39Z

Yeah, sounds great :)

Another option would be to not decode, and dump on stdout straight using sys.stdout.buffer.write.

superbobry · 2016-05-25T18:27:02Z

Cool, I'll push a fix.

Another option would be to not decode, and dump on stdout straight using std.stdout.buffer.write.

AFAIK this won't work on Python2.

jquast · 2016-05-26T23:41:54Z

Verified, 953c098 is just as i described, good!

superbobry closed this as completed in 953c098 May 25, 2016

superbobry mentioned this issue May 25, 2016

Revise ByteStream API #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes when parsing data which is not valid UTF-8 #48

Crashes when parsing data which is not valid UTF-8 #48

ysangkok commented May 22, 2016 •

edited

Loading

ysangkok commented May 23, 2016

jquast commented May 23, 2016

superbobry commented May 23, 2016 •

edited

Loading

ysangkok commented May 23, 2016 •

edited

Loading

superbobry commented May 25, 2016

jquast commented May 26, 2016

Crashes when parsing data which is not valid UTF-8 #48

Crashes when parsing data which is not valid UTF-8 #48

Comments

ysangkok commented May 22, 2016 • edited Loading

ysangkok commented May 23, 2016

jquast commented May 23, 2016

superbobry commented May 23, 2016 • edited Loading

ysangkok commented May 23, 2016 • edited Loading

superbobry commented May 25, 2016

jquast commented May 26, 2016

ysangkok commented May 22, 2016 •

edited

Loading

superbobry commented May 23, 2016 •

edited

Loading

ysangkok commented May 23, 2016 •

edited

Loading