Skip to content

Conversation

nfelt
Copy link
Contributor

@nfelt nfelt commented Oct 31, 2017

This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to markdown_to_safe_html(). When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes markdown_to_safe_html() so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via bazel test --python_path=....

@nfelt nfelt requested a review from chihuahua October 31, 2017 01:45
Copy link
Member

@chihuahua chihuahua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigative work!

@@ -166,8 +166,7 @@ def text_array_to_html(text_arr):
"""
if not text_arr.shape:
# It is a scalar. No need to put it in a table, just apply markdown
return plugin_util.markdown_to_safe_html(
text_arr.astype(np.dtype(str)).tostring())
return plugin_util.markdown_to_safe_html(np.asscalar(text_arr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - it looks like np.asscalar(text_arr) returns a str (raw bytes representation) in python2 and str (unicode) in python3. Either way, markdown_to_safe_html now gracefully deals with the value.

And we avoid encoding in a way that assumes little-endian UTF-32 (woah) as noted in #647.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the point was just to make the treatment for scalars match what's being done for all the non-scalar cases where they're converted into HTML element-wise (and then concatenated). Numpy will natively represent the strings as you described, str == bytes in py2 and str == unicode in py3, so I figured I wouldn't change that.

Using np.asscalar(a) is literally just syntatic sugar for doing a.item() but it's slightly more explicit, so I went with that.

# there will probably be a bunch of null bytes. These would be stripped by
# the sanitizer no matter what, but make sure we remove them before markdown
# interpretation to avoid affecting output (e.g. "_with_" gets italicized).
s = u'word_with_underscores'.encode('utf-16')[2:] # Strip byte-order mark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 2 chars are not stripped, do we get a UnicodeDecodeError? Is that why we strip them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's why. I realized that using "-le" or "-be" suffixes on the encoding to specify endianness will avoid writing out a byte order mark in the first place though which is significantly cleaner, so now I've updated this to just use utf-32-le since that's the exact encoding Numpy uses.

@nfelt nfelt merged commit 0e05e14 into tensorflow:master Oct 31, 2017
@nfelt nfelt deleted the fix-textplugin-scalar-decoding branch October 31, 2017 22:25
jart pushed a commit to jart/tensorboard that referenced this pull request Oct 31, 2017
This fixes tensorflow#647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`.  When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly.  See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via `bazel test --python_path=...`.
jart pushed a commit that referenced this pull request Nov 1, 2017
This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`.  When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly.  See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via `bazel test --python_path=...`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

text plugin incorrectly deserializes scalar strings when running python 3
2 participants