-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix TextPlugin scalar string decoding errors in py3 #685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TextPlugin scalar string decoding errors in py3 #685
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice investigative work!
@@ -166,8 +166,7 @@ def text_array_to_html(text_arr): | |||
""" | |||
if not text_arr.shape: | |||
# It is a scalar. No need to put it in a table, just apply markdown | |||
return plugin_util.markdown_to_safe_html( | |||
text_arr.astype(np.dtype(str)).tostring()) | |||
return plugin_util.markdown_to_safe_html(np.asscalar(text_arr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - it looks like np.asscalar(text_arr)
returns a str
(raw bytes representation) in python2 and str
(unicode) in python3. Either way, markdown_to_safe_html
now gracefully deals with the value.
And we avoid encoding in a way that assumes little-endian UTF-32 (woah) as noted in #647.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the point was just to make the treatment for scalars match what's being done for all the non-scalar cases where they're converted into HTML element-wise (and then concatenated). Numpy will natively represent the strings as you described, str == bytes in py2 and str == unicode in py3, so I figured I wouldn't change that.
Using np.asscalar(a)
is literally just syntatic sugar for doing a.item()
but it's slightly more explicit, so I went with that.
tensorboard/plugin_util_test.py
Outdated
# there will probably be a bunch of null bytes. These would be stripped by | ||
# the sanitizer no matter what, but make sure we remove them before markdown | ||
# interpretation to avoid affecting output (e.g. "_with_" gets italicized). | ||
s = u'word_with_underscores'.encode('utf-16')[2:] # Strip byte-order mark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the 2 chars are not stripped, do we get a UnicodeDecodeError
? Is that why we strip them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's why. I realized that using "-le" or "-be" suffixes on the encoding to specify endianness will avoid writing out a byte order mark in the first place though which is significantly cleaner, so now I've updated this to just use utf-32-le since that's the exact encoding Numpy uses.
This fixes tensorflow#647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.
This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.
This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to
markdown_to_safe_html()
. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details.This both fixes the TextPlugin decoding logic itself, and also fixes
markdown_to_safe_html()
so that it will discard null bytes and log a warning about them before doing markdown interpretation.Tested under python 2 and 3 via
bazel test --python_path=...
.