Fix TextPlugin scalar string decoding errors in py3 #685

nfelt · 2017-10-31T01:45:36Z

This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to markdown_to_safe_html(). When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes markdown_to_safe_html() so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via bazel test --python_path=....

chihuahua

Nice investigative work!

chihuahua · 2017-10-31T03:29:33Z

tensorboard/plugins/text/text_plugin.py

@@ -166,8 +166,7 @@ def text_array_to_html(text_arr):
  """
  if not text_arr.shape:
    # It is a scalar. No need to put it in a table, just apply markdown
-    return plugin_util.markdown_to_safe_html(
-        text_arr.astype(np.dtype(str)).tostring())
+    return plugin_util.markdown_to_safe_html(np.asscalar(text_arr))


Great - it looks like np.asscalar(text_arr) returns a str (raw bytes representation) in python2 and str (unicode) in python3. Either way, markdown_to_safe_html now gracefully deals with the value.

And we avoid encoding in a way that assumes little-endian UTF-32 (woah) as noted in #647.

Yeah, the point was just to make the treatment for scalars match what's being done for all the non-scalar cases where they're converted into HTML element-wise (and then concatenated). Numpy will natively represent the strings as you described, str == bytes in py2 and str == unicode in py3, so I figured I wouldn't change that.

Using np.asscalar(a) is literally just syntatic sugar for doing a.item() but it's slightly more explicit, so I went with that.

chihuahua · 2017-10-31T03:39:34Z

tensorboard/plugin_util_test.py

+    # there will probably be a bunch of null bytes. These would be stripped by
+    # the sanitizer no matter what, but make sure we remove them before markdown
+    # interpretation to avoid affecting output (e.g. "_with_" gets italicized).
+    s = u'word_with_underscores'.encode('utf-16')[2:]  # Strip byte-order mark


If the 2 chars are not stripped, do we get a UnicodeDecodeError? Is that why we strip them?

Yeah, that's why. I realized that using "-le" or "-be" suffixes on the encoding to specify endianness will avoid writing out a byte order mark in the first place though which is significantly cleaner, so now I've updated this to just use utf-32-le since that's the exact encoding Numpy uses.

This fixes tensorflow#647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.

This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.

Fix TextPlugin scalar string decoding errors in py3

1baa8d1

nfelt requested a review from chihuahua October 31, 2017 01:45

chihuahua approved these changes Oct 31, 2017

View reviewed changes

tweak test code encoding to avoid needing to strip BOM

cc0ff9f

nfelt merged commit 0e05e14 into tensorflow:master Oct 31, 2017

nfelt deleted the fix-textplugin-scalar-decoding branch October 31, 2017 22:25

This was referenced Nov 7, 2017

Tensorboard tf.summary.text mangles markdown linebreaks #222

Closed

Proper np array object to string conversion #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TextPlugin scalar string decoding errors in py3 #685

Fix TextPlugin scalar string decoding errors in py3 #685

Uh oh!

nfelt commented Oct 31, 2017

Uh oh!

chihuahua left a comment

Uh oh!

chihuahua Oct 31, 2017

Uh oh!

nfelt Oct 31, 2017

Uh oh!

chihuahua Oct 31, 2017

Uh oh!

nfelt Oct 31, 2017

Uh oh!

Uh oh!

Fix TextPlugin scalar string decoding errors in py3 #685

Fix TextPlugin scalar string decoding errors in py3 #685

Uh oh!

Conversation

nfelt commented Oct 31, 2017

Uh oh!

chihuahua left a comment

Choose a reason for hiding this comment

Uh oh!

chihuahua Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

nfelt Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

chihuahua Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

nfelt Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

Uh oh!