Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mypyc: Fix C string encoding #7978

Merged
merged 2 commits into from Nov 20, 2019

Conversation

@saleemrashid
Copy link
Contributor

saleemrashid commented Nov 19, 2019

Using repr() to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example:

  • The C string could inadvertently contain trigraphs.

  • Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence.

Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences.

It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.

As an ad hoc test, I generated a Python module containing string literals for every possible character:

with open("literals.py", "w") as f:
    f.write(
        "LITERALS = [{}]".format(
            ", ".join([repr(bytes([x, ord("0")])) for x in range(256)])
        )
    )

Then I compiled the module with Mypyc and checked it produced the correct byte strings:

import literals

assert literals.__file__.endswith(".so")

for x in range(256):
    assert literals.LITERALS[x] == bytes([x, ord("0")])
Using repr() to encode C strings relied on implementation-defined
behaviour and had a number of bugs, for example:

* The C string could inadvertently contain trigraphs.

* Valid hexdecimal characters following a hexadecimal escape sequence
  would be parsed as part of the escape sequence.

Octal escape sequences are used for unprintable characters because they
do not have the same issue as hexadecimal escape sequences.

It would be possible to mitigate the issue with hexadecimal escape
sequences by emitting multiple string literals. However, the complexity
is not worth it, especially as the output is not designed to be
human-readable.
@msullivan msullivan self-requested a review Nov 19, 2019
Copy link
Collaborator

msullivan left a comment

Good catch! Thanks for this.

Basically looks good, but a few comments about reorganizing it a little.

I'm also sad about the octal thing but oh well.

@@ -421,8 +449,7 @@ def encode_as_c_string(s: str) -> Tuple[str, int]:

def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:
"""Produce a single-escaped, quoted C string and its size from a bytes"""
# This is a kind of abusive way to do this...
escaped = repr(b)[2:-1].replace('"', '\\"')
escaped = ''.join(map(C_CHAR_MAP.__getitem__, b))

This comment has been minimized.

Copy link
@msullivan

msullivan Nov 19, 2019

Collaborator

I'd write this as a list comprehension instead of a map

@@ -67,6 +68,33 @@
# A list of (file name, file contents) pairs.
FileContents = List[Tuple[str, str]]

# The C standard specifies that any number of valid hexadecimal characters are

This comment has been minimized.

Copy link
@msullivan

msullivan Nov 19, 2019

Collaborator

I think it's worth moving this out of emitmodule at this point. mypyc.cstring, maybe?

This comment has been minimized.

Copy link
@msullivan

msullivan Nov 19, 2019

Collaborator

(And then encode_as_c_string and encode_bytes_as_c_string too)

# These assignments must be done after string.printable because they are
# overrides for the printable characters that need to be escaped in string
# literals.
C_CHAR_MAP[ord('\'')] = '\\\''

This comment has been minimized.

Copy link
@msullivan

msullivan Nov 19, 2019

Collaborator

These are all generated the same way (adding a backslash), so I think it would be an improvement to generate them all mechanically from a list of characters to escape.

This comment has been minimized.

Copy link
@saleemrashid

saleemrashid Nov 19, 2019

Author Contributor

I considered this, but the neatest way I can think to do this is:

for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'):
    escaped = '\\{}'.format(x)
    value = escaped.encode('ascii').decode('unicode_escape')
    C_CHAR_MAP[ord(value)] = escaped

I am not sure whether this is an improvement. What do you think?

@saleemrashid saleemrashid force-pushed the saleemrashid:mypyc-cstr-encoding branch from d389606 to bad98fa Nov 19, 2019
@saleemrashid

This comment has been minimized.

Copy link
Contributor Author

saleemrashid commented Nov 19, 2019

I've force-pushed some changes, i.e. rewrote the comments, and refactored the escaped strings to raw string literals.

I will try and implement your suggestions shortly.

I'm also sad about the octal thing but oh well.

@msullivan My original prototype used itertools.groupby to group blocks of printable characters and unprintable characters so you could safely use hexadecimal escape sequences. It worked fine but it seemed like unnecessary complexity.

@msullivan

This comment has been minimized.

Copy link
Collaborator

msullivan commented Nov 19, 2019

@msullivan My original prototype used itertools.groupby to group blocks of printable characters and unprintable characters so you could safely use hexadecimal escape sequences. It worked fine but it seemed like unnecessary complexity.

I agree that using octal is preferable to writing complex code that avoids it.

@saleemrashid saleemrashid force-pushed the saleemrashid:mypyc-cstr-encoding branch from bad98fa to 7bc5d89 Nov 19, 2019

# These assignments must come last because we prioritize the escape
# sequences in the C standard over any other representation.
CHAR_MAP[ord('\'')] = r'\''

This comment has been minimized.

Copy link
@saleemrashid

saleemrashid Nov 19, 2019

Author Contributor

@msullivan This is the simplest method I have been able to devise to generate these escape sequences:

for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'):
    escaped = '\\{}'.format(x)
    value = escaped.encode('ascii').decode('unicode_escape')
    C_CHAR_MAP[ord(value)] = escaped

While I am not a huge fan of manually listing each escape sequence, I dislike this method of generating them even more.

This comment has been minimized.

Copy link
@msullivan

msullivan Nov 19, 2019

Collaborator

The encode/decode is a little hokey but I think I like this more

This comment has been minimized.

Copy link
@saleemrashid

saleemrashid Nov 19, 2019

Author Contributor

The main reason I dislike this is that we're calling into Python's Unicode APIs. There's also codecs.escape_decode that handles ASCII only, but it's undocumented.

@saleemrashid saleemrashid force-pushed the saleemrashid:mypyc-cstr-encoding branch 2 times, most recently from 9389393 to 3ca384a Nov 19, 2019
@msullivan

This comment has been minimized.

Copy link
Collaborator

msullivan commented Nov 19, 2019

Could you also remove -Wno-trigraph from our C flags in build.py <_<

return encode_bytes_as_c_string(s.encode('utf-8'))


def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:

This comment has been minimized.

Copy link
@saleemrashid

saleemrashid Nov 19, 2019

Author Contributor

This is what an implementation with hexadecimal escape sequences, instead of octal, would look like:

diff --git i/mypyc/cstring.py w/mypyc/cstring.py
index 37f3ab38..0557a216 100644
--- i/mypyc/cstring.py
+++ w/mypyc/cstring.py
@@ -19,9 +19,11 @@ octal digits.
 """
 
 import string
+import itertools
 from typing import Tuple
 
-CHAR_MAP = ['\\{:03o}'.format(i) for i in range(256)]
+HEX_DIGITS = frozenset(string.hexdigits)
+CHAR_MAP = {}
 
 # It is safe to use string.printable as it always uses the C locale.
 for c in string.printable:
@@ -49,5 +51,13 @@ def encode_as_c_string(s: str) -> Tuple[str, int]:
 
 def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:
     """Produce a quoted C string literal and its size, for a byte string."""
-    escaped = ''.join([CHAR_MAP[i] for i in b])
+    escaped = ''
+    for printable, group in itertools.groupby(b, key=CHAR_MAP.__contains__):
+        if printable:
+            s = ''.join([CHAR_MAP[i] for i in group])
+            if len(escaped) and s[0] in HEX_DIGITS:
+                escaped += '" "'
+            escaped += s
+        else:
+            escaped += ''.join(['\\x{:02X}'.format(i) for i in group])
     return '"{}"'.format(escaped), len(b)
The compiler will not complain about trigraphs anymore because we now
escape all question marks in C string literals.
@saleemrashid saleemrashid force-pushed the saleemrashid:mypyc-cstr-encoding branch 3 times, most recently from 22bc75f to d4f3656 Nov 19, 2019
@msullivan msullivan merged commit 05b92c0 into python:master Nov 20, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@msullivan

This comment has been minimized.

Copy link
Collaborator

msullivan commented Nov 20, 2019

Thanks!

@saleemrashid saleemrashid deleted the saleemrashid:mypyc-cstr-encoding branch Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.