octal escapes applied inconsistently throughout the interpreter and lib #81548

mr-nfamous · 2019-06-21T19:46:09Z

BPO	37367
Nosy	@gvanrossum, @terryjreedy, @taleinat, @ezio-melotti, @serhiy-storchaka, @mr-nfamous, @websurfer5
PRs	bpo-37367: octal escapes applied inconsistently throughout the interpreter and lib #14654

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-06-21.19:46:08.946>
labels = ['expert-regex', 'easy', '3.9']
title = 'octal escapes applied inconsistently throughout the interpreter and lib'
updated_at = <Date 2019-11-20.09:14:13.360>
user = 'https://github.com/mr-nfamous'

bugs.python.org fields:

activity = <Date 2019-11-20.09:14:13.360>
actor = 'taleinat'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2019-06-21.19:46:08.946>
creator = 'bup'
dependencies = []
files = []
hgrepos = []
issue_num = 37367
keywords = ['patch', 'easy', 'easy (C)']
message_count = 9.0
messages = ['346246', '346865', '347261', '347411', '347516', '356817', '357026', '357031', '357041']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'terry.reedy', 'taleinat', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'bup', 'Jeffrey.Kintscher']
pr_nums = ['14654']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue37367'
versions = ['Python 3.9']

mr-nfamous · 2019-06-21T19:46:09Z

At present, the bytecode compiler can generate 512 different unicode characters, one for each integral from the range [0-511), 512 being the total number of syntactically valid permutations of 3 octal digits preceded by a backslash. However, this does not match the regex compiler, which raises an error regardless of the input type when it encounters an an octal escape character with a decimal value greater than 255. On the other hand... the bytes literal:

>> b'\407'

is somehow valid, and can lead to extremely difficult bugs to track down, such as this nonsense:

>>> re.compile(b'\407').search(b'\a')
<re.Match object; span=(0, 1), match=b'\x07'>

I propose that the regex parser be augmented, enabling for unicode patterns the interpretation of three character octal escapes from the range(256, 512), while the bytecode parser be adjusted to match the behavior of the regex parser, raising an error for bytes literals > b"\400", rather than truncating the 9th bit.

terryjreedy · 2019-06-28T21:04:31Z

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
says, for \ooo, "In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value."

I agree that sometimes truncating an invalid integer instead of always raising ValueError is strange.

>>> ord(b'\407')
7
>>> bytes((0o407,))
...
ValueError: bytes must be in range(0, 256)

I don't know is there was an intentional back-compatibility reason for this.

Without an example of re raising, I don't understand the re complaint.

websurfer5 · 2019-07-04T08:07:51Z

>>> b'\407'
b'\x07'
>>> ord(b'\407')
7

This is the object structure passed to builtin_ord():

(lldb) p *((PyBytesObject *)(c))
(PyBytesObject) $19 = {
ob_base = {
ob_base = {
ob_refcnt = 4
ob_type = 0x00000001003cb0b0
}
ob_size = 1
}
ob_shash = 8685212186264880044
ob_sval = {
[0] = '\a'
}
}

If two bytes were stored (0x107), I would expect ob_sval[0] to be 7 ('\a') and ob_sval[1] to be 1 on a little endian system, but ob_sval[1] is 0:

(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[0])
(long) $23 = 7
(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[1])
(long) $24 = 0

This means the truncation to a single byte is happening when the byte string object is created.

websurfer5 · 2019-07-06T00:53:00Z

Here is the problematic code in _PyBytes_DecodeEscape in Objects/bytesobject.c:

            c = s[-1] - '0';
            if (s < end && '0' <= *s && *s <= '7') {
                c = (c<<3) + *s++ - '0';
                if (s < end && '0' <= *s && *s <= '7')
                    c = (c<<3) + *s++ - '0';
            }
            *p++ = c;

c is an int, and p is a char pointer to the new bytes object's string buffer. For b'\407', c gets correctly calculated as 263 (0x107), but the upper bits are lost when it gets recast as a char and stored in the location pointed to by p. Hence, b'\407' becomes b'\x07' when the object is created.

IMO, this should raise "ValueError: bytes must be in range(0, 256)" instead of silently throwing away the upper bits. I will work on a PR.

I also took a look at how escaped hex values are handled by the same function. It may seem at first glance that

>>> b'\x107'
b'\x107'

is returning the hex value 0x107, but in reality it is returning '\x10' as the first character and '7' as the second character. While visually misleading, it is syntactically and semantically correct.

websurfer5 · 2019-07-09T03:13:19Z

The PR is ready for review. It raises ValueError if the escaped octal value in a byte string is greater than 255.

taleinat · 2019-11-17T16:00:26Z

The PR looks pretty good, but the question is whether we want to break backwards compatibility in the name of correctness.

In this case, the silent buggy behavior of keeping the (mod 256) of the value seems worth fixing to me.

terryjreedy · 2019-11-20T04:13:55Z

I can't find who wrote this block treating octal escapes beginning with 4-7 the same as those beginning with 0-3.

        case '0': case '1': case '2': case '3':
        case '4': case '5': case '6': case '7':
            c = s[-1] - '0';
            if (s < end && '0' <= *s && *s <= '7') {
                c = (c<<3) + *s++ - '0';
                if (s < end && '0' <= *s && *s <= '7')
                    c = (c<<3) + *s++ - '0';
            }
            *p++ = c;
            break;

Antoine Pitrou merged from somewhere in 2010 and Christiqn Heimes renamed something in 2008 and before that, ???. Guido wrote the initial bytesobject.c in 2006.

Guido, do you agree that the current behavior, treating the same int differently when input into a bytes in different ways, is a bug, and if so, should we backport the fix?

>>> b'\407'
b'\x07'
>>> bytes((0o407,))
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    bytes((0o407,))
ValueError: bytes must be in range(0, 256)

gvanrossum · 2019-11-20T06:09:39Z

For sure the Python tokenizer/parser should reject octal escapes that produce values >= 256. Certainly in bytes strings. Probably also in text strings (nobody using Unicode thinks in octal). This is ancient behavior though (it's the same in 2.7) so it may require a deprecation for the text string case. (For byte strings, it should become an error in 3.9 -- dropping the top bit is nonsensical.)

The regex parser should not be changed.

taleinat · 2019-11-20T09:14:13Z

Alright, so let's push through the existing PR for rejecting such octal escapes in byte strings.

We'll also need another PR for raising a deprecation warning for such octal escapes in strings.

serhiy-storchaka · 2022-04-18T17:29:57Z

#91668 is an alternate PR which deprecates invalid octal sequences in bytes and string literals, in the same way as invalid escape sequences are deprecated.

terryjreedy · 2022-04-18T21:43:15Z

Given that the current code matches the doc, which says "As in Standard C, up to three octal digits are accepted." without qualification, it seems reasonable to treat this as a design bug (to be deprecated) rather than an implementation bug (to be immediately changed). I think it definitely preferable to deprecate now, before the upcoming beta, and close #14654, than to do nothing before 3.12.

…77 (GH-91668)

…e strings

…ngs (GH-92643) Automerge-Triggered-By: GH:pablogsal

…e strings (pythonGH-92643) Automerge-Triggered-By: GH:pablogsal (cherry picked from commit 0d8500c) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…ngs (GH-92643) Automerge-Triggered-By: GH:pablogsal (cherry picked from commit 0d8500c) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

SilentGhost mannequin added topic-regex 3.9 only security fixes labels Jun 21, 2019

taleinat added easy labels Nov 20, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 18, 2022

pythongh-81548: Deprecate octal sequences with value larger than 0o377

906049a

serhiy-storchaka mentioned this issue Apr 18, 2022

gh-81548: Deprecate octal sequences with value larger than 0o377 #91668

Merged

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes and removed topic-regex easy 3.9 only security fixes labels Apr 18, 2022

serhiy-storchaka closed this as completed in #91668 Apr 30, 2022

serhiy-storchaka added a commit that referenced this issue Apr 30, 2022

gh-81548: Deprecate octal escape sequences with value larger than 0o3…

3483299

…77 (GH-91668)

sweeneyde mentioned this issue Apr 30, 2022

gh-81548: Convert :issue: to :gh: in whatsnew #92089

Merged

pablogsal added a commit to pablogsal/cpython that referenced this issue May 10, 2022

pythongh-81548: Clarify the deprecation of octal sequences affect byt…

592abf2

…e strings

miss-islington pushed a commit that referenced this issue May 16, 2022

gh-81548: Clarify the deprecation of octal sequences affect byte stri…

0d8500c

…ngs (GH-92643) Automerge-Triggered-By: GH:pablogsal

vstinner mentioned this issue Oct 18, 2022

Reject invalid escape sequences (and octal escape sequences) in bytes and Unicode strings #98401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

octal escapes applied inconsistently throughout the interpreter and lib #81548

octal escapes applied inconsistently throughout the interpreter and lib #81548

mr-nfamous mannequin commented Jun 21, 2019

mr-nfamous mannequin commented Jun 21, 2019

terryjreedy commented Jun 28, 2019

websurfer5 mannequin commented Jul 4, 2019

websurfer5 mannequin commented Jul 6, 2019

websurfer5 mannequin commented Jul 9, 2019

taleinat commented Nov 17, 2019

terryjreedy commented Nov 20, 2019

gvanrossum commented Nov 20, 2019

taleinat commented Nov 20, 2019

serhiy-storchaka commented Apr 18, 2022

terryjreedy commented Apr 18, 2022

octal escapes applied inconsistently throughout the interpreter and lib #81548

octal escapes applied inconsistently throughout the interpreter and lib #81548

Comments

mr-nfamous mannequin commented Jun 21, 2019

mr-nfamous mannequin commented Jun 21, 2019

terryjreedy commented Jun 28, 2019

websurfer5 mannequin commented Jul 4, 2019

websurfer5 mannequin commented Jul 6, 2019

websurfer5 mannequin commented Jul 9, 2019

taleinat commented Nov 17, 2019

terryjreedy commented Nov 20, 2019

gvanrossum commented Nov 20, 2019

taleinat commented Nov 20, 2019

serhiy-storchaka commented Apr 18, 2022

terryjreedy commented Apr 18, 2022