Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for potential buffer overrun in tokenizer.c #41437

Closed
glchapman mannequin opened this issue Jan 13, 2005 · 9 comments
Closed

Patch for potential buffer overrun in tokenizer.c #41437

glchapman mannequin opened this issue Jan 13, 2005 · 9 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@glchapman
Copy link
Mannequin

glchapman mannequin commented Jan 13, 2005

BPO 1101726
Nosy @loewis, @doerwalter
Files
  • tokenizer.c.3.diff: patch without utf8 manipulation
  • pep263tests.zip: test files
  • test_pep263.py.diff
  • tokenizer.c.4.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/loewis'
    closed_at = <Date 2005-07-12.21:59:22.000>
    created_at = <Date 2005-01-13.15:45:20.000>
    labels = ['interpreter-core']
    title = 'Patch for potential buffer overrun in tokenizer.c'
    updated_at = <Date 2005-07-12.21:59:22.000>
    user = 'https://bugs.python.org/glchapman'

    bugs.python.org fields:

    activity = <Date 2005-07-12.21:59:22.000>
    actor = 'doerwalter'
    assignee = 'loewis'
    closed = True
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2005-01-13.15:45:20.000>
    creator = 'glchapman'
    dependencies = []
    files = ['6431', '6432', '6433', '6434']
    hgrepos = []
    issue_num = 1101726
    keywords = ['patch']
    message_count = 9.0
    messages = ['47529', '47530', '47531', '47532', '47533', '47534', '47535', '47536', '47537']
    nosy_count = 3.0
    nosy_names = ['loewis', 'doerwalter', 'glchapman']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1101726'
    versions = ['Python 2.4']

    @glchapman
    Copy link
    Mannequin Author

    glchapman mannequin commented Jan 13, 2005

    The fp_readl function in tokenizer.c has a potential
    buffer overrun; see:

    www.python.org/sf/1089395

    It is also triggered by trying to generate the pycom
    file for the Excel 11.0 typelib, which is where I ran
    into it.

    The attached patch allows successful generation of the
    Excel file; it also runs the fail.py script from the
    above report without an access violation. It also
    doesn't break any of the tests in the regression suite
    (probably something like fail.py should be added as a
    test).

    It is not as efficient as it might be; with a function
    for determining the number of unicode characters in a
    utf8 string, you could avoid some memory allocations.
    Perhaps such a function should be added to unicodeobject.c?

    And, of course, the patch definitely needs review. I'm
    especially concerned that my use of
    tok->decoding_buffer might be violating some kind of
    assumption that I missed.

    @glchapman glchapman mannequin closed this as completed Jan 13, 2005
    @glchapman glchapman mannequin assigned loewis Jan 13, 2005
    @glchapman glchapman mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 13, 2005
    @glchapman
    Copy link
    Mannequin Author

    glchapman mannequin commented Jan 23, 2005

    Logged In: YES
    user_id=86307

    I'm attaching a new patch (tokenizer.c.2.diff), which should
    be better since it avoids some unnecessary allocations and
    decoding/encoding. Hopefully, I haven't made any
    unwarranted assumptions about UTF8.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    I think the patch looks good. Staring at it for a while I
    believe the reference counts are OK. Can you incorporate
    fail.py into the test suite?

    @glchapman
    Copy link
    Mannequin Author

    glchapman mannequin commented Feb 27, 2005

    Logged In: YES
    user_id=86307

    After thinking about it some more, I realized that fp_readl
    doesn’t have to do any utf8 manipulation. If the string it
    copies into the buffer does not include a \n, the buffer is
    expanded and fp_readl is called again until a \n is returned
    (or the end of the file is reached). It doesn’t matter if,
    during this loop, the buffer contains temporarily broken
    utf8 characters. So I’m attaching another new patch. This
    copies as many characters as fit into the buffer and saves
    any overflow into a new string object (stored in tok

    decoding_buffer). When it is called again by the loop, it
    uses this overflow string to copy characters into the buffer.

    I’m also attaching a zip file with a modified test_pep263.py
    and some ancillary files. To test fp_readl, python needs to
    read from a file. Perhaps it would be better to generate
    temporary files rather than include these extra files.
    Also, although the tests will trigger the assert using a
    debug build of the 2.4 source, I’m not getting an access
    violation running regrtest against a release build (of 2.4)
    – perhaps different tests are in order?

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    Maybe I've put the test files in the wrong spot, but what
    I'm getting is:

    ./python Lib/test/test_pep263.py
    test_pep263 (main.PEP263Test) ... ok
    test_evilencoding (main.EvilEncodingTest) ... ok
    Segmentation fault

    You should probably put pep263_evilencoding.py and
    pep263_longline.py alongside test_pep263.py and then find
    then via os.path.join(file.rsplit(os.sep, 1)[0],
    "pep263_longline.py")

    @glchapman
    Copy link
    Mannequin Author

    glchapman mannequin commented Mar 21, 2005

    Logged In: YES
    user_id=86307

    You're quite right, 2.4 AV's reliably using the new
    test_pep263. The copy of python24.dll I was testing against
    already had (the first version of) the patch applied.
    Anyway, I'm attaching a diff for test_pep263 (against the
    original 2.4 test_pep263) which incorporates your
    suggestions about not assuming the current directory.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    Here is another version of the patch, this version doesn't
    pass any arguments to readline(), because it's no longer
    neccessary to limit the line length. What do you think?

    @glchapman
    Copy link
    Mannequin Author

    glchapman mannequin commented May 17, 2005

    Logged In: YES
    user_id=86307

    Looks good to me.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    Checked in as:
    Parser/tokenizer.c 2.78
    Parser/tokenizer.c 2.76.2.1

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant