Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO in tailmatch(): it does not support backward in all cases #60485

Closed
vstinner opened this issue Oct 19, 2012 · 9 comments
Closed

TODO in tailmatch(): it does not support backward in all cases #60485

vstinner opened this issue Oct 19, 2012 · 9 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@vstinner
Copy link
Member

BPO 16281
Nosy @loewis, @vstinner, @serhiy-storchaka

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-01-03.02:20:27.254>
created_at = <Date 2012-10-19.00:51:23.526>
labels = ['interpreter-core']
title = 'TODO in tailmatch(): it does not support backward in all cases'
updated_at = <Date 2013-01-03.13:38:55.832>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2013-01-03.13:38:55.832>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2013-01-03.02:20:27.254>
closer = 'python-dev'
components = ['Interpreter Core']
creation = <Date 2012-10-19.00:51:23.526>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 16281
keywords = []
message_count = 9.0
messages = ['173301', '174441', '174847', '174884', '174893', '178901', '178902', '178941', '178944']
nosy_count = 4.0
nosy_names = ['loewis', 'vstinner', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue16281'
versions = ['Python 3.3', 'Python 3.4']

@vstinner
Copy link
Member Author

Oh oh, it looks like the implementation of tailmatch() was not finished:

        /* If both are of the same kind, memcmp is sufficient */
        if (kind_self == kind_sub) {
            return ...;
        }
        /* otherwise we have to compare each character by first accesing it */
        else {
            /* We do not need to compare 0 and len(substring)-1 because
               the if statement above ensured already that they are equal
               when we end up here. */
            /* TODO: honor direction and do a forward or backwards search */
            for (i = 1; i < end_sub; ++i) {
                if (PyUnicode_READ(kind_self, data_self, offset + i) !=
                    PyUnicode_READ(kind_sub, data_sub, i))
                    return 0;
            }
            return 1;
        }

@serhiy-storchaka
Copy link
Member

The result does not depend on the direction of comparison. This only affects speed. But who can to say in which direction comparison will be faster?

Here I see a one obvious opportunity for optimization:

if (kind_self < kind_sub)
    return 0;

After that and after processing the case (kind_self == kind_sub) only 3 special cases left: UCS1 in UCS2, UCS1 in UCS4, and UCS2 in UCS4. Get rid of slow PyUnicode_READ() for this cases will speed up the code. Also note that comparing first and last characters before memcmp can be a slowdown (because PyUnicode_READ() is slow). Try to compare first and last bytes.

@vstinner
Copy link
Member Author

vstinner commented Nov 4, 2012

Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function can fail.

@serhiy-storchaka
Copy link
Member

Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function
can fail.

But it does.

.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
Py_ssize_t start, Py_ssize_t end, int direction)

Return 1 if substr matches str[start:end] at the given tail end
(direction == -1 means to do a prefix match, direction == 1 a suffix match),
0 otherwise. Return -1 if an error occurred.

@vstinner
Copy link
Member Author

vstinner commented Nov 5, 2012

Oh, PyUnicode_Tailmatch() documentation doesn't mention that the function
can fail.

But it does.

.. c:function:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr,
Py_ssize_t start, Py_ssize_t end, int direction)

Return 1 if substr matches str[start:end] at the given tail end
(direction == -1 means to do a prefix match, direction == 1 a suffix match),
0 otherwise. Return -1 if an error occurred.

Oh, I read the "documentation" in unicodeobject.h:

/* Return 1 if substr matches str[start:end] at the given tail end, 0
otherwise. */

The problem is that tailmatch() returns 0 if PyUnicode_READY() failed.
It is a bug, even if it cannot occur because PyUnicode_Tailmatch()
checks that the both strings are ready.

@serhiy-storchaka serhiy-storchaka added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 15, 2012
@python-dev
Copy link
Mannequin

python-dev mannequin commented Jan 3, 2013

New changeset 49eb2488145d by Victor Stinner in branch 'default':
Close bpo-16281: handle tailmatch() failure and remove useless comment
http://hg.python.org/cpython/rev/49eb2488145d

@python-dev python-dev mannequin closed this as completed Jan 3, 2013
@vstinner
Copy link
Member Author

vstinner commented Jan 3, 2013

"Here I see a one obvious opportunity for optimization: ..."

@serhiy: Can you please open a new issue for this? I consider the issue as fixed: I just removed the TODO (for the reason explained in the changeset).

@serhiy-storchaka
Copy link
Member

Shouldn't this be applied to 3.3?

As for optimization, I made some benchmarks and didn't saw any significant difference. Usually this function used to check short ASCII heads and tails and any optimization will not be seen even under a microscope.

@vstinner
Copy link
Member Author

vstinner commented Jan 3, 2013

Shouldn't this be applied to 3.3?

It's just a cleanup, it doesn't fix any real bug. I prefer to not
pollute old versions with cleanup.

As for optimization, I made some benchmarks and didn't saw any significant difference. Usually this function used to check short ASCII heads and tails and any optimization will not be seen even under a microscope.

Ok, agreed.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

2 participants