Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add efficient primitives for str.strip(). #18742

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

advait-dixit
Copy link
Contributor

Fixes mypyc/mypyc#1090.

Copying cpython implementation for strip, lstrip and rstrip to str_ops.c.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good overall, just a few things I noticed. Can you run some microbenchmarks to make sure the performance impact is as expected?

@@ -774,3 +774,16 @@ def test_surrogate() -> None:
assert ord(f()) == 0xd800
assert ord("\udfff") == 0xdfff
assert repr("foobar\x00\xab\ud912\U00012345") == r"'foobar\x00«\ud912𒍅'"

[case testStrip]
# This is a negative test. strip variants without args does not use efficient primitives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment out of date / incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was out of date. Removed it.

s = "xxb2yy"
assert s.lstrip("xy") == "b2yy"
assert s.strip("xy") == "b2"
assert s.rstrip("xy") == "xxb2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test all string kinds and different character code ranges, such as these (and mixing these):

  • Character codes between 128 and 255 (0x80 to 0xff)
  • Character codes between 256 and 65535 (0x100 to 0xffff)
  • Character codes 65536+ (0x10000+)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -5,6 +5,58 @@
#include <Python.h>
#include "CPy.h"

// Copied from cpython.git:Objects/unicodeobject.c.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention the Python version or commit date from where this is from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BLOOM_MASK sepmask;
Py_ssize_t seplen;

kind = PyUnicode_KIND(self);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to call PyUnicode_READY on self and sepobj (and check the return values) on 3.9 at least, before you can call PyUnicode_KIND etc. You can check the relevant function in Python 3.9 to see how it's used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Copied from do_strip function in cpython.git/Objects/unicodeobject.c.
PyObject *_CPyStr_Strip(PyObject *self, int strip_type, PyObject *sep) {
if (sep == NULL || sep == Py_None) {
Py_ssize_t len, i, j;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I think you'll need to call PyUnicode_READY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Fixing code comments.
* Adding tests with more unicode chars.
* Adding commit ID for code copied from cpython.git.
@advait-dixit
Copy link
Contributor Author

Thanks for the PR! Looks good overall, just a few things I noticed. Can you run some microbenchmarks to make sure the performance impact is as expected?

I have pushed a new commit that addresses review comments.
Below is a micro-benchmark on on my laptop. It compares CPython vs mypyc@this branch vs mypyc@master.

(py313_venv) advait@advaits-mbp mypy % cat prof_strip.py 
def solve1(s: str) -> str:
    for i in range(200):
        s.strip()
    return "s"

(py313_venv) advait@advaits-mbp mypy % cat driver.py 
import prof_strip
import timeit
import random
random_characters = "".join([chr(random.randint(0x0020, 0x10FFFF)) for _ in range(20)])
print(timeit.timeit(lambda: prof_strip.solve1(random_characters), number=100000))

(py313_venv) advait@advaits-mbp mypy % python driver.py                            
0.38529958299477585

(py313_venv) advait@advaits-mbp mypy % mypyc prof_strip.py > /dev/null && python driver.py
0.22940995899261907

(py313_venv) advait@advaits-mbp mypy % rm -rf build *.so && git checkout master && mypyc prof_strip.py > /dev/null && python driver.py
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
0.45345445899874903

@advait-dixit advait-dixit requested a review from JukkaL March 5, 2025 00:08
@cdce8p cdce8p added the topic-mypyc mypyc bugs label Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-mypyc mypyc bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add primitives for strip(), rstrip() and lstrip()
3 participants