Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate strings contains operations to pylibcudf #15880

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR creates pylibcudf strings contains APIs and migrates the cuDF cython to leverage them. Part of #15162.

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner May 29, 2024 16:33
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue labels May 29, 2024
return plc.interop.from_arrow(pa_target_col)


@pytest.fixture(params=["A"], scope="module")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to add some more actual tests here.

cpdef Column contains_re(
Column input,
RegexProgram prog
):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to start the sphinx docs for the strings submodule for pylibcudf.

As far as I can tell, there's no .rst files for strings in the api_docs/pylibcudf folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few sphinx docs here

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with two small questions.

raise ValueError("Do not instantiate RegexProgram directly, use create")

@staticmethod
def create(object pattern, object flags):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create(object pattern, object flags):
def create(str pattern, RegexFlags flags):

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experimentation leads me to believe it must be done this way: 758755c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you always have to use the cimported name when typing Cython functions, not the imported name. regex_flags != RegexFlags because the latter is imported rather than cimported. This is deliberate! We import the aliased name because that is the name we want to expose to users of pylibcudf in pure Python, using Python naming conventions for classes (CapsCase) since enums in Python are classes, while regex_flags is the snake_case representation of the Cython/C enum.

if isinstance(flags, (int, RegexFlags)):
c_flags = flags
with nogil:
c_prog = regex_program.create(c_pattern, c_flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Whose job is it to ensure that the provided input is a "valid" regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The libcudf method should validate the pattern at the c++ level:

cudf::logic_error If pattern is invalid or contains unsupported features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some simple tests of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's reasonable. Will add some

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of tests at b15588a

@lithomas1 lithomas1 added feature request New feature or request non-breaking Non-breaking change labels May 29, 2024
@lithomas1 lithomas1 added the pylibcudf Issues specific to the pylibcudf package label May 31, 2024
@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Jun 3, 2024

This commit 758755c broke tests it seems. Not sure why yet, but this causes cuDF python tests to fail when passing a value that is outside the enum range but represents the result of a bitwise or of valid values.

I think we can't type this signature this strongly because a value really can be an object, like a python integer. Typing the signature as regex_flags causes an attempted enum lookup of the int value which might be outside the range.

@lithomas1
Copy link
Contributor

lithomas1 commented Jun 3, 2024

What if you set the integer values (in the Cython .pxd) corresponding to the enum values like here?
https://docs.rapids.ai/api/libcudf/legacy/flags_8hpp_source#l00035

EDIT: Nvm, I misunderstood. I guess maybe just type it as an int then?

@brandon-b-miller
Copy link
Contributor Author

cef5d27 types the flags as an int, but requires a cast - I can't honestly tell if it feels better this way or not.

@lithomas1
Copy link
Contributor

Cool, this LGTM then.

One final note:
I realize that for docs, it's probably better to make a subfolder for strings (since it's possible that some modulenames clash, e.g. replace).

I don't think we necessarily need to rework the docs in this PR, but just wanted to give you a heads up that I'm probably going to rework the docs in #15839 (where I'm doing replace).

@brandon-b-miller
Copy link
Contributor Author

@wence- any final thoughts here?

@lithomas1 lithomas1 added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jun 5, 2024
@lithomas1
Copy link
Contributor

@brandon-b-miller

Just a quick heads-up that #15839 is probably going to land before this one (so you'll need to re-sync this PR with the doc changes there).

I've tagged this PR with DO NOT MERGE tentatively until #15839 gets in (but feel free to remove afterwards).

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brandon

@lithomas1 lithomas1 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jun 5, 2024
@brandon-b-miller
Copy link
Contributor Author

Just a quick heads-up that #15839 is probably going to land before this one (so you'll need to re-sync this PR with the doc changes there).

Thanks @lithomas1 , I merged the latest and plugged the docs into the structure you built out in the other PR. Should be good to go now.

@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Jun 5, 2024

Ah seems there's a small doc error somewhere:

/__w/cudf/cudf/docs/cudf/source/user_guide/api_docs/pylibcudf/contains.rst: WARNING: document isn't included in any toctree

Will look into whats missing.

EDIT: I think this was an extra file hanging around.

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 7fd6918 into rapidsai:branch-24.08 Jun 6, 2024
70 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jun 13, 2024
This PR adds cudf-polars code for evaluating the `StringFunction.Contains` expression node.

Depends on #15880

Authors:
  - https://github.com/brandon-b-miller
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15918
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants