Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::strings::replace_re can be 20X slower than a single thread on the CPU for some cases #12778

Closed
ttnghia opened this issue Feb 14, 2023 · 3 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Feb 14, 2023

Similar to #12694, we discovered a similar issue with cudf::strings::replace_re for some input data.

In particular, using the same data file as in #12694, Spark CPU (regexp_replace) can complete its job in around 600ms while the GPU version (which calls to cudf::strings::replace_re) on the same input takes around 11 seconds.

@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Feb 14, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 14, 2023

Example of reproducible code:

{
    ScopeTimer timer("replace_re");

    auto const pattern    = std::string("\\[\\{|\\}\\]");
    auto const regex_prog = cudf::strings::regex_program::create(
      pattern, cudf::strings::regex_flags::DEFAULT, cudf::strings::capture_groups::NON_CAPTURE);
    auto const replace = cudf::string_scalar("");
    cudf::strings::replace_re(strs_input, *regex_prog, replace);
}

Output:

replace_re: 5002.102684ms

@davidwendt
Copy link
Contributor

davidwendt commented Feb 15, 2023

Yes, long strings for any regex is a known issue. This is likely not going to be fixed.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 2, 2023
@GregoryKimball
Copy link
Contributor

I'll close this in favor of #13048, where we can track progress on long string performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

3 participants