Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memcheck error found in CSV_TEST CsvReaderNumericTypeTest/0.SingleColumn #14140

Closed
davidwendt opened this issue Sep 20, 2023 · 7 comments · Fixed by #15293
Closed

[BUG] Memcheck error found in CSV_TEST CsvReaderNumericTypeTest/0.SingleColumn #14140

davidwendt opened this issue Sep 20, 2023 · 7 comments · Fixed by #15293
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@davidwendt
Copy link
Contributor

Describe the bug
The nightly builds memcheck found an error in the CSV_TEST CsvReaderNumericTypeTest/0.SingleColumn
No other CSV_TEST gtests are failing.

Steps/Code to reproduce bug

# compute-sanitizer --tool memcheck gtests/CSV_TEST --gtest_filter=CsvReaderNumericTypeTest/0.SingleColumn --rmm_mode=cuda
========= COMPUTE-SANITIZER
Note: Google Test filter = CsvReaderNumericTypeTest/0.SingleColumn
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CsvReaderNumericTypeTest/0, where TypeParam = long
[ RUN      ] CsvReaderNumericTypeTest/0.SingleColumn
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxPopCurrent_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2cbe56]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0xd999f]
=========                in /conda/envs/rapids/lib/libcufile.so.0

@davidwendt davidwendt added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Sep 20, 2023
@davidwendt
Copy link
Contributor Author

This is definitely something introduced in the code stack.
I can get the error to produce on various versions of compute-sanitizer including

  • 2022.3.0 (11.8)
  • 2022.2.1 (11.7)
  • 2022.1.1 (11.6)

The stack from includes libcufile.so so perhaps there is a change to that recently?

@davidwendt
Copy link
Contributor Author

The nightly builds started failing on September 19.
I tried building locally with a commit before that day: 3b691f4
But that fails (as per the description) as well.
I'm not sure the next steps to track this down.

@davidwendt
Copy link
Contributor Author

Looks like this error has been occurring for awhile but never triggered a memcheck build failure. I checked logs back to the beginning of July and found the error has been occurring at least since then (logs are not kept before that).

@vuule
Copy link
Contributor

vuule commented Sep 25, 2023

Ran the repro locally; it looks like this test fails only because it's first one. Running a different CSV test also leads to the error.
Bit more of the stack:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxPopCurrent_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2aa4f4]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x12204f]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcufile.so.0
=========     Host Frame: [0x1251a4]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcufile.so.0
=========     Host Frame: [0x1256bd]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcufile.so.0
=========     Host Frame: [0x12c83e]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcufile.so.0
=========     Host Frame:cuFileDriverOpen [0x12ea29]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcufile.so.0
=========     Host Frame:kvikio::cuFileAPI::cuFileAPI() [0x150688c]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcudf.so
=========     Host Frame:kvikio::defaults::defaults() [0x1509255]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcudf.so
=========     Host Frame:kvikio::defaults::instance() [0x150956c]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcudf.so
=========     Host Frame:cudf::io::(anonymous namespace)::file_source::file_source(char const*) [0x1517cf7]
=========                in /home/vukasin/anaconda3/envs/cudf_dev/lib/libcudf.so
=========     Host Frame:cudf::io::datasource::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) [0x1519eb0]

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@vuule
Copy link
Contributor

vuule commented Dec 20, 2023

Looks like we forgot to update the issue: we discovered that there is a cuFile bug where cuFileDriverOpen fails if it's the first CUDA call in the thread/process. So, not a libcudf bug. We can add a simple workaround if needed.

@davidwendt
Copy link
Contributor Author

... So, not a libcudf bug. We can add a simple workaround if needed.

I definitely would appreciate a workaround added for this since this generates a significant amount of noise in the memcheck builds.

rapids-bot bot pushed a commit that referenced this issue Mar 16, 2024
Closes #14140

Added a no-op CUDA call before creating a `kvikio::FileHandle` to avoid the error in `cuFileDriverOpen`.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #15293
@davidwendt
Copy link
Contributor Author

Looks like this has been fixed for CSV_TEST but I'm still seeing a failure in the first PARQUET_TEST

Running compute-sanitizer on PARQUET_TEST
========= COMPUTE-SANITIZER
[==========] Running 325 tests from 94 test suites.
[----------] Global test environment set-up.
[----------] 11 tests from ParquetChunkedReaderTest
[ RUN      ] ParquetChunkedReaderTest.TestChunkedReadNoData
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxPopCurrent_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2ce616]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0xf2faf]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libcufile.so.0

rapids-bot bot pushed a commit that referenced this issue Apr 1, 2024
Issue #14140

Follow-up on #15293

Moving the `cudaFree(0)` call to a function called both by file `datasource` and `data_sink`.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15335
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants