Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly memcheck failure caused by compute-sanitizer bug #15258

Open
davidwendt opened this issue Mar 8, 2024 · 1 comment
Open

Nightly memcheck failure caused by compute-sanitizer bug #15258

davidwendt opened this issue Mar 8, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@davidwendt
Copy link
Contributor

Describe the issue
Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be compute-sanitizer tool issue which has been opened as nvbug 4553815.
This issue is to document the issue while working on possible workarounds until the bug is fixed.

The 2 errors appear as follows:

[ RUN      ] NumericValueIteratorTest/1.non_null_iterator
========= Invalid __shared__ read of size 16 bytes
=========     at 0x9670 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, thrust::minimum<void>>::Policy600, short *, short *, unsigned int, thrust::minimum<void>, short, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331d50]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x14fb4]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x70aae]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<short*, short*, thrust::minimum<void>, short, int>(void*, unsigned long&, short*, short*, int, thrust::minimum<void>, short, CUstream_st*) [clone .isra.0] [0x2fa199]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/ITERATOR_TEST
[ RUN      ] MinMaxReductionTest/0.MinMaxTypes
========= Invalid __shared__ read of size 16 bytes
=========     at 0x4310 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331d50]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x14fb4]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x70aae]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, int>(void*, unsigned long&, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, CUstream_st*) [clone .isra.0] [0x18950ae]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda()#2}::operator()() const [0x18984c3]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:std::unique_ptr<cudf::scalar, std::default_delete<cudf::scalar> > cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1898a70]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::detail::min(cudf::column_view const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x187ea46]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:decltype(auto) cudf::detail::aggregation_dispatcher<cudf::reduction::detail::reduce_dispatch_functor, cudf::reduce_aggregation const&>(cudf::aggregation::Kind, cudf::reduction::detail::reduce_dispatch_functor&&, cudf::reduce_aggregation const&) [0x193431e]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::detail::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1934d71]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, rmm::mr::device_memory_resource*) [0x193583f]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:std::pair<short, bool> ReductionTest<short>::reduction_test<short>(cudf::column_view const&, cudf::reduce_aggregation const&, std::optional<cudf::data_type>) [clone .constprop.0] [0x28ec47]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/./REDUCTIONS_TEST

If these were real errors the should appear when running without compute-sanitizer.
The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.

Steps/Code to reproduce

compute-sanitizer --tool memcheck gtests/ITERATOR_TEST --gtest_filter=NumericValueIteratorTest/1.non_null_iterator --rmm_mode=cuda
compute-sanitizer --tool memcheck gtests/REDUCTIONS_TEST--gtest_filter=MinMaxReductionTest/0.MinMaxTypes --rmm_mode=cuda

Note the failure only occurs on int16 (short) integer types when doing a min-reduction through CUB.

Additional context
The error occurs as follows on various compute-sanitizer versions:

2022.3.0    ok
2022.4.0    ok
2022.4.1    fail
2023.1.1    fail
2023.2.2.0  fail
2023.3.1    fail

In general, it fails only with 12.0 and above.

@davidwendt davidwendt added the bug Something isn't working label Mar 8, 2024
@davidwendt davidwendt self-assigned this Mar 8, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 11, 2024
Provides a workaround for the compute-sanitizer issue described in #15258 causing memcheck failures in nightly builds.
An environment variable is introduced `LIBCUDF_MEMCHECK_ENABLED` so test code can bypass specific tests that cause the compute-sanitizer error. The env var is set only during memcheck tests since the failure does not occur in normal testing.
The failure only occurs for some `int16` or `uint16` reduction tests so managing these few tests is reasonable.

Other possible workarounds include
1. Reverting the compute-sanitizer to 11.8
   Using the latest version is more desirable since the fix will likely not be back ported.
2. Adding an exclude filter to the CUB Reduce kernel
   This disables checking for almost all reduction kernels

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #15259
@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

We are skipping the problematic test to avoid CI failures as of #15259, but we don't have a fix for the underlying issue yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants