Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark Crashing with data corruption after contiguous split change #6842

Closed
revans2 opened this issue Nov 24, 2020 · 4 comments
Closed
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 24, 2020

Describe the bug
After 591bead went in we started to see spark tests crash with memory corruption.

Steps/Code to reproduce bug
There is no simple repro case right now. We run the hash aggregate integration test in the spark plugin repo and they crash very consistently (with errors indicating memory corruption), but when I capture the input to contiguous split as parquet and rerun it everything works fine. I am going to try and trace this down more, but I don't know when I might come up with that.

Expected behavior
No crashing no memory corruption.

Environment overview (please complete the following information)
My desktop Ubuntu 16 with a V100 16GB

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Nov 24, 2020
@revans2
Copy link
Contributor Author

revans2 commented Nov 25, 2020

So I have a repro case now and it looks like there are multiple things wrong. I managed to find, and I think fix one of them, but I have not been able to trace down and will probably need some help with.

First the repro case.

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 462f66e6ab..a8a55a4c03 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -705,3 +705,7 @@ add_custom_command(OUTPUT CUDF_DOXYGEN
                    VERBATIM)
 
 add_custom_target(docs_cudf DEPENDS CUDF_DOXYGEN)
+
+
+add_executable(my_main "my_main.cu")
+target_link_libraries(my_main cudf)
diff --git a/cpp/my_main.cu b/cpp/my_main.cu
new file mode 100644
index 0000000000..f117064584
--- /dev/null
+++ b/cpp/my_main.cu
@@ -0,0 +1,21 @@
+#include <cudf/io/parquet.hpp>
+#include <cudf/copying.hpp>
+
+#include <iostream>
+
+
+int main(int argc, char ** argv) {
+  for (int i = 1; i < argc; i++) {
+    try {
+      std::cerr << "READING " << i << " " << argv[i] << std::endl;
+      cudf::io::source_info source(argv[i]);
+      cudf::io::parquet_reader_options opts = cudf::io::parquet_reader_options::builder(source).build();
+      cudf::io::table_with_metadata tab = cudf::io::read_parquet(opts);
+      std::vector<cudf::size_type> indicies;
+      indicies.emplace_back(tab.tbl->num_rows());
+      std::vector<cudf::contiguous_split_result> result = cudf::contiguous_split(*tab.tbl, indicies);
+    } catch (const std::exception & e) {
+      std::cerr << "ERROR WITH " << i << " " << argv[i] << " " << e.what() << std::endl;
+    }
+  }
+}
diff --git a/cpp/src/copying/contiguous_split.cu b/cpp/src/copying/contiguous_split.cu
index 0719af9756..5b92862286 100644
--- a/cpp/src/copying/contiguous_split.cu
+++ b/cpp/src/copying/contiguous_split.cu
@@ -587,7 +587,7 @@ BufInfo build_output_columns(InputIter begin,
     uint8_t const* data_ptr;
     size_type size;
     std::tie(data_ptr, size) = [&]() {
-      if (src.head()) {
+      if (src.num_children() == 0) {
         auto const ptr = base_ptr + current_info->dst_offset;
         // if we have data, num_elements will always be the correct size.
         // we don't want to use num_rows because if we are an offset column, num_rows

This adds in a new executable that reads in a parquet file and then does a contiguous split on it for the entire size of the table.

The change to contiguous_split.cu is what I found after debugging yesterday where if we have a String column that is all empty strings or nulls then the code interpreted that incorrectly and assumed that the data column was a parent column and got off by one. But I am still running into errors with columns that come after that, which I cannot totally explain.

The files to test are in the following zip. They are generated from random data so it is not a big deal.

So to reproduce the error you can run something like

$ ./my_main target/95_18.parquet
READING 1 ./target/95_18.parquet
ERROR WITH 1 ./target/95_18.parquet cuDF failure at: ../src/table/table_view.cpp:33: Column size mismatch.

data.zip

I also see a lot of errors when running with mem-check, but the issues don't show up for all of the failing parquet files.

The files that I regularly see fail are

100_20.parquet
100_46.parquet
101_24.parquet
101_26.parquet
102_38.parquet
102_4.parquet
102_93.parquet
103_2.parquet
103_30.parquet
104_44.parquet
104_6.parquet
105_16.parquet
105_48.parquet
94_14.parquet
94_40.parquet
95_124.parquet
95_18.parquet
95_32.parquet
96_10.parquet
96_28.parquet
97_12.parquet
97_137.parquet
97_34.parquet
98_132.parquet
98_36.parquet
98_8.parquet
99_22.parquet
99_42.parquet

Some of those that I see mem-check errors with are

95_32.parquet
99_22.parquet

But I think there are others that pass, but also have mem-check errors.

One of the main places I see an issue is something like

========= Invalid __global__ read of size 4
=========     at 0x000003e0 in cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::copy_partition(int, int, unsigned char**, unsigned char**, cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::dst_buf_info*)
=========     by thread (4,0,0) in block (15,0,0)
=========     Address 0x7f58bd202e40 is out of bounds
=========     Device Frame:cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::copy_partition(int, int, unsigned char**, unsigned char**, cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::dst_buf_info*) (cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::copy_partition(int, int, unsigned char**, unsigned char**, cudf::_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e21::dst_buf_info*) : 0x3e0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel_ptsz + 0x346) [0x2ca006]
=========     Host Frame:/home/roberte/miniconda3/envs/cudf_dev/lib/libcudart.so.10.1 [0x19bcb]
=========     Host Frame:/home/roberte/miniconda3/envs/cudf_dev/lib/libcudart.so.10.1 [0x19c77]
=========     Host Frame:/home/roberte/miniconda3/envs/cudf_dev/lib/libcudart.so.10.1 (cudaLaunchKernel_ptsz + 0x225) [0x4ecf5]
=========     Host Frame:/home/roberte/src/cudf/cpp/build/libcudf_base.so (_Z144__device_stub__ZN4cudf75_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e2114copy_partitionEiiPPhS2_PNS0_12dst_buf_infoEiiPPhS0_PN4cudf75_GLOBAL__N__51_tmpxft_00002622_00000000_6_contiguous_split_cpp1_ii_26302e2112dst_buf_infoE + 0x12a) [0x7af42a]
=========     Host Frame:/home/roberte/src/cudf/cpp/build/libcudf_base.so (_ZN4cudf6detail16contiguous_splitERKNS_10table_viewERKSt6vectorIiSaIiEEN3rmm16cuda_stream_viewEPNS9_2mr22device_memory_resourceE + 0xd05) [0x7b0605]
=========     Host Frame:/home/roberte/src/cudf/cpp/build/libcudf_base.so (_ZN4cudf16contiguous_splitERKNS_10table_viewERKSt6vectorIiSaIiEEPN3rmm2mr22device_memory_resourceE + 0x8e) [0x7b10be]
=========     Host Frame:./my_main [0x47df]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21bf7]
=========     Host Frame:./my_main [0x54ea]

I will keep digging.

@sperlingxx
Copy link
Contributor

sperlingxx commented Nov 26, 2020

Hi @revans2, I am following your work, and I think I made some progress. After changing the contiguous_split.cu with:

diff --git a/cpp/src/copying/contiguous_split.cu b/cpp/src/copying/contiguous_split.cu
index 38466de12c..d7d21ff2da 100644
--- a/cpp/src/copying/contiguous_split.cu
+++ b/cpp/src/copying/contiguous_split.cu
@@ -435,6 +435,11 @@ std::pair<src_buf_info*, size_type> buf_info_functor::operator()<cudf::string_vi
                           parent_offset_index,
                           false,
                           col.offset());
+
+  if (scv.chars_size() == 0) {
+    return {current + 1, offset_stack_pos + offset_depth};
+  }
+
   current++;
   offset_stack_pos += offset_depth;
 
@@ -600,7 +605,8 @@ BufInfo build_output_columns(InputIter begin,
       // Parent columns w/o data (e.g., strings, lists) don't have an associated `dst_buf_info`,
       // therefore, use the first child's info. their num_rows value will be correct (also see
       // comment above)
-      return std::make_pair(static_cast<uint8_t const*>(nullptr), current_info->num_rows);
+      auto const size = (src.num_children() == 0) ? 0 : current_info->num_rows;
+      return std::make_pair(static_cast<uint8_t const*>(nullptr), size);
     }();
     auto children = std::vector<column_view>{};
     children.reserve(src.num_children());

It seems to work on the case of string columns without valid value. I did some verification through adding case of null string columns into split (unit) tests. The additional case fails with latest branch-0.17, and succeeds after above code change. And I also verified it with parquets you provided. 95_18.parquet, 95_32.parquet, and 99_22.parquet are all happy to this change.
Due to some environmental problems, I am not able to run integration tests on my local machine.

sperlingxx added a commit that referenced this issue Nov 30, 2020
This PR attempts to address issue #6842, which may be caused by multiple reasons. For now, one certain problem is contiguous splitting on null (fully invalid) string columns, which only contains single child column (offset column). This PR is about to fix this problem.
@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 3, 2020

Was this fixed by #6853?

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 3, 2020
@revans2
Copy link
Contributor Author

revans2 commented Dec 3, 2020

Was this fixed by #6853?

Yes

@revans2 revans2 closed this as completed Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants