Collaborations on columnar data structures #1

wesm · 2017-05-08T14:23:36Z

Excited to see this new org created. I am interested to see if Apache Arrow (i.e. contiguous columnar data, validity bitmap for nulls) is the appropriate data model for data on the GPU, and if we can collaborate on some aspects of the code. It seems that CUDA 7 now supports C++11, so in theory we could compile the Arrow C++ libraries with nvcc and provide necessary APIs to enable Numba to interact with the raw memory buffers. This might simplify IPC with GPU main memory (record batch loading and unloading) and make less work for you here. I have an NVIDIA GPU on my home desktop, so I could help with testing.

tmostak · 2017-05-16T16:39:59Z

Hi @wesm, thanks for this. Yes we are excited about Arrow (even though we are only supporting a subset at the moment) because it provides interoperability with lots of other things and makes sense as a way to represent columnar data. I don't see any issues why it should not be performant on GPU, as the MapD native format is quite similar (except we store nulls in-line when possible to save space and bandwidth). Would it make sense to set up a call with the project members so we can discuss ways to collaborate?

wesm · 2017-05-16T17:37:03Z

That sounds good to me. Adding @julienledem @xhochy since they will be interested, and maybe other from the Apache Arrow team.

I am interested in

Ingest data (zero-copy, preferably) from Arrow record batches
Ingest data to MapD from Arrow
Export data as Arrow record batches
UDF protocol for batch-based UDFs
Benchmarks and analysis of pros/cons of different columnar-type memory layouts on the GPU (you say you store nulls inline -- does that mean sentinel values? Otherwise I am not sure how you could be more efficient that 1 bit per value for data that has nulls).

As background, I did some GPU development for accelerating Bayesian inference problems years ago and did a fair amount of CUDA C and PyCUDA work, so I've had a long-standing interest in architecting data structures and memory access patterns for the GPU.

billmaimone · 2017-05-16T21:59:28Z

Bingo on all fronts, all things mentioned in the talk I gave last week at GTC. We have also some basic work to do for supporting the rest of the data types (prototype did only simple, uncompress numerics to keep it simple).

wesm · 2017-05-18T21:25:05Z

Does the GPU benefit from columnar compression techniques like CPU-based columnar databases do?

m1mc · 2017-05-18T21:56:24Z

@wesm, we already have some in core engine like dictionary compression. And we are planning to tokenize any string column that only has digits to save memory, but they don't require to be columnar if you just mean sth. like RLE or HCC. All ways aim to keep GPU decoding fast.

billmaimone · 2017-06-07T16:51:22Z

PFOR and run-length also planned, not yet done.

…

On Thu, May 18, 2017 at 2:56 PM, Minggang Yu ***@***.***> wrote: @wesm <https://github.com/wesm>, we already have some in core engine like dictionary compression. And we are planning to tokenize any string column that only has digits to save memory. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHEP-HaH_AM9vNpUNhov3XQTU5sSbKSkks5r7L6JgaJpZM4NUBSb> .

* np array solution * cleanup * np solution for division * full reflected ops tests

update my fork

* adding eq datetime ops for pygdf * flake8 fixes * Drop Python 2.7, Add Python 3.7 * removing int coercion for datetime * Remove Python 3.7 build * bumping numba * forgot to commit meta.yaml changes * flake8 * commutative addition * commutative subtraction and multiplication * reflected floordiv and truediv * cleanup * stray comment * change rsub method * further testing rsub * rsub docstring * revert back * type coercion * revert to pseudo-commutative implementation * commutative ops tests * test comment cleanup * Feature/reflected ops noncommutative testing (#1) * np array solution * cleanup * np solution for division * full reflected ops tests * cleanup * switching lambda scalar to 2 * Update README.md Conda installation instruction needed changes with pygdf version. * Feature/reflected ops update (#2) * test binary_operator * test one line * essentially use _binaryop with a line flipped * expand to all non commutative reflected ops * revert rmul * Feature/reflected ops update (#3) * test binary_operator * test one line * essentially use _binaryop with a line flipped * expand to all non commutative reflected ops * revert rmul * rbinaryop function for clarity * add scalar to array generation to avoid division by zero behavior * remove integer division test due to libgdf bug * Fix timezone issue when converting from datetime object into datetime64 * Remove unused import to fix flake8 * Initial modifications for new join API

BLD Update rmm submodule commit

Fea/cudf empty groubpy cont

Branch 0.7

* ENH: Support `GDF_BOOL8` in Parquet reader * Translate `parquet::BOOLEAN` to `GDF_BOOL8` * Remove no-longer-necessary type conversion in pytests * ENH: Support `GDF_BOOL8` in CSV reader * Add new type detection count for booleans * Add `true` and `false` as detected boolean values * Add extra pytest case for bools * ENH: Support `GDF_BOOL8` in CSV writer * Uncomment lines to call nvstring to convert booleans to string * Fix wrong order of na, true, false arguments * Add boolean column to gtest * Update CHANGELOG.md for PR * first draft of inequality_comparator to replace LesserRTTI * fixed build issues * Added optimization for inequality comparator so that it is faster if there are no nulls * ENH: Support `GDF_BOOL8` in ORC reader * Translate `orc::BOOLEAN` to `GDF_BOOL8` * Remove no-longer-necessary type conversion in pytests * WIP and compiles * split nulls and non nulls operators. This increases compile time. Commiting here anyways for historical reasons * fixed inequality_comparator, updated group by to use new equality_comparator, removed no longer needed null handling flag from context * cleaned up * updated CHANGELOG * improved formatting and added changes that somehow did not make it in a previous commit * fixed issues caused by formatting in previous commit * Fix issue by handling multiindex in series groupby * CHANGELOG * CHANGELOG again * refactored based on PR feedback and added more code documentation * Change output datatype for count groupby to np.int32 * Add assert to ensure count() datatype is updated if gdf_size_type changes * Add changelog entry * Don't check for dtype when doing groupby-count * Split device_atomics.cuh file split the file into `device_atomics.cuh` and `device_operators.cuh` separated the difinition of the device operators * Remove atomicCASImpl(int8 or int16) move atomicCASImpl(int8 or int16) into typesAtomicCASImpl * Simplify `atomicAdd` * Simplify atomicMin, atomixMax simplify atomicMin, atomixMax add cudf::bool8 for atomic test case for atomicAdd,Min,Max add cudf::bool8 specialization for genericAtomicOperation * Add more test coverage * Simplify atomicAnd/Or/Xor * Removed `genericAtomicOperationUnderlyingType` * Remove `typesAtomicOperation32|64` * Update doxygen texts for atomics * Add '__forceinline__ __device__' Add '__forceinline__ __device__' to `W genericAtomicOperator(W)` * add static_assert for long long int size Add size check assert between `long long int` and `int64_t` * remove redundant `sizeof(T)` from `CASImpl` remove redundant `sizeof(T)` when calling 'typesAtomicCASImpl` * remove redundant `sizeof(T)` from `atomic op impl` remove redundant `sizeof(T)` when calling 'genericAtomicOperationImpl` * Add `genericAtomicOperationImpl(int64_t, Sum)` Add native atomicAdd(uint64_t) call for sint64_t * Add comment for impl of atomicAdd(int64_t) Add comment for `genericAtomicOperationImpl<int64_t, DeviceSum, 8>` why it uses atomicAdd(uint64) inside * Removed `genericAtomicOperation(W)` Removed `genericAtomicOperation(W)` since it is not invoked for cudf::wrapper types. Merged it into `genericAtomicOperation(T)` Add size check assert at `type_reinterpret`. * CHANGELOG. * CHANGELOG. * Don't check for dtype when doing groupby-count in test_string.pu * Update CHANGELOG.md Co-Authored-By: Keith Kraus <keith.j.kraus@gmail.com> * update changelog * Concatenate multiindexes. * Use temporary buffer for `NvString:create_from_bool` for GDF_BOOL8 * No guarantee that `cudf::bool8` and `bool` are same type for cast * Insane MultiIndex _concat method and many supporting tests. * Fix style and CHANGELOG * Add two more inverse tests. * CSV reader: support specifying a subset of dtypes when using usecols parameter. Include C++ API changes * Complete the support for partial dtype list w/ usecols. Expanded the test and refactored the dtype assignment. * remove unused include * Update CHANGELOG.md * fix Python style * Fix error cheking when setting the dtype array. * implemented PR feedback * Fix handling read only schema buffers in gpuarrow reader * Changelog # * Remove insane multiindex concatenation block and raise NotImplementedError instead. * typo fixed * Use one line list comprehension and eliminate shallow copies now that the _concat is not performed with levels/codes multiindices. * One more single line fix * Change the dtype behavior with usecols and list dtype parameter - user need to specify all column types, not just the active ones. * format fix * correct a comment * Handle more generalized numpy input instead of forcing unsigned char * Change test to use foreign memory similar to OmniSci * Update CHANGELOG.md * fixed build issue * REL v0.7.2 release * fix groupby count dtype issue * maintain the original series name in series.unique output * changelog * ENH: Add test for cudf::bool8 in booleans gtest * Fill mask with zeros when making a null column * Fix merge * Merge branch 'ohe-perf' of https://github.com/RFinkelberg/cudf into ohe-perf # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. * Some cleanup of bindings and fixes for s_v and v_s binops

Minor cudf serialization improvements

…on_column [REVIEW] Full join issue with no common column

Fix for missing series index in isin

add factory methods for DType

1. Output struct size must match target column, not source. 2. Test case for rapidsai#1. 3. Also, move scatter_struct_tests and scatter_list_tests from .cu to .cpp, for faster compile.

Fix style checks for string_udfs.

…untime-checks Remove runtime checks for CUDA versions from strings_udf

This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes #12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 #5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 #6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f #7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 #8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 #9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 #10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f #11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 #12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 #13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c #14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 #15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 #16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: #13298

Pin conda packages to `aws-sdk-cpp<1.11`. The recent upgrade in version `1.11.*` has caused several issues with cleaning up (more details on changes can be read in [this link](https://github.com/aws/aws-sdk-cpp#version-111-is-now-available)), leading to Distributed and Dask-CUDA processes to segfault. The stack for one of those crashes looks like the following: ``` (gdb) bt #0 0x00007f5125359a0c in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level(aws_logger*, unsigned int) () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so #1 0x00007f5124968f83 in aws_event_loop_thread () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0 #2 0x00007f5124ad9359 in thread_fn () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1 #3 0x00007f519958f6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #4 0x00007f5198b1361f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ``` Such segfaults now manifest frequently in CI, and in some cases are reproducible with a hit rate of ~30%. Given the approaching release time, it's probably the safest option to just pin to an older version of the package while we don't pinpoint the exact cause for the issue and a patched build is released upstream. The `aws-sdk-cpp` is statically-linked in the `pyarrow` pip package, which prevents us from using the same pinning technique. cuDF is currently pinned to `pyarrow=12.0.1` which seems to be built against `aws-sdk-cpp=1.10.*`, as per [recent build logs](https://github.com/apache/arrow/actions/runs/6276453828/job/17046177335?pr=37792#step:6:1372). Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass) URL: #14173

Update image name

wesm mentioned this issue Jul 18, 2017

Support parsing categorical (dictionary-encoded) columns from ipc arrow format rapidsai/libgdf#22

Merged

kkraus14 mentioned this issue Mar 15, 2018

Improve documentation around environment #119

Closed

2 tasks

mike-wendt added the proposal Change current process or code label Aug 6, 2018

mike-wendt changed the title ~~Collaborations on columnar data structures~~ Collaborations on columnar data structures Aug 8, 2018

kkraus14 pushed a commit that referenced this issue Aug 23, 2018

Feature/reflected ops noncommutative testing (#1)

2745b29

* np array solution * cleanup * np solution for division * full reflected ops tests

mike-wendt pushed a commit that referenced this issue Oct 26, 2018

Merge pull request #1 from gpuopenanalytics/master

13ddcf8

update my fork

raghavmi mentioned this issue Dec 7, 2018

Converting a GPU DF to DMatrix doesn't handle uint8 #468

Closed

kkraus14 closed this as completed Dec 10, 2018

harrism added a commit that referenced this issue Jan 24, 2019

Merge pull request #1 from raydouglass/fea-ext-rmm-submodule

3de38bd

BLD Update rmm submodule commit

randerzander mentioned this issue Feb 11, 2019

[BUG] Creating DataFrames or Series with only None/null entries results in 0-length #915

Closed

huangsongjue mentioned this issue Mar 20, 2019

[BUG] cuDF build error #1243

Closed

kovaltan mentioned this issue Apr 10, 2019

[REVIEW] libcudf: generic reduction and scan support #1005

Merged

20 tasks

thomcom mentioned this issue May 2, 2019

[DISCUSSION] libcudf column abstraction redesign #1443

Closed

karthikeyann mentioned this issue May 3, 2019

[BUG] CSV_TEST fails test gdf_csv_test.SkiprowsNrows #1605

Closed

This was referenced May 8, 2019

[BUG] OOM crash on left merge of categoricals #1673

Closed

[BUG] High overhead #1674

Closed

ayushdg pushed a commit to ayushdg/cudf that referenced this issue May 8, 2019

Merge pull request rapidsai#1 from quasiben/fea/cudf-empty-groubpy-cont

60a6bf9

Fea/cudf empty groubpy cont

chuannv mentioned this issue May 9, 2019

[BUG] '+' op in pandas dataframe and cuDF have inconsistent behavior #1690

Closed

raydouglass pushed a commit that referenced this issue May 13, 2019

Merge pull request #1 from rapidsai/branch-0.7

b563256

Branch 0.7

rjzamora pushed a commit to rjzamora/cudf that referenced this issue Jun 19, 2019

Merge pull request rapidsai#1 from rjzamora/cleanup-serialize

2f255ec

Minor cudf serialization improvements

kkraus14 pushed a commit that referenced this issue Feb 10, 2020

Merge pull request #1 from rgsl888prabhu/full_join_issue_with_no_comm…

f1d594a

…on_column [REVIEW] Full join issue with no common column

OlivierNV added a commit to OlivierNV/cudf that referenced this issue Feb 10, 2020

Cosmetic Item rapidsai#1: Add links to protobuf and Apache ORC specs

051ca53

OlivierNV added a commit to OlivierNV/cudf that referenced this issue Feb 21, 2020

test with serialized warps rapidsai#1

efdfc87

kkraus14 pushed a commit that referenced this issue Apr 6, 2020

Merge pull request #1 from rgsl888prabhu/index_changes_to_series_is_in

6c9869a

Fix for missing series index in isin

harrism mentioned this issue Jun 3, 2020

[FEA] Add support for CUDA 11 #5369

Closed

codereport added a commit to codereport/cudf that referenced this issue Jun 19, 2020

Refactor rapidsai#1

bd11137

codereport added a commit to codereport/cudf that referenced this issue Jun 26, 2020

Refactor rapidsai#1

4f38fbd

codereport added a commit to codereport/cudf that referenced this issue Jun 29, 2020

Refactor rapidsai#1

edd038e

codereport added a commit to codereport/cudf that referenced this issue Jul 2, 2020

Refactor rapidsai#1

965d05d

harrism mentioned this issue Aug 10, 2020

[REVIEW] Make sure to always preserve list column hierarchies. #5816

Merged

OlivierNV mentioned this issue Oct 8, 2020

[REVIEW] Fix ORC reader issue with decimal type #6466

Merged

sperlingxx pushed a commit to sperlingxx/cudf that referenced this issue Oct 23, 2020

Merge pull request rapidsai#1 from sperlingxx/dtype_class_alfxu

86b8a6b

add factory methods for DType

taureandyernv mentioned this issue Nov 12, 2020

[DOC] Update read_avro docs to better describe usage and remove misleading engine options #6760

Closed

mythrocks mentioned this issue May 4, 2021

Fix scatter output size for structs. #8155

Merged

harrism mentioned this issue Jun 8, 2021

Implement reverse in libcudf #8410

Merged

This was referenced Mar 28, 2022

[BUG][ppc64le][cpp] test failures - BINARY_TEST, TRANSFORM_TEST, ARROW_IO_SOURCE_TEST and ROLLING_TEST #10520

Closed

[BUG][ppc64le][cudf python] test failures #10521

Closed

mythrocks mentioned this issue Apr 14, 2022

Add option to drop cache in cuIO benchmarks #10488

Merged

rapids-bot bot pushed a commit that referenced this issue Jul 22, 2022

Merge pull request #1 from bdice/string_udfs-style

5602da8

Fix style checks for string_udfs.

bwyogatama referenced this issue in bwyogatama/cudf Aug 19, 2022

Code cleanup #1

b7ede43

rapids-bot bot pushed a commit that referenced this issue Sep 28, 2022

Merge pull request #1 from brandon-b-miller/strings-udf-remove-cuda-r…

1c344e8

…untime-checks Remove runtime checks for CUDA versions from strings_udf

abellina mentioned this issue Nov 3, 2022

[BUG] size_type overflow in cudf::groupby::detail::hash::extract_populated_keys #12058

Closed

mythrocks mentioned this issue Jun 29, 2023

Fix inf/NaN comparisons for FLOAT orderby in window functions #13635

Merged

3 tasks

raydouglass pushed a commit that referenced this issue Nov 7, 2023

Merge pull request #1 from AyodeAwe/update-image-name

6d73abe

Update image name

NTNguyen13 mentioned this issue Nov 14, 2023

[BUG]CUDA_ERROR_NO_DEVICE #14343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collaborations on columnar data structures #1

Collaborations on columnar data structures #1

wesm commented May 8, 2017

tmostak commented May 16, 2017

wesm commented May 16, 2017

billmaimone commented May 16, 2017

wesm commented May 18, 2017

m1mc commented May 18, 2017 •

edited

Loading

billmaimone commented Jun 7, 2017 via email

Collaborations on columnar data structures #1

Collaborations on columnar data structures #1

Comments

wesm commented May 8, 2017

tmostak commented May 16, 2017

wesm commented May 16, 2017

billmaimone commented May 16, 2017

wesm commented May 18, 2017

m1mc commented May 18, 2017 • edited Loading

billmaimone commented Jun 7, 2017 via email

m1mc commented May 18, 2017 •

edited

Loading