Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet reader unable to read duration types written by pyarrow #13410

Closed
galipremsagar opened this issue May 22, 2023 · 2 comments · Fixed by #15617
Closed

[BUG] Parquet reader unable to read duration types written by pyarrow #13410

galipremsagar opened this issue May 22, 2023 · 2 comments · Fixed by #15617
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
When a pyarrow table containing duration types are being written to parquet, the cudf reader seems to reading the columns as int64 as opposed to correct timedelta64[..] types.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

In [1]: import cudf

In [3]: df = cudf.DataFrame({"ms": cudf.Series([1234, 3456, 32442], dtype='timedelta64[ms]')})

In [4]: pa_table = df.to_arrow()

In [5]: import pyarrow as pa

In [6]: pa.parquet.write_table(pa_table, "a")

In [8]: pa.parquet.read_table("a")
Out[8]: 
pyarrow.Table
ms: duration[ms]
----
ms: [[1234,3456,32442]]

In [9]: pa_table
Out[9]: 
pyarrow.Table
ms: duration[ms]
----
ms: [[1234,3456,32442]]

In [10]: cudf.read_parquet("a")
Out[10]: 
      ms
0   1234
1   3456
2  32442

In [11]: df
Out[11]: 
                      ms
0 0 days 00:00:01.234000
1 0 days 00:00:03.456000
2 0 days 00:00:32.442000


In [12]: cudf.read_parquet("a").dtypes
Out[12]: 
ms    int64
dtype: object

In [13]: df.dtypes
Out[13]: 
ms    timedelta64[ms]
dtype: object

Expected behavior

In [10]: cudf.read_parquet("a")
Out[10]: 
                      ms
0 0 days 00:00:01.234000
1 0 days 00:00:03.456000
2 0 days 00:00:32.442000


In [12]: cudf.read_parquet("a").dtypes
Out[12]: 
ms    timedelta64[ms]
dtype: object

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [from source]

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details
 **git***
 commit 9b1496df64b9ae9bd7b44a30cfaa42a2f7e2db3f (HEAD -> branch-23.06)
 Author: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
 Date:   Mon May 22 13:52:36 2023 -0400
 
 Fix groupby head/tail for empty dataframe (#13398)
 
 Closes #13397
 
 Authors:
 - Ashwin Srinath (https://github.com/shwina)
 
 Approvers:
 - GALI PREM SAGAR (https://github.com/galipremsagar)
 - Bradley Dice (https://github.com/bdice)
 
 URL: https://github.com/rapidsai/cudf/pull/13398
 **git submodules***
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
 NAME="Ubuntu"
 VERSION="18.04.4 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.4 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux dt07 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Mon May 22 13:53:56 2023
 +---------------------------------------------------------------------------------------+
 | NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
 |-----------------------------------------+----------------------+----------------------+
 | GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                                         |                      |               MIG M. |
 |=========================================+======================+======================|
 |   0  Tesla T4                        On | 00000000:3B:00.0 Off |                    0 |
 | N/A   45C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   1  Tesla T4                        On | 00000000:5E:00.0 Off |                    0 |
 | N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   2  Tesla T4                        On | 00000000:AF:00.0 Off |                    0 |
 | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   3  Tesla T4                        On | 00000000:D8:00.0 Off |                    0 |
 | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 
 +---------------------------------------------------------------------------------------+
 | Processes:                                                                            |
 |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
 |        ID   ID                                                             Usage      |
 |=======================================================================================|
 |  No running processes found                                                           |
 +---------------------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              64
 On-line CPU(s) list: 0-63
 Thread(s) per core:  2
 Core(s) per socket:  16
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
 Stepping:            4
 CPU MHz:             1412.660
 BogoMIPS:            4200.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            22528K
 NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
 NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
 
 ***CMake***
 /nvme/0/pgali/envs/cudfdev/bin/cmake
 cmake version 3.26.4
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /nvme/0/pgali/envs/cudfdev/bin/g++
 g++ (conda-forge gcc 11.3.0-19) 11.3.0
 Copyright (C) 2021 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /nvme/0/pgali/envs/cudfdev/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2022 NVIDIA Corporation
 Built on Wed_Sep_21_10:33:58_PDT_2022
 Cuda compilation tools, release 11.8, V11.8.89
 Build cuda_11.8.r11.8/compiler.31833905_0
 
 ***Python***
 /nvme/0/pgali/envs/cudfdev/bin/python
 Python 3.10.11
 
 ***Environment Variables***
 PATH                            : /nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/.cargo/bin:/home/nfs/pgali/.vscode-server/bin/b3e4e68a0bc097f0ae7907b217c1119af9e03435/bin/remote-cli:/nvme/0/pgali/.cargo/bin:/nvme/0/pgali/anaconda3/bin:/nvme/0/pgali/anaconda3/condabin:/nvme/0/pgali/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin
 LD_LIBRARY_PATH                 : /usr/local/cuda/lib64::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /nvme/0/pgali/envs/cudfdev
 PYTHON_PATH                     :
 
 ***conda packages***
 /nvme/0/pgali/anaconda3/bin/conda
 # packages in environment at /nvme/0/pgali/envs/cudfdev:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                       2_gnu    conda-forge
 _sysroot_linux-64_curr_repodata_hack 3                   h69a702a_13    conda-forge
 accessible-pygments       0.0.4              pyhd8ed1ab_0    conda-forge
 aiobotocore               2.5.0              pyhd8ed1ab_0    conda-forge
 aiohttp                   3.8.4           py310h1fa729e_0    conda-forge
 aioitertools              0.11.0             pyhd8ed1ab_0    conda-forge
 aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
 alabaster                 0.7.13             pyhd8ed1ab_0    conda-forge
 anyio                     3.6.2              pyhd8ed1ab_0    conda-forge
 argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
 argon2-cffi-bindings      21.2.0          py310h5764c6d_3    conda-forge
 arrow-cpp                 11.0.0          ha770c72_20_cpu    conda-forge
 asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
 async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
 attrs                     23.1.0             pyh71513ae_1    conda-forge
 aws-c-auth                0.6.27               he072965_1    conda-forge
 aws-c-cal                 0.5.26               hf677bf3_1    conda-forge
 aws-c-common              0.8.19               hd590300_0    conda-forge
 aws-c-compression         0.2.16               hbad4bc6_7    conda-forge
 aws-c-event-stream        0.2.20               hb4b372c_7    conda-forge
 aws-c-http                0.7.7                h2632f9a_4    conda-forge
 aws-c-io                  0.13.21              h9fef7b8_5    conda-forge
 aws-c-mqtt                0.8.11               h2282364_1    conda-forge
 aws-c-s3                  0.3.0                hcb5a9b2_2    conda-forge
 aws-c-sdkutils            0.1.9                hbad4bc6_2    conda-forge
 aws-checksums             0.1.14               hbad4bc6_7    conda-forge
 aws-crt-cpp               0.20.1               he0fdcb3_3    conda-forge
 aws-sam-translator        1.55.0             pyhd8ed1ab_0    conda-forge
 aws-sdk-cpp               1.10.57             hb0b1f3a_12    conda-forge
 aws-xray-sdk              2.12.0             pyhd8ed1ab_0    conda-forge
 babel                     2.12.1             pyhd8ed1ab_1    conda-forge
 backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
 backports                 1.0                pyhd8ed1ab_3    conda-forge
 backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
 backports.zoneinfo        0.2.1           py310hff52083_7    conda-forge
 bcrypt                    3.2.2           py310h5764c6d_1    conda-forge
 beautifulsoup4            4.12.2             pyha770c72_0    conda-forge
 binutils                  2.39                 hdd6e379_1    conda-forge
 binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
 binutils_linux-64         2.39                h5fc0e48_13    conda-forge
 blas                      1.0                         mkl    conda-forge
 bleach                    6.0.0              pyhd8ed1ab_0    conda-forge
 blinker                   1.6.2              pyhd8ed1ab_0    conda-forge
 bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
 boto3                     1.26.76            pyhd8ed1ab_0    conda-forge
 botocore                  1.29.76            pyhd8ed1ab_0    conda-forge
 brotlipy                  0.7.0           py310h5764c6d_1005    conda-forge
 bzip2                     1.0.8                h7f98852_4    conda-forge
 c-ares                    1.19.0               hd590300_0    conda-forge
 c-compiler                1.5.2                h0b41bf4_0    conda-forge
 ca-certificates           2023.5.7             hbcca054_0    conda-forge
 cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
 certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
 cffi                      1.15.1          py310h255011f_3    conda-forge
 cfgv                      3.3.1              pyhd8ed1ab_0    conda-forge
 cfn-lint                  0.75.1             pyhd8ed1ab_0    conda-forge
 charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
 click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
 cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
 cmake                     3.26.4               hcfe8598_0    conda-forge
 colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
 comm                      0.1.3              pyhd8ed1ab_0    conda-forge
 commonmark                0.9.1                      py_0    conda-forge
 coverage                  7.2.5           py310h2372a71_0    conda-forge
 cryptography              40.0.2          py310h34c0648_0    conda-forge
 cubinlinker               0.2.2           py310hf09951c_0    rapidsai
 cuda-python               11.8.1          py310h01a121a_2    conda-forge
 cuda-sanitizer-api        11.8.86                       0    nvidia
 cudatoolkit               11.8.0              h37601d7_11    conda-forge
 cudf                      23.6.0                   pypi_0    pypi
 cupy                      12.0.0          py310h9216885_1    conda-forge
 cxx-compiler              1.5.2                hf52228f_0    conda-forge
 cyrus-sasl                2.1.27               h9033bb2_6    conda-forge
 cython                    0.29.34         py310heca2aa9_0    conda-forge
 cytoolz                   0.12.0          py310h5764c6d_1    conda-forge
 dask                      2023.3.2           pyhd8ed1ab_0    conda-forge
 dask-core                 2023.3.2           pyhd8ed1ab_0    conda-forge
 dask-cuda                 23.06.00a       py310_230522_gcf6e9fb_24    rapidsai-nightly
 dask-cudf                 23.6.0                   pypi_0    pypi
 dataclasses               0.8                pyhc8e2a94_3    conda-forge
 datasets                  2.12.0             pyhd8ed1ab_0    conda-forge
 debugpy                   1.6.7           py310heca2aa9_0    conda-forge
 decopatch                 1.4.10             pyhd8ed1ab_0    conda-forge
 decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
 defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
 dill                      0.3.6              pyhd8ed1ab_1    conda-forge
 distlib                   0.3.6              pyhd8ed1ab_0    conda-forge
 distributed               2023.3.2.1         pyhd8ed1ab_0    conda-forge
 distro                    1.8.0              pyhd8ed1ab_0    conda-forge
 dlpack                    0.5                  h9c3ff4c_0    conda-forge
 docker-py                 6.1.0              pyhd8ed1ab_0    conda-forge
 docutils                  0.19            py310hff52083_1    conda-forge
 doxygen                   1.8.20               had0d8f1_0    conda-forge
 ecdsa                     0.18.0             pyhd8ed1ab_1    conda-forge
 entrypoints               0.4                pyhd8ed1ab_0    conda-forge
 exceptiongroup            1.1.1              pyhd8ed1ab_0    conda-forge
 execnet                   1.9.0              pyhd8ed1ab_0    conda-forge
 executing                 1.2.0              pyhd8ed1ab_0    conda-forge
 expat                     2.5.0                hcb278e6_1    conda-forge
 fastavro                  1.7.4           py310h2372a71_0    conda-forge
 fastrlock                 0.8             py310hd8f1fbe_3    conda-forge
 filelock                  3.12.0             pyhd8ed1ab_0    conda-forge
 flask                     2.3.2              pyhd8ed1ab_0    conda-forge
 flask_cors                3.0.10             pyhd3deb0d_0    conda-forge
 flit-core                 3.9.0              pyhd8ed1ab_0    conda-forge
 fmt                       9.1.0                h924138e_0    conda-forge
 freetype                  2.12.1               hca18f0e_1    conda-forge
 frozenlist                1.3.3           py310h5764c6d_0    conda-forge
 fsspec                    2023.5.0           pyh1a96a4e_0    conda-forge
 future                    0.18.3             pyhd8ed1ab_0    conda-forge
 gcc                       11.3.0              h02d0930_13    conda-forge
 gcc_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
 gcc_linux-64              11.3.0              he6f903b_13    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 glog                      0.6.0                h6f12383_0    conda-forge
 gmock                     1.13.0               ha770c72_1    conda-forge
 gmp                       6.2.1                h58526e2_0    conda-forge
 gmpy2                     2.1.2           py310h3ec546c_1    conda-forge
 graphql-core              3.2.3              pyhd8ed1ab_0    conda-forge
 greenlet                  2.0.2           py310hc6cd4ac_1    conda-forge
 gtest                     1.13.0               h00ab1b0_1    conda-forge
 gxx                       11.3.0              h02d0930_13    conda-forge
 gxx_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
 gxx_linux-64              11.3.0              hc203a17_13    conda-forge
 huggingface_hub           0.14.1             pyhd8ed1ab_0    conda-forge
 hypothesis                6.75.3             pyha770c72_0    conda-forge
 identify                  2.5.24             pyhd8ed1ab_0    conda-forge
 idna                      3.4                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
 importlib-metadata        6.6.0              pyha770c72_0    conda-forge
 importlib_metadata        6.6.0                hd8ed1ab_0    conda-forge
 iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
 intel-openmp              2022.1.0          h9e868ea_3769
 ipykernel                 6.23.1             pyh210e3f2_0    conda-forge
 ipython                   8.13.2             pyh41d4057_0    conda-forge
 ipython_genutils          0.2.0                      py_1    conda-forge
 itsdangerous              2.1.2              pyhd8ed1ab_0    conda-forge
 jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
 jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
 jschema-to-python         1.2.3              pyhd8ed1ab_0    conda-forge
 jsondiff                  2.0.0              pyhd8ed1ab_0    conda-forge
 jsonpatch                 1.32               pyhd8ed1ab_0    conda-forge
 jsonpickle                2.2.0              pyhd8ed1ab_0    conda-forge
 jsonpointer               2.0                        py_0    conda-forge
 jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
 junit-xml                 1.9                pyh9f0ad1d_0    conda-forge
 jupyter-cache             0.6.1              pyhd8ed1ab_0    conda-forge
 jupyter_client            8.2.0              pyhd8ed1ab_0    conda-forge
 jupyter_core              5.3.0           py310hff52083_0    conda-forge
 jupyter_events            0.6.3              pyhd8ed1ab_0    conda-forge
 jupyter_server            2.5.0              pyhd8ed1ab_0    conda-forge
 jupyter_server_terminals  0.4.4              pyhd8ed1ab_1    conda-forge
 jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
 kernel-headers_linux-64   3.10.0              h4a8ded7_13    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.20.1               h81ceb04_0    conda-forge
 lcms2                     2.15                 haa2dc70_1    conda-forge
 ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
 lerc                      4.0.0                h27087fc_0    conda-forge
 libabseil                 20230125.2      cxx17_h59595ed_2    conda-forge
 libarrow                  11.0.0          h6564b11_20_cpu    conda-forge
 libblas                   3.9.0            16_linux64_mkl    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
 libbrotlidec              1.0.9                h166bdaf_8    conda-forge
 libbrotlienc              1.0.9                h166bdaf_8    conda-forge
 libcblas                  3.9.0            16_linux64_mkl    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcufile                 1.4.0.31                      0    nvidia
 libcufile-dev             1.4.0.31                      0    nvidia
 libcurand                 10.3.0.86                     0    nvidia
 libcurand-dev             10.3.0.86                     0    nvidia
 libcurl                   8.1.0                h409715c_0    conda-forge
 libdeflate                1.18                 h0b41bf4_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 h516909a_1    conda-forge
 libevent                  2.1.12               h3358134_0    conda-forge
 libexpat                  2.5.0                hcb278e6_1    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-devel_linux-64     11.3.0              h210ce93_19    conda-forge
 libgcc-ng                 12.2.0              h65d4601_19    conda-forge
 libgfortran-ng            12.2.0              h69a702a_19    conda-forge
 libgfortran5              12.2.0              h337968e_19    conda-forge
 libgomp                   12.2.0              h65d4601_19    conda-forge
 libgoogle-cloud           2.10.1               hac9eb74_1    conda-forge
 libgrpc                   1.54.2               hb20ce57_2    conda-forge
 libiconv                  1.17                 h166bdaf_0    conda-forge
 libjpeg-turbo             2.1.5.1              h0b41bf4_0    conda-forge
 libkvikio                 23.06.00a       cuda11_230522_g2fbcd33_26    rapidsai-nightly
 liblapack                 3.9.0            16_linux64_mkl    conda-forge
 libllvm11                 11.1.0               he0ac6c6_5    conda-forge
 libnghttp2                1.52.0               h61bc06f_0    conda-forge
 libnsl                    2.0.0                h7f98852_0    conda-forge
 libntlm                   1.4               h7f98852_1002    conda-forge
 libnuma                   2.0.16               h0b41bf4_1    conda-forge
 libpng                    1.6.39               h753d276_0    conda-forge
 libprotobuf               3.21.12              h3eb15da_0    conda-forge
 librdkafka                1.9.2                ha5a0de0_2    conda-forge
 librmm                    23.06.00a       cuda11_230522_gc11ea8a5_19    rapidsai-nightly
 libsanitizer              11.3.0              h239ccf8_19    conda-forge
 libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libsqlite                 3.42.0               h2797004_0    conda-forge
 libssh2                   1.10.0               hf14f497_3    conda-forge
 libstdcxx-devel_linux-64  11.3.0              h210ce93_19    conda-forge
 libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
 libthrift                 0.18.1               h8fd135c_1    conda-forge
 libtiff                   4.5.0                ha587672_6    conda-forge
 libutf8proc               2.8.0                h166bdaf_0    conda-forge
 libuuid                   2.38.1               h0b41bf4_0    conda-forge
 libuv                     1.44.2               h166bdaf_0    conda-forge
 libwebp-base              1.3.0                h0b41bf4_0    conda-forge
 libxcb                    1.15                 h0b41bf4_0    conda-forge
 libzlib                   1.2.13               h166bdaf_4    conda-forge
 livereload                2.6.3              pyh9f0ad1d_0    conda-forge
 llvmlite                  0.39.1          py310h58363a5_1    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
 lz4                       4.3.2           py310h0cfdcf0_0    conda-forge
 lz4-c                     1.9.4                hcb278e6_0    conda-forge
 makefun                   1.15.1             pyhd8ed1ab_0    conda-forge
 markdown                  3.4.3              pyhd8ed1ab_0    conda-forge
 markdown-it-py            2.2.0              pyhd8ed1ab_0    conda-forge
 markupsafe                2.1.2           py310h1fa729e_0    conda-forge
 matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
 mdit-py-plugins           0.3.5              pyhd8ed1ab_0    conda-forge
 mdurl                     0.1.0              pyhd8ed1ab_0    conda-forge
 mimesis                   10.0.0             pyhd8ed1ab_0    conda-forge
 mistune                   2.0.5              pyhd8ed1ab_0    conda-forge
 mkl                       2022.1.0           hc2b9512_224
 moto                      4.1.10             pyhd8ed1ab_0    conda-forge
 mpc                       1.3.1                hfe3b2da_0    conda-forge
 mpfr                      4.2.0                hb012696_0    conda-forge
 msgpack-python            1.0.5           py310hdf3cbec_0    conda-forge
 multidict                 6.0.4           py310h1fa729e_0    conda-forge
 multiprocess              0.70.14         py310h5764c6d_3    conda-forge
 myst-nb                   0.17.2             pyhd8ed1ab_0    conda-forge
 myst-parser               0.18.1             pyhd8ed1ab_0    conda-forge
 nbclassic                 1.0.0              pyhb4ecaf3_1    conda-forge
 nbclient                  0.7.4              pyhd8ed1ab_0    conda-forge
 nbconvert                 7.2.9              pyhd8ed1ab_0    conda-forge
 nbconvert-core            7.2.9              pyhd8ed1ab_0    conda-forge
 nbconvert-pandoc          7.2.9              pyhd8ed1ab_0    conda-forge
 nbformat                  5.8.0              pyhd8ed1ab_0    conda-forge
 nbsphinx                  0.9.1              pyhd8ed1ab_0    conda-forge
 ncurses                   6.3                  h27087fc_1    conda-forge
 nest-asyncio              1.5.6              pyhd8ed1ab_0    conda-forge
 networkx                  2.8.8              pyhd8ed1ab_0    conda-forge
 ninja                     1.11.1               h924138e_0    conda-forge
 nodeenv                   1.8.0              pyhd8ed1ab_0    conda-forge
 notebook                  6.5.4              pyha770c72_0    conda-forge
 notebook-shim             0.2.3              pyhd8ed1ab_0    conda-forge
 numba                     0.56.4          py310h0e39c9b_1    conda-forge
 numpy                     1.23.5          py310h53a5b5f_0    conda-forge
 numpydoc                  1.5.0              pyhd8ed1ab_0    conda-forge
 nvcc_linux-64             11.8                h41dc85b_22    conda-forge
 nvtx                      0.2.5           py310h1fa729e_0    conda-forge
 openapi-schema-validator  0.2.3              pyhd8ed1ab_0    conda-forge
 openapi-spec-validator    0.4.0              pyhd8ed1ab_1    conda-forge
 openjpeg                  2.5.0                hfec8fc6_2    conda-forge
 openssl                   3.1.0                hd590300_3    conda-forge
 orc                       1.8.3                hfdbbad2_0    conda-forge
 packaging                 23.1               pyhd8ed1ab_0    conda-forge
 pandas                    1.5.3                    pypi_0    pypi
 pandoc                    3.1.2                h32600fe_1    conda-forge
 pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
 paramiko                  3.1.0              pyhd8ed1ab_0    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
 parso                     0.8.3              pyhd8ed1ab_0    conda-forge
 partd                     1.4.0              pyhd8ed1ab_0    conda-forge
 pbr                       5.11.1             pyhd8ed1ab_0    conda-forge
 pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    9.5.0           py310h582fbeb_1    conda-forge
 pip                       23.1.2             pyhd8ed1ab_0    conda-forge
 platformdirs              3.5.1              pyhd8ed1ab_0    conda-forge
 pluggy                    1.0.0              pyhd8ed1ab_5    conda-forge
 pooch                     1.7.0              pyha770c72_3    conda-forge
 pre-commit                3.3.2              pyha770c72_0    conda-forge
 prometheus_client         0.16.0             pyhd8ed1ab_0    conda-forge
 prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
 prompt_toolkit            3.0.38               hd8ed1ab_0    conda-forge
 protobuf                  4.21.12         py310heca2aa9_0    conda-forge
 psutil                    5.9.5           py310h1fa729e_0    conda-forge
 pthread-stubs             0.4               h36c2ea0_1001    conda-forge
 ptxcompiler               0.8.1           py310h01a121a_0    conda-forge
 ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
 pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
 py-cpuinfo                9.0.0              pyhd8ed1ab_0    conda-forge
 pyarrow                   11.0.0          py310he6bfd7f_20_cpu    conda-forge
 pyasn1                    0.4.8                      py_0    conda-forge
 pycparser                 2.21               pyhd8ed1ab_0    conda-forge
 pydata-sphinx-theme       0.13.3             pyhd8ed1ab_0    conda-forge
 pygments                  2.15.1             pyhd8ed1ab_0    conda-forge
 pynacl                    1.5.0           py310h5764c6d_2    conda-forge
 pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
 pyopenssl                 23.1.1             pyhd8ed1ab_0    conda-forge
 pyorc                     0.8.0           py310hd52fb3e_4    conda-forge
 pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
 pyrsistent                0.19.3          py310h1fa729e_0    conda-forge
 pysocks                   1.7.1              pyha2e5f31_6    conda-forge
 pytest                    7.3.1              pyhd8ed1ab_0    conda-forge
 pytest-benchmark          4.0.0              pyhd8ed1ab_0    conda-forge
 pytest-cases              3.6.14             pyhd8ed1ab_0    conda-forge
 pytest-cov                4.0.0              pyhd8ed1ab_0    conda-forge
 pytest-xdist              3.3.1              pyhd8ed1ab_0    conda-forge
 python                    3.10.11         he550d4f_0_cpython    conda-forge
 python-confluent-kafka    1.9.2           py310h5764c6d_2    conda-forge
 python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
 python-fastjsonschema     2.17.1             pyhd8ed1ab_0    conda-forge
 python-jose               3.3.0              pyh6c4a22f_1    conda-forge
 python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
 python-snappy             0.6.1           py310hcee4d7c_0    conda-forge
 python-xxhash             3.2.0           py310h1fa729e_0    conda-forge
 python_abi                3.10                    3_cp310    conda-forge
 pytorch                   1.11.0             py3.10_cpu_0    pytorch
 pytorch-mutex             1.0                         cpu    pytorch
 pytz                      2023.3             pyhd8ed1ab_0    conda-forge
 pywin32-on-windows        0.1.0              pyh1179c8e_3    conda-forge
 pyyaml                    6.0             py310h5764c6d_5    conda-forge
 pyzmq                     25.0.2          py310h059b190_0    conda-forge
 re2                       2023.03.02           h8c504da_0    conda-forge
 readline                  8.2                  h8228510_1    conda-forge
 recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
 regex                     2023.5.5        py310h2372a71_0    conda-forge
 requests                  2.31.0             pyhd8ed1ab_0    conda-forge
 responses                 0.18.0             pyhd8ed1ab_0    conda-forge
 rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
 rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
 rhash                     1.4.3                h166bdaf_0    conda-forge
 rmm                       23.06.00a       cuda11_py310_230522_gc11ea8a5_19    rapidsai-nightly
 rsa                       4.9                pyhd8ed1ab_0    conda-forge
 s2n                       1.3.44               h06160fa_0    conda-forge
 s3fs                      2023.5.0           pyhd8ed1ab_0    conda-forge
 s3transfer                0.6.1              pyhd8ed1ab_0    conda-forge
 sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
 sarif-om                  1.0.4              pyhd8ed1ab_0    conda-forge
 scikit-build              0.17.1             pyh56297ac_0    conda-forge
 scipy                     1.10.1          py310ha4c1d20_3    conda-forge
 sed                       4.8                  he412f7d_0    conda-forge
 send2trash                1.8.2              pyh41d4057_0    conda-forge
 setuptools                67.7.2             pyhd8ed1ab_0    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 snappy                    1.1.10               h9fff704_0    conda-forge
 sniffio                   1.3.0              pyhd8ed1ab_0    conda-forge
 snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
 sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
 soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
 spdlog                    1.11.0               h9b3ece8_1    conda-forge
 sphinx                    5.3.0              pyhd8ed1ab_0    conda-forge
 sphinx-autobuild          2021.3.14          pyhd8ed1ab_0    conda-forge
 sphinx-copybutton         0.5.2              pyhd8ed1ab_0    conda-forge
 sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
 sphinxcontrib-applehelp   1.0.4              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
 sphinxcontrib-htmlhelp    2.0.1              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
 sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
 sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
 sphinxcontrib-websupport  1.2.4              pyhd8ed1ab_1    conda-forge
 sqlalchemy                2.0.15          py310h2372a71_0    conda-forge
 sshpubkeys                3.3.1              pyhd8ed1ab_0    conda-forge
 stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
 streamz                   0.6.4              pyh6c4a22f_0    conda-forge
 sysroot_linux-64          2.17                h4a8ded7_13    conda-forge
 tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
 tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
 terminado                 0.17.1             pyh41d4057_0    conda-forge
 tinycss2                  1.2.1              pyhd8ed1ab_0    conda-forge
 tk                        8.6.12               h27826a3_0    conda-forge
 tokenizers                0.13.1          py310h633acb5_2    conda-forge
 toml                      0.10.2             pyhd8ed1ab_0    conda-forge
 tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
 toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
 tornado                   6.3.2           py310h2372a71_0    conda-forge
 tqdm                      4.65.0             pyhd8ed1ab_1    conda-forge
 traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
 transformers              4.24.0             pyhd8ed1ab_0    conda-forge
 typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
 typing_extensions         4.5.0              pyha770c72_0    conda-forge
 tzdata                    2023.3                   pypi_0    pypi
 ucx                       1.14.1               h8c404fb_0    conda-forge
 ukkonen                   1.0.1           py310hbf28c38_3    conda-forge
 urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
 virtualenv                20.23.0            pyhd8ed1ab_0    conda-forge
 wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
 webencodings              0.5.1                      py_1    conda-forge
 websocket-client          1.5.2              pyhd8ed1ab_0    conda-forge
 werkzeug                  2.3.4              pyhd8ed1ab_0    conda-forge
 wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
 wrapt                     1.15.0          py310h1fa729e_0    conda-forge
 xmltodict                 0.13.0             pyhd8ed1ab_0    conda-forge
 xorg-libxau               1.0.11               hd590300_0    conda-forge
 xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
 xxhash                    0.8.1                h0b41bf4_0    conda-forge
 xz                        5.2.6                h166bdaf_0    conda-forge
 yaml                      0.2.5                h7f98852_2    conda-forge
 yarl                      1.9.1           py310h2372a71_0    conda-forge
 zeromq                    4.3.4                h9c3ff4c_1    conda-forge
 zict                      3.0.0              pyhd8ed1ab_0    conda-forge
 zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
 zlib                      1.2.13               h166bdaf_4    conda-forge
 zstd                      1.5.2                h3eb15da_6    conda-forge

Additional context
Add any other context about the problem here.

@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels May 22, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 7, 2023
@mhaseeb123
Copy link
Member

Investigating this. Some insights:

  1. The behavior seen above is correct as far as parquet reader/writer(s) is/are concerned. The actual problem seems to be in the df.to_arrow() function which converts the timedelta64[unit] to duration[unit] dtype instead of the correct time32[unit] or time64[unit] dtypes.
  2. The pyarrow.duration() is indeed stored and as int64 which when written to parquet shows up when read.
  3. We see the same behavior when we start with a pa.Table and convert to cudf.DataFrame. The cudf.DataFrame.from_arrow() function fails if the pa.Table.type == time64[unit] instead of duration[unit]
  4. Note that time64[s] or time64[ms] are not valid types in pyarrow as seconds and milliseconds are written as int32 in pyarrow as well as libcudf. We should update the label names to reflect that as well.

Summary: Seems like the issue is in our to_arrow() and from_arrow() functions which I will investigate more.

Note this interesting example

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from io import BytesIO
import cudf

times = pa.array(
    [1234, 3456, 32442], type="duration[ms]"
)  # setting type="time32[ms] or time64[us]" etc will fail cudf.DF.from_arrow function
names = ["s"]
pa_table = pa.Table.from_arrays([times], names=names)
buf = BytesIO()

pq.write_table(pa_table, buf)
df2 = cudf.read_parquet(buf)
df3 = pq.read_table(buf)

print("Original table (pa)", pa_table, pa_table["s"].type)
print("cudf read parquet", df2, df2["s"].dtype)
print("pyarrow read parquet", df3, df3["s"].type)

df = cudf.DataFrame.from_arrow(
    pa_table
)  # setting type="time32[ms] or time64[us]" in pa_table etc will fail cudf.DF.from_arrow function
buf2 = BytesIO()
df.to_parquet(buf2)
df4 = cudf.read_parquet(buf)
df5 = pq.read_table(buf)

print("from_arrow table (cudf)", df, df["s"].dtype)
print("cudf read parquet", df4, df4["s"].dtype)
print("pyarrow read parquet", df5, df5["s"].type)

@mhaseeb123
Copy link
Member

mhaseeb123 commented Mar 22, 2024

Updates:

  1. Our to_arrow and from_arrow functions are working as expected and correctly converting between cudf:timedelta64 and pyarrrow:duration.
  2. (Detailed in [BUG] Unable to write timedelta64[s] type correctly with parquet writer #13409): Arrow encodes duration as plain int64 in parquet instead of TimeType. It also appends parquet files a serialized arrow schema used to correctly convert int64 back to duration type when needed: [C++][Parquet] Support DurationType in writing/reading parquet apache/arrow#23117 and ARROW-6780: [C++][Parquet] Support DurationType in writing/reading parquet (written as int64) apache/arrow#12449.
  3. Without something similar, we can't directly encode/decode an interoperable duration type in cudf.

@mhaseeb123 mhaseeb123 self-assigned this Apr 29, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress and removed 0 - Backlog In queue waiting for assignment labels Apr 29, 2024
rapids-bot bot pushed a commit that referenced this issue May 15, 2024
)

This PR adds the support for reading and using the `arrow:schema` struct from the serialized `arrow:ipc` message written at the key-value metadata section of the Parquet file with `ARROW:schema` key. This allows cudf to read and interop with arrow for non-standard parquet types (`DurationType` in this PR). 

Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the `arrow:Schema` structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an empty `ipc:Message` (also a flatbuffer-serialized structure inside Message.fbs). The `ipc:Message` is prepended with two integers containing a `validity` message and the `size of the header` (the `arrow:Schema` + padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using `"ARROW:schema"` key. 

In this PR, we base64-decode the `ipc:Message`, then we decode the `validity` message and the header size, and offset pointers to the `arrow:Schema` flatbuffer. We then use Flatbuffer structs to walk the `arrow:Schema` and collect information on columns of interest as an unordered_map (using column name as key).  This unordered_map is used inside `select_columns` function to build cudf Table columns and get the correct `dtype`.

Closes #13410

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #15617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants