Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pyorc does not read string column statistics of cuDF generated files #9313

Open
1 of 2 tasks
devavret opened this issue Sep 27, 2021 · 4 comments
Open
1 of 2 tasks
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Sep 27, 2021

When reading the statistics for an ORC file written by cuDF, the result for sum is wrong when read using cuDF and absent when using pyorc.

In [1]: import cudf

In [2]: import pyorc

In [3]: gdf = cudf.DataFrame({'b':[1,7], 'a':['Badam khao', 'roz']})

In [4]: gdf.to_orc("temp.orc")

In [5]: cudf.io.orc.read_orc_statistics(["temp.orc"])
Out[5]: 
([{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}],
 [{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}])

In [6]: f = open("temp.orc", 'rb')

In [7]: r = pyorc.Reader(f)

In [8]: r[1].statistics
Out[8]: 
{'has_null': False,
 'number_of_values': 2,
 'minimum': 1,
 'maximum': 7,
 'sum': 8,
 'kind': <TypeKind.LONG: 4>}

In [9]: r[2].statistics
Out[9]: {'has_null': False, 'number_of_values': 2, 'kind': <TypeKind.STRING: 7>}

Expected result

Sum statistics contains the sum of lengths of all the strings in the column. We do correctly compute this in libcudf, so it should be present when reading with pyorc and correct when reading with cudf.

There's two issues here:

@devavret devavret added bug Something isn't working cuIO cuIO issue labels Sep 27, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added the good first issue Good for newcomers label Jun 28, 2022
@vuule vuule added this to Issue-Needs prioritizing in v22.10 Release via automation Aug 2, 2022
@vuule vuule moved this from Issue-Needs prioritizing to Issue-P0 in v22.10 Release Aug 2, 2022
@vuule
Copy link
Contributor

vuule commented Aug 2, 2022

This is a correctness issue, prioritizing for 22.10.

@vuule vuule self-assigned this Aug 3, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 22, 2022
Issue #9313
The root cause is that the sum value was encoded as an unsigned int. ORC specs show that the value should be encoded as signed.
Because both encode and decode where assuming unsigned encoding, the existing C++ test (OrcStatisticsTest, Basic) was passing even without this fix. Added a Python test that uses a different decode method, so it fails without the fix.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #11740
@vuule vuule removed this from Issue-P0 in v22.10 Release Sep 23, 2022
@vuule vuule added this to Issue-Needs prioritizing in v22.12 Release via automation Sep 23, 2022
@vuule
Copy link
Contributor

vuule commented Sep 26, 2022

Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.

@vuule vuule moved this from Issue-Needs prioritizing to Issue-P2 in v22.12 Release Sep 26, 2022
@vuule vuule removed inactive-30d good first issue Good for newcomers labels Sep 26, 2022
@vuule vuule removed their assignment Sep 27, 2022
@vuule vuule changed the title [BUG] ORC string sum statistics are wrong [BUG] pyorc does not read string column statistics of cuDF generated files Sep 27, 2022
@vuule vuule removed this from Issue-P2 in v22.12 Release Nov 16, 2022
@vuule vuule added this to Issue-Needs prioritizing in v23.02 Release via automation Nov 16, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
No open projects
v23.02 Release
  
Issue-Needs prioritizing
Development

No branches or pull requests

3 participants