Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

Closed
ttnghia opened this issue Aug 12, 2022 · 12 comments · Fixed by #11586 or #11699
Closed

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

ttnghia opened this issue Aug 12, 2022 · 12 comments · Fixed by #11586 or #11699
Labels
bug Something isn't working cuIO cuIO issue

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Aug 12, 2022

When read/write timestamp data into ORC file format, the following cases were discovered. For the input column of one row of timestamp type 1839-12-24 03:58:55.000826:

Write in Spark + read in libcudf:

auto const in_opts = cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
                         .use_index(false)
                         .timestamp_type(cudf::data_type(cudf::type_id::TIMESTAMP_MICROSECONDS))
                         .build();
auto const& table = cudf_io::read_orc(in_opts).tbl;
auto const a = table->get_column(0);
cudf::test::print(a.view());

Result: 1839-12-24T03:58:54.000826Z

Write in libcudf + read in Spark:

+--------------------------+
|_col0                     |
+--------------------------+
|1839-12-24 03:58:56.000826|
+--------------------------+

In particular, the number of second is wrong. The input second is 55 but it is messed up somehow.

Of course for that input timestamp value, results of both writing and reading in libcudf are the same as the input.

@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Aug 12, 2022
@github-actions github-actions bot added this to Needs prioritizing in Bug Squashing Aug 12, 2022
@ttnghia ttnghia changed the title [BUG] ORC read/write is incompatible with Spark at some corner case(s) [BUG] ORC read/write is incompatible with Spark in some corner case(s) Aug 12, 2022
@ttnghia ttnghia added the cuIO cuIO issue label Aug 13, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 16, 2022

Other data that can reproduce:

1930-12-24 03:58:55.000826
1950-12-24 03:58:55.000826

@sameerz
Copy link
Contributor

sameerz commented Aug 16, 2022

@ttnghia can you add the output resulting from these values?

Other data that can reproduce:

1930-12-24 03:58:55.000826
1950-12-24 03:58:55.000826

@ttnghia
Copy link
Contributor Author

ttnghia commented Aug 16, 2022

If these timestamps are written in Spark CPU, the read timestamp in cudf are always .... 03:58:54.... (the second value is 54 instead of 55).

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 17, 2022

Thanks @ttnghia, it looks like we find the same issue from the python side with pyarrow.orc as well.

import cudf
import pyarrow.orc as orc
import pyarrow as pa
import pandas as pd

def output_orc(df, path):
    table = pa.Table.from_pandas(df, preserve_index=False)
    orc.write_table(table, path)

ref = cudf.DataFrame({'a': [pd.Timestamp('1839-12-24 03:58:55.000826')]})

ref.to_orc('cudf.orc')
df = cudf.read_orc('cudf.orc')
print('cudf write, cudf read', df['a'][0])
df = pd.read_orc('cudf.orc')
print('cudf write, pd read', df['a'][0])

output_orc(ref.to_pandas(), 'pyorc.orc')
df = cudf.read_orc('pyorc.orc')
print('pa write, cudf read', df['a'][0])
df = pd.read_orc('pyorc.orc')
print('pa write, pd read', df['a'][0])
cudf write, cudf read: 1839-12-24T03:58:55.000826000
cudf write, pd read: 1839-12-24 03:58:56.000826
pa write, cudf read: 1839-12-24T03:58:54.000826000
pa write, pd read: 1839-12-24 03:58:55.000826

++ does not seem to affect parquet 🤔

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 17, 2022

This error appears to affect all timestamps before the start of the UTC epoch in 1970 and with microsecond count between 0 and 1000.

result           timestamp string           timestamp(us) as int64

error by 1.0 sec 1808-02-13 14:13:43.000764 -5108521576999236
OK               1811-05-06 13:47:37.001848 -5006743942998152
error by 1.0 sec 1821-04-15 14:27:47.000786 -4692936732999214
error by 1.0 sec 1834-12-27 22:29:54.000112 -4260562205999888
OK               1886-02-14 06:30:28.001961 -2646926971998039
OK               1922-01-14 21:26:17.001589 -1513564422998411
error by 1.0 sec 1930-11-15 00:35:55.000082 -1234826644999918
OK               1941-09-20 21:13:19.001565 -892435600998435
OK               1944-02-24 22:41:32.001253 -815793507998747
error by 1.0 sec 1951-05-01 18:13:25.000087 -589182394999913
error by 1.0 sec 1965-12-17 06:22:31.000880 -127503448999120
OK               1967-09-07 13:28:38.001946 -73132281998054
OK               1970-04-01 04:42:13.000171 7792933000171
OK               1991-12-02 13:21:14.001617 691680074001617
OK               2004-11-01 01:01:47.001943 1099270907001943
OK               2014-09-06 12:26:14.000901 1410006374000901
OK               2023-10-14 02:04:12.000802 1697249052000802
OK               2027-07-24 21:26:55.000092 1816464415000092
OK               2038-07-10 12:41:41.001826 2162378501001826
OK               2042-08-04 13:01:13.000769 2290770073000769

@vuule
Copy link
Contributor

vuule commented Aug 23, 2022

Probably related: #5529 (comment)

@vuule
Copy link
Contributor

vuule commented Aug 24, 2022

Looks like the comment above is not directly related to the root cause.
The difference between https://github.com/apache/orc/blob/fa9c011e13e8376d2a185bd76af834bd644f4332/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1221-L1227 and our ORC reader (logic related to milliseconds instead of seconds) might be. Looking further into this.

rapids-bot bot pushed a commit that referenced this issue Sep 13, 2022
Fixes #11525

Contains a chain of fixes:

1. Allow negative nanoseconds in negative timestamps - aligns writer with pyorc;
2. Limit seconds adjustment to positive nanoseconds - fixes the off-by-one issue reported in #11525;
3. Fix the decode of large uint64_t values (>max `int64_t`) - fixes reading of cuDF encoded negative nanoseconds;
4. Avoid mode 2 encode when the base value is larger than max `int64_t` - follows the specs and fixes reading of negative nanoseconds using non-cuDF readers.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #11586
Bug Squashing automation moved this from Needs prioritizing to Closed Sep 13, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 13, 2022

Sorry guys. Our tests just discovered new failed cases for these values:

1647-05-20 19:25:03.000638
1846-7-2 21:3:40.000508
1626-11-11 21:9:20.000733

I'm not sure if this is still the old bug or a new one.

Edit: Added more cases.

@ttnghia ttnghia reopened this Sep 13, 2022
Bug Squashing automation moved this from Closed to Needs prioritizing Sep 13, 2022
@vuule
Copy link
Contributor

vuule commented Sep 13, 2022

Sorry guys. Our tests just discovered new failed cases for these values:

1647-05-20 19:25:03.000638
1846-7-2 21:3:40.000508

I'm not sure if this is still the old bug or a new one.

What are the timestamps in the comment? Two cases that fail, correct timestamp and the cudf result, something else?

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 13, 2022

These values if written by Spark CPU then reading by cudf will produce different values (off-by-one second). For example:

cpu = datetime.datetime(1647, 5, 20, 19, 25, 3, 638)
gpu = datetime.datetime(1647, 5, 20, 19, 25, 2, 638)

Note that 638 here is microsecond so it should be written as 000638 when testing.

@vuule
Copy link
Contributor

vuule commented Sep 13, 2022

No repro so far, all three timestamps are read correctly with cuDF. @ttnghia can you please share a more detailed repro instructions?

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 13, 2022

Reproducing:

Write in Spark (spark-shell):

scala> import java.sql.Timestamp
import java.sql.Timestamp

scala>  val df = Seq(Timestamp.valueOf("1647-5-20 19:25:3.000638")).toDF("v")
df: org.apache.spark.sql.DataFrame = [v: timestamp]

scala> df.coalesce(1).write.mode("overwrite").orc("/home/nghiat/Devel/tmp/ts.orc")

Read in cudf:

auto const filepath =
    "/home/nghiat/Devel/tmp/ts.orc/"
    "part-00000-32a8c643-dcaf-40ef-b3eb-7e335d491bf1-c000.snappy.orc";
  auto const in_opts = cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
                         .use_index(false)
                         .timestamp_type(cudf::data_type(cudf::type_id::TIMESTAMP_MICROSECONDS))
                         .build();
  auto const& table = cudf_io::read_orc(in_opts).tbl;
  auto const a = table->get_column(0);
  cudf::test::print(a.view());


1647-05-20T19:25:02.000638Z

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 14, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 14, 2022
…or (#11699)

Closes #11525
Not sure why, but the apache Java ORC reader does the following when reading negative timestamps: https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1284-L1285
This detail does not impact cuDF and pyorc writers (reading cudf files with apache reader already works) because these libraries write negative timestamps with negative nanoseconds.

This PR modifies the ORC reader behavior to match the apache reader so that cuDF correctly reads ORC files written by the apache reader.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Elias Stehle (https://github.com/elstehle)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11699
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
No open projects
5 participants