[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

ttnghia · 2022-08-12T22:33:22Z

When read/write timestamp data into ORC file format, the following cases were discovered. For the input column of one row of timestamp type 1839-12-24 03:58:55.000826:

Write in Spark + read in libcudf:

auto const in_opts = cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
                         .use_index(false)
                         .timestamp_type(cudf::data_type(cudf::type_id::TIMESTAMP_MICROSECONDS))
                         .build();
auto const& table = cudf_io::read_orc(in_opts).tbl;
auto const a = table->get_column(0);
cudf::test::print(a.view());

Result: 1839-12-24T03:58:54.000826Z

Write in libcudf + read in Spark:

+--------------------------+
|_col0                     |
+--------------------------+
|1839-12-24 03:58:56.000826|
+--------------------------+

In particular, the number of second is wrong. The input second is 55 but it is messed up somehow.

Of course for that input timestamp value, results of both writing and reading in libcudf are the same as the input.

The text was updated successfully, but these errors were encountered:

ttnghia · 2022-08-16T16:10:32Z

Other data that can reproduce:

1930-12-24 03:58:55.000826
1950-12-24 03:58:55.000826

sameerz · 2022-08-16T20:40:35Z

@ttnghia can you add the output resulting from these values?

Other data that can reproduce:
1930-12-24 03:58:55.000826
1950-12-24 03:58:55.000826

ttnghia · 2022-08-16T20:42:27Z

If these timestamps are written in Spark CPU, the read timestamp in cudf are always .... 03:58:54.... (the second value is 54 instead of 55).

GregoryKimball · 2022-08-17T04:11:06Z

Thanks @ttnghia, it looks like we find the same issue from the python side with pyarrow.orc as well.

import cudf
import pyarrow.orc as orc
import pyarrow as pa
import pandas as pd

def output_orc(df, path):
    table = pa.Table.from_pandas(df, preserve_index=False)
    orc.write_table(table, path)

ref = cudf.DataFrame({'a': [pd.Timestamp('1839-12-24 03:58:55.000826')]})

ref.to_orc('cudf.orc')
df = cudf.read_orc('cudf.orc')
print('cudf write, cudf read', df['a'][0])
df = pd.read_orc('cudf.orc')
print('cudf write, pd read', df['a'][0])

output_orc(ref.to_pandas(), 'pyorc.orc')
df = cudf.read_orc('pyorc.orc')
print('pa write, cudf read', df['a'][0])
df = pd.read_orc('pyorc.orc')
print('pa write, pd read', df['a'][0])

cudf write, cudf read: 1839-12-24T03:58:55.000826000
cudf write, pd read: 1839-12-24 03:58:56.000826
pa write, cudf read: 1839-12-24T03:58:54.000826000
pa write, pd read: 1839-12-24 03:58:55.000826

++ does not seem to affect parquet 🤔

GregoryKimball · 2022-08-17T05:37:59Z

This error appears to affect all timestamps before the start of the UTC epoch in 1970 and with microsecond count between 0 and 1000.

result           timestamp string           timestamp(us) as int64

error by 1.0 sec 1808-02-13 14:13:43.000764 -5108521576999236
OK               1811-05-06 13:47:37.001848 -5006743942998152
error by 1.0 sec 1821-04-15 14:27:47.000786 -4692936732999214
error by 1.0 sec 1834-12-27 22:29:54.000112 -4260562205999888
OK               1886-02-14 06:30:28.001961 -2646926971998039
OK               1922-01-14 21:26:17.001589 -1513564422998411
error by 1.0 sec 1930-11-15 00:35:55.000082 -1234826644999918
OK               1941-09-20 21:13:19.001565 -892435600998435
OK               1944-02-24 22:41:32.001253 -815793507998747
error by 1.0 sec 1951-05-01 18:13:25.000087 -589182394999913
error by 1.0 sec 1965-12-17 06:22:31.000880 -127503448999120
OK               1967-09-07 13:28:38.001946 -73132281998054
OK               1970-04-01 04:42:13.000171 7792933000171
OK               1991-12-02 13:21:14.001617 691680074001617
OK               2004-11-01 01:01:47.001943 1099270907001943
OK               2014-09-06 12:26:14.000901 1410006374000901
OK               2023-10-14 02:04:12.000802 1697249052000802
OK               2027-07-24 21:26:55.000092 1816464415000092
OK               2038-07-10 12:41:41.001826 2162378501001826
OK               2042-08-04 13:01:13.000769 2290770073000769

vuule · 2022-08-23T05:59:41Z

Probably related: #5529 (comment)

vuule · 2022-08-24T01:49:05Z

Looks like the comment above is not directly related to the root cause.
The difference between https://github.com/apache/orc/blob/fa9c011e13e8376d2a185bd76af834bd644f4332/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1221-L1227 and our ORC reader (logic related to milliseconds instead of seconds) might be. Looking further into this.

Fixes #11525 Contains a chain of fixes: 1. Allow negative nanoseconds in negative timestamps - aligns writer with pyorc; 2. Limit seconds adjustment to positive nanoseconds - fixes the off-by-one issue reported in #11525; 3. Fix the decode of large uint64_t values (>max `int64_t`) - fixes reading of cuDF encoded negative nanoseconds; 4. Avoid mode 2 encode when the base value is larger than max `int64_t` - follows the specs and fixes reading of negative nanoseconds using non-cuDF readers. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #11586

ttnghia · 2022-09-13T16:19:44Z

Sorry guys. Our tests just discovered new failed cases for these values:

1647-05-20 19:25:03.000638
1846-7-2 21:3:40.000508
1626-11-11 21:9:20.000733

I'm not sure if this is still the old bug or a new one.

Edit: Added more cases.

vuule · 2022-09-13T16:36:49Z

Sorry guys. Our tests just discovered new failed cases for these values:
1647-05-20 19:25:03.000638
1846-7-2 21:3:40.000508
I'm not sure if this is still the old bug or a new one.

What are the timestamps in the comment? Two cases that fail, correct timestamp and the cudf result, something else?

ttnghia · 2022-09-13T16:38:13Z

These values if written by Spark CPU then reading by cudf will produce different values (off-by-one second). For example:

cpu = datetime.datetime(1647, 5, 20, 19, 25, 3, 638)
gpu = datetime.datetime(1647, 5, 20, 19, 25, 2, 638)

Note that 638 here is microsecond so it should be written as 000638 when testing.

vuule · 2022-09-13T19:54:56Z

No repro so far, all three timestamps are read correctly with cuDF. @ttnghia can you please share a more detailed repro instructions?

ttnghia · 2022-09-13T20:50:17Z

Reproducing:

Write in Spark (spark-shell):

scala> import java.sql.Timestamp
import java.sql.Timestamp

scala>  val df = Seq(Timestamp.valueOf("1647-5-20 19:25:3.000638")).toDF("v")
df: org.apache.spark.sql.DataFrame = [v: timestamp]

scala> df.coalesce(1).write.mode("overwrite").orc("/home/nghiat/Devel/tmp/ts.orc")

Read in cudf:

auto const filepath =
    "/home/nghiat/Devel/tmp/ts.orc/"
    "part-00000-32a8c643-dcaf-40ef-b3eb-7e335d491bf1-c000.snappy.orc";
  auto const in_opts = cudf_io::orc_reader_options::builder(cudf_io::source_info{filepath})
                         .use_index(false)
                         .timestamp_type(cudf::data_type(cudf::type_id::TIMESTAMP_MICROSECONDS))
                         .build();
  auto const& table = cudf_io::read_orc(in_opts).tbl;
  auto const a = table->get_column(0);
  cudf::test::print(a.view());


1647-05-20T19:25:02.000638Z

…or (#11699) Closes #11525 Not sure why, but the apache Java ORC reader does the following when reading negative timestamps: https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1284-L1285 This detail does not impact cuDF and pyorc writers (reading cudf files with apache reader already works) because these libraries write negative timestamps with negative nanoseconds. This PR modifies the ORC reader behavior to match the apache reader so that cuDF correctly reads ORC files written by the apache reader. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Elias Stehle (https://github.com/elstehle) - Nghia Truong (https://github.com/ttnghia) URL: #11699

ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Aug 12, 2022

github-actions bot added this to Needs prioritizing in Bug Squashing Aug 12, 2022

ttnghia mentioned this issue Aug 12, 2022

[BUG] Timestamp from GPU ORC reading is different from CPU ORC reading NVIDIA/spark-rapids#6312

Closed

ttnghia changed the title ~~[BUG] ORC read/write is incompatible with Spark at some corner case(s)~~ [BUG] ORC read/write is incompatible with Spark in some corner case(s) Aug 12, 2022

ttnghia added the cuIO cuIO issue label Aug 13, 2022

vuule mentioned this issue Aug 24, 2022

Fix encode/decode of negative timestamps in ORC reader/writer #11586

Merged

3 tasks

rapids-bot bot closed this as completed in #11586 Sep 13, 2022

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 13, 2022

This was referenced Sep 13, 2022

[BUG] ORC read/write is wrong in day values in pre-1582 datetime values #11691

Open

[BUG] Orc writer wrong for timestamps prior to 1970 NVIDIA/spark-rapids#140

Open

ttnghia reopened this Sep 13, 2022

Bug Squashing automation moved this from Closed to Needs prioritizing Sep 13, 2022

vuule mentioned this issue Sep 14, 2022

Modify ORC reader timestamp parsing to match the apache reader behavior #11699

Merged

3 tasks

rapids-bot bot closed this as completed in #11699 Sep 14, 2022

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 14, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

ttnghia commented Aug 12, 2022 •

edited

Loading

ttnghia commented Aug 16, 2022 •

edited

Loading

sameerz commented Aug 16, 2022

ttnghia commented Aug 16, 2022 •

edited

Loading

GregoryKimball commented Aug 17, 2022 •

edited

Loading

GregoryKimball commented Aug 17, 2022 •

edited

Loading

vuule commented Aug 23, 2022

vuule commented Aug 24, 2022

ttnghia commented Sep 13, 2022 •

edited

Loading

vuule commented Sep 13, 2022

ttnghia commented Sep 13, 2022 •

edited

Loading

vuule commented Sep 13, 2022

ttnghia commented Sep 13, 2022

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

[BUG] ORC read/write is incompatible with Spark in some corner case(s) #11525

Comments

ttnghia commented Aug 12, 2022 • edited Loading

Write in Spark + read in libcudf:

Write in libcudf + read in Spark:

ttnghia commented Aug 16, 2022 • edited Loading

sameerz commented Aug 16, 2022

ttnghia commented Aug 16, 2022 • edited Loading

GregoryKimball commented Aug 17, 2022 • edited Loading

GregoryKimball commented Aug 17, 2022 • edited Loading

vuule commented Aug 23, 2022

vuule commented Aug 24, 2022

ttnghia commented Sep 13, 2022 • edited Loading

vuule commented Sep 13, 2022

ttnghia commented Sep 13, 2022 • edited Loading

vuule commented Sep 13, 2022

ttnghia commented Sep 13, 2022

ttnghia commented Aug 12, 2022 •

edited

Loading

ttnghia commented Aug 16, 2022 •

edited

Loading

ttnghia commented Aug 16, 2022 •

edited

Loading

GregoryKimball commented Aug 17, 2022 •

edited

Loading

GregoryKimball commented Aug 17, 2022 •

edited

Loading

ttnghia commented Sep 13, 2022 •

edited

Loading

ttnghia commented Sep 13, 2022 •

edited

Loading