Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include writer code and writerVersion in ORC files #14458

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 20, 2023

Description

Closes #14325
Changes some of the metadata written to ORC file:

  • Include the (cuDF) writer code (5).
  • include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics.

This change can have unexpected impact, depending on how readers use these fields.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request cuIO cuIO issue breaking Breaking change labels Nov 20, 2023
@vuule vuule self-assigned this Nov 20, 2023
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Nov 20, 2023
@vuule vuule changed the base branch from branch-23.12 to branch-24.02 November 20, 2023 23:07
@vuule vuule changed the title Include writer code and correct version in ORC files Include writer code and writerVersion in ORC files Nov 21, 2023
@vuule vuule marked this pull request as ready for review November 22, 2023 02:40
@vuule vuule requested a review from a team as a code owner November 22, 2023 02:40
@@ -40,6 +40,9 @@ namespace orc {
static constexpr uint32_t block_header_size = 3;
// Seconds from January 1st, 1970 to January 1st, 2015
static constexpr int64_t orc_utc_epoch = 1420070400;
// Each ORC writer implementation should write its code to the file footer
// the codes are specified in the ORC specification
static constexpr int32_t cudf_writer_code = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are ORC writer number 5?
Is there a link to these codes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we got assigned 5. Search for cudf in https://orc.apache.org/specification/ORCv1/

I can add the link above.

@harrism harrism removed their request for review November 23, 2023 01:34
@harrism
Copy link
Member

harrism commented Nov 23, 2023

I think someone more familiar with ORC should review. Removing my assignment.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@@ -2702,7 +2703,8 @@ void writer::impl::close()
ps.footerLength = pbw.size();
ps.compression = _compression_kind;
ps.compressionBlockSize = _compression_blocksize;
ps.version = {0, 12};
ps.version = {0, 12}; // Hive 0.12
ps.writerVersion = 7; // ORC-517 fixed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this 7 be defined above like MAGIC with a comment about it? Do we change the writer version every major release or major changes? Is this something we never change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we would increase this version when we implement newer features and fix notable bugs, see https://github.com/apache/orc/blob/main/proto/orc_proto.proto#L412-L422

I'm not sure if this should be a constant. I'll make the change if you think that's better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a constant it is easier to add a multi-line comment explaining that major changes should increment it. I'd also be ok with a comment about the version. At the extreme, I would have a changelist in the comments

// version 1: Initial implementation
// version 2: added nested types
// version 3: fixed bug xyz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included the list of versions, let me know what you think

static constexpr int32_t cudf_writer_code = 5;
// Each ORC writer implementation should write its version to the PostScript
// The version values are based on the ORC Java writer bug fixes and features
// From https://github.com/apache/orc/blob/main/proto/orc_proto.proto:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link cannot be accessed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, they removed the proto module in a recent PR :(
I'll track down where the information is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the link
what timing!

uint32_t compressionBlockSize{}; // the maximum size of each compression chunk
std::vector<uint32_t> version; // the version of the file format [major, minor]
uint64_t metadataLength = 0; // the length of the metadata section in bytes
std::optional<uint32_t> writerVersion; // The version of the writer that wrote the file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is an optional? Should it be a permanent field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's based on the protobuf specs: https://github.com/apache/orc/blob/branch-1.9/proto/orc_proto.proto#L439
We might always write the version now, but we should be able to read files without this field.

@@ -90,6 +113,7 @@ struct FileFooter {
uint64_t numberOfRows = 0; // the total number of rows in the file
std::vector<ColStatsBlob> statistics; // Column statistics blobs
uint32_t rowIndexStride = 0; // the maximum number of rows in each index entry
std::optional<uint32_t> writer; // Writer code
Copy link
Contributor

@ttnghia ttnghia Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, should it be a permanent field?

@vuule vuule requested a review from ttnghia December 6, 2023 23:54
@vuule
Copy link
Contributor Author

vuule commented Dec 7, 2023

/merge

@rapids-bot rapids-bot bot merged commit b02e82f into rapidsai:branch-24.02 Dec 7, 2023
67 checks passed
@vuule vuule deleted the fea-write_orc-version-and-code branch December 7, 2023 17:48
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
Closes rapidsai#14325
Changes some of the metadata written to ORC file:

- Include the (cuDF) writer code (5).
- include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics.

This change can have unexpected impact, depending on how readers use these fields.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#14458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down
6 participants