Skip to content

[Draft]Add SQL logic tests for Run-End Encoded (REE) #16715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rich-t-kid-datadog
Copy link

@rich-t-kid-datadog rich-t-kid-datadog commented Jul 8, 2025

This PR contributes towards the larger (RLE)/(REE) epic!

Rationale for this change

DataFusion currently supports REE encoding through Arrow's RunEndEncoded type, but lacks comprehensive testing to ensure this functionality works correctly across various SQL operations.
This PR adds comprehensive test coverage for REE-encoded data to ensure that:

  • Basic SQL operations (filtering, selection, aggregation) work correctly with REE columns
  • String functions (LOWER, UPPER, CONCAT, SUBSTR, REPLACE, REVERSE) properly handle REE-encoded data
  • Complex queries involving multiple REE columns function as expected
  • The encoding/decoding process is transparent to users and maintains data integrity

What changes are included in this PR?

This PR adds a new test file run_end_encoding.slt that provides comprehensive testing for Run-End Encoded data:

  1. Basic REE Functionality
  • Table creation with REE-encoded columns using arrow_cast(column, 'RunEndEncoded(Int32, Utf8)')
  • Verification of correct data types via DESCRIBE statements
  1. Filtering and Selection
  • Filtering on REE columns
  • Combined filtering on REE and non-REE columns
  1. Aggregation Functions
  • COUNT(*) and COUNT(DISTINCT) on REE columns
  1. String Function Testing
  • LOWER() and UPPER() on REE columns
  • CONCAT() with REE columns (including nested operations)
  • SUBSTR()/SUBSTRING() on REE columns
  • REPLACE() on REE columns
  • REVERSE() on REE columns
  • Combined string functions (e.g., UPPER(SUBSTR(...)))

Are these changes tested?

The changes are test.

Are there any user-facing changes?

As of now, no. A majority of these test wont pass as of now due to the lack of support but it gives a guideline as to what our focus is.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 8, 2025
@rich-t-kid-datadog rich-t-kid-datadog force-pushed the RunEndEncoding-slt-test branch from 147e058 to bef6ee3 Compare July 9, 2025 14:06
@rich-t-kid-datadog rich-t-kid-datadog changed the title draft of .slt file. Implemented the basics, need to test with cast ch… [Draft]Add SQL logic tests for Run-End Encoded (REE) Jul 9, 2025
@rich-t-kid-datadog
Copy link
Author

TODO: Add Hash Joins
Not a priority


# LOWER function tests
query T
SELECT LOWER(name) FROM ree_test_two_columns WHERE name = 'Alice' LIMIT 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests it could be nice to have a way to verify the output type is also Run End Encoded (as opposed to DataFusion implicitly casting to Utf8)

Copy link
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! how about adding some test that try to stress REE with some more edge cases, for example, we could have tests that check for REE input containing NULLs, or when the input table contains no duplicates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants