Spark: Support Parquet dictionary encoded UUIDs #13324

Fokko · 2025-06-16T11:35:25Z

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007

I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't caught previously.

Closes #4581

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007 I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet. For PyIceberg we only generate little data, so therefore this wasn't caught previously.

kevinjqliu

Generally LGTM

Is there a way to test this? Can we add a dictionary encoded UUID like this?

dingo4dev

@Fokko TIA.

Do we have any tests for the UUIDs partition read?

Fokko · 2025-06-18T05:43:54Z

@DinGo4DEV Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍

Fokko · 2025-06-18T08:14:49Z

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.

Test has been added and checked using breakpoints that it hits the newly added lines 👍

github-actions bot added spark arrow labels Jun 16, 2025

Fokko mentioned this pull request Jun 16, 2025

Spark: Cannot read or write UUID columns #4581

Open

Fokko force-pushed the fd-uuid branch from c000b5c to 97c150d Compare June 16, 2025 11:41

Fokko force-pushed the fd-uuid branch from 97c150d to f033e4d Compare June 16, 2025 14:44

Fokko mentioned this pull request Jun 16, 2025

fix: correct UUIDType partition representation for BucketTransform apache/iceberg-python#2003

Open

kevinjqliu approved these changes Jun 17, 2025

View reviewed changes

dingo4dev reviewed Jun 18, 2025

View reviewed changes

Add another test

cc50da4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Support Parquet dictionary encoded UUIDs #13324

Spark: Support Parquet dictionary encoded UUIDs #13324

Fokko commented Jun 16, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

dingo4dev left a comment

Uh oh!

Fokko commented Jun 18, 2025

Uh oh!

Fokko commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Spark: Support Parquet dictionary encoded UUIDs #13324

Are you sure you want to change the base?

Spark: Support Parquet dictionary encoded UUIDs #13324

Conversation

Fokko commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

dingo4dev left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jun 18, 2025

Uh oh!

Fokko commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Fokko commented Jun 16, 2025 •

edited

Loading

Fokko commented Jun 18, 2025 •

edited

Loading