Skip to content

Spark: Cannot read or write UUID columns #4581

@RussellSpitzer

Description

@RussellSpitzer

Because of the String -> Fixed Binary conversion the readers and writers are both incorrect.

The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.

java.lang.UnsupportedOperationException: Unsupported type: UTF8String
	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour

The writer is broken because it gets String Columns from Spark but needs to write fixed binary.

Something like this needed as a fix

  private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
    return new UUIDWriter(desc);
  }

  private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
    private ByteBuffer buffer = ByteBuffer.allocate(16);

    private UUIDWriter(ColumnDescriptor desc) {
      super(desc);
    }

    @Override
    public void write(int repetitionLevel, UTF8String string) {
      UUID uuid = UUID.fromString(string.toString());
      buffer.rewind();
      buffer.putLong(uuid.getMostSignificantBits());
      buffer.putLong(uuid.getLeastSignificantBits());
      buffer.rewind();
      column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
    }
  }

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions