Extract Delta Lake deletion vectors #661

ashvina · 2025-03-05T23:48:51Z

Fixes: #343 and #345

This change extracts deletion vectors represented as roaring bitmaps in delta lake files and converts them into the XTable intermediate representation.

Previously, XTable only detected tables changes that included adding or removing of data files. Now the detected table change also includes any deletion vectors files added in the commit.

Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format. However, once extracted by Xtable, the offset are currently extracted into a list of long offsets. This representation is not the most efficient for large datasets. Optimization is pending to prioritize end-to-end conversion completion.

relates to #627

piyushdubey · 2025-03-06T21:41:00Z

xtable-core/src/main/java/org/apache/xtable/delta/DeltaConversionSource.java

+        InternalDeletionVector deletionVector =
+            actionsConverter.extractDeletionVector(snapshotAtVersion, (AddFile) action);
+        if (deletionVector != null) {
+          deletionVectors.put(deletionVector.dataFilePath(), deletionVector);


deletionVector.dataFilePath() points to the path of the associated Parquet Data File.

We should use deletionVector.getPhysicalPath() instead. Thoughts? @ashvina

Thanks for the review @piyushdubey
The intention is to use path of the data file with which this deletion vector is associated (see comment on line 118). This is used to update the maps of files added and removed.

Makes sense. I see on line 168, you are concatenating the deletion vectors to the Internal files, which will not add dv and data file both without skipping.

Thanks for the clarification.

ashvina · 2025-03-06T22:16:31Z

Depends on #662

This change extracts deletion vectors represented as roaring bitmaps in delta lake files and converts them into the XTable intermediate representation. Previously, XTable only detected tables changes that included adding or removing of data files. Now the detected table change also includes any deletion vectors files added in the commit. Note that, in Delta Lake, the Deletion vectors are represented in a compressed binary format. However, once extracted by Xtable, the offset are currently extracted into a list of long offsets. This representation is not the most efficient for large datasets. Optimization is pending to prioritize end-to-end conversion completion.

the-other-tim-brown · 2025-03-09T15:30:01Z

xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDeletionVector.java

+   * binary representation of the deletion vector. The consumer can use the {@link
+   * #ordinalsIterator()} to extract the ordinals represented in the binary format.
+   */
+  byte[] binaryRepresentation;


Currently when this field is set to a non-null value the ordinalsIterator is also set. I think it may be cleaner to remove this and rely directly on the ordinalsIterator. Is there something in the future though where this may be used directly?

My main worry is that future developers implementing support for deletion vectors may eagerly parse the data into this field.

the-other-tim-brown · 2025-03-09T15:39:51Z

xtable-core/src/test/java/org/apache/xtable/delta/TestDeltaActionsConverter.java

+    Configuration conf = new Configuration();
+    DeltaLog deltaLog = Mockito.mock(DeltaLog.class);
+    when(snapshot.deltaLog()).thenReturn(deltaLog);
+    when(deltaLog.dataPath()).thenReturn(new Path(basePath));
+    when(deltaLog.newDeltaHadoopConf()).thenReturn(conf);
+
+    long[] ordinals = {45, 78, 98};
+    Mockito.doReturn(ordinals)
+        .when(actionsConverter)
+        .parseOrdinalFile(conf, new Path(deleteFilePath), size, offset);


Can you pull the common testing setup into a helper method?

Similarly, the assertions below can be added to a common method so there are less places update if the assertions need to update due to new field or something like that.

the-other-tim-brown · 2025-03-09T15:46:02Z

xtable-core/src/main/java/org/apache/xtable/delta/DeltaConversionSource.java

@@ -151,7 +153,7 @@ public TableChange getTableChangeForCommit(Long versionNumber) {
    // entry which is replaced by a new entry, AddFile with delete vector information. Since the
    // same data file is removed and added, we need to remove it from the added and removed file
    // maps which are used to track actual added and removed data files.
-    for (String deletionVector : deletionVectors) {
+    for (String deletionVector : deletionVectors.keySet()) {


nitpick: the name deletionVector is no longer representative of the actual string. Something like dataFileForDeletionVector would be more clear

the-other-tim-brown · 2025-03-09T15:50:37Z

xtable-core/src/test/java/org/apache/xtable/delta/ITDeltaDeleteVectorConvert.java

+  }
+
+  private void validateDeletionInfoForCommit(
+      TableState tableState,


This is unused in the method, is that intentional?

the-other-tim-brown · 2025-03-09T15:52:21Z

xtable-core/src/test/java/org/apache/xtable/delta/ITDeltaDeleteVectorConvert.java

+      Iterator<Long> iterator = deleteInfo.ordinalsIterator();
+      List<Long> deletes = new ArrayList<>();
+      iterator.forEachRemaining(deletes::add);
+      assertEquals(deletes.size(), deleteInfo.getRecordCount());


Should we also validate the ordinals are correct here?

the-other-tim-brown · 2025-03-09T15:55:17Z

xtable-core/src/test/java/org/apache/xtable/delta/ITDeltaDeleteVectorConvert.java

    SourceTable tableConfig =
        SourceTable.builder()
            .name(testSparkDeltaTable.getTableName())
-            .basePath(testSparkDeltaTable.getBasePath())
+            .basePath(tableBasePath)
            .formatName(TableFormat.DELTA)
            .build();
    DeltaConversionSource conversionSource =
        conversionSourceProvider.getConversionSourceInstance(tableConfig);
    InternalSnapshot internalSnapshot = conversionSource.getCurrentSnapshot();

    //    validateDeltaPartitioning(internalSnapshot);


Can you remove this comment?

the-other-tim-brown · 2025-03-09T15:56:57Z

xtable-core/src/test/java/org/apache/xtable/delta/ITDeltaDeleteVectorConvert.java

@@ -91,11 +99,24 @@ void setUp() {
    conversionSourceProvider.init(hadoopConf);
  }

+  private static class TableState {
+    Map<String, AddFile> activeFiles;
+    List<Row> rowsToDelete;


This list looks like it is unused, is that intentional?

piyushdubey reviewed Mar 6, 2025

View reviewed changes

ashvina force-pushed the 345-read-and-translate-the-deletion-vectors-in-delta-source-table-to-xtables-internal-representation branch 2 times, most recently from 9a22abe to df587f3 Compare March 7, 2025 23:23

ashvina force-pushed the 345-read-and-translate-the-deletion-vectors-in-delta-source-table-to-xtables-internal-representation branch from df587f3 to 1fd115c Compare March 7, 2025 23:26

the-other-tim-brown reviewed Mar 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Delta Lake deletion vectors #661

Extract Delta Lake deletion vectors #661

ashvina commented Mar 5, 2025

piyushdubey Mar 6, 2025

ashvina Mar 6, 2025

piyushdubey Mar 6, 2025

ashvina commented Mar 6, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

the-other-tim-brown Mar 9, 2025

Extract Delta Lake deletion vectors #661

Are you sure you want to change the base?

Extract Delta Lake deletion vectors #661

Conversation

ashvina commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashvina commented Mar 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment