vdk-oracle: optimize batching of payload rows with different keysets #2931

DeltaMichael · 2023-11-22T21:51:43Z

Overview

Prerequisites

Use case of payload objects with different key sets.

https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-plugins/vdk-oracle/tests/jobs/oracle-ingest-job-different-payloads-no-table/10_ingest.py#L6

We still want to be able to batch queries using cursor.executemany(). This is not feasible when the keysets differ, because we're substituting in an insert query with a static number of columns, e.g.

f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({', '.join([':' + str(i + 1) for i in range(len(columns))])})"

We call this for each ingestion payload. This means that the payload should be uniform, e.g. have the same keys and same number of values for every row (object). However, this contradicts our desired use case. We've solved this by further batching the payload by key set and then doing separate cursor.executemany() calls for each key set.

There are a few drawbacks to this approach.

We create a frozenset for each row. Depending on the number of columns, this could be a problem, because it's an O(n) operation.
Frozen sets are hashable, but depending on the hash function, there is probably a better option to use as a key. This should be researched further.
We have to recreate the key-value mappings, so that order is preserved across the batch, e.g. each batch contains a bunch of rows that are dicts and we have to convert them to lists in order to execute the query. The columns are a list and were converted earlier from a keyset, where order is not guaranteed. We have to make each row follow the order of the column list, otherwise there will be errors when executing the queries. This requires seeing each element of the row and putting it in a list at the correct position.

We might be constrained by linear time, because batching requires seeing at least every row. Making the data uniform maybe does not require seeing every value of every row, so there's room for optimization.

Proposed solution

A good alternative approach might be to get the sum of all key sets in the payload (all possible columns). Then, for each row, if there are missing keys, we just set them to null and do a single cursor.executemany().

Acceptance criteria

Decide if this is worth optimizing
Implement optimization
Measure results

The text was updated successfully, but these errors were encountered:

## Why? In order to support more use cases, vdk should support connecting and ingesting to an oracle database ## What? Add oracle plugin. Plugin supports simple queries, cli queries and ingestion. ## How was this tested? Local functional tests, CI tests are part of a separate task ## What kind of change is this? Feature/non-breaking ## Follow-up [Set up testcontainers for CI](#2928) [Support type inference when ingesting](#2929) [Support passing math.nAn and None for ingestion](#2930) [Optimize batching of payload rows with different keysets](#2931) [ORA-01002: fetch out of sequence error in _cache_tables when some rows fail to ingest](#2932) [Further load testing](#2933) [Investigate possible segfaults](#2934) Signed-off-by: Dilyan Marinov <mdilyan@vmware.com> Co-authored-by: Antoni Ivanov <aivanov@vmware.com>

DeltaMichael added the enhancement New feature or request label Nov 22, 2023

DeltaMichael added this to the VDK Oracle milestone Nov 22, 2023

DeltaMichael added the story Task for an Epic label Nov 22, 2023

DeltaMichael mentioned this issue Nov 22, 2023

vdk-oracle: create oracle plugin #2927

Merged

DeltaMichael mentioned this issue Nov 27, 2023

vdk-oracle: further load testing #2933

Open

DeltaMichael self-assigned this Feb 21, 2024

stefan-pulov added the initiative: VDK Oracle VDK Oracle support label Mar 6, 2024

DeltaMichael linked a pull request Mar 12, 2024 that will close this issue

vdk-oracle: Pass ingestion payload rows in uniform batches #3194

Merged

DeltaMichael closed this as completed in #3194 Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdk-oracle: optimize batching of payload rows with different keysets #2931

vdk-oracle: optimize batching of payload rows with different keysets #2931

DeltaMichael commented Nov 22, 2023 •

edited

Loading

vdk-oracle: optimize batching of payload rows with different keysets #2931

vdk-oracle: optimize batching of payload rows with different keysets #2931

Comments

DeltaMichael commented Nov 22, 2023 • edited Loading

Overview

Prerequisites

Use case of payload objects with different key sets.

Proposed solution

Acceptance criteria

DeltaMichael commented Nov 22, 2023 •

edited

Loading