Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdk-oracle: optimize batching of payload rows with different keysets #2931

Closed
DeltaMichael opened this issue Nov 22, 2023 · 0 comments · Fixed by #3194
Closed

vdk-oracle: optimize batching of payload rows with different keysets #2931

DeltaMichael opened this issue Nov 22, 2023 · 0 comments · Fixed by #3194
Assignees
Labels
enhancement New feature or request initiative: VDK Oracle VDK Oracle support story Task for an Epic
Milestone

Comments

@DeltaMichael
Copy link
Contributor

DeltaMichael commented Nov 22, 2023

Overview

Prerequisites

#2933
Benchmark values

Use case of payload objects with different key sets.

https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-plugins/vdk-oracle/tests/jobs/oracle-ingest-job-different-payloads-no-table/10_ingest.py#L6

We still want to be able to batch queries using cursor.executemany(). This is not feasible when the keysets differ, because we're substituting in an insert query with a static number of columns, e.g.

f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({', '.join([':' + str(i + 1) for i in range(len(columns))])})"

We call this for each ingestion payload. This means that the payload should be uniform, e.g. have the same keys and same number of values for every row (object). However, this contradicts our desired use case. We've solved this by further batching the payload by key set and then doing separate cursor.executemany() calls for each key set.

There are a few drawbacks to this approach.

  1. We create a frozenset for each row. Depending on the number of columns, this could be a problem, because it's an O(n) operation.
  2. Frozen sets are hashable, but depending on the hash function, there is probably a better option to use as a key. This should be researched further.
  3. We have to recreate the key-value mappings, so that order is preserved across the batch, e.g. each batch contains a bunch of rows that are dicts and we have to convert them to lists in order to execute the query. The columns are a list and were converted earlier from a keyset, where order is not guaranteed. We have to make each row follow the order of the column list, otherwise there will be errors when executing the queries. This requires seeing each element of the row and putting it in a list at the correct position.

We might be constrained by linear time, because batching requires seeing at least every row. Making the data uniform maybe does not require seeing every value of every row, so there's room for optimization.

Proposed solution

A good alternative approach might be to get the sum of all key sets in the payload (all possible columns). Then, for each row, if there are missing keys, we just set them to null and do a single cursor.executemany().

Acceptance criteria

  1. Decide if this is worth optimizing
  2. Implement optimization
  3. Measure results
@DeltaMichael DeltaMichael added the enhancement New feature or request label Nov 22, 2023
@DeltaMichael DeltaMichael added this to the VDK Oracle milestone Nov 22, 2023
@DeltaMichael DeltaMichael added the story Task for an Epic label Nov 22, 2023
DeltaMichael added a commit that referenced this issue Nov 24, 2023
## Why?
    
In order to support more use cases, vdk should support connecting and
ingesting to an oracle database
    
## What?
    
Add oracle plugin. Plugin supports simple queries, cli queries and
ingestion.
    
## How was this tested?
    
Local functional tests, CI tests are part of a separate task
    
## What kind of change is this?
    
Feature/non-breaking
    
## Follow-up

[Set up testcontainers for
CI](#2928)
[Support type inference when
ingesting](#2929)
[Support passing math.nAn and None for
ingestion](#2930)
[Optimize batching of payload rows with different
keysets](#2931)
[ORA-01002: fetch out of sequence error in _cache_tables when some rows
fail to
ingest](#2932)
[Further load
testing](#2933)
[Investigate possible
segfaults](#2934)

Signed-off-by: Dilyan Marinov <mdilyan@vmware.com>
Co-authored-by: Antoni Ivanov <aivanov@vmware.com>
@DeltaMichael DeltaMichael self-assigned this Feb 21, 2024
@stefan-pulov stefan-pulov added the initiative: VDK Oracle VDK Oracle support label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request initiative: VDK Oracle VDK Oracle support story Task for an Epic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants