Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Feature] Faster Bulk-Data Loading in YugabyteDB #11765

Open
ymahajan opened this issue Mar 15, 2022 · 1 comment
Open

[New Feature] Faster Bulk-Data Loading in YugabyteDB #11765

ymahajan opened this issue Mar 15, 2022 · 1 comment
Assignees
Labels
area/ycql Yugabyte CQL (YCQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects

Comments

@ymahajan
Copy link
Contributor

ymahajan commented Mar 15, 2022

Jira Link: DB-4641

Description

Master issue to track improvements to make it easier and faster to get large amounts of data into YugabyteDB.

Phase 1

Status Feature GitHub Issue Comments
Faster non transactional writes during bulk load #7809 Allowing faster writes on copy command by using session variable "yb_force_non_transactional_writes".
Disable transactional writes during bulk data loading for indexes #11266 Add yb_disable_transactional_writes session to improve the latency performance of bulk data loading for index tables such as when COPY command is used which goes into the insert write path (not delete or update).
Implement Async Flush for COPY command #11628 Currently, we synchronously wait for a flush response every time we flush. We want to make this asynchronous to reduce the time spent waiting and improve the performance of COPY.
Speed up YSQL inserts by skipping lookup of keys being inserted #11269 During bulk load (for example inserts by Copy command), skip lookup of the key being inserted, to speed up the inserts. This is similar to the upsert mode that is supported for YCQL.
Optimize memory allocation/deallocation in bulk insert/copy using Protobuf's arena #11720 Currently when running bulk insert / copy command, in the PostgreSQL backend for, about 15 percent of CPU time is spent on memory allocation / deallocation.
Perf improvement by eliminating serialization to the WAL format. #11409 When writing data to the RocksDb layer, there are additional steps of serializing to the WAL format which is unnecessary and leads to wasted work.
Tuning parameters for faster copy performance #12293 Tuning parameters for faster copy performance
Pack columns in DocDB storage format for better performance #3520 Packing columns into a single RocksDB entry per row instead of one per column (as we do currently) improves YSQL performance
⬜️ Parallelize copy command #11453 Distribute copy operation internally using multiple workers

Phase 2

Status Feature GitHub Issue Comments
⬜️ Streaming ingest to YugabyteDB without using JDBC Inserting around 1 billion records through the streaming interface every day. It will be inefficient to transfer this huge volume of records over the JDBC interface. It could be implementing Spark RDD write interface.
@ymahajan ymahajan added the kind/new-feature This is a request for a completely new feature label Mar 15, 2022
@rthallamko3 rthallamko3 added the area/ycql Yugabyte CQL (YCQL) label Dec 29, 2022
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Dec 29, 2022
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature status/awaiting-triage Issue awaiting triage and removed kind/new-feature This is a request for a completely new feature status/awaiting-triage Issue awaiting triage labels Jan 9, 2023
@ddorian
Copy link
Contributor

ddorian commented Oct 27, 2023

Streaming ingest to YugabyteDB without using JDBC

Maybe https://www.postgresql.org/about/news/apache-arrow-flight-sql-adapter-for-postgresql-010-2716/ could be of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ycql Yugabyte CQL (YCQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Bulk Load
Awaiting triage
Development

No branches or pull requests

5 participants