Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build optimised bulk file load #161

Open
flyingsilverfin opened this issue Jun 23, 2021 · 0 comments
Open

Build optimised bulk file load #161

flyingsilverfin opened this issue Jun 23, 2021 · 0 comments

Comments

@flyingsilverfin
Copy link
Member

flyingsilverfin commented Jun 23, 2021

Issue to Solve

To improve the getting-started UX, we want to enable users to have an easy way to load a large amount of data quickly (to show the speeds that are attainable), without themselves having to write a producer/multi-consumer parallelised loader. Instead, we can build this paradigm into console.

The goal is to have a way to consume a file consisting purely of insert or match-insert, which we can restrict to having one query per line. This file can be large (100s of mb or some gb's probably), so it must use multiple transactions to load the data. Compare this to the source command that we currently have within the transaction inner REPL, which is by definition of being in the transaction REPL, a single transaction.

The feature will probably live at a session or top level repl and could look like this:

> bulk-load <database> <TypeQL inserts file path> [--parallel]

Without the --parallel flag, we should sequentially load the queries in batches, because the queries may have inter-dependencies between each other. In the help menu we should print that using the --parallel flag requires that each query be independent of each other (eg. not use prior insert's results). This allows us to go the BioGrakn-Semmed migrator style of data loading with a file-reader thread piped into a blocking queue and read by multiple writer threads, which parallelise transaction batches into the server.

This loader command should also be silent, or show a progress bar, unlike the source command which prints the output of every query, making it slow not only on the extra network time and round trips for collecting the printed data, but also because printing itself is slow.

@haikalpribadi haikalpribadi self-assigned this Jun 25, 2021
@haikalpribadi haikalpribadi added this to the TypeDB Rust Rewrite milestone Oct 29, 2021
@flyingsilverfin flyingsilverfin removed this from the TypeDB: Rust Rewrite milestone Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants