Add the import process implementation for data loader #2462

inv-jishnu · 2025-01-10T11:03:18Z

Description

In this PR I have added import processes to process the import file based on the file format and related dtos and util files for it.

Related issues and/or PRs

Please review this PR once the below PRs are reviewed and merged and master branch is merged to this branch with those changes.

Some more information on data chunk and transaction size
The data chunk size and transaction size are introduced in new changes. The data chunk size is specified is used to split the input files to data chunks of specified size. If the scalardb mode is transaction, the records in each data chunk is processed as transactions. The records are then further split up based on transaction size and are processed together as a single transaction.

Changes made

Added classes to process the import source file based on the file format and related dtos and util classes

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes.
Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
Tests (unit, integration, etc.) have been added for the changes.
My changes generate no new warnings.
Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Road map to merge remaining data loader core files. Current status

General
- Add ScalarDB Dao and related files - Add ScalarDB Dao and related files #2417
- TableMetadataService(partially replaced by ConsensusUtils): Add table metadata service #2434
Export
- Export options and validations: Add export options validator #2435
- ProducerTasks: 1 PR incoming
Import
- Dto classes and utilities: Add data chunk and task result enums and dtos #2442
- Import processor and task code: 2-3 PRs incoming
  - Add dtos and other classes for task #2446
  - Add the import process implementation for data loader #2462
- Code for Import transaction batch and data chunk import: 1 PR Incoming
- ControlFile related Dtos: Add Control file module files and validation #2445
- Import logger: 1 PR incoming

Release notes

NA

komamitsu

Left a minor comment. But other than that, LGTM! 👍

komamitsu · 2025-04-04T05:15:52Z

data-loader/core/src/main/resources/config.properties

@@ -0,0 +1,2 @@
+transaction.batch.thread.pool.size=16
+import.data.chunk.queue.size=256


[minor] A file should end with a newline https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

I have added a new line.
Thank you.

brfrn169 · 2025-04-04T05:54:03Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+  private ImportDataChunkStatus processDataChunkWithTransactions(
+      ImportDataChunk dataChunk, int transactionBatchSize, int numCores) {


We no longer use numCores. Could you please remove it?

brfrn169 · 2025-04-04T05:56:40Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+    Instant startTime = Instant.now();
+    AtomicInteger successCount = new AtomicInteger(0);
+    AtomicInteger failureCount = new AtomicInteger(0);
+    ExecutorService recordExecutor = Executors.newFixedThreadPool(numCores);


We should not use the number of cores here for the same reason mentioned in #2462 (comment).

brfrn169 · 2025-04-04T06:01:16Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/util/ConfigUtil.java

+ * <p>This class reads properties from a {@code config.properties} file located in the classpath.
+ */
+public class ConfigUtil {
+  public static final String CONFIG_PROPERTIES = "config.properties";


So, are we using a fixed configuration file? Could you please explain why we don’t use command-line arguments for the queue size and thread pool size?

Anyway, if we’re using a file for the configurations, I think we should pass the configuration file name via command-line arguments.

I think we should pass the configuration file name via command-line arguments.

👍

@brfrn169 san, @komamitsu san,
Thank you for the suggestions.
I have updated to use command line arguments to configure queue size(added new parameter) and thread pool size (there was an existing parameter for this, but I didn't use that as the initial new data loader changes had usage of number of cores in it. So I didn't change that). I have removed configurable by properties file completely. Asking the users to add a properties file for just 2 parameter while rest are passed as arguments seemed confusing. I have also remove numCores from method arguments and instead use the value directly as instructed.

inv-jishnu · 2025-04-07T04:31:54Z

@brfrn169 san,
I have made further changes based on feedback.
Please take a look at this again when you get a chance.
Thank you.

brfrn169

LGTM! Thank you!

ypeckstadt

LGTM. Thank you.

Torch3333

LGTM, thank you!

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

inv-jishnu and others added 30 commits December 4, 2024 15:59

Util classes for data loader

753618b

Fix spotbug issue

8d39d02

Removed error message and added core error

bf94c49

Applied spotless

47be388

Fixed unit test failures

913eb1c

Merge branch 'master' into feat/data-loader/utils

1f204b8

Basic data import enum and exception

6cfa83a

Removed exception class for now

d381b2b

Added DECIMAL_FORMAT

67f2474

Path util class updated

14e3593

Feedback changes

a096d51

Merge branch 'master' into feat/data-loader/utils

dbf1940

Merge branch 'master' into feat/data-loader/utils

cd8add9

Changes

52890c8

Merge branch 'master' into feat/data-loader/import-data-1

5114639

Merge branch 'feat/data-loader/utils' into feat/data-loader/scaladb-dao

4f9cd75

Added ScalarDB Dao

1997eb8

Merge branch 'master' into feat/data-loader/scaladb-dao

91e6310

Remove unnecessary files

8a7338b

Initial commit [skip ci]

2b52eeb

Changes

e206073

Changes

26d3144

spotbugs exclude

b86487d

spotbugs exclude -2

818a2b4

Added a file [skip ci]

90c4105

Added unit test files [skip ci]

3d5d3e0

Spotbug fixes

6495202

Removed use of List.of to fix CI error

90abd9e

Merged changes from master after resolving conflict

ba2b3dd

Merge branch 'master' into feat/data-loader/metadata-service

b1b811b

inv-jishnu requested review from brfrn169 and komamitsu April 2, 2025 11:25

inv-jishnu added 2 commits April 3, 2025 12:10

Thread exexcuter changes

d9f239c

Changed few values to be configurable

723bd51

inv-jishnu mentioned this pull request Apr 4, 2025

Rename parameters from ScalarDB to ScalarDb #2582

Merged

7 tasks

komamitsu approved these changes Apr 4, 2025

View reviewed changes

Added new line

450aaea

brfrn169 reviewed Apr 4, 2025

View reviewed changes

inv-jishnu added 3 commits April 6, 2025 21:47

reverted config utils and add CLI options

aeaa08f

Updated tests

44bf503

Removed explict passing of thread size and use it directly

a5c0b91

inv-jishnu requested a review from brfrn169 April 7, 2025 04:31

brfrn169 requested a review from ypeckstadt April 7, 2025 12:01

brfrn169 approved these changes Apr 7, 2025

View reviewed changes

ypeckstadt approved these changes Apr 8, 2025

View reviewed changes

Torch3333 approved these changes Apr 9, 2025

View reviewed changes

ypeckstadt merged commit 2eacbbc into master Apr 9, 2025
48 checks passed

feeblefakie pushed a commit that referenced this pull request Apr 9, 2025

Add the import process implementation for data loader (#2462)

6b0302a

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

feeblefakie pushed a commit that referenced this pull request Apr 9, 2025

Add the import process implementation for data loader (#2462)

10d0a3d

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

feeblefakie mentioned this pull request Apr 9, 2025

Backport to branch(3) : Add the import process implementation for data loader #2590

Merged

inv-jishnu added a commit that referenced this pull request Apr 10, 2025

Add the import process implementation for data loader (#2462)

dafc939

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

inv-jishnu added a commit that referenced this pull request Apr 10, 2025

Add the import process implementation for data loader (#2462)

1f5c521

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

inv-jishnu added a commit that referenced this pull request Apr 10, 2025

Add the import process implementation for data loader (#2462)

64c8979

Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>

brfrn169 deleted the feat/data-loader/import-process branch April 10, 2025 07:57

inv-jishnu mentioned this pull request Apr 11, 2025

Add import log classes and utils #2591

Merged

6 tasks

		@@ -0,0 +1,2 @@
		transaction.batch.thread.pool.size=16
		import.data.chunk.queue.size=256 No newline at end of file

		private ImportDataChunkStatus processDataChunkWithTransactions(
		ImportDataChunk dataChunk, int transactionBatchSize, int numCores) {

Add the import process implementation for data loader #2462

Add the import process implementation for data loader #2462

Conversation

inv-jishnu commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues and/or PRs

Changes made

Checklist

Additional notes (optional)

Release notes

Uh oh!

komamitsu left a comment

Choose a reason for hiding this comment

Uh oh!

komamitsu Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

inv-jishnu Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

brfrn169 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

brfrn169 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

brfrn169 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

komamitsu Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

inv-jishnu Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

inv-jishnu commented Apr 7, 2025

Uh oh!

brfrn169 left a comment

Choose a reason for hiding this comment

Uh oh!

ypeckstadt left a comment

Choose a reason for hiding this comment

Uh oh!

Torch3333 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

inv-jishnu commented Jan 10, 2025 •

edited

Loading