Skip to content

Conversation

@inv-jishnu
Copy link
Contributor

@inv-jishnu inv-jishnu commented Jan 10, 2025

Description

In this PR I have added import processes to process the import file based on the file format and related dtos and util files for it.

Related issues and/or PRs

Please review this PR once the below PRs are reviewed and merged and master branch is merged to this branch with those changes.

Some more information on data chunk and transaction size
The data chunk size and transaction size are introduced in new changes. The data chunk size is specified is used to split the input files to data chunks of specified size. If the scalardb mode is transaction, the records in each data chunk is processed as transactions. The records are then further split up based on transaction size and are processed together as a single transaction.

Changes made

Added classes to process the import source file based on the file format and related dtos and util classes

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Road map to merge remaining data loader core files. Current status

Release notes

NA

@inv-jishnu inv-jishnu requested review from brfrn169 and komamitsu April 2, 2025 11:25
Copy link
Contributor

@komamitsu komamitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment. But other than that, LGTM! 👍

@@ -0,0 +1,2 @@
transaction.batch.thread.pool.size=16
import.data.chunk.queue.size=256 No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new line.
Thank you.

Comment on lines 330 to 331
private ImportDataChunkStatus processDataChunkWithTransactions(
ImportDataChunk dataChunk, int transactionBatchSize, int numCores) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer use numCores. Could you please remove it?

Instant startTime = Instant.now();
AtomicInteger successCount = new AtomicInteger(0);
AtomicInteger failureCount = new AtomicInteger(0);
ExecutorService recordExecutor = Executors.newFixedThreadPool(numCores);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use the number of cores here for the same reason mentioned in #2462 (comment).

* <p>This class reads properties from a {@code config.properties} file located in the classpath.
*/
public class ConfigUtil {
public static final String CONFIG_PROPERTIES = "config.properties";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are we using a fixed configuration file? Could you please explain why we don’t use command-line arguments for the queue size and thread pool size?

Anyway, if we’re using a file for the configurations, I think we should pass the configuration file name via command-line arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass the configuration file name via command-line arguments.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brfrn169 san, @komamitsu san,
Thank you for the suggestions.
I have updated to use command line arguments to configure queue size(added new parameter) and thread pool size (there was an existing parameter for this, but I didn't use that as the initial new data loader changes had usage of number of cores in it. So I didn't change that). I have removed configurable by properties file completely. Asking the users to add a properties file for just 2 parameter while rest are passed as arguments seemed confusing. I have also remove numCores from method arguments and instead use the value directly as instructed.

@inv-jishnu
Copy link
Contributor Author

@brfrn169 san,
I have made further changes based on feedback.
Please take a look at this again when you get a chance.
Thank you.

@inv-jishnu inv-jishnu requested a review from brfrn169 April 7, 2025 04:31
@brfrn169 brfrn169 requested a review from ypeckstadt April 7, 2025 12:01
Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Copy link
Contributor

@ypeckstadt ypeckstadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

Copy link
Contributor

@Torch3333 Torch3333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@ypeckstadt ypeckstadt merged commit 2eacbbc into master Apr 9, 2025
48 checks passed
feeblefakie pushed a commit that referenced this pull request Apr 9, 2025
Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>
feeblefakie pushed a commit that referenced this pull request Apr 9, 2025
Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>
inv-jishnu added a commit that referenced this pull request Apr 10, 2025
Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>
inv-jishnu added a commit that referenced this pull request Apr 10, 2025
Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>
inv-jishnu added a commit that referenced this pull request Apr 10, 2025
Co-authored-by: Peckstadt Yves <peckstadt.yves@gmail.com>
@brfrn169 brfrn169 deleted the feat/data-loader/import-process branch April 10, 2025 07:57
@inv-jishnu inv-jishnu mentioned this pull request Apr 11, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants