Skip to content

Add transaction mode support for data export in dataloader core and CLI #2740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

inv-jishnu
Copy link
Contributor

@inv-jishnu inv-jishnu commented Jun 5, 2025

Description

This PR adds support for transactional mode in data export within the Dataloader core, along with corresponding updates to the CLI.

Previously, the export functionality only supported storage mode because there was no transactional scanner API that allowed streaming data without loading everything into memory. With the introduction of a suitable transactional scanner, we can now support export in transaction mode as well.

Related issues and/or PRs

NA

Changes made

On core

  • Added changes to support transaction for export in TableMetadataService, ExportManager, and ScalarDBDao
  • Added/updated unit tests.
    On CLI
  • Added a parameter to input the Scalardb mode.
  • Added checks to use the transaction based on input for metadata retrieval and export.

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • I have considered whether similar issues could occur in other products, components, or modules if this PR is for bug fixes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

NA

Release notes

Added transaction mode in data export on data loader core and CLI

@inv-jishnu inv-jishnu self-assigned this Jun 5, 2025
@inv-jishnu inv-jishnu added the enhancement New feature or request label Jun 5, 2025
@inv-jishnu inv-jishnu marked this pull request as draft June 5, 2025 10:39
@ypeckstadt ypeckstadt changed the title Add transaction mode in export on dataloader core and CLI Add transaction mode support for data export in dataloader core and CLI Jun 9, 2025
@ypeckstadt ypeckstadt marked this pull request as ready for review June 11, 2025 00:26
@ypeckstadt ypeckstadt requested review from a team, komamitsu, brfrn169, feeblefakie, Torch3333 and ypeckstadt and removed request for a team June 11, 2025 00:26
@komamitsu
Copy link
Contributor

@inv-jishnu I think there are many if conditions to check whether it's storage or transaction. This is just an idea, but ScalarDbDao might be a good wrapper of the storage and transaction. How about making ScalarDbDao an interface (or abstract class) and adding concrete classes something like ScalarDbStorageDao and ScalarDbTransactionDao which contain a storage or a transaction, respectively, instead of accepting either of storage or transaction at method level (e.g., get()) ? I'm not sure it's a reasonable suggestion in terms of release schedule and so on, though.

@@ -18,7 +19,17 @@
public class TableMetadataService {

private final DistributedStorageAdmin storageAdmin;
private final DistributedTransactionAdmin transactionAdmin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems the union of TableMetadataService for storage and for transaction with some internal if branches. How about making TableMetadataService abstract and creating 2 child classes that return the TableMetadata for storage or transaction from something like getTableMetadataInternal()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@komamitsu san,
I have split up table metadata service as suggested.
Thank you.

@inv-jishnu inv-jishnu requested a review from komamitsu June 11, 2025 10:35
@inv-jishnu
Copy link
Contributor Author

@inv-jishnu I think there are many if conditions to check whether it's storage or transaction. This is just an idea, but ScalarDbDao might be a good wrapper of the storage and transaction. How about making ScalarDbDao an interface (or abstract class) and adding concrete classes something like ScalarDbStorageDao and ScalarDbTransactionDao which contain a storage or a transaction, respectively, instead of accepting either of storage or transaction at method level (e.g., get()) ? I'm not sure it's a reasonable suggestion in terms of release schedule and so on, though.

@komamitsu san,
Thank you for the suggestion.
I will work on this in another PR (as you mentioned about the release schedule). I will add the necessary changes in that one.

@@ -144,4 +145,11 @@ public class ExportCommandOptions {
description = "Size of the data chunk to process in a single task (default: 200)",
defaultValue = "200")
protected int dataChunkSize;

@CommandLine.Option(
names = {"--mode", "-sm"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -sm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brfrn169 san,
-m is already used by the --include-matadata input as this was already present

@CommandLine.Option(
      names = {"--include-metadata", "-m"},
      description = "Include transaction metadata in the exported data (default: false)",
      defaultValue = "false")
  protected boolean includeTransactionMetadata;

I didn't want to change the existing value as it has been there from the always and there may be users who use it.
Then also the --mode in import is -m(which also was added in latest change) Should I change the mode to-mand include metadata to something else like -im` ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brfrn169 san,
After discussing with Yves san, I have set -m for mode and -im to inlcude-metadata to make it consistent with import.
Thank you.

Scan scan =
createScan(namespace, table, null, null, new ArrayList<>(), projectionColumns, limit);
try {
return transaction.getScanner(scan);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use DistributedTransactionManager.getScanner(). You don’t need to begin or commit a transaction when using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brfrn169 san,
I had initially used DistributedTransactionManager directly but then was a bit confused whether I can use it like that (in import transaction was used in the Dao). So I used transaction. I have updated this as suggested.
Thank you.

exportOptions.getMaxThreadCount() > 0
? exportOptions.getMaxThreadCount()
: Runtime.getRuntime().availableProcessors();
executorService = Executors.newFixedThreadPool(threadCount);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is as follows:

  • The main thread (a single thread) scans the table.
  • Multiple threads process the scan results in batches and write them through a single writer (a single file).

Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

@ypeckstadt
Copy link
Contributor

@brfrn169 Hiro would like is to focus on the migration of existing features only for the 3.16 release. So no need to handle/merge this PR yet this week.

Is it best that I add this to the 3.16.1 project when available instead? (and remove it from 3.16 for now)?

@inv-jishnu
Copy link
Contributor Author

@brfrn169 san
I have made changes as suggested.
Please take a look at this again when you get a chance.
Thank you.

@inv-jishnu inv-jishnu requested a review from brfrn169 June 12, 2025 07:54
@brfrn169
Copy link
Collaborator

Is it best that I add this to the 3.16.1 project when available instead? (and remove it from 3.16 for now)?

@ypeckstadt Basically, we don’t include new features in patch versions.

As discussed with @feeblefakie, we’ve decided not to release 3.16 this week but to release it next week instead. So I think we can include this PR in that release. Thanks.

@thongdk8
Copy link
Contributor

FYI, as discussed in another place Slack conv. With the default config of transaction manager consensus-commit when exporting data in transaction mode, the OMM will occur if the export data exceeds the available memory. So we are expecting to run exporting in transaction mode with scalar.db.transaction_manager=single-crud-operation, with single-crud-operation transaction manager the memory usage will be safe as exporting in storage mode.

@thongdk8 thongdk8 self-requested a review June 18, 2025 07:09
@inv-jishnu
Copy link
Contributor Author

@brfrn169 san, @komamitsu san,
I have made updates based on feedback.
Please take a look at this PR when you get a chance.
Thank you

DistributedStorage storage = StorageFactory.create(scalarDbPropertiesFilePath).getStorage();
return createExportManagerWithStorage(storage, scalarDbDao, fileFormat, taskFactory);
} else {
DistributedTransactionManager distributedTransactionManager =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to consider #2740 (comment) ?

@inv-jishnu
Copy link
Contributor Author

@komamitsu san, @brfrn169 san (cc @ypeckstadt )

Based on comment from @thongdk8 , for export operations in transaction mode, is it acceptable to explicitly set scalar.db.transaction_manager to single-crud-operation, even if the user hasn’t specified this value in the database configuration file?

@komamitsu
Copy link
Contributor

@inv-jishnu I think it's a reasonable and safe option. But it depends on the requirement (e.g., users can use consensus-commit to export snapshot data even it has the risk of OOM). Probably, you can talk with @ypeckstadt to clarify the requirement.

@brfrn169
Copy link
Collaborator

Based on comment from @thongdk8 , for export operations in transaction mode, is it acceptable to explicitly set scalar.db.transaction_manager to single-crud-operation, even if the user hasn’t specified this value in the database configuration file?

@inv-jishnu Let me discuss it with @feeblefakie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants