-
Notifications
You must be signed in to change notification settings - Fork 37
Add transaction mode support for data export in dataloader core and CLI #2740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@inv-jishnu I think there are many |
@@ -18,7 +19,17 @@ | |||
public class TableMetadataService { | |||
|
|||
private final DistributedStorageAdmin storageAdmin; | |||
private final DistributedTransactionAdmin transactionAdmin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class seems the union of TableMetadataService
for storage and for transaction with some internal if
branches. How about making TableMetadataService
abstract and creating 2 child classes that return the TableMetadata for storage or transaction from something like getTableMetadataInternal()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@komamitsu san,
I have split up table metadata service as suggested.
Thank you.
@komamitsu san, |
@@ -144,4 +145,11 @@ public class ExportCommandOptions { | |||
description = "Size of the data chunk to process in a single task (default: 200)", | |||
defaultValue = "200") | |||
protected int dataChunkSize; | |||
|
|||
@CommandLine.Option( | |||
names = {"--mode", "-sm"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -sm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brfrn169 san,
-m
is already used by the --include-matadata
input as this was already present
@CommandLine.Option(
names = {"--include-metadata", "-m"},
description = "Include transaction metadata in the exported data (default: false)",
defaultValue = "false")
protected boolean includeTransactionMetadata;
I didn't want to change the existing value as it has been there from the always and there may be users who use it.
Then also the --mode in import is
-m(which also was added in latest change) Should I change the mode to
-mand include metadata to something else like
-im` ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brfrn169 san,
After discussing with Yves san, I have set -m
for mode and -im
to inlcude-metadata to make it consistent with import.
Thank you.
Scan scan = | ||
createScan(namespace, table, null, null, new ArrayList<>(), projectionColumns, limit); | ||
try { | ||
return transaction.getScanner(scan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use DistributedTransactionManager.getScanner()
. You don’t need to begin or commit a transaction when using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brfrn169 san,
I had initially used DistributedTransactionManager
directly but then was a bit confused whether I can use it like that (in import transaction
was used in the Dao). So I used transaction
. I have updated this as suggested.
Thank you.
exportOptions.getMaxThreadCount() > 0 | ||
? exportOptions.getMaxThreadCount() | ||
: Runtime.getRuntime().availableProcessors(); | ||
executorService = Executors.newFixedThreadPool(threadCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is as follows:
- The main thread (a single thread) scans the table.
- Multiple threads process the scan results in batches and write them through a single writer (a single file).
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
@brfrn169 Hiro would like is to focus on the migration of existing features only for the 3.16 release. So no need to handle/merge this PR yet this week. Is it best that I add this to the 3.16.1 project when available instead? (and remove it from 3.16 for now)? |
@brfrn169 san |
@ypeckstadt Basically, we don’t include new features in patch versions. As discussed with @feeblefakie, we’ve decided not to release 3.16 this week but to release it next week instead. So I think we can include this PR in that release. Thanks. |
FYI, as discussed in another place Slack conv. With the default config of transaction manager |
@brfrn169 san, @komamitsu san, |
DistributedStorage storage = StorageFactory.create(scalarDbPropertiesFilePath).getStorage(); | ||
return createExportManagerWithStorage(storage, scalarDbDao, fileFormat, taskFactory); | ||
} else { | ||
DistributedTransactionManager distributedTransactionManager = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to consider #2740 (comment) ?
@komamitsu san, @brfrn169 san (cc @ypeckstadt ) Based on comment from @thongdk8 , for export operations in transaction mode, is it acceptable to explicitly set scalar.db.transaction_manager to single-crud-operation, even if the user hasn’t specified this value in the database configuration file? |
@inv-jishnu I think it's a reasonable and safe option. But it depends on the requirement (e.g., users can use |
@inv-jishnu Let me discuss it with @feeblefakie. |
Description
This PR adds support for transactional mode in data export within the Dataloader core, along with corresponding updates to the CLI.
Previously, the export functionality only supported storage mode because there was no transactional scanner API that allowed streaming data without loading everything into memory. With the introduction of a suitable transactional scanner, we can now support export in transaction mode as well.
Related issues and/or PRs
NA
Changes made
On core
On CLI
Checklist
Additional notes (optional)
NA
Release notes
Added transaction mode in data export on data loader core and CLI