-
Notifications
You must be signed in to change notification settings - Fork 114
Added move_dataset
function to be able to move datasets to different project
#1163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR implements dataset relocation across projects by adding a high-level move_dataset operation, extending metastore and warehouse support for project_id changes, refactoring catalog’s rename logic, and enriching tests with a shared table_row_count utility and dedicated move_dataset scenarios. Sequence diagram for move_dataset operationsequenceDiagram
actor User
participant dc_datasets as dc.datasets.move_dataset
participant Session
participant Catalog
participant Metastore
participant Warehouse
User->>dc_datasets: move_dataset(name, namespace, project, new_namespace, new_project)
dc_datasets->>Session: get(session, in_memory)
dc_datasets->>Catalog: get_dataset(name, Metastore.get_project(project, namespace))
dc_datasets->>Metastore: get_project(new_project, new_namespace)
dc_datasets->>Catalog: update_dataset(dataset, project_id=new_project_id)
Catalog->>Metastore: update_dataset(dataset, project_id)
Metastore->>Metastore: get_project_by_id(project_id)
Metastore-->>Catalog: updated dataset
Catalog->>Warehouse: rename_dataset_tables(dataset, dataset_updated)
Warehouse-->>Catalog: (tables renamed if needed)
Catalog-->>dc_datasets: updated dataset
dc_datasets-->>User: (done)
Class diagram for new and updated dataset movement logicclassDiagram
class Metastore {
+get_project(name, namespace_name, conn)
+get_project_by_id(project_id, conn)
+update_dataset(dataset, conn, **kwargs)
}
class Catalog {
+update_dataset(dataset, conn, **kwargs)
+get_dataset(name, project)
}
class Warehouse {
+rename_dataset_table(dataset, old_name, new_name, old_version, new_version)
+rename_dataset_tables(dataset, dataset_updated)
}
class dc.datasets {
+move_dataset(name, namespace, project, new_namespace, new_project, session, in_memory)
}
Metastore <|-- Catalog
Catalog o-- Warehouse
dc.datasets ..> Catalog : uses
dc.datasets ..> Metastore : uses
dc.datasets ..> Warehouse : indirectly via Catalog
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ilongin - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/data_storage/metastore.py:991` </location>
<code_context>
values[field] = json.dumps(value)
dataset_values[field] = DatasetRecord.parse_schema(value)
+ elif field == "project_id":
+ if not value:
+ raise ValueError("Cannot set empty project_id for dataset")
+ dataset_values["project"] = self.get_project_by_id(value)
+ values[field] = value
</code_context>
<issue_to_address>
Strict check for falsy project_id may reject valid values.
'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
elif field == "project_id":
if not value:
raise ValueError("Cannot set empty project_id for dataset")
dataset_values["project"] = self.get_project_by_id(value)
values[field] = value
=======
elif field == "project_id":
if value is None:
raise ValueError("Cannot set empty project_id for dataset")
dataset_values["project"] = self.get_project_by_id(value)
values[field] = value
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `tests/unit/lib/test_datachain.py:3438` </location>
<code_context>
+ session=test_session,
+ )
+
+ if new_project != old_project:
+ with pytest.raises(DatasetNotFoundError):
+ catalog.get_dataset(ds_name, old_project)
+ else:
+ catalog.get_dataset(ds_name, old_project)
</code_context>
<issue_to_address>
Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.
Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).
Suggested implementation:
```python
dc.move_dataset(
ds_name,
namespace=old_project.namespace.name,
project=old_project.name,
new_namespace=new_project.namespace.name,
new_project=new_project.name,
session=test_session,
)
if new_project != old_project:
with pytest.raises(DatasetNotFoundError):
catalog.get_dataset(ds_name, old_project)
else:
catalog.get_dataset(ds_name, old_project)
dataset_updated = catalog.get_dataset(ds_name, new_project)
# check if dataset tables are renamed correctly as well
for version in [v.version for v in dataset.versions]:
old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
if old_project == new_project:
# --- New test: moving to a project where a dataset with the same name exists ---
def test_move_dataset_to_project_with_existing_dataset(
dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError
):
# Create a dataset in the destination project with the same name
existing_dataset = catalog.create_dataset(
name=ds_name,
project=new_project,
session=test_session,
)
# Attempt to move the dataset and expect an error
with pytest.raises(DatasetAlreadyExistsError):
dc.move_dataset(
ds_name,
namespace=old_project.namespace.name,
project=old_project.name,
new_namespace=new_project.namespace.name,
new_project=new_project.name,
session=test_session,
)
```
- Ensure that `DatasetAlreadyExistsError` is imported or available in the test context.
- You may need to adjust the fixture or setup for `old_project`, `new_project`, and `ds_name` to match your test suite's conventions.
- If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
elif field == "project_id": | ||
if not value: | ||
raise ValueError("Cannot set empty project_id for dataset") | ||
dataset_values["project"] = self.get_project_by_id(value) | ||
values[field] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Strict check for falsy project_id may reject valid values.
'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.
elif field == "project_id": | |
if not value: | |
raise ValueError("Cannot set empty project_id for dataset") | |
dataset_values["project"] = self.get_project_by_id(value) | |
values[field] = value | |
elif field == "project_id": | |
if value is None: | |
raise ValueError("Cannot set empty project_id for dataset") | |
dataset_values["project"] = self.get_project_by_id(value) | |
values[field] = value |
if new_project != old_project: | ||
with pytest.raises(DatasetNotFoundError): | ||
catalog.get_dataset(ds_name, old_project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.
Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).
Suggested implementation:
dc.move_dataset(
ds_name,
namespace=old_project.namespace.name,
project=old_project.name,
new_namespace=new_project.namespace.name,
new_project=new_project.name,
session=test_session,
)
if new_project != old_project:
with pytest.raises(DatasetNotFoundError):
catalog.get_dataset(ds_name, old_project)
else:
catalog.get_dataset(ds_name, old_project)
dataset_updated = catalog.get_dataset(ds_name, new_project)
# check if dataset tables are renamed correctly as well
for version in [v.version for v in dataset.versions]:
old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
if old_project == new_project:
# --- New test: moving to a project where a dataset with the same name exists ---
def test_move_dataset_to_project_with_existing_dataset(
dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError
):
# Create a dataset in the destination project with the same name
existing_dataset = catalog.create_dataset(
name=ds_name,
project=new_project,
session=test_session,
)
# Attempt to move the dataset and expect an error
with pytest.raises(DatasetAlreadyExistsError):
dc.move_dataset(
ds_name,
namespace=old_project.namespace.name,
project=old_project.name,
new_namespace=new_project.namespace.name,
new_project=new_project.name,
session=test_session,
)
- Ensure that
DatasetAlreadyExistsError
is imported or available in the test context. - You may need to adjust the fixture or setup for
old_project
,new_project
, andds_name
to match your test suite's conventions. - If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.
@@ -16,7 +16,7 @@ | |||
from datachain.lib.listing import parse_listing_uri | |||
from datachain.query.dataset import DatasetQuery | |||
from datachain.sql.types import Float32, Int, Int64 | |||
from tests.utils import assert_row_names, dataset_dependency_asdict | |||
from tests.utils import assert_row_names, dataset_dependency_asdict, table_row_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Don't import test modules. (dont-import-test-modules
)
Explanation
Don't import test modules.Tests should be self-contained and don't depend on each other.
If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.
from tests.utils import ( | ||
ANY_VALUE, | ||
df_equal, | ||
skip_if_not_sqlite, | ||
sort_df, | ||
sorted_dicts, | ||
table_row_count, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Don't import test modules. (dont-import-test-modules
)
Explanation
Don't import test modules.Tests should be self-contained and don't depend on each other.
If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.
for _ in range(2): | ||
( | ||
dc.read_values(num=[1, 2, 3], session=test_session) | ||
.settings(namespace=old_project.namespace.name, project=old_project.name) | ||
.save(ds_name) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests
)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
if new_project != old_project: | ||
with pytest.raises(DatasetNotFoundError): | ||
catalog.get_dataset(ds_name, old_project) | ||
else: | ||
catalog.get_dataset(ds_name, old_project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests
)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
for version in [v.version for v in dataset.versions]: | ||
old_table_name = catalog.warehouse.dataset_table_name(dataset, version) | ||
new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version) | ||
if old_project == new_project: | ||
assert old_table_name == new_table_name | ||
else: | ||
assert table_row_count(catalog.warehouse.db, old_table_name) is None | ||
|
||
assert table_row_count(catalog.warehouse.db, new_table_name) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid loops in tests. (no-loop-in-tests
)
Explanation
Avoid complex code, like loops, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
if old_project == new_project: | ||
assert old_table_name == new_table_name | ||
else: | ||
assert table_row_count(catalog.warehouse.db, old_table_name) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests
)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
rows = list(self.db.execute(query, conn=conn)) | ||
if not rows: | ||
raise ProjectNotFoundError(f"Project with id {project_id} not found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression
) - Lift code into else after jump in control flow (
reintroduce-else
) - Swap if/else branches (
swap-if-else-branches
)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1163 +/- ##
==========================================
- Coverage 88.72% 88.63% -0.10%
==========================================
Files 152 152
Lines 13549 13582 +33
Branches 1885 1888 +3
==========================================
+ Hits 12022 12038 +16
- Misses 1086 1098 +12
- Partials 441 446 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Deploying datachain-documentation with
|
Latest commit: |
102950a
|
Status: | ✅ Deploy successful! |
Preview URL: | https://98774507.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1161-move-dataset-to.datachain-documentation.pages.dev |
@@ -373,6 +374,24 @@ def rename_dataset_table( | |||
|
|||
self.db.rename_table(old_ds_table_name, new_ds_table_name) | |||
|
|||
def rename_dataset_tables( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can I move only a particular version? why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task was about moving whole dataset from project to project. What you suggest is moving particular version across different datasets. We can create separate issue for that if it's really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference? why not allow this right away? E.g. I want to make only a particular version of the dataset a production version? Is it only about accepting (limiting the operation to version?)
(probably we need also duplicate operation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my impression it was about moving versions 🙂
The use case I have in mind - I was massaging my_ds
, created 13 versions and I'd like to "promote" this 13th to the latest major version of prod/actions/animal_planet
which might be, let say animal_planet@4.2.2
I need something like dc.datasets.move("my_ds", "prod.actions.animal_planet") to create animal_planet@5.0.0
Note, there is an assumption that it increases major version by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the whole dataset seems like a separate operation. We do need it but I'm not sure it's common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need both. Moving the whole dataset is much needed - just to rename, reorganize things.
My2cs: I feel the promotion scenario is much needed, but will more niche.
Note, there is an assumption that it increases major version by default.
this seems too complicated, seems like a different operation (to rename / move that we talked about in the first place) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few important questions:
- transactional semantics - what happens if it fails in the middle, e.g. we are renaming tables 50 out of 100 succeeded - what happens? Will people be able to run it again?
- what happens if I in a single session:
- created a dataset
- renamed it
- sessions failed with an exception
will the created dataset be deleted (we have a special logic that handles this - will it handle renames?)
(same btw with deletions - are we going to restore them?)
- do we need lock the operation if other jobs are running that are accessing this dataset?
Good questions. So we were speaking about this transactional issues before but somehow it was not really a priority.
Regarding your questions:
We first update metadata and then we start to rename warehouse tables. If it fails in the middle, those tables that are renamed will be usable, and those that are not will be in some limbo state. If user starts to use dataset version for which table is not renamed error will be thrown. It's impossible to run it again without some manual intervention. BTW this could all be much easier if we can just use
What do you mean by session? Our
Deletions are also not under transaction, but situation is little bit better as dataset will just stay present with less number of dataset versions (those that are not removed). The worst that can happen is that some dataset table in warehouse is not removed and it's metadata is which means it will hang there forever, but it's easy to create some job cleanup of hanging dataset tables which are not attached to any metadata.
Yes, probably something like that needs to happen, but we need to brainstorm about it. These are all general questions which needs to be addressed in some separate issue IMO as it's all around the codebase - at the time we didn't want to deal with it to not loose time but maybe now time has come to refactor everything. |
session = cloud_test_catalog.session | ||
catalog = cloud_test_catalog.catalog | ||
|
||
dc.move_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to do datasets.move
?
session=session, | ||
) | ||
|
||
dataset = catalog.get_dataset(dogs_dataset.name, project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use dc.read_dataset
instead of an internal calls
assert dataset.get_version("1.0.0").num_objects == expected_table_row_count | ||
|
||
|
||
def test_move_dataset(cloud_test_catalog, dogs_dataset, project): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a test / test when next query after the move fails (e.g. some UDF and it breaks)
("old", "old", "old", "old"), | ||
], | ||
) | ||
def test_move_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't look like a unit test, what is the difference with the func test above?
assert table_row_count(catalog.warehouse.db, new_table_name) == 3 | ||
|
||
|
||
def test_move_dataset_wrong_old_project(test_session, project): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not unit test, this is regular func test
we are going into more complicated operations I think - we didn't have much happening before
we should try to design things in way that doesn't require that (implementation is too complicated). E.g. don't rename tables - use UUIDs as names, etc, etc
we can really try to structure code / keep it simple so that it is also not required or everything is happening within a single query
yes, exactly. And in general on UUID more (vs names, or internal DB ids like ids from RDS) |
TODO
Summary by Sourcery
Implement dataset relocation across projects by introducing a
move_dataset
API that updates the dataset’s project association and renames underlying tables accordingly, along with metastore, catalog, and warehouse support and accompanying tests.New Features:
move_dataset
function in the DataChain API to move datasets between namespaces and projectsmove_dataset
at the top-level DataChain moduleEnhancements:
get_project_by_id
andproject_id
handling in metastore to support project reassignmentrename_dataset_tables
in warehouse for batch renaming of version tables on metadata changescatalog.update_dataset
to use the new warehouse table-rename method instead of ad-hoc logicTests:
move_dataset
, including valid moves and error cases for wrong projectstable_row_count
helper for database verification