Added `move_dataset` function to be able to move datasets to different project #1163

ilongin · 2025-06-21T00:09:19Z

TODO

Summary by Sourcery

Implement dataset relocation across projects by introducing a move_dataset API that updates the dataset’s project association and renames underlying tables accordingly, along with metastore, catalog, and warehouse support and accompanying tests.

New Features:

Add move_dataset function in the DataChain API to move datasets between namespaces and projects
Expose move_dataset at the top-level DataChain module

Enhancements:

Add get_project_by_id and project_id handling in metastore to support project reassignment
Implement rename_dataset_tables in warehouse for batch renaming of version tables on metadata changes
Refactor catalog.update_dataset to use the new warehouse table-rename method instead of ad-hoc logic

Tests:

Add unit tests for move_dataset, including valid moves and error cases for wrong projects
Add functional tests covering dataset relocation and table row counts after moving
Update test utilities to include table_row_count helper for database verification

…projects

sourcery-ai · 2025-06-21T00:09:22Z

Reviewer's Guide

This PR implements dataset relocation across projects by adding a high-level move_dataset operation, extending metastore and warehouse support for project_id changes, refactoring catalog’s rename logic, and enriching tests with a shared table_row_count utility and dedicated move_dataset scenarios.

Sequence diagram for move_dataset operation

sequenceDiagram
    actor User
    participant dc_datasets as dc.datasets.move_dataset
    participant Session
    participant Catalog
    participant Metastore
    participant Warehouse

    User->>dc_datasets: move_dataset(name, namespace, project, new_namespace, new_project)
    dc_datasets->>Session: get(session, in_memory)
    dc_datasets->>Catalog: get_dataset(name, Metastore.get_project(project, namespace))
    dc_datasets->>Metastore: get_project(new_project, new_namespace)
    dc_datasets->>Catalog: update_dataset(dataset, project_id=new_project_id)
    Catalog->>Metastore: update_dataset(dataset, project_id)
    Metastore->>Metastore: get_project_by_id(project_id)
    Metastore-->>Catalog: updated dataset
    Catalog->>Warehouse: rename_dataset_tables(dataset, dataset_updated)
    Warehouse-->>Catalog: (tables renamed if needed)
    Catalog-->>dc_datasets: updated dataset
    dc_datasets-->>User: (done)

Class diagram for new and updated dataset movement logic

classDiagram
    class Metastore {
        +get_project(name, namespace_name, conn)
        +get_project_by_id(project_id, conn)
        +update_dataset(dataset, conn, **kwargs)
    }
    class Catalog {
        +update_dataset(dataset, conn, **kwargs)
        +get_dataset(name, project)
    }
    class Warehouse {
        +rename_dataset_table(dataset, old_name, new_name, old_version, new_version)
        +rename_dataset_tables(dataset, dataset_updated)
    }
    class dc.datasets {
        +move_dataset(name, namespace, project, new_namespace, new_project, session, in_memory)
    }
    Metastore <|-- Catalog
    Catalog o-- Warehouse
    dc.datasets ..> Catalog : uses
    dc.datasets ..> Metastore : uses
    dc.datasets ..> Warehouse : indirectly via Catalog

File-Level Changes

Change	Details	Files
Introduced move_dataset command in the CLI layer	Defined move_dataset in dc.datasets to update project_id Registered move_dataset in datachain/init and dc/init Wire move_dataset to catalog.update_dataset	`src/datachain/lib/dc/datasets.py` `src/datachain/__init__.py` `src/datachain/lib/dc/__init__.py`
Extended metastore to handle project ID lookups and dataset moves	Added abstract get_project_by_id and concrete implementation Mapped project_id field to a Project object in update_dataset Scoped dataset update query by name and original project_id	`src/datachain/data_storage/metastore.py`
Refactored catalog.update_dataset to centralize table renaming	Removed inline per-version rename logic Invoked new warehouse.rename_dataset_tables helper Returned the updated DatasetRecord	`src/datachain/catalog/catalog.py`
Enhanced warehouse with batch table renaming	Introduced rename_dataset_tables to rename all version tables Skipped renaming when parameters didn’t change Left rename_dataset_table in place with a TODO	`src/datachain/data_storage/warehouse.py`
Consolidated row-count helper and added move_dataset tests	Extracted table_row_count into tests.utils Replaced inline get_table_row_count in unit and functional tests Added parametric unit tests and end-to-end tests for move_dataset	`tests/utils.py` `tests/unit/lib/test_datachain.py` `tests/func/test_datasets.py`

Possibly linked issues

#0: PR implements move_dataset function, updating catalog, metastore, and warehouse to support moving datasets.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/data_storage/metastore.py:991` </location>
<code_context>
                     values[field] = json.dumps(value)
                     dataset_values[field] = DatasetRecord.parse_schema(value)
+            elif field == "project_id":
+                if not value:
+                    raise ValueError("Cannot set empty project_id for dataset")
+                dataset_values["project"] = self.get_project_by_id(value)
+                values[field] = value
</code_context>

<issue_to_address>
Strict check for falsy project_id may reject valid values.

'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            elif field == "project_id":
                if not value:
                    raise ValueError("Cannot set empty project_id for dataset")
                dataset_values["project"] = self.get_project_by_id(value)
                values[field] = value
=======
            elif field == "project_id":
                if value is None:
                    raise ValueError("Cannot set empty project_id for dataset")
                dataset_values["project"] = self.get_project_by_id(value)
                values[field] = value
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `tests/unit/lib/test_datachain.py:3438` </location>
<code_context>
+        session=test_session,
+    )
+
+    if new_project != old_project:
+        with pytest.raises(DatasetNotFoundError):
+            catalog.get_dataset(ds_name, old_project)
+    else:
+        catalog.get_dataset(ds_name, old_project)
</code_context>

<issue_to_address>
Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.

Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).

Suggested implementation:

```python

    dc.move_dataset(
        ds_name,
        namespace=old_project.namespace.name,
        project=old_project.name,
        new_namespace=new_project.namespace.name,
        new_project=new_project.name,
        session=test_session,
    )

    if new_project != old_project:
        with pytest.raises(DatasetNotFoundError):
            catalog.get_dataset(ds_name, old_project)
    else:
        catalog.get_dataset(ds_name, old_project)

    dataset_updated = catalog.get_dataset(ds_name, new_project)

    # check if dataset tables are renamed correctly as well
    for version in [v.version for v in dataset.versions]:
        old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
        new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
        if old_project == new_project:

# --- New test: moving to a project where a dataset with the same name exists ---

def test_move_dataset_to_project_with_existing_dataset(
    dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError
):
    # Create a dataset in the destination project with the same name
    existing_dataset = catalog.create_dataset(
        name=ds_name,
        project=new_project,
        session=test_session,
    )

    # Attempt to move the dataset and expect an error
    with pytest.raises(DatasetAlreadyExistsError):
        dc.move_dataset(
            ds_name,
            namespace=old_project.namespace.name,
            project=old_project.name,
            new_namespace=new_project.namespace.name,
            new_project=new_project.name,
            session=test_session,
        )

```

- Ensure that `DatasetAlreadyExistsError` is imported or available in the test context.
- You may need to adjust the fixture or setup for `old_project`, `new_project`, and `ds_name` to match your test suite's conventions.
- If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-21T00:10:19Z

src/datachain/data_storage/metastore.py

+            elif field == "project_id":
+                if not value:
+                    raise ValueError("Cannot set empty project_id for dataset")
+                dataset_values["project"] = self.get_project_by_id(value)
+                values[field] = value


suggestion (bug_risk): Strict check for falsy project_id may reject valid values.

'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.

Suggested change

elif field == "project_id":

if not value:

raise ValueError("Cannot set empty project_id for dataset")

dataset_values["project"] = self.get_project_by_id(value)

values[field] = value

elif field == "project_id":

if value is None:

raise ValueError("Cannot set empty project_id for dataset")

dataset_values["project"] = self.get_project_by_id(value)

values[field] = value

sourcery-ai · 2025-06-21T00:10:19Z

tests/unit/lib/test_datachain.py

+    if new_project != old_project:
+        with pytest.raises(DatasetNotFoundError):
+            catalog.get_dataset(ds_name, old_project)


suggestion (testing): Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.

Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).

Suggested implementation:

dc.move_dataset( ds_name, namespace=old_project.namespace.name, project=old_project.name, new_namespace=new_project.namespace.name, new_project=new_project.name, session=test_session, ) if new_project != old_project: with pytest.raises(DatasetNotFoundError): catalog.get_dataset(ds_name, old_project) else: catalog.get_dataset(ds_name, old_project) dataset_updated = catalog.get_dataset(ds_name, new_project) # check if dataset tables are renamed correctly as well for version in [v.version for v in dataset.versions]: old_table_name = catalog.warehouse.dataset_table_name(dataset, version) new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version) if old_project == new_project: # --- New test: moving to a project where a dataset with the same name exists --- def test_move_dataset_to_project_with_existing_dataset( dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError ): # Create a dataset in the destination project with the same name existing_dataset = catalog.create_dataset( name=ds_name, project=new_project, session=test_session, ) # Attempt to move the dataset and expect an error with pytest.raises(DatasetAlreadyExistsError): dc.move_dataset( ds_name, namespace=old_project.namespace.name, project=old_project.name, new_namespace=new_project.namespace.name, new_project=new_project.name, session=test_session, )

Ensure that DatasetAlreadyExistsError is imported or available in the test context.

You may need to adjust the fixture or setup for old_project, new_project, and ds_name to match your test suite's conventions.

If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.

sourcery-ai · 2025-06-21T00:10:19Z

tests/func/test_datasets.py

@@ -16,7 +16,7 @@
 from datachain.lib.listing import parse_listing_uri
 from datachain.query.dataset import DatasetQuery
 from datachain.sql.types import Float32, Int, Int64
-from tests.utils import assert_row_names, dataset_dependency_asdict
+from tests.utils import assert_row_names, dataset_dependency_asdict, table_row_count


issue (code-quality): Don't import test modules. (dont-import-test-modules)

Explanation
Don't import test modules.
Tests should be self-contained and don't depend on each other.

If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.

sourcery-ai · 2025-06-21T00:10:20Z

tests/unit/lib/test_datachain.py

+from tests.utils import (
+    ANY_VALUE,
+    df_equal,
+    skip_if_not_sqlite,
+    sort_df,
+    sorted_dicts,
+    table_row_count,
+)


issue (code-quality): Don't import test modules. (dont-import-test-modules)

Explanation
Don't import test modules.
Tests should be self-contained and don't depend on each other.

If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.

sourcery-ai · 2025-06-21T00:10:20Z

tests/unit/lib/test_datachain.py

+    for _ in range(2):
+        (
+            dc.read_values(num=[1, 2, 3], session=test_session)
+            .settings(namespace=old_project.namespace.name, project=old_project.name)
+            .save(ds_name)
+        )


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-21T00:10:20Z

tests/unit/lib/test_datachain.py

+    if new_project != old_project:
+        with pytest.raises(DatasetNotFoundError):
+            catalog.get_dataset(ds_name, old_project)
+    else:
+        catalog.get_dataset(ds_name, old_project)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-21T00:10:20Z

tests/unit/lib/test_datachain.py

+    for version in [v.version for v in dataset.versions]:
+        old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
+        new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
+        if old_project == new_project:
+            assert old_table_name == new_table_name
+        else:
+            assert table_row_count(catalog.warehouse.db, old_table_name) is None
+
+        assert table_row_count(catalog.warehouse.db, new_table_name) == 3


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-21T00:10:20Z

tests/unit/lib/test_datachain.py

+        if old_project == new_project:
+            assert old_table_name == new_table_name
+        else:
+            assert table_row_count(catalog.warehouse.db, old_table_name) is None


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-21T00:10:20Z

src/datachain/data_storage/metastore.py

+        rows = list(self.db.execute(query, conn=conn))
+        if not rows:
+            raise ProjectNotFoundError(f"Project with id {project_id} not found.")


issue (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Lift code into else after jump in control flow (reintroduce-else)

Swap if/else branches (swap-if-else-branches)

codecov · 2025-06-21T00:14:52Z

Codecov Report

Attention: Patch coverage is 82.35294% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.63%. Comparing base (4edef66) to head (102950a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/data_storage/metastore.py	75.00%	2 Missing and 2 partials ⚠️
src/datachain/data_storage/warehouse.py	77.77%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1163      +/-   ##
==========================================
- Coverage   88.72%   88.63%   -0.10%     
==========================================
  Files         152      152              
  Lines       13549    13582      +33     
  Branches     1885     1888       +3     
==========================================
+ Hits        12022    12038      +16     
- Misses       1086     1098      +12     
- Partials      441      446       +5

Flag	Coverage Δ
datachain	`88.56% <82.35%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/__init__.py	`100.00% <ø> (ø)`
src/datachain/catalog/catalog.py	`85.92% <100.00%> (-0.13%)`	⬇️
src/datachain/lib/dc/__init__.py	`100.00% <100.00%> (ø)`
src/datachain/lib/dc/datasets.py	`95.00% <100.00%> (+0.33%)`	⬆️
src/datachain/data_storage/warehouse.py	`87.46% <77.77%> (-1.48%)`	⬇️
src/datachain/data_storage/metastore.py	`93.81% <75.00%> (-0.54%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-07-01T14:46:25Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`102950a`
Status:	✅ Deploy successful!
Preview URL:	https://98774507.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1161-move-dataset-to.datachain-documentation.pages.dev

View logs

src/datachain/data_storage/metastore.py

shcheklein · 2025-07-01T19:22:15Z

src/datachain/data_storage/warehouse.py

@@ -373,6 +374,24 @@ def rename_dataset_table(

        self.db.rename_table(old_ds_table_name, new_ds_table_name)

+    def rename_dataset_tables(


can I move only a particular version? why not?

This task was about moving whole dataset from project to project. What you suggest is moving particular version across different datasets. We can create separate issue for that if it's really needed?

what is the difference? why not allow this right away? E.g. I want to make only a particular version of the dataset a production version? Is it only about accepting (limiting the operation to version?)

(probably we need also duplicate operation)

In my impression it was about moving versions 🙂

The use case I have in mind - I was massaging my_ds, created 13 versions and I'd like to "promote" this 13th to the latest major version of prod/actions/animal_planet which might be, let say animal_planet@4.2.2

I need something like dc.datasets.move("my_ds", "prod.actions.animal_planet") to create animal_planet@5.0.0

Note, there is an assumption that it increases major version by default.

Moving the whole dataset seems like a separate operation. We do need it but I'm not sure it's common.

I think we need both. Moving the whole dataset is much needed - just to rename, reorganize things.

My2cs: I feel the promotion scenario is much needed, but will more niche.

Note, there is an assumption that it increases major version by default.

this seems too complicated, seems like a different operation (to rename / move that we talked about in the first place) 🤔

shcheklein

A few important questions:

transactional semantics - what happens if it fails in the middle, e.g. we are renaming tables 50 out of 100 succeeded - what happens? Will people be able to run it again?
what happens if I in a single session:
1. created a dataset
2. renamed it
3. sessions failed with an exception

will the created dataset be deleted (we have a special logic that handles this - will it handle renames?)

(same btw with deletions - are we going to restore them?)

do we need lock the operation if other jobs are running that are accessing this dataset?

ilongin · 2025-07-02T08:25:03Z

A few important questions:

Good questions. So we were speaking about this transactional issues before but somehow it was not really a priority.
Anyway, I think there are 2 problems:

Not using DB transactions for metastore where we can -> for this we also probably need to be able to start transaction above metastore, e.g in Catalog class but currently that is not possible - we need to refactor code and analyze where transactions are actually missed - maybe there is not much that's missed.
Multi DB transactions -> this would need to be implemented in code which doesn't solve main issue - what if process just dies...unless we have really complex mechanism which is able to re-continue after process comes live again

Regarding your questions:

transactional semantics - what happens if it fails in the middle, e.g. we are renaming tables 50 out of 100 succeeded - what happens? Will people be able to run it again?

We first update metadata and then we start to rename warehouse tables. If it fails in the middle, those tables that are renamed will be usable, and those that are not will be in some limbo state. If user starts to use dataset version for which table is not renamed error will be thrown. It's impossible to run it again without some manual intervention. BTW this could all be much easier if we can just use uuid in dataset table names instead, and actually we had that before but then requirement were to have human readable names.

what happens if I in a single session:

created a dataset

renamed it

sessions failed with an exception

will the created dataset be deleted (we have a special logic that handles this - will it handle renames?)

What do you mean by session? Our datachain.query.session.Session object? All those operations are separate from DB perspective, i.e. there is no single high level transaction or any kind of that. So in you example dataset will be created and if something fails when updating dataset tables in warehouse the same scenario from above example will happen so no, dataset won't be deleted.

(same btw with deletions - are we going to restore them?)

Deletions are also not under transaction, but situation is little bit better as dataset will just stay present with less number of dataset versions (those that are not removed). The worst that can happen is that some dataset table in warehouse is not removed and it's metadata is which means it will hang there forever, but it's easy to create some job cleanup of hanging dataset tables which are not attached to any metadata.

do we need lock the operation if other jobs are running that are accessing this dataset?

Yes, probably something like that needs to happen, but we need to brainstorm about it. These are all general questions which needs to be addressed in some separate issue IMO as it's all around the codebase - at the time we didn't want to deal with it to not loose time but maybe now time has come to refactor everything.

shcheklein · 2025-07-02T19:14:33Z

tests/func/test_datasets.py

+    session = cloud_test_catalog.session
+    catalog = cloud_test_catalog.catalog
+
+    dc.move_dataset(


I thought we wanted to do datasets.move ?

shcheklein · 2025-07-02T19:15:02Z

tests/func/test_datasets.py

+        session=session,
+    )
+
+    dataset = catalog.get_dataset(dogs_dataset.name, project)


let's use dc.read_dataset instead of an internal calls

shcheklein · 2025-07-02T19:16:17Z

tests/func/test_datasets.py

+    assert dataset.get_version("1.0.0").num_objects == expected_table_row_count
+
+
+def test_move_dataset(cloud_test_catalog, dogs_dataset, project):


let's add a test / test when next query after the move fails (e.g. some UDF and it breaks)

shcheklein · 2025-07-02T19:17:46Z

tests/unit/lib/test_datachain.py

+        ("old", "old", "old", "old"),
+    ],
+)
+def test_move_dataset(


it doesn't look like a unit test, what is the difference with the func test above?

shcheklein · 2025-07-02T19:18:26Z

tests/unit/lib/test_datachain.py

+        assert table_row_count(catalog.warehouse.db, new_table_name) == 3
+
+
+def test_move_dataset_wrong_old_project(test_session, project):


this is not unit test, this is regular func test

shcheklein · 2025-07-02T19:26:20Z

So we were speaking about this transactional issues before but somehow it was not really a priority.

we are going into more complicated operations I think - we didn't have much happening before

Multi DB transactions

we should try to design things in way that doesn't require that (implementation is too complicated). E.g. don't rename tables - use UUIDs as names, etc, etc

Not using DB transactions for metastore where we can

we can really try to structure code / keep it simple so that it is also not required or everything is happening within a single query

BTW this could all be much easier if we can just use uuid in dataset table names instead, and actually we had that before but then requirement were to have human readable names.

yes, exactly. And in general on UUID more (vs names, or internal DB ids like ids from RDS)

added move_dataset function to be able to move datasets to different …

dbd72fd

…projects

ilongin marked this pull request as draft June 21, 2025 00:09

sourcery-ai bot reviewed Jun 21, 2025

View reviewed changes

ilongin linked an issue Jun 21, 2025 that may be closed by this pull request

Add ability to move dataset to different namespace / project #1161

Open

ilongin marked this pull request as ready for review June 21, 2025 01:38

merging with main

3805edc

shcheklein reviewed Jul 1, 2025

View reviewed changes

src/datachain/data_storage/metastore.py Show resolved Hide resolved

shcheklein reviewed Jul 1, 2025

View reviewed changes

fixing tests

102950a

ilongin requested a review from shcheklein July 2, 2025 13:20

shcheklein reviewed Jul 2, 2025

View reviewed changes

		@@ -373,6 +374,24 @@ def rename_dataset_table(

		self.db.rename_table(old_ds_table_name, new_ds_table_name)

		def rename_dataset_tables(

		assert dataset.get_version("1.0.0").num_objects == expected_table_row_count


		def test_move_dataset(cloud_test_catalog, dogs_dataset, project):

		assert table_row_count(catalog.warehouse.db, new_table_name) == 3


		def test_move_dataset_wrong_old_project(test_session, project):

Added move_dataset function to be able to move datasets to different project #1163

Are you sure you want to change the base?

Added move_dataset function to be able to move datasets to different project #1163

Conversation

ilongin commented Jun 21, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for move_dataset operation

Class diagram for new and updated dataset movement logic

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilongin Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

ilongin commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Added `move_dataset` function to be able to move datasets to different project #1163

Added `move_dataset` function to be able to move datasets to different project #1163

ilongin commented Jun 21, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 21, 2025 •

edited

Loading

codecov bot commented Jun 21, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 1, 2025 •

edited

Loading

ilongin Jul 2, 2025 •

edited

Loading

ilongin commented Jul 2, 2025 •

edited

Loading