Skip to content

Added move_dataset function to be able to move datasets to different project #1163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jun 21, 2025

TODO

Summary by Sourcery

Implement dataset relocation across projects by introducing a move_dataset API that updates the dataset’s project association and renames underlying tables accordingly, along with metastore, catalog, and warehouse support and accompanying tests.

New Features:

  • Add move_dataset function in the DataChain API to move datasets between namespaces and projects
  • Expose move_dataset at the top-level DataChain module

Enhancements:

  • Add get_project_by_id and project_id handling in metastore to support project reassignment
  • Implement rename_dataset_tables in warehouse for batch renaming of version tables on metadata changes
  • Refactor catalog.update_dataset to use the new warehouse table-rename method instead of ad-hoc logic

Tests:

  • Add unit tests for move_dataset, including valid moves and error cases for wrong projects
  • Add functional tests covering dataset relocation and table row counts after moving
  • Update test utilities to include table_row_count helper for database verification

Copy link
Contributor

sourcery-ai bot commented Jun 21, 2025

Reviewer's Guide

This PR implements dataset relocation across projects by adding a high-level move_dataset operation, extending metastore and warehouse support for project_id changes, refactoring catalog’s rename logic, and enriching tests with a shared table_row_count utility and dedicated move_dataset scenarios.

Sequence diagram for move_dataset operation

sequenceDiagram
    actor User
    participant dc_datasets as dc.datasets.move_dataset
    participant Session
    participant Catalog
    participant Metastore
    participant Warehouse

    User->>dc_datasets: move_dataset(name, namespace, project, new_namespace, new_project)
    dc_datasets->>Session: get(session, in_memory)
    dc_datasets->>Catalog: get_dataset(name, Metastore.get_project(project, namespace))
    dc_datasets->>Metastore: get_project(new_project, new_namespace)
    dc_datasets->>Catalog: update_dataset(dataset, project_id=new_project_id)
    Catalog->>Metastore: update_dataset(dataset, project_id)
    Metastore->>Metastore: get_project_by_id(project_id)
    Metastore-->>Catalog: updated dataset
    Catalog->>Warehouse: rename_dataset_tables(dataset, dataset_updated)
    Warehouse-->>Catalog: (tables renamed if needed)
    Catalog-->>dc_datasets: updated dataset
    dc_datasets-->>User: (done)
Loading

Class diagram for new and updated dataset movement logic

classDiagram
    class Metastore {
        +get_project(name, namespace_name, conn)
        +get_project_by_id(project_id, conn)
        +update_dataset(dataset, conn, **kwargs)
    }
    class Catalog {
        +update_dataset(dataset, conn, **kwargs)
        +get_dataset(name, project)
    }
    class Warehouse {
        +rename_dataset_table(dataset, old_name, new_name, old_version, new_version)
        +rename_dataset_tables(dataset, dataset_updated)
    }
    class dc.datasets {
        +move_dataset(name, namespace, project, new_namespace, new_project, session, in_memory)
    }
    Metastore <|-- Catalog
    Catalog o-- Warehouse
    dc.datasets ..> Catalog : uses
    dc.datasets ..> Metastore : uses
    dc.datasets ..> Warehouse : indirectly via Catalog
Loading

File-Level Changes

Change Details Files
Introduced move_dataset command in the CLI layer
  • Defined move_dataset in dc.datasets to update project_id
  • Registered move_dataset in datachain/init and dc/init
  • Wire move_dataset to catalog.update_dataset
src/datachain/lib/dc/datasets.py
src/datachain/__init__.py
src/datachain/lib/dc/__init__.py
Extended metastore to handle project ID lookups and dataset moves
  • Added abstract get_project_by_id and concrete implementation
  • Mapped project_id field to a Project object in update_dataset
  • Scoped dataset update query by name and original project_id
src/datachain/data_storage/metastore.py
Refactored catalog.update_dataset to centralize table renaming
  • Removed inline per-version rename logic
  • Invoked new warehouse.rename_dataset_tables helper
  • Returned the updated DatasetRecord
src/datachain/catalog/catalog.py
Enhanced warehouse with batch table renaming
  • Introduced rename_dataset_tables to rename all version tables
  • Skipped renaming when parameters didn’t change
  • Left rename_dataset_table in place with a TODO
src/datachain/data_storage/warehouse.py
Consolidated row-count helper and added move_dataset tests
  • Extracted table_row_count into tests.utils
  • Replaced inline get_table_row_count in unit and functional tests
  • Added parametric unit tests and end-to-end tests for move_dataset
tests/utils.py
tests/unit/lib/test_datachain.py
tests/func/test_datasets.py

Possibly linked issues

  • #0: PR implements move_dataset function, updating catalog, metastore, and warehouse to support moving datasets.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ilongin ilongin marked this pull request as draft June 21, 2025 00:09
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/data_storage/metastore.py:991` </location>
<code_context>
                     values[field] = json.dumps(value)
                     dataset_values[field] = DatasetRecord.parse_schema(value)
+            elif field == "project_id":
+                if not value:
+                    raise ValueError("Cannot set empty project_id for dataset")
+                dataset_values["project"] = self.get_project_by_id(value)
+                values[field] = value
</code_context>

<issue_to_address>
Strict check for falsy project_id may reject valid values.

'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            elif field == "project_id":
                if not value:
                    raise ValueError("Cannot set empty project_id for dataset")
                dataset_values["project"] = self.get_project_by_id(value)
                values[field] = value
=======
            elif field == "project_id":
                if value is None:
                    raise ValueError("Cannot set empty project_id for dataset")
                dataset_values["project"] = self.get_project_by_id(value)
                values[field] = value
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `tests/unit/lib/test_datachain.py:3438` </location>
<code_context>
+        session=test_session,
+    )
+
+    if new_project != old_project:
+        with pytest.raises(DatasetNotFoundError):
+            catalog.get_dataset(ds_name, old_project)
+    else:
+        catalog.get_dataset(ds_name, old_project)
</code_context>

<issue_to_address>
Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.

Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).

Suggested implementation:

```python

    dc.move_dataset(
        ds_name,
        namespace=old_project.namespace.name,
        project=old_project.name,
        new_namespace=new_project.namespace.name,
        new_project=new_project.name,
        session=test_session,
    )

    if new_project != old_project:
        with pytest.raises(DatasetNotFoundError):
            catalog.get_dataset(ds_name, old_project)
    else:
        catalog.get_dataset(ds_name, old_project)

    dataset_updated = catalog.get_dataset(ds_name, new_project)

    # check if dataset tables are renamed correctly as well
    for version in [v.version for v in dataset.versions]:
        old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
        new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
        if old_project == new_project:

# --- New test: moving to a project where a dataset with the same name exists ---

def test_move_dataset_to_project_with_existing_dataset(
    dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError
):
    # Create a dataset in the destination project with the same name
    existing_dataset = catalog.create_dataset(
        name=ds_name,
        project=new_project,
        session=test_session,
    )

    # Attempt to move the dataset and expect an error
    with pytest.raises(DatasetAlreadyExistsError):
        dc.move_dataset(
            ds_name,
            namespace=old_project.namespace.name,
            project=old_project.name,
            new_namespace=new_project.namespace.name,
            new_project=new_project.name,
            session=test_session,
        )

```

- Ensure that `DatasetAlreadyExistsError` is imported or available in the test context.
- You may need to adjust the fixture or setup for `old_project`, `new_project`, and `ds_name` to match your test suite's conventions.
- If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +990 to +994
elif field == "project_id":
if not value:
raise ValueError("Cannot set empty project_id for dataset")
dataset_values["project"] = self.get_project_by_id(value)
values[field] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Strict check for falsy project_id may reject valid values.

'if not value' will also reject 0, which may be a valid project_id. Use 'if value is None' to only reject None values.

Suggested change
elif field == "project_id":
if not value:
raise ValueError("Cannot set empty project_id for dataset")
dataset_values["project"] = self.get_project_by_id(value)
values[field] = value
elif field == "project_id":
if value is None:
raise ValueError("Cannot set empty project_id for dataset")
dataset_values["project"] = self.get_project_by_id(value)
values[field] = value

Comment on lines +3438 to +3440
if new_project != old_project:
with pytest.raises(DatasetNotFoundError):
catalog.get_dataset(ds_name, old_project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding a test for moving a dataset to a project where a dataset with the same name already exists.

Please add a test to cover the case where the destination project already has a dataset with the same name, and verify the expected behavior (error, overwrite, or merge).

Suggested implementation:

    dc.move_dataset(
        ds_name,
        namespace=old_project.namespace.name,
        project=old_project.name,
        new_namespace=new_project.namespace.name,
        new_project=new_project.name,
        session=test_session,
    )

    if new_project != old_project:
        with pytest.raises(DatasetNotFoundError):
            catalog.get_dataset(ds_name, old_project)
    else:
        catalog.get_dataset(ds_name, old_project)

    dataset_updated = catalog.get_dataset(ds_name, new_project)

    # check if dataset tables are renamed correctly as well
    for version in [v.version for v in dataset.versions]:
        old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
        new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
        if old_project == new_project:

# --- New test: moving to a project where a dataset with the same name exists ---

def test_move_dataset_to_project_with_existing_dataset(
    dc, catalog, test_session, old_project, new_project, ds_name, DatasetAlreadyExistsError
):
    # Create a dataset in the destination project with the same name
    existing_dataset = catalog.create_dataset(
        name=ds_name,
        project=new_project,
        session=test_session,
    )

    # Attempt to move the dataset and expect an error
    with pytest.raises(DatasetAlreadyExistsError):
        dc.move_dataset(
            ds_name,
            namespace=old_project.namespace.name,
            project=old_project.name,
            new_namespace=new_project.namespace.name,
            new_project=new_project.name,
            session=test_session,
        )
  • Ensure that DatasetAlreadyExistsError is imported or available in the test context.
  • You may need to adjust the fixture or setup for old_project, new_project, and ds_name to match your test suite's conventions.
  • If your codebase expects a different error or behavior (e.g., overwrite or merge), adjust the assertion accordingly.

@@ -16,7 +16,7 @@
from datachain.lib.listing import parse_listing_uri
from datachain.query.dataset import DatasetQuery
from datachain.sql.types import Float32, Int, Int64
from tests.utils import assert_row_names, dataset_dependency_asdict
from tests.utils import assert_row_names, dataset_dependency_asdict, table_row_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Don't import test modules. (dont-import-test-modules)

ExplanationDon't import test modules.

Tests should be self-contained and don't depend on each other.

If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.

Comment on lines +42 to +49
from tests.utils import (
ANY_VALUE,
df_equal,
skip_if_not_sqlite,
sort_df,
sorted_dicts,
table_row_count,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Don't import test modules. (dont-import-test-modules)

ExplanationDon't import test modules.

Tests should be self-contained and don't depend on each other.

If a helper function is used by multiple tests,
define it in a helper module,
instead of importing one test from the other.

Comment on lines +3420 to +3425
for _ in range(2):
(
dc.read_values(num=[1, 2, 3], session=test_session)
.settings(namespace=old_project.namespace.name, project=old_project.name)
.save(ds_name)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +3438 to +3442
if new_project != old_project:
with pytest.raises(DatasetNotFoundError):
catalog.get_dataset(ds_name, old_project)
else:
catalog.get_dataset(ds_name, old_project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +3447 to +3455
for version in [v.version for v in dataset.versions]:
old_table_name = catalog.warehouse.dataset_table_name(dataset, version)
new_table_name = catalog.warehouse.dataset_table_name(dataset_updated, version)
if old_project == new_project:
assert old_table_name == new_table_name
else:
assert table_row_count(catalog.warehouse.db, old_table_name) is None

assert table_row_count(catalog.warehouse.db, new_table_name) == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

ExplanationAvoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +3450 to +3453
if old_project == new_project:
assert old_table_name == new_table_name
else:
assert table_row_count(catalog.warehouse.db, old_table_name) is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +828 to +830
rows = list(self.db.execute(query, conn=conn))
if not rows:
raise ProjectNotFoundError(f"Project with id {project_id} not found.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

codecov bot commented Jun 21, 2025

Codecov Report

Attention: Patch coverage is 82.35294% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.63%. Comparing base (4edef66) to head (102950a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/data_storage/metastore.py 75.00% 2 Missing and 2 partials ⚠️
src/datachain/data_storage/warehouse.py 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1163      +/-   ##
==========================================
- Coverage   88.72%   88.63%   -0.10%     
==========================================
  Files         152      152              
  Lines       13549    13582      +33     
  Branches     1885     1888       +3     
==========================================
+ Hits        12022    12038      +16     
- Misses       1086     1098      +12     
- Partials      441      446       +5     
Flag Coverage Δ
datachain 88.56% <82.35%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/__init__.py 100.00% <ø> (ø)
src/datachain/catalog/catalog.py 85.92% <100.00%> (-0.13%) ⬇️
src/datachain/lib/dc/__init__.py 100.00% <100.00%> (ø)
src/datachain/lib/dc/datasets.py 95.00% <100.00%> (+0.33%) ⬆️
src/datachain/data_storage/warehouse.py 87.46% <77.77%> (-1.48%) ⬇️
src/datachain/data_storage/metastore.py 93.81% <75.00%> (-0.54%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilongin ilongin linked an issue Jun 21, 2025 that may be closed by this pull request
@ilongin ilongin marked this pull request as ready for review June 21, 2025 01:38
Copy link

cloudflare-workers-and-pages bot commented Jul 1, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 102950a
Status: ✅  Deploy successful!
Preview URL: https://98774507.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1161-move-dataset-to.datachain-documentation.pages.dev

View logs

@@ -373,6 +374,24 @@ def rename_dataset_table(

self.db.rename_table(old_ds_table_name, new_ds_table_name)

def rename_dataset_tables(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I move only a particular version? why not?

Copy link
Contributor Author

@ilongin ilongin Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task was about moving whole dataset from project to project. What you suggest is moving particular version across different datasets. We can create separate issue for that if it's really needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference? why not allow this right away? E.g. I want to make only a particular version of the dataset a production version? Is it only about accepting (limiting the operation to version?)

(probably we need also duplicate operation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my impression it was about moving versions 🙂

The use case I have in mind - I was massaging my_ds, created 13 versions and I'd like to "promote" this 13th to the latest major version of prod/actions/animal_planet which might be, let say animal_planet@4.2.2

I need something like dc.datasets.move("my_ds", "prod.actions.animal_planet") to create animal_planet@5.0.0

Note, there is an assumption that it increases major version by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the whole dataset seems like a separate operation. We do need it but I'm not sure it's common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need both. Moving the whole dataset is much needed - just to rename, reorganize things.

My2cs: I feel the promotion scenario is much needed, but will more niche.

Note, there is an assumption that it increases major version by default.

this seems too complicated, seems like a different operation (to rename / move that we talked about in the first place) 🤔

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few important questions:

  • transactional semantics - what happens if it fails in the middle, e.g. we are renaming tables 50 out of 100 succeeded - what happens? Will people be able to run it again?
  • what happens if I in a single session:
    1. created a dataset
    2. renamed it
    3. sessions failed with an exception

will the created dataset be deleted (we have a special logic that handles this - will it handle renames?)

(same btw with deletions - are we going to restore them?)

  • do we need lock the operation if other jobs are running that are accessing this dataset?

@ilongin
Copy link
Contributor Author

ilongin commented Jul 2, 2025

A few important questions:

Good questions. So we were speaking about this transactional issues before but somehow it was not really a priority.
Anyway, I think there are 2 problems:

  1. Not using DB transactions for metastore where we can -> for this we also probably need to be able to start transaction above metastore, e.g in Catalog class but currently that is not possible - we need to refactor code and analyze where transactions are actually missed - maybe there is not much that's missed.
  2. Multi DB transactions -> this would need to be implemented in code which doesn't solve main issue - what if process just dies...unless we have really complex mechanism which is able to re-continue after process comes live again

Regarding your questions:

  • transactional semantics - what happens if it fails in the middle, e.g. we are renaming tables 50 out of 100 succeeded - what happens? Will people be able to run it again?

We first update metadata and then we start to rename warehouse tables. If it fails in the middle, those tables that are renamed will be usable, and those that are not will be in some limbo state. If user starts to use dataset version for which table is not renamed error will be thrown. It's impossible to run it again without some manual intervention. BTW this could all be much easier if we can just use uuid in dataset table names instead, and actually we had that before but then requirement were to have human readable names.

  • what happens if I in a single session:

    1. created a dataset
    2. renamed it
    3. sessions failed with an exception

will the created dataset be deleted (we have a special logic that handles this - will it handle renames?)

What do you mean by session? Our datachain.query.session.Session object? All those operations are separate from DB perspective, i.e. there is no single high level transaction or any kind of that. So in you example dataset will be created and if something fails when updating dataset tables in warehouse the same scenario from above example will happen so no, dataset won't be deleted.

(same btw with deletions - are we going to restore them?)

Deletions are also not under transaction, but situation is little bit better as dataset will just stay present with less number of dataset versions (those that are not removed). The worst that can happen is that some dataset table in warehouse is not removed and it's metadata is which means it will hang there forever, but it's easy to create some job cleanup of hanging dataset tables which are not attached to any metadata.

  • do we need lock the operation if other jobs are running that are accessing this dataset?

Yes, probably something like that needs to happen, but we need to brainstorm about it. These are all general questions which needs to be addressed in some separate issue IMO as it's all around the codebase - at the time we didn't want to deal with it to not loose time but maybe now time has come to refactor everything.

@ilongin ilongin requested a review from shcheklein July 2, 2025 13:20
session = cloud_test_catalog.session
catalog = cloud_test_catalog.catalog

dc.move_dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to do datasets.move ?

session=session,
)

dataset = catalog.get_dataset(dogs_dataset.name, project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use dc.read_dataset instead of an internal calls

assert dataset.get_version("1.0.0").num_objects == expected_table_row_count


def test_move_dataset(cloud_test_catalog, dogs_dataset, project):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a test / test when next query after the move fails (e.g. some UDF and it breaks)

("old", "old", "old", "old"),
],
)
def test_move_dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look like a unit test, what is the difference with the func test above?

assert table_row_count(catalog.warehouse.db, new_table_name) == 3


def test_move_dataset_wrong_old_project(test_session, project):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not unit test, this is regular func test

@shcheklein
Copy link
Member

So we were speaking about this transactional issues before but somehow it was not really a priority.

we are going into more complicated operations I think - we didn't have much happening before

Multi DB transactions

we should try to design things in way that doesn't require that (implementation is too complicated). E.g. don't rename tables - use UUIDs as names, etc, etc

Not using DB transactions for metastore where we can

we can really try to structure code / keep it simple so that it is also not required or everything is happening within a single query

BTW this could all be much easier if we can just use uuid in dataset table names instead, and actually we had that before but then requirement were to have human readable names.

yes, exactly. And in general on UUID more (vs names, or internal DB ids like ids from RDS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to move dataset to different namespace / project
3 participants