Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init collection from #1364

Merged
merged 11 commits into from
Jan 22, 2023
Merged

Init collection from #1364

merged 11 commits into from
Jan 22, 2023

Conversation

generall
Copy link
Member

@generall generall commented Jan 17, 2023

Add a parameter to initialize collection from another collection.
Might be useful for:

  • checking different configurations faster
  • in combination with aliases - allows to emulate re-partition

ToDo:

  • Collection lock (?) ✔️
  • Shards status during data migration (?)
  • Tests ✔️

@generall generall changed the base branch from master to dev January 17, 2023 23:09
@generall generall marked this pull request as ready for review January 18, 2023 20:55

/// Handlers for migration data from one collection into another within single cluster

/// Get a list of local shards, which can be used for migration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the usage of the term migration in this file and this context a bit confusing.
As far as I can see, it is actually a copy and not a migration as the source is not deleted at the end of the routine.

Source and target coexist at the end of the process.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renames, not sure it became better, though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time, I like it better. It is internal stuff anyway, we can rename it later 👍

source_collection: &CollectionId,
) -> Result<(), StorageError> {
let collection = self.get_collection(source_collection).await?;
let collection_vectors_schema = collection.state().await.config.params.vectors;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this preventing users from changing the distance function used for measuring distance between vectors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and one reason for this is if you try to migrate from Cosine => Dot it might not work as expected, as we normalized the vectors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate, would have been a good use case.
An important point for the docs I guess 📝


let source_collection =
handle_get_collection(collections_read.get(source_collection_name))?;
let _updates_guard = source_collection.lock_updates().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this properly, this is not a transactional copy of the collection as it was at a given point in time.
As we progress through the offset, we will fetch the latest values for each point.
Update occurring on already fetched ids will be missed.

Am I understanding this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not forward updates into the new collection automatically. But the lock allows the user to start pushing updates into both collections from the external source, and, therefore, switch from one collection to another transparently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, it could be in the docs as well 📝

@generall generall merged commit 61d970e into dev Jan 22, 2023
generall added a commit that referenced this pull request Feb 6, 2023
* WIP

* WIP: add shards selection

* fmt

* WIP: collection init from other collection

* fmt

* fix bugs

* add collection write lock + use collection manager runtime instead of search runtime

* fmt

* integration test

* better names ?

* payload indexes transfer
@agourlay agourlay mentioned this pull request Feb 7, 2023
@generall generall mentioned this pull request Apr 19, 2023
8 tasks
@agourlay agourlay deleted the init-collection-from branch July 12, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants