Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate points for shard reads #4301

Merged
merged 11 commits into from
May 24, 2024
Merged

Deduplicate points for shard reads #4301

merged 11 commits into from
May 24, 2024

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented May 22, 2024

Tracked in: #4213

Deduplicate points based on their point ID in all reads from shards.

This is necessary for resharding, where multiple shards may have a single point ID.

The idea was to only deduplicate this while resharding is active, but after a discussion with @generall we decided to enable it at all times. We might benefit from this in other cases too, such as when we have the same point ID across multiple shard keys. At this time it is unclear yet what performance impact that might have, so that's yet to be tested.

Tasks

  • Add tests for every deduplication scenario

Open questions

  • Do we need to handle read errors in any special way?
    • e.g., ignore read errors from new shard?

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Copy link
Contributor

@coszio coszio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is no way to dedup in count since the response from each shard is just a number

Copy link
Contributor

@ffuugoo ffuugoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like how simple the implementation turned out to be. 👌

@timvisee
Copy link
Member Author

@coszio is right that count should be updated too. I propose to do that in a separate PR though, as it requires a different approach than what I've done in this PR.

@timvisee timvisee merged commit 466debb into dev May 24, 2024
17 checks passed
@timvisee timvisee deleted the dedup-points-from-shards branch May 24, 2024 09:55
generall pushed a commit that referenced this pull request May 26, 2024
* Deduplicate merge from shards based on point ID

* Merge shard points more efficiently, reuse hash set across batches

* Deduplicate points for scroll

* Deduplicate points for retrieve

* Deduplicate points for scroll without order by

* Add point deduplication test for scroll by without ordering

* Add point deduplication test for retrieve

* Add point deduplication test for scroll by with ordering

* Properly define and use constants, add some assertions

* Add point deduplication test for search

* refactor `merge_from_shards` for less intermediate allocation and kmerge. (#4311)

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
timonv pushed a commit to bosun-ai/swiftide that referenced this pull request Jun 28, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [qdrant/qdrant](https://qdrant.com/)
([source](https://togithub.com/qdrant/qdrant)) | service | patch |
`v1.9.2` -> `v1.9.7` |

---

### Release Notes

<details>
<summary>qdrant/qdrant (qdrant/qdrant)</summary>

### [`v1.9.7`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.7)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.6...v1.9.7)

### Change log

#### Improvements

-
[qdrant/qdrant#4517
- Do not allow embedding the web UI in an iframe
-
[qdrant/qdrant#4556
- Include HNSW configuration in snasphots to fix some edge cases

#### Bug fixes

-
[qdrant/qdrant#4555
- Fix panic on start with sparse index from versions 1.9.3 to 1.9.6
-
[qdrant/qdrant#4551
- Fix positive/negative points IDs being excluded when using
recommendation search with `lookup_from`

### [`v1.9.6`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.6)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.5...v1.9.6)

### Change log

#### Bug fixes

-
[qdrant/qdrant#4472
- fix potential panic on recovery sparse vectors from crash
-
[qdrant/qdrant#4426
- improve error message on missing payload index
-
[qdrant/qdrant#4375
- fix in-place updates for sparse index
-
[qdrant/qdrant#4523
- fix missing payload index issue, introduced in v1.9.5

### [`v1.9.5`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.5)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.4...v1.9.5)

### Change log

#### Features

-
[qdrant/qdrant#4254
- Add pyroscope integration for continuous profiling on demand

#### Improvements

-
[qdrant/qdrant#4309
- Allow to configure default number of shards per node
-
[qdrant/qdrant#4317
- Allow to overwrite optimizer settings via config
-
[qdrant/qdrant#4312,
[qdrant/qdrant#4369
- Improve vector size estimations, making index thresholds more reliable
-
[qdrant/qdrant#4428
- Improve default maximum segment size, base it on number of CPUs used
for indexing
-
[qdrant/qdrant#4370
- Use consistent RocksDB settings for both put and remove
-
[qdrant/qdrant#4376
- Improve ordering of insertions and deletions in RocksDB
-
[qdrant/qdrant#4371
- Log error if segment flushing failed on drop
-
[qdrant/qdrant#4352
- Promote REST request processing problems from warning to error
-
[qdrant/qdrant#4368
- Improve error messages in cases of missing vectors
-
[qdrant/qdrant#4391
- Improve shard state log message, not strictly related to snapshot
recovery
-
[qdrant/qdrant#4414
- Improve Dockerfile, don't invalidate caches each commit and allow
debug settings

#### Bug fixes

-
[qdrant/qdrant#4402
- Fix deadlock caused by concurrent snapshot and optimization
-
[qdrant/qdrant#4411
- Fix potentially losing vectors on crash by enabling RocksDB WAL
-
[qdrant/qdrant#4416,
[qdrant/qdrant#4440
- Respect `max_segment_size` on data ingestion with optimizers disabled,
create segments as needed
-
[qdrant/qdrant#4442
- Fix potentially having bad HNSW links on multithreaded systems

### [`v1.9.4`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.4)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.3...v1.9.4)

### Change log

#### Bug fixes

-
[qdrant/qdrant#4332
- Fix potentially losing a segment when creating a snapshot with ongoing
updates
-
[qdrant/qdrant#4342
- Fix potential panic on start if there is no appendable segment
-
[qdrant/qdrant#4328
- Prevent panic when searching with huge limit

### [`v1.9.3`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.3)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.2...v1.9.3)

### Change log

#### Improvements

-
[qdrant/qdrant#4165
- Handle Out-Of-Disk on insertions gracefully
-
[qdrant/qdrant#3964
- Faster consensus convergence with batched updates
-
[qdrant/qdrant#4301
- Deduplicate points by ID for custom sharding

#### Bug fixes

-
[qdrant/qdrant#4307
- Fix overflow panic if scroll limit is usize::MAX
-
[qdrant/qdrant#4322
- Fix panic with missing sparse vectors after recovery of corrupted
storage

#### Web UI

-
[qdrant/qdrant-web-ui#183
- Notification for miss-configured collections

Full change log:
https://github.com/qdrant/qdrant-web-ui/releases/tag/v0.1.26

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/bosun-ai/swiftide).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MjAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQyMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFzdGVyIiwibGFiZWxzIjpbXX0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants