Skip to content

fix(pgvector): fix ConcurrentInsertRunner for non-thread-safe DBs#764

Merged
XuanYang-cn merged 1 commit intozilliztech:mainfrom
XuanYang-cn:fix-conc-insert-for-threadsafe-false
Apr 21, 2026
Merged

fix(pgvector): fix ConcurrentInsertRunner for non-thread-safe DBs#764
XuanYang-cn merged 1 commit intozilliztech:mainfrom
XuanYang-cn:fix-conc-insert-for-threadsafe-false

Conversation

@XuanYang-cn
Copy link
Copy Markdown
Collaborator

For non-thread-safe DBs (e.g. PgVector), ConcurrentInsertRunner clamps max_workers to 1, so there is always exactly one worker thread. There is no need to deepcopy self.db per thread — the single worker can use self.db directly via the connection already opened by task()'s with self.db.init():.

The original code called deepcopy(self.db) inside _get_thread_db() after task() had already opened a live psycopg C-extension Connection on self.db. C-extension objects cannot be deep-copied, causing:
TypeError: no default reduce due to non-trivial cinit

Fix: remove the deepcopy branch entirely. All workers (thread-safe or not) now use self.db directly; thread-safety is guaranteed for non-thread-safe DBs by the max_workers=1 clamp.

Also clean up stale comments in pgvector.py left over from #760/#763.

Adds tests/test_pgvector.py with:

  • unit test that reproduces the bug (fails on original, passes on fix)
  • e2e regression test via ConcurrentInsertRunner + OpenAI 50K dataset

See also: #756

For non-thread-safe DBs (e.g. PgVector), ConcurrentInsertRunner clamps
max_workers to 1, so there is always exactly one worker thread. There is
no need to deepcopy self.db per thread — the single worker can use
self.db directly via the connection already opened by task()'s
`with self.db.init():`.

The original code called deepcopy(self.db) inside _get_thread_db() after
task() had already opened a live psycopg C-extension Connection on
self.db. C-extension objects cannot be deep-copied, causing:
  TypeError: no default __reduce__ due to non-trivial __cinit__

Fix: remove the deepcopy branch entirely. All workers (thread-safe or
not) now use self.db directly; thread-safety is guaranteed for
non-thread-safe DBs by the max_workers=1 clamp.

Also clean up stale comments in pgvector.py left over from zilliztech#760/zilliztech#763.

Adds tests/test_pgvector.py with:
- unit test that reproduces the bug (fails on original, passes on fix)
- e2e regression test via ConcurrentInsertRunner + OpenAI 50K dataset

See also: zilliztech#756

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@XuanYang-cn XuanYang-cn merged commit 02e5d33 into zilliztech:main Apr 21, 2026
4 checks passed
@XuanYang-cn XuanYang-cn deleted the fix-conc-insert-for-threadsafe-false branch April 21, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants