Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncQdrantClient calls blocking event loop #615

Open
amey-wynk opened this issue Apr 27, 2024 · 3 comments
Open

AsyncQdrantClient calls blocking event loop #615

amey-wynk opened this issue Apr 27, 2024 · 3 comments

Comments

@amey-wynk
Copy link

Currently trying to utilise qdrant for a production usecase and this requires building an API for real time vector search. It is crucial for this to be async so that it can scale well for multiple users. But when using the async client with FastAPI and Uvicorn, latency increases as number of concurrent users increase.

Current Behavior

Latency of the endpoint increases as concurrent users increase.

Steps to Reproduce

from fastapi import FastAPI
import asyncio
from qdrant_client import AsyncQdrantClient

app = FastAPI()
asyncClient = AsyncQdrantClient(url = 'prod.qdrant.com', port = 80)

@app.post('/sleep')
async def sleep(request: Request) -> dict:
    await asyncio.sleep(0.1) # simulate 100ms I/O call
    return {'message': 'success'}

@app.post('/recommend')
async def recommend(request: Request) -> dict:
    await asyncClient.recommend(
        collection_name = 'my_collection',
        positive = ['4be0643f1d98573b97cdca98a65347dd'],
        limit = 10,
    )
    return {'message': 'success'}
  1. Run the above app using uvicorn
  2. Load test the endpoint using locust
  3. The /sleep endpoint scales to 50 concurrent users while maintaining a p99 latency of 110ms (as expected since the requests are concurrent)
    4.The /recommend endpoint does not do this. At 1 concurrent users the p99 latency is 10ms, 65ms at 10 users and 260ms for 50 concurrent users. (this is unexpected if the requests are processed concurrently)

Uvicorn version: 0.27.1
FastAPI version: 0.110.0
Qdrant version: 1.7.3 (both server and client)
Qdrant is deployed on k8s with 10 pods and 20vCPUs each. All vectors are in memory and utilization of pods is <1 vCPU during testing.

Expected Behavior

The requests should be processed concurrently. Maybe the async call using the client is blocking the event loop and preventing FastAPI from processing other requests

@amey-wynk amey-wynk added the bug Something isn't working label Apr 27, 2024
@generall generall transferred this issue from qdrant/qdrant Apr 27, 2024
@generall
Copy link
Member

Latency of the endpoint increases as concurrent users increase.

This is kinda expected. You can't scale indefinitely by just adding parallel calls.

But also considering that

memory and utilization of pods is <1 vCPU during testing.

there might be two possibilities, depending on your configuration it of the collection: either bottleneck is in disk, or bottleneck is on the client side.

Could you please try to do the same with gRPC?

@amey-wynk
Copy link
Author

Thanks for the response @generall

I know its not possible to scale indefinitely using parallel calls as you will bottleneck at some point. However this number should be very high compared to the tests I have ran and should not increase linearly for such small amount of requests (like the /sleep endpoint hits 110ms instead of 100ms). Also the request I'm sending qdrant is the most basic kind of request (a batch recommend request would be heavier on the CPU side). Even scaling that basic request to 2/3/5 parallel calls the latency increases almost linearly.

Regarding the CPU consumption, I've tried this with multiple configs for collections and the latency results are always the same. Just the CPU and memory usage changes on the pods. Also as mentioned the db is entirely in memory.

I've experimented with different shard_numbers, replication_factor and segment_count but the latency is always the same. I think I've assigned more than sufficient hardware resources to the DB as well. I've gone through the optimization section of the docs as well.

Is there some specific configuration of the collection I should try? (assuming the db is the bottleneck)

@generall
Copy link
Member

(assuming the db is the bottleneck)

If storage is all in memory and CPU usage on DB side is small, I don't think DB is a bettleneck, actually.

I would try to check what's the client CPU usage. You can check GET /telemetry to verify that latency. as observed from db side, correlates with client measurements

@joein joein removed the bug Something isn't working label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants