-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
🚀 The feature, motivation and pitch
When the server experiences high concurrency with multiple long requests, the preprocessing time (which includes tokenization) becomes a bottleneck. This leads to increased overall latency and poor TTFT performance, as requests are forced to wait in a serialized queue for preprocessing (tokenization), even though the XPU compute resources might be available.
The current implementation in vllm uses a single thread to handle the request (in serving_engine.py).
self._tokenizer_executor = ThreadPoolExecutor(max_workers=1)
The cost of these operations scales linearly with the length of the input/output sequences. Under high load with long contexts, a queue of requests forms, each waiting for the previous one to finish its tokenization.
A multiprocessing-based tokenizer process pool can be utilized to parallelize the encoding and decoding steps, thus breaking the serialization bottleneck. The minor overhead of inter-process communication (IPC) is negligible compared to the massive gains in parallelizing long tokenization tasks.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.