You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.
The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.
Alternatives
No response
Additional context
from openai import OpenAI
import threading
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
def query():
client.completions.create(
model="facebook/opt-125m",
prompt = "Sachin Tendulkar is",
max_tokens=2040,
n=1
)
threads = []
for i in range(1000):
thread = threading.Thread(target=query)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:
Observation 1
As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.
Observation 2
After some time, the gpu_cache_usage_perc shoots up to 99 percent
Observation 3
Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.
Observation 4
Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.
Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics:
num_requests_running
andnum_requests_waiting
. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.The new proposed metric
num_requests_preempted
that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.Alternatives
No response
Additional context
I ran the above script on a vLLM server started with command
python -m vllm.entrypoints.openai.api_server
in a machine with one NVIDIA T400; below are my observations:Observation 1
As soon as the query is executed, the
num_requests_running
reaches 256. At this point, thegpu_cache_usage_perc
is 6 percent.Observation 2
After some time, the
gpu_cache_usage_perc
shoots up to 99 percentObservation 3
Gradually, the
num_requests_running
comes down to 100 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 4
Gradually, the
num_requests_running
goes up to 256 while thegpu_cache_usage_perc
remains in the 99 percent range.Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics
num_requests_running
andgpu_cache_usage_perc
had to be correlated to understand that the requests are getting thrashed. It would be great if we can providenum_requests_preempted
as this would give direct measure of thrashing.The text was updated successfully, but these errors were encountered: