[Feature]: Add num_requests_preempted metric #5051

sathyanarays · 2024-05-25T11:11:35Z

🚀 The feature, motivation and pitch

There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.

The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.

Alternatives

No response

Additional context

from openai import OpenAI
import threading
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

def query():
    client.completions.create(
    model="facebook/opt-125m",
    prompt = "Sachin Tendulkar is",
    max_tokens=2040,
    n=1
    )

threads = []
for i in range(1000):
    thread = threading.Thread(target=query)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:

Observation 1

As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.

Observation 2

After some time, the gpu_cache_usage_perc shoots up to 99 percent

Observation 3

Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 4

Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.

The text was updated successfully, but these errors were encountered:

sathyanarays · 2024-05-25T11:17:22Z

Might be related to request_with_evicted_tokens and total_evicted_tokens in #5041

sathyanarays added the feature request label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add num_requests_preempted metric #5051

[Feature]: Add num_requests_preempted metric #5051

sathyanarays commented May 25, 2024

sathyanarays commented May 25, 2024

[Feature]: Add num_requests_preempted metric #5051

[Feature]: Add num_requests_preempted metric #5051

Comments

sathyanarays commented May 25, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Observation 1

Observation 2

Observation 3

Observation 4

sathyanarays commented May 25, 2024