Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add num_requests_preempted metric #5051

Open
sathyanarays opened this issue May 25, 2024 · 1 comment
Open

[Feature]: Add num_requests_preempted metric #5051

sathyanarays opened this issue May 25, 2024 · 1 comment

Comments

@sathyanarays
Copy link

🚀 The feature, motivation and pitch

There are metrics that give an idea about the number of requests that are currently running and waiting through the metrics: num_requests_running and num_requests_waiting. But, these metrics alone does not give an idea about if the requests are getting thrashed and thus underutilizing GPUs.

The new proposed metric num_requests_preempted that reflects the number of requests preempted and waiting for execution would provide idea about request thrashing. This provides the high-level schedulers to avoid adding new requests to the thrashing GPUs.

Alternatives

No response

Additional context

from openai import OpenAI
import threading
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

def query():
    client.completions.create(
    model="facebook/opt-125m",
    prompt = "Sachin Tendulkar is",
    max_tokens=2040,
    n=1
    )

threads = []
for i in range(1000):
    thread = threading.Thread(target=query)
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

I ran the above script on a vLLM server started with command python -m vllm.entrypoints.openai.api_server in a machine with one NVIDIA T400; below are my observations:

Observation 1

As soon as the query is executed, the num_requests_running reaches 256. At this point, the gpu_cache_usage_perc is 6 percent.

Observation 2

After some time, the gpu_cache_usage_perc shoots up to 99 percent

Observation 3

Gradually, the num_requests_running comes down to 100 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 4

Gradually, the num_requests_running goes up to 256 while the gpu_cache_usage_perc remains in the 99 percent range.

Observation 3 and Observation 4 are repeated till all the requests are completed. The metrics num_requests_running and gpu_cache_usage_perc had to be correlated to understand that the requests are getting thrashed. It would be great if we can provide num_requests_preempted as this would give direct measure of thrashing.

@sathyanarays
Copy link
Author

Might be related to request_with_evicted_tokens and total_evicted_tokens in #5041

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant