Skip to content

Adds metrics endpoint #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 16, 2025
Merged

Adds metrics endpoint #78

merged 8 commits into from
Jun 16, 2025

Conversation

ilopezluna
Copy link
Contributor

@ilopezluna ilopezluna commented Jun 12, 2025

This PR uses the llama.cpp metrics endpoint to collect and aggregate the metrics of all active runners.

No active runners:

curl http://localhost:13434/metrics
# No active runners

An active runner with completions mode:

curl http://localhost:13434/metrics                                
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 285.714

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.047

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1000

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 47

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.007

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

An active runner with embeddings mode:

curl http://localhost:13434/metrics
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

@ilopezluna ilopezluna changed the title [WIP] Adds metrics endpoint Adds metrics endpoint Jun 12, 2025
@ilopezluna ilopezluna requested a review from a team June 12, 2025 12:58
@ilopezluna ilopezluna marked this pull request as ready for review June 12, 2025 12:58
}

// NewPrometheusParser creates a new Prometheus metrics parser
func NewPrometheusParser() *PrometheusParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a 500 when visit the link but I assume its temporary. I'll take a look tomorrow, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to use https://pkg.go.dev/github.com/prometheus/common/expfmt so I can use families, err := parser.TextToMetricFamilies(strings.NewReader(string(body)))
And then I can get metrics per family and add our labels.
Let me know what do you think 🙏

}

// NewPrometheusParser creates a new Prometheus metrics parser
func NewPrometheusParser() *PrometheusParser {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

- Remove custom prometheus_metrics.go
- Use expfmt.TextParser for parsing and expfmt.NewEncoder for output
# Conflicts:
#	go.mod
#	pkg/inference/scheduling/scheduler.go
Copy link
Collaborator

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor suggestions.

@ilopezluna ilopezluna merged commit 9933b7d into main Jun 16, 2025
4 checks passed
@ilopezluna ilopezluna deleted the add-metrics branch June 16, 2025 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants