
A modern, full-stack chat application demonstrating how to integrate React frontend with a Go backend and run local Large Language Models (LLMs) using Docker's Model Runner. This repo also integrates the GenAI app with the Observability stack that includes Prometheus, Grafana and Jaeger.
This project showcases a complete Generative AI interface that includes:
- React/TypeScript frontend with a responsive chat UI
- Go backend server for API handling
- Integration with Docker's Model Runner to run Llama 3.2 locally
- Comprehensive observability with metrics, logging, and tracing
- NEW: llama.cpp metrics integration directly in the UI
- π¬ Interactive chat interface with message history
- π Real-time streaming responses (tokens appear as they're generated)
- π Light/dark mode support based on user preference
- π³ Dockerized deployment for easy setup and portability
- π Run AI models locally without cloud API dependencies
- π Cross-origin resource sharing (CORS) enabled
- π§ͺ Integration testing using Testcontainers
- π Metrics and performance monitoring
- π Structured logging with zerolog
- π Distributed tracing with OpenTelemetry
- π Grafana dashboards for visualization
- π Advanced llama.cpp performance metrics

The application consists of these main components:
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Frontend β >>> β Backend β >>> β Model Runnerβ
β (React/TS) β β (Go) β β (Llama 3.2) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
:3000 :8080 :12434
β β
βββββββββββββββ βββββββ βββββββ βββββββββββββββ
β Grafana β <<< β Prometheus β β Jaeger β
β Dashboards β β Metrics β β Tracing β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
:3001 :9091 :16686
There are two ways to connect to Model Runner:
This method uses Docker's internal DNS resolution to connect to the Model Runner:
- Connection URL:
http://model-runner.docker.internal/engines/llama.cpp/v1/
- Configuration is set in
backend.env
This method uses host-side TCP support:
- Connection URL:
host.docker.internal:12434
- Requires updates to the environment configuration
- Docker Desktop 4.41.0 or later
- Docker Compose 2.35 or later
- Git
- Go 1.19 or higher (for local development)
- Node.js and npm (for frontend development)
Before starting, pull the required model:
docker model pull ai/llama3.2:1B-Q8_0
-
Clone this repository:
git clone https://github.com/ajeetraina/genai-model-runner-metrics.git cd genai-model-runner-metrics
-
Start the application using Docker Compose:
docker compose up -d --build
-
Access the frontend at http://localhost:3000
-
Access observability dashboards:
- Grafana: http://localhost:3001 (admin/admin)
Ensure that you provide http://prometheus:9090
instead of localhost:9090
to see the metrics on the Grafana dashboard.
- Jaeger UI: http://localhost:16686
- Prometheus: http://localhost:9091
The frontend is built with React, TypeScript, and Vite:
cd frontend
npm install
npm run dev
This will start the development server at http://localhost:3000.
The Go backend can be run directly:
go mod download
go run main.go
Make sure to set the required environment variables from backend.env
:
BASE_URL
: URL for the model runnerMODEL
: Model identifier to useAPI_KEY
: API key for authentication (defaults to "ollama")LOG_LEVEL
: Logging level (debug, info, warn, error)LOG_PRETTY
: Whether to output pretty-printed logsTRACING_ENABLED
: Enable OpenTelemetry tracingOTLP_ENDPOINT
: OpenTelemetry collector endpoint
- The frontend sends chat messages to the backend API
- The backend formats the messages and sends them to the Model Runner
- The LLM processes the input and generates a response
- The backend streams the tokens back to the frontend as they're generated
- The frontend displays the incoming tokens in real-time
- Observability components collect metrics, logs, and traces throughout the process
βββ compose.yaml # Docker Compose configuration
βββ backend.env # Backend environment variables
βββ main.go # Go backend server
βββ frontend/ # React frontend application
β βββ src/ # Source code
β β βββ components/ # React components
β β βββ App.tsx # Main application component
β β βββ ...
βββ pkg/ # Go packages
β βββ logger/ # Structured logging
β βββ metrics/ # Prometheus metrics
β βββ middleware/ # HTTP middleware
β βββ tracing/ # OpenTelemetry tracing
β βββ health/ # Health check endpoints
βββ prometheus/ # Prometheus configuration
βββ grafana/ # Grafana dashboards and configuration
βββ observability/ # Observability documentation
βββ ...
The application includes detailed llama.cpp metrics displayed directly in the UI:
- Tokens per Second: Real-time generation speed
- Context Window Size: Maximum tokens the model can process
- Prompt Evaluation Time: Time spent processing the input prompt
- Memory per Token: Memory usage efficiency
- Thread Utilization: Number of threads used for inference
- Batch Size: Inference batch size
These metrics help in understanding the performance characteristics of llama.cpp models and can be used to optimize configurations.
The project includes comprehensive observability features:
- Model performance (latency, time to first token)
- Token usage (input and output counts)
- Request rates and error rates
- Active request monitoring
- llama.cpp specific performance metrics
- Structured JSON logs with zerolog
- Log levels (debug, info, warn, error, fatal)
- Request logging middleware
- Error tracking
- Request flow tracing with OpenTelemetry
- Integration with Jaeger for visualization
- Span context propagation
For more information, see Observability Documentation.
The application has been enhanced with specific metrics for llama.cpp models:
-
Backend Integration: The Go backend collects and exposes llama.cpp-specific metrics:
- Context window size tracking
- Memory per token measurement
- Token generation speed calculations
- Thread utilization monitoring
- Prompt evaluation timing
- Batch size tracking
-
Frontend Dashboard: A dedicated metrics panel in the UI shows:
- Real-time token generation speed
- Memory efficiency
- Thread utilization with recommendations
- Context window size visualization
- Expandable detailed metrics view
- Integration with model info panel
-
Prometheus Integration: All llama.cpp metrics are exposed to Prometheus for long-term storage and analysis:
- Custom histograms for timing metrics
- Gauges for resource utilization
- Counters for token throughput
You can customize the application by:
- Changing the model in
backend.env
to use a different LLM - Modifying the frontend components for a different UI experience
- Extending the backend API with additional functionality
- Customizing the Grafana dashboards for different metrics
- Adjusting llama.cpp parameters for performance optimization
The project includes integration tests using Testcontainers:
cd tests
go test -v
- Model not loading: Ensure you've pulled the model with
docker model pull
- Connection errors: Verify Docker network settings and that Model Runner is running
- Streaming issues: Check CORS settings in the backend code
- Metrics not showing: Verify that Prometheus can reach the backend metrics endpoint
- llama.cpp metrics missing: Confirm that your model is indeed a llama.cpp model
MIT
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request