A GenAI Chatbot Demo with Docker Model Runner and Observability Metrics

A modern, full-stack chat application demonstrating how to integrate React frontend with a Go backend and run local Large Language Models (LLMs) using Docker's Model Runner. This repo also integrates the GenAI app with the Observability stack that includes Prometheus, Grafana and Jaeger.

Overview

This project showcases a complete Generative AI interface that includes:

React/TypeScript frontend with a responsive chat UI
Go backend server for API handling
Integration with Docker's Model Runner to run Llama 3.2 locally
Comprehensive observability with metrics, logging, and tracing
NEW: llama.cpp metrics integration directly in the UI

Features

💬 Interactive chat interface with message history
🔄 Real-time streaming responses (tokens appear as they're generated)
🌓 Light/dark mode support based on user preference
🐳 Dockerized deployment for easy setup and portability
🏠 Run AI models locally without cloud API dependencies
🔒 Cross-origin resource sharing (CORS) enabled
🧪 Integration testing using Testcontainers
📊 Metrics and performance monitoring
📝 Structured logging with zerolog
🔍 Distributed tracing with OpenTelemetry
📈 Grafana dashboards for visualization
🚀 Advanced llama.cpp performance metrics

Architecture

The application consists of these main components:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Frontend  │ >>> │   Backend   │ >>> │ Model Runner│
│  (React/TS) │     │    (Go)     │     │ (Llama 3.2) │
└─────────────┘     └─────────────┘     └─────────────┘
      :3000              :8080               :12434
                          │  │
┌─────────────┐     ┌─────┘  └─────┐     ┌─────────────┐
│   Grafana   │ <<< │ Prometheus  │     │   Jaeger    │
│ Dashboards  │     │  Metrics    │     │   Tracing   │
└─────────────┘     └─────────────┘     └─────────────┘
      :3001              :9091              :16686

Connection Methods

There are two ways to connect to Model Runner:

1. Using Internal DNS (Default)

This method uses Docker's internal DNS resolution to connect to the Model Runner:

Connection URL: http://model-runner.docker.internal/engines/llama.cpp/v1/
Configuration is set in backend.env

2. Using TCP

This method uses host-side TCP support:

Connection URL: host.docker.internal:12434
Requires updates to the environment configuration

Prerequisites

Docker Desktop 4.41.0 or later
Docker Compose 2.35 or later
Git
Go 1.19 or higher (for local development)
Node.js and npm (for frontend development)

Before starting, pull the required model:

docker model pull ai/llama3.2:1B-Q8_0

Quick Start

Clone this repository:

git clone https://github.com/ajeetraina/genai-model-runner-metrics.git
cd genai-model-runner-metrics

Start the application using Docker Compose:
```
docker compose up -d --build
```
Access the frontend at http://localhost:3000
Access observability dashboards:
- Grafana: http://localhost:3001 (admin/admin)

Ensure that you provide http://prometheus:9090 instead of localhost:9090 to see the metrics on the Grafana dashboard.

Jaeger UI: http://localhost:16686
Prometheus: http://localhost:9091

Development Setup

Frontend

The frontend is built with React, TypeScript, and Vite:

cd frontend
npm install
npm run dev

This will start the development server at http://localhost:3000.

Backend

The Go backend can be run directly:

go mod download
go run main.go

Make sure to set the required environment variables from backend.env:

BASE_URL: URL for the model runner
MODEL: Model identifier to use
API_KEY: API key for authentication (defaults to "ollama")
LOG_LEVEL: Logging level (debug, info, warn, error)
LOG_PRETTY: Whether to output pretty-printed logs
TRACING_ENABLED: Enable OpenTelemetry tracing
OTLP_ENDPOINT: OpenTelemetry collector endpoint

How It Works

The frontend sends chat messages to the backend API
The backend formats the messages and sends them to the Model Runner
The LLM processes the input and generates a response
The backend streams the tokens back to the frontend as they're generated
The frontend displays the incoming tokens in real-time
Observability components collect metrics, logs, and traces throughout the process

Project Structure

├── compose.yaml           # Docker Compose configuration
├── backend.env            # Backend environment variables
├── main.go                # Go backend server
├── frontend/              # React frontend application
│   ├── src/               # Source code
│   │   ├── components/    # React components
│   │   ├── App.tsx        # Main application component
│   │   └── ...
├── pkg/                   # Go packages
│   ├── logger/            # Structured logging
│   ├── metrics/           # Prometheus metrics
│   ├── middleware/        # HTTP middleware
│   ├── tracing/           # OpenTelemetry tracing
│   └── health/            # Health check endpoints
├── prometheus/            # Prometheus configuration
├── grafana/               # Grafana dashboards and configuration
├── observability/         # Observability documentation
└── ...

llama.cpp Metrics Features

The application includes detailed llama.cpp metrics displayed directly in the UI:

Tokens per Second: Real-time generation speed
Context Window Size: Maximum tokens the model can process
Prompt Evaluation Time: Time spent processing the input prompt
Memory per Token: Memory usage efficiency
Thread Utilization: Number of threads used for inference
Batch Size: Inference batch size

These metrics help in understanding the performance characteristics of llama.cpp models and can be used to optimize configurations.

Observability Features

The project includes comprehensive observability features:

Metrics

Model performance (latency, time to first token)
Token usage (input and output counts)
Request rates and error rates
Active request monitoring
llama.cpp specific performance metrics

Logging

Structured JSON logs with zerolog
Log levels (debug, info, warn, error, fatal)
Request logging middleware
Error tracking

Tracing

Request flow tracing with OpenTelemetry
Integration with Jaeger for visualization
Span context propagation

For more information, see Observability Documentation.

llama.cpp Metrics Integration

The application has been enhanced with specific metrics for llama.cpp models:

Backend Integration: The Go backend collects and exposes llama.cpp-specific metrics:
- Context window size tracking
- Memory per token measurement
- Token generation speed calculations
- Thread utilization monitoring
- Prompt evaluation timing
- Batch size tracking
Frontend Dashboard: A dedicated metrics panel in the UI shows:
- Real-time token generation speed
- Memory efficiency
- Thread utilization with recommendations
- Context window size visualization
- Expandable detailed metrics view
- Integration with model info panel
Prometheus Integration: All llama.cpp metrics are exposed to Prometheus for long-term storage and analysis:
- Custom histograms for timing metrics
- Gauges for resource utilization
- Counters for token throughput

Customization

You can customize the application by:

Changing the model in backend.env to use a different LLM
Modifying the frontend components for a different UI experience
Extending the backend API with additional functionality
Customizing the Grafana dashboards for different metrics
Adjusting llama.cpp parameters for performance optimization

Testing

The project includes integration tests using Testcontainers:

cd tests
go test -v

Troubleshooting

Model not loading: Ensure you've pulled the model with docker model pull
Connection errors: Verify Docker network settings and that Model Runner is running
Streaming issues: Check CORS settings in the backend code
Metrics not showing: Verify that Prometheus can reach the backend metrics endpoint
llama.cpp metrics missing: Confirm that your model is indeed a llama.cpp model

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
frontend		frontend
genai-app-demo		genai-app-demo
grafana/provisioning/dashboards		grafana/provisioning/dashboards
observability		observability
pkg		pkg
prometheus		prometheus
refs/heads		refs/heads
tests		tests
.env		.env
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README-model-runner.md		README-model-runner.md
README.md		README.md
backend.env		backend.env
compose.yaml		compose.yaml
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_branch_update.md		main_branch_update.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A GenAI Chatbot Demo with Docker Model Runner and Observability Metrics

Overview

Features

Architecture

Connection Methods

1. Using Internal DNS (Default)

2. Using TCP

Prerequisites

Quick Start

Development Setup

Frontend

Backend

How It Works

Project Structure

llama.cpp Metrics Features

Observability Features

Metrics

Logging

Tracing

llama.cpp Metrics Integration

Customization

Testing

Troubleshooting

License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

dockersamples/genai-model-runner-metrics

Folders and files

Latest commit

History

Repository files navigation

A GenAI Chatbot Demo with Docker Model Runner and Observability Metrics

Overview

Features

Architecture

Connection Methods

1. Using Internal DNS (Default)

2. Using TCP

Prerequisites

Quick Start

Development Setup

Frontend

Backend

How It Works

Project Structure

llama.cpp Metrics Features

Observability Features

Metrics

Logging

Tracing

llama.cpp Metrics Integration

Customization

Testing

Troubleshooting

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages