A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both Windows and Linux systems!
- 🪟 Windows: Full support with Visual Studio and MSVC
- 🐧 Linux: Native support with GCC/Clang
- 🎮 GPU Acceleration: NVIDIA CUDA and Vulkan support
- 📦 Easy Installation: Direct binary installation or build from source
- 🚀 Fast Inference: Built with llama.cpp for optimized model inference
- 🔗 OpenAI Compatible: Drop-in replacement for OpenAI API endpoints
- 📡 Streaming Support: Real-time streaming responses for chat completions
- 🎛️ Multi-Model Management: Load and manage multiple models simultaneously
- 📊 Real-time Metrics: Monitor completion performance with TPS, TTFT, and success rates
- ⚙️ Lazy Loading: Defer model loading until first request with
load_immediately=false
- 🔧 Configurable: Flexible model loading parameters and inference settings
- 🔒 Authentication: API key and rate limiting support
- 🌐 Cross-Platform: Windows and Linux native builds
System Requirements:
- Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
- GCC 9+ or Clang 10+
- CMake 3.14+
- Git with submodule support
- At least 4GB RAM (8GB+ recommended for larger models)
- CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
- Vulkan SDK (optional, for alternative GPU acceleration)
Install Dependencies:
Ubuntu/Debian:
# Update package list
sudo apt update
# Install essential build tools
sudo apt install -y build-essential cmake git pkg-config
# Install required libraries
sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev
# Optional: Install CUDA for GPU support
# Follow NVIDIA's official installation guide for your distribution
CentOS/RHEL/Fedora:
# For CentOS/RHEL 8+
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake git curl-devel yaml-cpp-devel
# For Fedora
sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel
Arch Linux:
sudo pacman -S base-devel cmake git curl yaml-cpp
1. Clone the Repository:
git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
2. Initialize Submodules:
git submodule update --init --recursive
3. Create Build Directory:
mkdir build && cd build
4. Configure Build:
Standard Build (CPU-only):
cmake -DCMAKE_BUILD_TYPE=Release ..
With CUDA Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..
With Vulkan Support:
cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..
Debug Build:
cmake -DCMAKE_BUILD_TYPE=Debug ..
5. Build the Project:
# Use all available CPU cores
make -j$(nproc)
# Or specify number of cores manually
make -j4
6. Verify Build:
# Check if the executable was created
ls -la kolosal-server
# Test basic functionality
./kolosal-server --help
Start the Server:
# From build directory
./kolosal-server
# Or specify a config file
./kolosal-server --config ../config.yaml
Background Service:
# Run in background
nohup ./kolosal-server > server.log 2>&1 &
# Check if running
ps aux | grep kolosal-server
Check Server Status:
# Test if server is responding
curl http://localhost:8080/v1/health
Install to System Path:
# Install binary to /usr/local/bin
sudo cp build/kolosal-server /usr/local/bin/
# Make it executable
sudo chmod +x /usr/local/bin/kolosal-server
# Now you can run from anywhere
kolosal-server --help
Install with Package Manager (Future):
# Note: Package manager installation will be available in future releases
# For now, use the build from source method above
Create Service File:
sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF
[Unit]
Description=Kolosal Server - LLM Inference Server
After=network.target
[Service]
Type=simple
User=kolosal
Group=kolosal
WorkingDirectory=/opt/kolosal-server
ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
EOF
Enable and Start Service:
# Create user for service
sudo useradd -r -s /bin/false kolosal
# Install binary and config
sudo mkdir -p /opt/kolosal-server /etc/kolosal-server
sudo cp build/kolosal-server /opt/kolosal-server/
sudo cp config.example.yaml /etc/kolosal-server/config.yaml
sudo chown -R kolosal:kolosal /opt/kolosal-server
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable kolosal-server
sudo systemctl start kolosal-server
# Check status
sudo systemctl status kolosal-server
Common Build Issues:
-
Missing dependencies:
# Check for missing packages ldd build/kolosal-server # Install missing development packages sudo apt install -y libssl-dev libcurl4-openssl-dev
-
CMake version too old:
# Install newer CMake from Kitware APT repository wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' sudo apt update && sudo apt install cmake
-
CUDA compilation errors:
# Verify CUDA installation nvcc --version nvidia-smi # Set CUDA environment variables if needed export CUDA_HOME=/usr/local/cuda export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
-
Permission issues:
# Fix ownership sudo chown -R $USER:$USER ./build # Make executable chmod +x build/kolosal-server
Performance Optimization:
-
CPU Optimization:
# Build with native optimizations cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" ..
-
Memory Settings:
# For systems with limited RAM, reduce parallel jobs make -j2 # Set memory limits in config echo "server.max_memory_mb: 4096" >> config.yaml
-
GPU Memory:
# Monitor GPU usage watch nvidia-smi # Adjust GPU layers in model config # Reduce n_gpu_layers if running out of VRAM
Prerequisites:
- Windows 10/11
- Visual Studio 2019 or later
- CMake 3.20+
- CUDA Toolkit (optional, for GPU acceleration)
Building:
git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
mkdir build && cd build
cmake ..
cmake --build . --config Debug
./Debug/kolosal-server.exe
The server will start on http://localhost:8080
by default.
Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.
server:
port: "8080"
models:
- id: "my-model"
path: "./models/model.gguf"
load_immediately: true
server:
port: "8080"
max_connections: 500
worker_threads: 8
auth:
enabled: true
require_api_key: true
api_keys:
- "sk-your-api-key-here"
models:
- id: "gpt-3.5-turbo"
path: "./models/gpt-3.5-turbo.gguf"
load_immediately: true
main_gpu_id: 0
load_params:
n_ctx: 4096
n_gpu_layers: 50
features:
metrics: true # Enable /metrics and /completion-metrics
For complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the Configuration Guide.
Before using chat completions, you need to add a model engine:
curl -X POST http://localhost:8080/engines \
-H "Content-Type: application/json" \
-d '{
"engine_id": "my-model",
"model_path": "path/to/your/model.gguf",
"load_immediately": true,
"n_ctx": 2048,
"n_gpu_layers": 0,
"main_gpu_id": 0
}'
For faster startup times, you can defer model loading until first use:
curl -X POST http://localhost:8080/engines \
-H "Content-Type: application/json" \
-d '{
"engine_id": "my-model",
"model_path": "https://huggingface.co/model-repo/model.gguf",
"load_immediately": false,
"n_ctx": 4096,
"n_gpu_layers": 30,
"main_gpu_id": 0
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "Hello, how are you today?"
}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 100
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
"role": "assistant"
}
}
],
"created": 1749981228,
"id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu",
"model": "my-model",
"object": "chat.completion",
"system_fingerprint": "fp_4d29efe704",
"usage": {
"completion_tokens": 15,
"prompt_tokens": 9,
"total_tokens": 24
}
}
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "Tell me a short story about a robot."
}
],
"stream": true,
"temperature": 0.8,
"max_tokens": 150
}'
Response (Server-Sent Events):
data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}
data: [DONE]
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "system",
"content": "You are a helpful programming assistant."
},
{
"role": "user",
"content": "How do I create a simple HTTP server in Python?"
},
{
"role": "assistant",
"content": "You can create a simple HTTP server in Python using the built-in http.server module..."
},
{
"role": "user",
"content": "Can you show me the code?"
}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 200
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"stream": false,
"temperature": 0.1,
"top_p": 0.9,
"max_tokens": 50,
"seed": 42,
"presence_penalty": 0.0,
"frequency_penalty": 0.0
}'
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": "The future of artificial intelligence is",
"stream": false,
"temperature": 0.7,
"max_tokens": 100
}'
Response:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..."
}
],
"created": 1749981288,
"id": "cmpl-80HTkM01z7aaaThFbuALkbTu",
"model": "my-model",
"object": "text_completion",
"usage": {
"completion_tokens": 25,
"prompt_tokens": 8,
"total_tokens": 33
}
}
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "my-model",
"prompt": "Write a haiku about programming:",
"stream": true,
"temperature": 0.8,
"max_tokens": 50
}'
Response (Server-Sent Events):
data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}
data: [DONE]
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": [
"The weather today is",
"In other news,"
],
"stream": false,
"temperature": 0.5,
"max_tokens": 30
}'
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"prompt": "Explain quantum computing:",
"stream": false,
"temperature": 0.2,
"top_p": 0.9,
"max_tokens": 100,
"seed": 123,
"presence_penalty": 0.0,
"frequency_penalty": 0.1
}'
curl -X GET http://localhost:8080/v1/engines
curl -X GET http://localhost:8080/engines/my-model/status
curl -X DELETE http://localhost:8080/engines/my-model
The server provides real-time completion metrics for monitoring performance and usage:
curl -X GET http://localhost:8080/completion-metrics
Response:
{
"completion_metrics": {
"summary": {
"total_requests": 15,
"completed_requests": 14,
"failed_requests": 1,
"success_rate_percent": 93.33,
"total_input_tokens": 120,
"total_output_tokens": 350,
"avg_turnaround_time_ms": 1250.5,
"avg_tps": 12.8,
"avg_output_tps": 8.4,
"avg_ttft_ms": 245.2,
"avg_rps": 0.85
},
"per_engine": [
{
"model_name": "my-model",
"engine_id": "default",
"total_requests": 15,
"completed_requests": 14,
"failed_requests": 1,
"total_input_tokens": 120,
"total_output_tokens": 350,
"tps": 12.8,
"output_tps": 8.4,
"avg_ttft": 245.2,
"rps": 0.85,
"last_updated": "2025-06-16T17:04:12.123Z"
}
],
"timestamp": "2025-06-16T17:04:12.123Z"
}
}
Alternative endpoints:
# OpenAI-style endpoint
curl -X GET http://localhost:8080/v1/completion-metrics
# Alternative path
curl -X GET http://localhost:8080/completion/metrics
Metric | Description |
---|---|
total_requests |
Total number of completion requests received |
completed_requests |
Number of successfully completed requests |
failed_requests |
Number of requests that failed |
success_rate_percent |
Success rate as a percentage |
total_input_tokens |
Total input tokens processed |
total_output_tokens |
Total output tokens generated |
avg_turnaround_time_ms |
Average time from request to completion (ms) |
avg_tps |
Average tokens per second (input + output) |
avg_output_tps |
Average output tokens per second |
avg_ttft_ms |
Average time to first token (ms) |
avg_rps |
Average requests per second |
# Get completion metrics
$metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET
Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%"
Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"
curl -X GET http://localhost:8080/v1/health
Parameter | Type | Default | Description |
---|---|---|---|
model |
string | required | The ID of the model to use |
messages |
array | required | List of message objects |
stream |
boolean | false | Whether to stream responses |
temperature |
number | 1.0 | Sampling temperature (0.0-2.0) |
top_p |
number | 1.0 | Nucleus sampling parameter |
max_tokens |
integer | 128 | Maximum tokens to generate |
seed |
integer | random | Random seed for reproducible outputs |
presence_penalty |
number | 0.0 | Presence penalty (-2.0 to 2.0) |
frequency_penalty |
number | 0.0 | Frequency penalty (-2.0 to 2.0) |
Parameter | Type | Default | Description |
---|---|---|---|
model |
string | required | The ID of the model to use |
prompt |
string/array | required | Text prompt or array of prompts |
stream |
boolean | false | Whether to stream responses |
temperature |
number | 1.0 | Sampling temperature (0.0-2.0) |
top_p |
number | 1.0 | Nucleus sampling parameter |
max_tokens |
integer | 16 | Maximum tokens to generate |
seed |
integer | random | Random seed for reproducible outputs |
presence_penalty |
number | 0.0 | Presence penalty (-2.0 to 2.0) |
frequency_penalty |
number | 0.0 | Frequency penalty (-2.0 to 2.0) |
Field | Type | Description |
---|---|---|
role |
string | Role: "system", "user", or "assistant" |
content |
string | The content of the message |
Parameter | Type | Default | Description |
---|---|---|---|
engine_id |
string | required | Unique identifier for the engine |
model_path |
string | required | Path to the GGUF model file or URL |
load_immediately |
boolean | true | Whether to load the model immediately or defer until first use |
n_ctx |
integer | 4096 | Context window size |
n_gpu_layers |
integer | 100 | Number of layers to offload to GPU |
main_gpu_id |
integer | 0 | Primary GPU device ID |
The server returns standard HTTP status codes and JSON error responses:
{
"error": {
"message": "Model 'non-existent-model' not found or could not be loaded",
"type": "invalid_request_error",
"param": null,
"code": null
}
}
Common error codes:
400
- Bad Request (invalid JSON, missing parameters)404
- Not Found (model/engine not found)500
- Internal Server Error (inference failures)
For Windows users, here are PowerShell equivalents:
$body = @{
engine_id = "my-model"
model_path = "C:\path\to\model.gguf"
load_immediately = $true
n_ctx = 2048
n_gpu_layers = 0
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"
$body = @{
model = "my-model"
messages = @(
@{
role = "user"
content = "Hello, how are you?"
}
)
stream = $false
temperature = 0.7
max_tokens = 100
} | ConvertTo-Json -Depth 3
Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"
$body = @{
model = "my-model"
prompt = "The future of AI is"
stream = $false
temperature = 0.7
max_tokens = 50
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"
For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the docs/
directory:
- Developer Guide - Complete setup, architecture, and development workflows
- Configuration Guide - Complete server configuration in JSON and YAML formats
- Architecture Overview - Detailed system design and component relationships
- Adding New Routes - Step-by-step guide for implementing API endpoints
- Adding New Models - Guide for creating data models and JSON handling
- API Specification - Complete API reference with examples
- Documentation Index - Complete documentation overview
- Project Structure - Understanding the codebase
- Contributing Guidelines - How to contribute
Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:
This project is powered by llama.cpp, developed by Georgi Gerganov and the ggml-org community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.
- Project: https://github.com/ggml-org/llama.cpp
- License: MIT License
- Description: Inference of Meta's LLaMA model (and others) in pure C/C++
We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.
- yaml-cpp: YAML parsing and emitting library
- nlohmann/json: JSON library for Modern C++
- libcurl: Client-side URL transfer library
- prometheus-cpp: Prometheus metrics library for C++
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We welcome contributions! Please see our Developer Documentation for detailed guides on:
- Getting Started: Developer Guide
- Understanding the System: Architecture Overview
- Adding Features: Route and Model guides
- API Changes: API Specification
- Fork the repository
- Follow the Developer Guide for setup
- Create a feature branch
- Implement your changes following our guides
- Add tests and update documentation
- Submit a Pull Request
- Issues: Report bugs and feature requests on GitHub Issues
- Documentation: Check the docs/ directory for comprehensive guides
- Discussions: Join Kolosal AI Discord for questions and community support