Skip to content

KolosalAI/kolosal-server

Repository files navigation

Kolosal Server

A high-performance inference server for large language models with OpenAI-compatible API endpoints. Now available for both Windows and Linux systems!

Platform Support

  • 🪟 Windows: Full support with Visual Studio and MSVC
  • 🐧 Linux: Native support with GCC/Clang
  • 🎮 GPU Acceleration: NVIDIA CUDA and Vulkan support
  • 📦 Easy Installation: Direct binary installation or build from source

Features

  • 🚀 Fast Inference: Built with llama.cpp for optimized model inference
  • 🔗 OpenAI Compatible: Drop-in replacement for OpenAI API endpoints
  • 📡 Streaming Support: Real-time streaming responses for chat completions
  • 🎛️ Multi-Model Management: Load and manage multiple models simultaneously
  • 📊 Real-time Metrics: Monitor completion performance with TPS, TTFT, and success rates
  • ⚙️ Lazy Loading: Defer model loading until first request with load_immediately=false
  • 🔧 Configurable: Flexible model loading parameters and inference settings
  • 🔒 Authentication: API key and rate limiting support
  • 🌐 Cross-Platform: Windows and Linux native builds

Quick Start

Linux (Recommended)

Prerequisites

System Requirements:

  • Ubuntu 20.04+ or equivalent Linux distribution (CentOS 8+, Fedora 32+, Arch Linux)
  • GCC 9+ or Clang 10+
  • CMake 3.14+
  • Git with submodule support
  • At least 4GB RAM (8GB+ recommended for larger models)
  • CUDA Toolkit 11.0+ (optional, for NVIDIA GPU acceleration)
  • Vulkan SDK (optional, for alternative GPU acceleration)

Install Dependencies:

Ubuntu/Debian:

# Update package list
sudo apt update

# Install essential build tools
sudo apt install -y build-essential cmake git pkg-config

# Install required libraries
sudo apt install -y libcurl4-openssl-dev libyaml-cpp-dev

# Optional: Install CUDA for GPU support
# Follow NVIDIA's official installation guide for your distribution

CentOS/RHEL/Fedora:

# For CentOS/RHEL 8+
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake git curl-devel yaml-cpp-devel

# For Fedora
sudo dnf install gcc-c++ cmake git libcurl-devel yaml-cpp-devel

Arch Linux:

sudo pacman -S base-devel cmake git curl yaml-cpp

Building from Source

1. Clone the Repository:

git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server

2. Initialize Submodules:

git submodule update --init --recursive

3. Create Build Directory:

mkdir build && cd build

4. Configure Build:

Standard Build (CPU-only):

cmake -DCMAKE_BUILD_TYPE=Release ..

With CUDA Support:

cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON ..

With Vulkan Support:

cmake -DCMAKE_BUILD_TYPE=Release -DLLAMA_VULKAN=ON ..

Debug Build:

cmake -DCMAKE_BUILD_TYPE=Debug ..

5. Build the Project:

# Use all available CPU cores
make -j$(nproc)

# Or specify number of cores manually
make -j4

6. Verify Build:

# Check if the executable was created
ls -la kolosal-server

# Test basic functionality
./kolosal-server --help

Running the Server

Start the Server:

# From build directory
./kolosal-server

# Or specify a config file
./kolosal-server --config ../config.yaml

Background Service:

# Run in background
nohup ./kolosal-server > server.log 2>&1 &

# Check if running
ps aux | grep kolosal-server

Check Server Status:

# Test if server is responding
curl http://localhost:8080/v1/health

Alternative Installation Methods

Install to System Path:

# Install binary to /usr/local/bin
sudo cp build/kolosal-server /usr/local/bin/

# Make it executable
sudo chmod +x /usr/local/bin/kolosal-server

# Now you can run from anywhere
kolosal-server --help

Install with Package Manager (Future):

# Note: Package manager installation will be available in future releases
# For now, use the build from source method above

Installation as System Service

Create Service File:

sudo tee /etc/systemd/system/kolosal-server.service > /dev/null << EOF
[Unit]
Description=Kolosal Server - LLM Inference Server
After=network.target

[Service]
Type=simple
User=kolosal
Group=kolosal
WorkingDirectory=/opt/kolosal-server
ExecStart=/opt/kolosal-server/kolosal-server --config /etc/kolosal-server/config.yaml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

Enable and Start Service:

# Create user for service
sudo useradd -r -s /bin/false kolosal

# Install binary and config
sudo mkdir -p /opt/kolosal-server /etc/kolosal-server
sudo cp build/kolosal-server /opt/kolosal-server/
sudo cp config.example.yaml /etc/kolosal-server/config.yaml
sudo chown -R kolosal:kolosal /opt/kolosal-server

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable kolosal-server
sudo systemctl start kolosal-server

# Check status
sudo systemctl status kolosal-server

Troubleshooting

Common Build Issues:

  1. Missing dependencies:

    # Check for missing packages
    ldd build/kolosal-server
    
    # Install missing development packages
    sudo apt install -y libssl-dev libcurl4-openssl-dev
  2. CMake version too old:

    # Install newer CMake from Kitware APT repository
    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | sudo tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
    sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main'
    sudo apt update && sudo apt install cmake
  3. CUDA compilation errors:

    # Verify CUDA installation
    nvcc --version
    nvidia-smi
    
    # Set CUDA environment variables if needed
    export CUDA_HOME=/usr/local/cuda
    export PATH=$CUDA_HOME/bin:$PATH
    export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
  4. Permission issues:

    # Fix ownership
    sudo chown -R $USER:$USER ./build
    
    # Make executable
    chmod +x build/kolosal-server

Performance Optimization:

  1. CPU Optimization:

    # Build with native optimizations
    cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native" ..
  2. Memory Settings:

    # For systems with limited RAM, reduce parallel jobs
    make -j2
    
    # Set memory limits in config
    echo "server.max_memory_mb: 4096" >> config.yaml
  3. GPU Memory:

    # Monitor GPU usage
    watch nvidia-smi
    
    # Adjust GPU layers in model config
    # Reduce n_gpu_layers if running out of VRAM

Windows

Prerequisites:

  • Windows 10/11
  • Visual Studio 2019 or later
  • CMake 3.20+
  • CUDA Toolkit (optional, for GPU acceleration)

Building:

git clone https://github.com/kolosalai/kolosal-server.git
cd kolosal-server
mkdir build && cd build
cmake ..
cmake --build . --config Debug

Running the Server

./Debug/kolosal-server.exe

The server will start on http://localhost:8080 by default.

Configuration

Kolosal Server supports configuration through JSON and YAML files for advanced setup including authentication, logging, model preloading, and server parameters.

Quick Configuration Examples

Minimal Configuration (config.yaml)

server:
  port: "8080"

models:
  - id: "my-model"
    path: "./models/model.gguf"
    load_immediately: true

Production Configuration

server:
  port: "8080"
  max_connections: 500
  worker_threads: 8

auth:
  enabled: true
  require_api_key: true
  api_keys:
    - "sk-your-api-key-here"

models:
  - id: "gpt-3.5-turbo"
    path: "./models/gpt-3.5-turbo.gguf"
    load_immediately: true
    main_gpu_id: 0
    load_params:
      n_ctx: 4096
      n_gpu_layers: 50

features:
  metrics: true  # Enable /metrics and /completion-metrics

For complete configuration documentation including all parameters, authentication setup, CORS configuration, and more examples, see the Configuration Guide.

API Usage

1. Add a Model Engine

Before using chat completions, you need to add a model engine:

curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "path/to/your/model.gguf",
    "load_immediately": true,
    "n_ctx": 2048,
    "n_gpu_layers": 0,
    "main_gpu_id": 0
  }'

Lazy Loading

For faster startup times, you can defer model loading until first use:

curl -X POST http://localhost:8080/engines \
  -H "Content-Type: application/json" \
  -d '{
    "engine_id": "my-model",
    "model_path": "https://huggingface.co/model-repo/model.gguf",
    "load_immediately": false,
    "n_ctx": 4096,
    "n_gpu_layers": 30,
    "main_gpu_id": 0
  }'

2. Chat Completions

Non-Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you today?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1749981228,
  "id": "chatcmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "chat.completion",
  "system_fingerprint": "fp_4d29efe704",
  "usage": {
    "completion_tokens": 15,
    "prompt_tokens": 9,
    "total_tokens": 24
  }
}

Streaming Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story about a robot."
      }
    ],
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 150
  }'

Response (Server-Sent Events):

data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":"Once"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":" upon"},"finish_reason":null,"index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749981242,"id":"chatcmpl-1749981241-1","model":"my-model","object":"chat.completion.chunk","system_fingerprint":"fp_4d29efe704"}

data: [DONE]

Multi-Message Conversation

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful programming assistant."
      },
      {
        "role": "user",
        "content": "How do I create a simple HTTP server in Python?"
      },
      {
        "role": "assistant",
        "content": "You can create a simple HTTP server in Python using the built-in http.server module..."
      },
      {
        "role": "user",
        "content": "Can you show me the code?"
      }
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 200
  }'

Advanced Parameters

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "stream": false,
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 50,
    "seed": 42,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0
  }'

3. Completions

Non-Streaming Completion

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "The future of artificial intelligence is",
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "text": " bright and full of possibilities. As we continue to advance in machine learning and deep learning technologies, we can expect to see significant improvements in various fields..."
    }
  ],
  "created": 1749981288,
  "id": "cmpl-80HTkM01z7aaaThFbuALkbTu",
  "model": "my-model",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 8,
    "total_tokens": 33
  }
}

Streaming Completion

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "my-model",
    "prompt": "Write a haiku about programming:",
    "stream": true,
    "temperature": 0.8,
    "max_tokens": 50
  }'

Response (Server-Sent Events):

data: {"choices":[{"finish_reason":"","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"","index":0,"text":"Code"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"","index":0,"text":" flows"}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: {"choices":[{"finish_reason":"stop","index":0,"text":""}],"created":1749981290,"id":"cmpl-1749981289-1","model":"my-model","object":"text_completion"}

data: [DONE]

Multiple Prompts

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": [
      "The weather today is",
      "In other news,"
    ],
    "stream": false,
    "temperature": 0.5,
    "max_tokens": 30
  }'

Advanced Completion Parameters

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "prompt": "Explain quantum computing:",
    "stream": false,
    "temperature": 0.2,
    "top_p": 0.9,
    "max_tokens": 100,
    "seed": 123,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.1
  }'

4. Engine Management

List Available Engines

curl -X GET http://localhost:8080/v1/engines

Get Engine Status

curl -X GET http://localhost:8080/engines/my-model/status

Remove an Engine

curl -X DELETE http://localhost:8080/engines/my-model

5. Completion Metrics and Monitoring

The server provides real-time completion metrics for monitoring performance and usage:

Get Completion Metrics

curl -X GET http://localhost:8080/completion-metrics

Response:

{
  "completion_metrics": {
    "summary": {
      "total_requests": 15,
      "completed_requests": 14,
      "failed_requests": 1,
      "success_rate_percent": 93.33,
      "total_input_tokens": 120,
      "total_output_tokens": 350,
      "avg_turnaround_time_ms": 1250.5,
      "avg_tps": 12.8,
      "avg_output_tps": 8.4,
      "avg_ttft_ms": 245.2,
      "avg_rps": 0.85
    },
    "per_engine": [
      {
        "model_name": "my-model",
        "engine_id": "default",
        "total_requests": 15,
        "completed_requests": 14,
        "failed_requests": 1,
        "total_input_tokens": 120,
        "total_output_tokens": 350,
        "tps": 12.8,
        "output_tps": 8.4,
        "avg_ttft": 245.2,
        "rps": 0.85,
        "last_updated": "2025-06-16T17:04:12.123Z"
      }
    ],
    "timestamp": "2025-06-16T17:04:12.123Z"
  }
}

Alternative endpoints:

# OpenAI-style endpoint
curl -X GET http://localhost:8080/v1/completion-metrics

# Alternative path
curl -X GET http://localhost:8080/completion/metrics

Metrics Explained

Metric Description
total_requests Total number of completion requests received
completed_requests Number of successfully completed requests
failed_requests Number of requests that failed
success_rate_percent Success rate as a percentage
total_input_tokens Total input tokens processed
total_output_tokens Total output tokens generated
avg_turnaround_time_ms Average time from request to completion (ms)
avg_tps Average tokens per second (input + output)
avg_output_tps Average output tokens per second
avg_ttft_ms Average time to first token (ms)
avg_rps Average requests per second

PowerShell Example

# Get completion metrics
$metrics = Invoke-RestMethod -Uri "http://localhost:8080/completion-metrics" -Method GET
Write-Output "Success Rate: $($metrics.completion_metrics.summary.success_rate_percent)%"
Write-Output "Average TPS: $($metrics.completion_metrics.summary.avg_tps)"

6. Health Check

curl -X GET http://localhost:8080/v1/health

Parameters Reference

Chat Completion Parameters

Parameter Type Default Description
model string required The ID of the model to use
messages array required List of message objects
stream boolean false Whether to stream responses
temperature number 1.0 Sampling temperature (0.0-2.0)
top_p number 1.0 Nucleus sampling parameter
max_tokens integer 128 Maximum tokens to generate
seed integer random Random seed for reproducible outputs
presence_penalty number 0.0 Presence penalty (-2.0 to 2.0)
frequency_penalty number 0.0 Frequency penalty (-2.0 to 2.0)

Completion Parameters

Parameter Type Default Description
model string required The ID of the model to use
prompt string/array required Text prompt or array of prompts
stream boolean false Whether to stream responses
temperature number 1.0 Sampling temperature (0.0-2.0)
top_p number 1.0 Nucleus sampling parameter
max_tokens integer 16 Maximum tokens to generate
seed integer random Random seed for reproducible outputs
presence_penalty number 0.0 Presence penalty (-2.0 to 2.0)
frequency_penalty number 0.0 Frequency penalty (-2.0 to 2.0)

Message Object

Field Type Description
role string Role: "system", "user", or "assistant"
content string The content of the message

Engine Loading Parameters

Parameter Type Default Description
engine_id string required Unique identifier for the engine
model_path string required Path to the GGUF model file or URL
load_immediately boolean true Whether to load the model immediately or defer until first use
n_ctx integer 4096 Context window size
n_gpu_layers integer 100 Number of layers to offload to GPU
main_gpu_id integer 0 Primary GPU device ID

Error Handling

The server returns standard HTTP status codes and JSON error responses:

{
  "error": {
    "message": "Model 'non-existent-model' not found or could not be loaded",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

Common error codes:

  • 400 - Bad Request (invalid JSON, missing parameters)
  • 404 - Not Found (model/engine not found)
  • 500 - Internal Server Error (inference failures)

Examples with PowerShell

For Windows users, here are PowerShell equivalents:

Add Engine

$body = @{
    engine_id = "my-model"
    model_path = "C:\path\to\model.gguf"
    load_immediately = $true
    n_ctx = 2048
    n_gpu_layers = 0
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:8080/engines" -Method POST -Body $body -ContentType "application/json"

Chat Completion

$body = @{
    model = "my-model"
    messages = @(
        @{
            role = "user"
            content = "Hello, how are you?"
        }
    )
    stream = $false
    temperature = 0.7
    max_tokens = 100
} | ConvertTo-Json -Depth 3

Invoke-RestMethod -Uri "http://localhost:8080/v1/chat/completions" -Method POST -Body $body -ContentType "application/json"

Completion

$body = @{
    model = "my-model"
    prompt = "The future of AI is"
    stream = $false
    temperature = 0.7
    max_tokens = 50
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://localhost:8080/v1/completions" -Method POST -Body $body -ContentType "application/json"

📚 Developer Documentation

For developers looking to contribute to or extend Kolosal Server, comprehensive documentation is available in the docs/ directory:

🚀 Getting Started

🔧 Implementation Guides

📖 Quick Links

Acknowledgments

Kolosal Server is built on top of excellent open-source projects and we want to acknowledge their contributions:

llama.cpp

This project is powered by llama.cpp, developed by Georgi Gerganov and the ggml-org community. llama.cpp provides the high-performance inference engine that makes Kolosal Server possible.

We extend our gratitude to the llama.cpp team for their incredible work on optimized LLM inference, which forms the foundation of our server's performance capabilities.

Other Dependencies

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

We welcome contributions! Please see our Developer Documentation for detailed guides on:

  1. Getting Started: Developer Guide
  2. Understanding the System: Architecture Overview
  3. Adding Features: Route and Model guides
  4. API Changes: API Specification

Quick Contributing Steps

  1. Fork the repository
  2. Follow the Developer Guide for setup
  3. Create a feature branch
  4. Implement your changes following our guides
  5. Add tests and update documentation
  6. Submit a Pull Request

Support

  • Issues: Report bugs and feature requests on GitHub Issues
  • Documentation: Check the docs/ directory for comprehensive guides
  • Discussions: Join Kolosal AI Discord for questions and community support

About

Kolosal AI is an OpenSource and Lightweight alternative to Ollama to run LLMs 100% offline on your device.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •