Skip to content

senico/aiserver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Server Setup Ansible Project

This project automates the setup of a Ubuntu-based AI server/workstation.

Features

  • Base Configuration: Common packages, UFW firewall, system updates.
  • NVIDIA Setup: Installs NVIDIA drivers (535 server) and CUDA Toolkit.
  • Docker: Installs Docker Engine and NVIDIA Container Toolkit (enabling GPU support in containers).
  • Python: Installs Miniconda and sets up a default ai_env with PyTorch, Pandas, and JupyterLab.

Prerequisites

  • A target machine running Ubuntu 22.04 LTS or newer.
  • SSH access to the target machine.
  • Ansible installed on your control machine (the one running the playbook).

Setup

  1. Configure Inventory: Edit inventory/hosts.ini to add your target server's IP address and SSH details.

    [ai_servers]
    192.168.1.xxx ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/id_rsa
  2. Configure Variables (Optional): You can tweak roles or playbook.yml vars if needed (e.g., enable/disable driver install).

Usage

Run the playbook:

ansible-playbook playbook.yml

To limit to specific tags:

ansible-playbook playbook.yml --tags "base,docker"

Verification

After the playbook completes:

  1. SSH into the server:

    ssh ubuntu@<server-ip>
  2. Check NVIDIA Drivers:

    nvidia-smi
  3. Check Docker GPU Support:

    sudo docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
  4. Check Python Environment:

    source ~/miniconda3/bin/activate ai_env
    python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Advanced Usage

Unified Inference Launcher (Dockerized)

Run any LLM using the unified script. It wraps llama.cpp, vLLM, or custom containers:

# Run GGUF (llama.cpp)
./scripts/run_inference.py <model.gguf>

# Run Standard Model (vLLM) - OpenAI Compatible API
./scripts/run_inference.py <model_dir_or_repo> --engine vllm

# Run Custom Image
./scripts/run_inference.py --engine custom --image <image_name>

The script automatically handles GPU detection and Docker flags.

Profit Switcher

The scripts/profit_switcher.py script monitors Clore.ai rental prices vs your electricity cost.

  1. Set env vars:
    export CLORE_API_KEY="your_key"
    export CLORE_SERVER_ID="your_id"
    export RIG_POWER_KW="0.8" # Approximate power consumption in kW
  2. Run the script (ideally via cron):
    sudo -E python3 scripts/profit_switcher.py
    Note: sudo -E preserves environment variables.

Provider Management

Provider Selector Easily switch between workloads to avoid resource conflicts:

sudo python3 scripts/select_provider.py [clore|vast|ionet|local]

Vast.ai Setup

  1. Get your unique start command from the Vast.ai dashboard.
  2. Run: sudo setup_vast.sh '<your_command>'

Telegram Bot Integration

You can chat effectively with your local Ollama instance via Telegram.

Option A: OpenClaw (Recommended for Agents)

  1. Get a Token from @BotFather.
  2. Configure OpenClaw to use Telegram as a channel.

Option B: Standalone Bot (Direct Chat) A lightweight Docker container that bridges Telegram directly to Ollama.

  1. Export your Token: export TELEGRAM_TOKEN="your_token_here"
  2. Start Ollama Service (with sudo -E to pass the token):
    sudo -E python3 scripts/select_provider.py ollama
    The script detects the token and automatically launches the telegram-bot container alongside Ollama.

Monitoring via Grafana

  • Grafana: http://<server-ip>:3000 (Default user/pass: admin/admin)
  • Prometheus: http://<server-ip>:9090
  • Alerts: Pre-configured to fire if any GPU exceeds 80°C.

PCIe x1 Optimization

For mining rigs with x1 risers:

  1. Use GGUF format: llama.cpp with GGUF leverages mmap better than other loaders, reducing initial load times.
  2. Fit Model in VRAM: Ensure the model (-ngl 99) fits entirely in VRAM. Swapping over PCIe x1 is detrimental to performance.
  3. Context Shifting: If using huge context, be aware that KV cache processing might be bottlenecked by bandwidth if split across cards. Keep batch sizes lower.

Containerized Architecture

All agents run in Docker containers using the "Sibling Docker" pattern (binding /var/run/docker.sock).

  • Clore Agent: clore-agent container.
  • Vast Agent: vast-agent container.
  • Local AI: llama.cpp container (ephemeral).
  • Monitoring: prometheus, grafana containers.

External Access / API Gateway

For connecting multiple clients (e.g., Clawdbots), use the LiteLLM Proxy.

  • Endpoint: http://<server-ip>:4000
  • API Key: sk-aiserver-admin (Master Key)
  • Supported Models: llama3, mistral, custom (Routed to Ollama internally).

Why use this?

  • Provides a stable OpenAI-compatible endpoint.
  • Handles queuing better than raw Ollama.
  • Secured with an API Key.

Storage Configuration

The system is configured to use your multi-drive setup:

  • System (NVMe): OS & Docker Images.
  • Models (SSD): /mnt/ssd/models (Ollama) & /mnt/ssd/cache (HuggingFace).
  • Data (HDD): /mnt/hdd/data (Datasets) & /mnt/hdd/backups.

Configuration File: ~/aiserver/config/storage.env Edit this file if your mount points differ.

  • Data (HDD): /mnt/hdd/data (Datasets) & /mnt/hdd/backups.
  • Swap (NVMe): 64GB Swapfile configured for AI (Swappiness 10).

Client Data Isolation

To keep data separate for different bots/clients:

  1. Create Storage:

    ./scripts/create_client_storage.sh my-client-1

    Creates /mnt/hdd/data/clients/my-client-1.

  2. Usage in Docker: Map the volume: -v /mnt/hdd/data/clients/my-client-1:/app/data (The create_agent.sh script does this automatically for new agents).

System Architecture

For a detailed diagram of how all components (LiteLLM, Ollama, Storage, GPUs) work together, read the System Architecture Presentation.

For a slide deck summary, see the Project Slides.

Operational Procedures

  • Client Onboarding & Removal SOP: Deployment and cleanup guides.

Multi-GPU Distribution

The system automatically balances loads across all available GPUs.

Engine Distribution Method Behavior
Ollama / llama.cpp Layer Offloading/Splitting Splits model layers across GPUs (e.g., 20 layers on GPU 0, 20 on GPU 1). Best for mixed cards (e.g., 3090 + 4090).
vLLM Tensor Parallelism Splits mathematical operations across GPUs. Requires identical cards for best performance (e.g., 2x 3090).

Note: My scripts automatically detect your GPU count and apply the correct sharding strategy.

Choosing Your Engine: vLLM vs Ollama

Feature Ollama vLLM
Best For Ease of Use & OpenClaw Raw Performance & Throughput
Model Management Automatic (ollama pull) Manual / HuggingFace Cache
Model Switching Dynamic (On-Demand Loading) Manual Restart Required
API OpenAI + Native OpenAI Compatible
Speed Good (based on llama.cpp) Extreme (Optimized PagedAttention)
Setup One command (select_provider.py) script flags (--engine vllm)

Recommendation:

  • Start with Ollama. It's easier, persistent, and works great with OpenClaw.
  • Switch to vLLM only if you need higher token speeds for massive contexts or concurrent agents.

Building Custom Agents

You can easily create your own AI agents that run in Docker and connect to your local server.

  1. Create a new Agent:
    ./scripts/create_agent.sh my_new_bot
  2. Run it:
    cd ~/my_agents/my_new_bot
    ./run.sh
    This generates a boiler-plate Python project with a Dockerfile, pre-configured to talk to your local Ollama instance.

Fine-Tuning Your Own Models

Create custom AI models trained on your data (using Unsloth for 2x speed).

  1. Prepare Data: Create a .jsonl file (Alpaca format).
    {"instruction": "Question...", "input": "", "output": "Answer..."}
  2. Run Training:
    ./scripts/finetune.sh --data example_dataset.jsonl --name my-custom-model
  3. Use It:
    • The script automatically installs it into Ollama.
    • Run: scripts/run_inference.py my-custom-model --engine ollama

GPU Optimization & Power Tuning

I have installed a system service to optimize your GPUs on boot.

To Adjust Power Limits:

  1. Edit: sudo nano /usr/local/bin/optimize_gpus.sh
  2. Uncomment: # nvidia-smi -pl 300
  3. Apply: sudo systemctl restart gpu-optimizer

OpenClaw Integration

Your server is ready for OpenClaw. The easiest method is using Ollama (Dockerized):

  1. Start Ollama Container: sudo scripts/select_provider.py ollama
  2. Run a model: scripts/run_inference.py llama3 --engine ollama
  3. Configure OpenClaw:
    • Provider: Select Custom Provider (sometimes labeled "OpenAI Compatible").
    • Base URL: http://<server-ip>:11434/v1
      • IMPORTANT: You MUST append /v1 to the URL so it treats Ollama like OpenAI.
    • Model: llama3 (or the exact name from ollama list).
    • API Key: ollama (Dummy value).

Model Management

How to update or add new models to Ollama:

1. Update an Existing Model

docker exec -it ollama ollama pull llama3

Ollama checks for updates. If a new version exists, it downloads it. The next API request will use the new version.

2. Add a Completely New Model

docker exec -it ollama ollama pull mistral

It downloads the new model to disk.

3. Switch Models

  • API: Change "model": "llama3" to "model": "mistral" in your request.
  • OpenClaw: Change the Model Name in settings.
  • Ollama automatically unloads the old model and loads the new one from SSD.

Alternatively, use vLLM for OpenAI-compatible API at port 8000.

About

Local-AI-Cloud: High-performance, multi-GPU LLM inference server with segregated storage for autonomous agents, automated fine-tuning, and OpenAI-compatible API gateway.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors