AI Server Setup Ansible Project

This project automates the setup of a Ubuntu-based AI server/workstation.

Features

Base Configuration: Common packages, UFW firewall, system updates.
NVIDIA Setup: Installs NVIDIA drivers (535 server) and CUDA Toolkit.
Docker: Installs Docker Engine and NVIDIA Container Toolkit (enabling GPU support in containers).
Python: Installs Miniconda and sets up a default ai_env with PyTorch, Pandas, and JupyterLab.

Prerequisites

A target machine running Ubuntu 22.04 LTS or newer.
SSH access to the target machine.
Ansible installed on your control machine (the one running the playbook).

Setup

Configure Inventory: Edit inventory/hosts.ini to add your target server's IP address and SSH details.
```
[ai_servers]
192.168.1.xxx ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/id_rsa
```
Configure Variables (Optional): You can tweak roles or playbook.yml vars if needed (e.g., enable/disable driver install).

Usage

Run the playbook:

ansible-playbook playbook.yml

To limit to specific tags:

ansible-playbook playbook.yml --tags "base,docker"

Verification

After the playbook completes:

SSH into the server:
```
ssh ubuntu@<server-ip>
```
Check NVIDIA Drivers:
```
nvidia-smi
```

Check Docker GPU Support:

sudo docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Check Python Environment:

source ~/miniconda3/bin/activate ai_env
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Advanced Usage

Unified Inference Launcher (Dockerized)

Run any LLM using the unified script. It wraps llama.cpp, vLLM, or custom containers:

# Run GGUF (llama.cpp)
./scripts/run_inference.py <model.gguf>

# Run Standard Model (vLLM) - OpenAI Compatible API
./scripts/run_inference.py <model_dir_or_repo> --engine vllm

# Run Custom Image
./scripts/run_inference.py --engine custom --image <image_name>

The script automatically handles GPU detection and Docker flags.

Profit Switcher

The scripts/profit_switcher.py script monitors Clore.ai rental prices vs your electricity cost.

Set env vars:

export CLORE_API_KEY="your_key"
export CLORE_SERVER_ID="your_id"
export RIG_POWER_KW="0.8" # Approximate power consumption in kW

Run the script (ideally via cron):
```
sudo -E python3 scripts/profit_switcher.py
```
Note: sudo -E preserves environment variables.

Provider Management

Provider Selector Easily switch between workloads to avoid resource conflicts:

sudo python3 scripts/select_provider.py [clore|vast|ionet|local]

Vast.ai Setup

Get your unique start command from the Vast.ai dashboard.
Run: sudo setup_vast.sh '<your_command>'

Telegram Bot Integration

You can chat effectively with your local Ollama instance via Telegram.

Option A: OpenClaw (Recommended for Agents)

Get a Token from @BotFather.
Configure OpenClaw to use Telegram as a channel.

Option B: Standalone Bot (Direct Chat) A lightweight Docker container that bridges Telegram directly to Ollama.

Export your Token: export TELEGRAM_TOKEN="your_token_here"
Start Ollama Service (with sudo -E to pass the token):
```
sudo -E python3 scripts/select_provider.py ollama
```
The script detects the token and automatically launches the telegram-bot container alongside Ollama.

Monitoring via Grafana

Grafana: http://<server-ip>:3000 (Default user/pass: admin/admin)
Prometheus: http://<server-ip>:9090
Alerts: Pre-configured to fire if any GPU exceeds 80°C.

PCIe x1 Optimization

For mining rigs with x1 risers:

Use GGUF format: llama.cpp with GGUF leverages mmap better than other loaders, reducing initial load times.
Fit Model in VRAM: Ensure the model (-ngl 99) fits entirely in VRAM. Swapping over PCIe x1 is detrimental to performance.
Context Shifting: If using huge context, be aware that KV cache processing might be bottlenecked by bandwidth if split across cards. Keep batch sizes lower.

Containerized Architecture

All agents run in Docker containers using the "Sibling Docker" pattern (binding /var/run/docker.sock).

Clore Agent: clore-agent container.
Vast Agent: vast-agent container.
Local AI: llama.cpp container (ephemeral).
Monitoring: prometheus, grafana containers.

External Access / API Gateway

For connecting multiple clients (e.g., Clawdbots), use the LiteLLM Proxy.

Endpoint: http://<server-ip>:4000
API Key: sk-aiserver-admin (Master Key)
Supported Models: llama3, mistral, custom (Routed to Ollama internally).

Why use this?

Provides a stable OpenAI-compatible endpoint.
Handles queuing better than raw Ollama.
Secured with an API Key.

Storage Configuration

The system is configured to use your multi-drive setup:

System (NVMe): OS & Docker Images.
Models (SSD): /mnt/ssd/models (Ollama) & /mnt/ssd/cache (HuggingFace).
Data (HDD): /mnt/hdd/data (Datasets) & /mnt/hdd/backups.

Configuration File: ~/aiserver/config/storage.env Edit this file if your mount points differ.

Data (HDD): /mnt/hdd/data (Datasets) & /mnt/hdd/backups.
Swap (NVMe): 64GB Swapfile configured for AI (Swappiness 10).

Client Data Isolation

To keep data separate for different bots/clients:

Create Storage:
```
./scripts/create_client_storage.sh my-client-1
```
Creates /mnt/hdd/data/clients/my-client-1.
Usage in Docker: Map the volume: -v /mnt/hdd/data/clients/my-client-1:/app/data (The create_agent.sh script does this automatically for new agents).

System Architecture

For a detailed diagram of how all components (LiteLLM, Ollama, Storage, GPUs) work together, read the System Architecture Presentation.

For a slide deck summary, see the Project Slides.

Operational Procedures

Client Onboarding & Removal SOP: Deployment and cleanup guides.

Multi-GPU Distribution

The system automatically balances loads across all available GPUs.

Engine	Distribution Method	Behavior
Ollama / llama.cpp	Layer Offloading/Splitting	Splits model layers across GPUs (e.g., 20 layers on GPU 0, 20 on GPU 1). Best for mixed cards (e.g., 3090 + 4090).
vLLM	Tensor Parallelism	Splits mathematical operations across GPUs. Requires identical cards for best performance (e.g., 2x 3090).

Note: My scripts automatically detect your GPU count and apply the correct sharding strategy.

Choosing Your Engine: vLLM vs Ollama

Feature	Ollama	vLLM
Best For	Ease of Use & OpenClaw	Raw Performance & Throughput
Model Management	Automatic (`ollama pull`)	Manual / HuggingFace Cache
Model Switching	Dynamic (On-Demand Loading)	Manual Restart Required
API	OpenAI + Native	OpenAI Compatible
Speed	Good (based on llama.cpp)	Extreme (Optimized PagedAttention)
Setup	One command (`select_provider.py`)	script flags (`--engine vllm`)

Recommendation:

Start with Ollama. It's easier, persistent, and works great with OpenClaw.
Switch to vLLM only if you need higher token speeds for massive contexts or concurrent agents.

Building Custom Agents

You can easily create your own AI agents that run in Docker and connect to your local server.

Create a new Agent:
```
./scripts/create_agent.sh my_new_bot
```
Run it:
```
cd ~/my_agents/my_new_bot
./run.sh
```
This generates a boiler-plate Python project with a Dockerfile, pre-configured to talk to your local Ollama instance.

Fine-Tuning Your Own Models

Create custom AI models trained on your data (using Unsloth for 2x speed).

Prepare Data: Create a .jsonl file (Alpaca format).

{"instruction": "Question...", "input": "", "output": "Answer..."}

Run Training:

./scripts/finetune.sh --data example_dataset.jsonl --name my-custom-model

Use It:
- The script automatically installs it into Ollama.
- Run: scripts/run_inference.py my-custom-model --engine ollama

GPU Optimization & Power Tuning

I have installed a system service to optimize your GPUs on boot.

To Adjust Power Limits:

Edit: sudo nano /usr/local/bin/optimize_gpus.sh
Uncomment: # nvidia-smi -pl 300
Apply: sudo systemctl restart gpu-optimizer

OpenClaw Integration

Your server is ready for OpenClaw. The easiest method is using Ollama (Dockerized):

Start Ollama Container: sudo scripts/select_provider.py ollama
Run a model: scripts/run_inference.py llama3 --engine ollama
Configure OpenClaw:
- Provider: Select Custom Provider (sometimes labeled "OpenAI Compatible").
- Base URL: http://<server-ip>:11434/v1
  - IMPORTANT: You MUST append /v1 to the URL so it treats Ollama like OpenAI.
- Model: llama3 (or the exact name from ollama list).
- API Key: ollama (Dummy value).

Model Management

How to update or add new models to Ollama:

1. Update an Existing Model

docker exec -it ollama ollama pull llama3

Ollama checks for updates. If a new version exists, it downloads it. The next API request will use the new version.

2. Add a Completely New Model

docker exec -it ollama ollama pull mistral

It downloads the new model to disk.

3. Switch Models

API: Change "model": "llama3" to "model": "mistral" in your request.
OpenClaw: Change the Model Name in settings.
Ollama automatically unloads the old model and loads the new one from SSD.

Alternatively, use vLLM for OpenAI-compatible API at port 8000.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Server Setup Ansible Project

Features

Prerequisites

Setup

Usage

Verification

Advanced Usage

Unified Inference Launcher (Dockerized)

Profit Switcher

Provider Management

Telegram Bot Integration

Monitoring via Grafana

PCIe x1 Optimization

Containerized Architecture

External Access / API Gateway

Storage Configuration

Client Data Isolation

System Architecture

Operational Procedures

Multi-GPU Distribution

Choosing Your Engine: vLLM vs Ollama

Building Custom Agents

Fine-Tuning Your Own Models

GPU Optimization & Power Tuning

OpenClaw Integration

Model Management

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
inventory		inventory
roles		roles
scripts		scripts
README.md		README.md
ansible.cfg		ansible.cfg
example_dataset.jsonl		example_dataset.jsonl
playbook.yml		playbook.yml

Folders and files

Latest commit

History

Repository files navigation

AI Server Setup Ansible Project

Features

Prerequisites

Setup

Usage

Verification

Advanced Usage

Unified Inference Launcher (Dockerized)

Profit Switcher

Provider Management

Telegram Bot Integration

Monitoring via Grafana

PCIe x1 Optimization

Containerized Architecture

External Access / API Gateway

Storage Configuration

Client Data Isolation

System Architecture

Operational Procedures

Multi-GPU Distribution

Choosing Your Engine: vLLM vs Ollama

Building Custom Agents

Fine-Tuning Your Own Models

GPU Optimization & Power Tuning

OpenClaw Integration

Model Management

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages