This project automates the setup of a Ubuntu-based AI server/workstation.
- Base Configuration: Common packages, UFW firewall, system updates.
- NVIDIA Setup: Installs NVIDIA drivers (535 server) and CUDA Toolkit.
- Docker: Installs Docker Engine and NVIDIA Container Toolkit (enabling GPU support in containers).
- Python: Installs Miniconda and sets up a default
ai_envwith PyTorch, Pandas, and JupyterLab.
- A target machine running Ubuntu 22.04 LTS or newer.
- SSH access to the target machine.
- Ansible installed on your control machine (the one running the playbook).
-
Configure Inventory: Edit
inventory/hosts.inito add your target server's IP address and SSH details.[ai_servers] 192.168.1.xxx ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/id_rsa
-
Configure Variables (Optional): You can tweak roles or
playbook.ymlvars if needed (e.g., enable/disable driver install).
Run the playbook:
ansible-playbook playbook.ymlTo limit to specific tags:
ansible-playbook playbook.yml --tags "base,docker"After the playbook completes:
-
SSH into the server:
ssh ubuntu@<server-ip>
-
Check NVIDIA Drivers:
nvidia-smi
-
Check Docker GPU Support:
sudo docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-
Check Python Environment:
source ~/miniconda3/bin/activate ai_env python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
Run any LLM using the unified script. It wraps llama.cpp, vLLM, or custom containers:
# Run GGUF (llama.cpp)
./scripts/run_inference.py <model.gguf>
# Run Standard Model (vLLM) - OpenAI Compatible API
./scripts/run_inference.py <model_dir_or_repo> --engine vllm
# Run Custom Image
./scripts/run_inference.py --engine custom --image <image_name>The script automatically handles GPU detection and Docker flags.
The scripts/profit_switcher.py script monitors Clore.ai rental prices vs your electricity cost.
- Set env vars:
export CLORE_API_KEY="your_key" export CLORE_SERVER_ID="your_id" export RIG_POWER_KW="0.8" # Approximate power consumption in kW
- Run the script (ideally via cron):
Note:
sudo -E python3 scripts/profit_switcher.py
sudo -Epreserves environment variables.
Provider Selector Easily switch between workloads to avoid resource conflicts:
sudo python3 scripts/select_provider.py [clore|vast|ionet|local]Vast.ai Setup
- Get your unique start command from the Vast.ai dashboard.
- Run:
sudo setup_vast.sh '<your_command>'
You can chat effectively with your local Ollama instance via Telegram.
Option A: OpenClaw (Recommended for Agents)
- Get a Token from
@BotFather. - Configure OpenClaw to use Telegram as a channel.
Option B: Standalone Bot (Direct Chat) A lightweight Docker container that bridges Telegram directly to Ollama.
- Export your Token:
export TELEGRAM_TOKEN="your_token_here" - Start Ollama Service (with
sudo -Eto pass the token):The script detects the token and automatically launches thesudo -E python3 scripts/select_provider.py ollama
telegram-botcontainer alongside Ollama.
- Grafana:
http://<server-ip>:3000(Default user/pass: admin/admin) - Prometheus:
http://<server-ip>:9090 - Alerts: Pre-configured to fire if any GPU exceeds 80°C.
For mining rigs with x1 risers:
- Use GGUF format:
llama.cppwith GGUF leveragesmmapbetter than other loaders, reducing initial load times. - Fit Model in VRAM: Ensure the model (
-ngl 99) fits entirely in VRAM. Swapping over PCIe x1 is detrimental to performance. - Context Shifting: If using huge context, be aware that KV cache processing might be bottlenecked by bandwidth if split across cards. Keep batch sizes lower.
All agents run in Docker containers using the "Sibling Docker" pattern (binding /var/run/docker.sock).
- Clore Agent:
clore-agentcontainer. - Vast Agent:
vast-agentcontainer. - Local AI:
llama.cppcontainer (ephemeral). - Monitoring:
prometheus,grafanacontainers.
For connecting multiple clients (e.g., Clawdbots), use the LiteLLM Proxy.
- Endpoint:
http://<server-ip>:4000 - API Key:
sk-aiserver-admin(Master Key) - Supported Models:
llama3,mistral,custom(Routed to Ollama internally).
Why use this?
- Provides a stable OpenAI-compatible endpoint.
- Handles queuing better than raw Ollama.
- Secured with an API Key.
The system is configured to use your multi-drive setup:
- System (NVMe): OS & Docker Images.
- Models (SSD):
/mnt/ssd/models(Ollama) &/mnt/ssd/cache(HuggingFace). - Data (HDD):
/mnt/hdd/data(Datasets) &/mnt/hdd/backups.
Configuration File: ~/aiserver/config/storage.env
Edit this file if your mount points differ.
- Data (HDD):
/mnt/hdd/data(Datasets) &/mnt/hdd/backups. - Swap (NVMe): 64GB Swapfile configured for AI (Swappiness 10).
To keep data separate for different bots/clients:
-
Create Storage:
./scripts/create_client_storage.sh my-client-1
Creates
/mnt/hdd/data/clients/my-client-1. -
Usage in Docker: Map the volume:
-v /mnt/hdd/data/clients/my-client-1:/app/data(Thecreate_agent.shscript does this automatically for new agents).
For a detailed diagram of how all components (LiteLLM, Ollama, Storage, GPUs) work together, read the System Architecture Presentation.
For a slide deck summary, see the Project Slides.
- Client Onboarding & Removal SOP: Deployment and cleanup guides.
The system automatically balances loads across all available GPUs.
| Engine | Distribution Method | Behavior |
|---|---|---|
| Ollama / llama.cpp | Layer Offloading/Splitting | Splits model layers across GPUs (e.g., 20 layers on GPU 0, 20 on GPU 1). Best for mixed cards (e.g., 3090 + 4090). |
| vLLM | Tensor Parallelism | Splits mathematical operations across GPUs. Requires identical cards for best performance (e.g., 2x 3090). |
Note: My scripts automatically detect your GPU count and apply the correct sharding strategy.
| Feature | Ollama | vLLM |
|---|---|---|
| Best For | Ease of Use & OpenClaw | Raw Performance & Throughput |
| Model Management | Automatic (ollama pull) |
Manual / HuggingFace Cache |
| Model Switching | Dynamic (On-Demand Loading) | Manual Restart Required |
| API | OpenAI + Native | OpenAI Compatible |
| Speed | Good (based on llama.cpp) | Extreme (Optimized PagedAttention) |
| Setup | One command (select_provider.py) |
script flags (--engine vllm) |
Recommendation:
- Start with Ollama. It's easier, persistent, and works great with OpenClaw.
- Switch to vLLM only if you need higher token speeds for massive contexts or concurrent agents.
You can easily create your own AI agents that run in Docker and connect to your local server.
- Create a new Agent:
./scripts/create_agent.sh my_new_bot
- Run it:
This generates a boiler-plate Python project with a Dockerfile, pre-configured to talk to your local Ollama instance.
cd ~/my_agents/my_new_bot ./run.sh
Create custom AI models trained on your data (using Unsloth for 2x speed).
- Prepare Data: Create a
.jsonlfile (Alpaca format).{"instruction": "Question...", "input": "", "output": "Answer..."} - Run Training:
./scripts/finetune.sh --data example_dataset.jsonl --name my-custom-model
- Use It:
- The script automatically installs it into Ollama.
- Run:
scripts/run_inference.py my-custom-model --engine ollama
I have installed a system service to optimize your GPUs on boot.
To Adjust Power Limits:
- Edit:
sudo nano /usr/local/bin/optimize_gpus.sh - Uncomment:
# nvidia-smi -pl 300 - Apply:
sudo systemctl restart gpu-optimizer
Your server is ready for OpenClaw. The easiest method is using Ollama (Dockerized):
- Start Ollama Container:
sudo scripts/select_provider.py ollama - Run a model:
scripts/run_inference.py llama3 --engine ollama - Configure OpenClaw:
- Provider: Select Custom Provider (sometimes labeled "OpenAI Compatible").
- Base URL:
http://<server-ip>:11434/v1- IMPORTANT: You MUST append
/v1to the URL so it treats Ollama like OpenAI.
- IMPORTANT: You MUST append
- Model:
llama3(or the exact name fromollama list). - API Key:
ollama(Dummy value).
How to update or add new models to Ollama:
1. Update an Existing Model
docker exec -it ollama ollama pull llama3Ollama checks for updates. If a new version exists, it downloads it. The next API request will use the new version.
2. Add a Completely New Model
docker exec -it ollama ollama pull mistralIt downloads the new model to disk.
3. Switch Models
- API: Change
"model": "llama3"to"model": "mistral"in your request. - OpenClaw: Change the Model Name in settings.
- Ollama automatically unloads the old model and loads the new one from SSD.
Alternatively, use vLLM for OpenAI-compatible API at port 8000.