Local WSL2 + vLLM setup with a simple FastAPI web UI for model selection, chat, and per-session markdown logs.
- Windows 11 (or Windows 10 22H2) with WSL2 enabled
- Latest NVIDIA Windows GPU drivers with WSL support
- VS Code + "Remote - WSL" extension
- Open PowerShell and install Ubuntu (if you have not yet):
wsl --install -d Ubuntu
- Launch Ubuntu and let it finish first-run setup.
- From Windows, open this folder in VS Code, then use "Remote - WSL: Reopen Folder in WSL".
Inside WSL:
nvidia-smi
You should see your RTX 4070 listed. If not, update NVIDIA drivers on Windows and confirm WSL2 GPU support is enabled.
From the repo root (inside WSL):
cp .env.example .env
./scripts/setup_wsl_ubuntu.sh
./scripts/setup_webui.sh
Note: zstd is installed by setup_wsl_ubuntu.sh because Ollama uses it for model downloads.
If you want to use GGUF models with the Ollama backend:
- Install Ollama for Linux:
curl -fsSL https://ollama.com/install.sh | sh
- Start the service:
ollama serve
- Pull the GGUF model you want:
ollama pull QuantFactory/Qwen2.5-7B-Instruct-abliterated-v2-GGUF
Alternatively, run the helper script:
./scripts/setup_ollama.sh
./scripts/run_webui.sh
Open your browser:
- Pick a model from the dropdown.
- Click "Load Model". The UI will stop any running vLLM server and start a new one with the selected model.
- Wait for the status to show "Running".
- If the selected model uses the Ollama backend, make sure
ollama serveis running.
Note: The web UI will attempt to start Ollama and pull the model automatically when you load an Ollama-backed model.
- Each new chat creates a new markdown file in
logs/. - Filename format:
YYYY-MM-DD_HHMMSS_<modelId>.md - Each user/assistant message is appended immediately.
- vLLM not running: click "Load Model" or check
./.run/vllm.log. - OOM errors: choose a smaller model or the 4-bit DeepSeek option.
- vLLM health check fails: confirm
VLLM_HOSTin.envand that the server started. - WSL GPU missing: run
nvidia-smiin WSL and verify Windows driver install.
- vLLM runs at http://localhost:8000 (OpenAI-compatible).
- Ollama runs at http://localhost:11434 (GGUF backend).
- Web UI runs at http://localhost:5000.
- Model list is defined in
config/models.json.
chmod +x scripts/*.sh