# Prompt

I have previously seen code to run a local llm in python but i cant recall any of it. can you walk me through the steps and python code to do that? be sure to mention any pitfalls and ways to avoid them on a windows machine.

Got you. Here’s a clean, practical path to running a local LLM on **Windows** with Python, plus the “gotchas” you’ll want to avoid.

# 3 good ways to run a local LLM (Windows)

## 1) **Ollama** (easiest end-to-end, great for beginners)

Ollama runs models locally (CPU/GPU) and exposes a simple HTTP API. There’s a tiny Python client too.

### Install

1. Install Ollama for Windows → run the installer and launch it. ([Ollama][1])
2. Pull a model (examples: `llama3`, `phi4`, `qwen2.5`, `gemma3`):

```powershell
ollama pull llama3
```

3. (Optional) Confirm Windows + AMD/NVIDIA GPU acceleration support exists (it does; Windows builds arrived in 2024 and AMD support followed). ([Ollama][2])

### Minimal Python (chat once)

```python
# pip install ollama
import ollama
resp = ollama.chat(model="llama3", messages=[{"role": "user", "content": "Explain RBF kernel simply."}])
print(resp["message"]["content"])
```

The official Python client just needs Ollama running and a pulled model. ([GitHub][3])

---

## 2) **llama.cpp via llama-cpp-python** (loads GGUF files directly in Python)

This gives you a pure-Python API (no server needed) for quantized **GGUF** models.

### Install (choose your path)

* **GPU (NVIDIA CUDA):** install a prebuilt wheel matching your CUDA version, e.g.:

```powershell
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
```

(Use `cu121`, `cu122`, `cu123`, `cu124`, or `cu125` to match your CUDA.) ([GitHub][4])
There are also community prebuilt wheels targeting newer CUDA (e.g., 12.8) if you need them. ([Reddit][5], [Hugging Face][6])

* **CPU-only:**

```powershell
pip install llama-cpp-python
```

### Minimal Python (load a local GGUF and generate)

```python
# pip install llama-cpp-python
from llama_cpp import Llama

# Point to your downloaded .gguf (e.g., from Hugging Face)
llm = Llama(model_path=r"C:\models\Llama-3-8B.Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1)  # set -1 to use GPU if built with CUDA
out = llm("Write a one-sentence summary of PCA.\n", max_tokens=128, stop=["\n"])
print(out["choices"][0]["text"])
```

---

## 3) **Transformers** (native PyTorch; flexible, but VRAM-hungry)

If you have a decent NVIDIA GPU, this is the most flexible route. If you’re on **AMD/Intel GPU on Windows**, use **DirectML** for acceleration (works with many—but not all—models). ([Microsoft Learn][7], [PyPI][8], [GitHub][9])

### Install (NVIDIA CUDA)

```powershell
# Pick a PyTorch that matches your CUDA (from pytorch.org)
pip install torch torchvision torchaudio  # with CUDA wheels
pip install transformers accelerate
```

### Install (AMD/Intel via DirectML)

```powershell
pip install torch-directml transformers accelerate
```

Then in Python:

```python
import torch
device = torch.device("dml")  # DirectML device
```

([Microsoft Learn][7], [PyPI][8])

### Minimal Python (Transformers generate)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-3-8b-instruct"   # swap to a license-friendly local mirror/variant as needed

tok = AutoTokenizer.from_pretrained(model_id)
# For NVIDIA: device="cuda"; for DirectML: device="dml"
device = "cuda" if torch.cuda.is_available() else ("dml" if torch.backends.__dict__.get("mps_is_available", False) is False else "cpu")
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device)

prompt = "Explain the bias-variance tradeoff for a stats student."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=160)
print(tok.decode(out[0], skip_special_tokens=True))
```

> Want **throughput** or **OpenAI-style server** on your machine? Consider **vLLM**—it’s very fast but best on Linux. On Windows, people commonly use **WSL2** (Ubuntu) for vLLM; guides exist that show vLLM running on WSL2. ([DEV Community][10], [Medium][11])

---

# Windows-specific pitfalls (and how to dodge them)

* **GPU not used / falling back to CPU:**

  * For **Ollama**, ensure the Windows build + drivers are installed; AMD acceleration is supported (preview rolled out in 2024). Check Task Manager → GPU. ([Ollama][2])
  * For **Transformers**, your **PyTorch** build must match your driver/CUDA. If you’re on **AMD/Intel**, use **torch-directml** and set `device="dml"`. ([Microsoft Learn][7], [PyPI][8])

* **llama-cpp-python compile errors on Windows:**
  Use a **prebuilt wheel** for your CUDA version instead of compiling; compiling requires VS “Desktop development with C++” and can be finicky. ([GitHub][4])

* **bitsandbytes (8-bit) on Windows:**
  Upstream **bitsandbytes** does **not** officially support Windows. Workarounds exist but are fragile. If you need quantization on Windows, prefer **GGUF with llama.cpp** or AWQ/GPTQ variants loaded via other backends. ([PyPI][12], [mindfiretechnology.com][13], [GitHub][14])

* **Not enough VRAM / OOM:**

  * Pick **smaller or more aggressively-quantized** models (e.g., Q4\_K\_M GGUF).
  * Increase Windows **pagefile** size if you see CPU RAM OOMs (helps stability).
  * For Transformers, you can try `torch_dtype=torch.float16`, `max_memory`, or quantized checkpoints.

* **Long paths / antivirus interference:**
  Keep model paths short (e.g., `C:\models\...`) and whitelist your model folder in AV if downloads repeatedly fail/corrupt.

* **Licenses & model cards:**
  Some checkpoints require acceptance/sign-in (e.g., Meta Llama). Make sure you comply with licensing before scripting downloads.

* **WSL2 vs native Windows:**
  If you need **vLLM** or Linux-first tooling, WSL2 (Ubuntu) is often the smoothest path on a Windows laptop or desktop. ([DEV Community][10], [Medium][11])

---

# Quick “which should I use?” cheat

* **I want the simplest local setup + a Python API.** → **Ollama** + `pip install ollama`. ([GitHub][3])
* **I want tight Python control + tiny RAM/VRAM usage.** → **llama-cpp-python** with **GGUF**. ([GitHub][4])
* **I want the full Hugging Face ecosystem.** → **Transformers** (CUDA or DirectML). ([Microsoft Learn][7], [PyPI][8])
* **I need high-throughput serving.** → **vLLM** (prefer **WSL2** on Windows). ([DEV Community][10])

If you tell me your GPU/VRAM and whether you prefer server-style vs. pure-Python, I’ll tailor the exact commands and a starter script for your setup.

[1]: https://ollama.com/download/windows?utm_source=chatgpt.com "Download Ollama on Windows"
[2]: https://ollama.com/blog?utm_source=chatgpt.com "Blog"
[3]: https://github.com/ollama/ollama-python?utm_source=chatgpt.com "Ollama Python library"
[4]: https://github.com/abetlen/llama-cpp-python?utm_source=chatgpt.com "Python bindings for llama.cpp"
[5]: https://www.reddit.com/r/LocalLLaMA/comments/1k8xu4d/release_llamacpppython_038_cuda_128_prebuilt/?utm_source=chatgpt.com "🚀 [Release] llama-cpp-python 0.3.8 (CUDA 12.8) Prebuilt ..."
[6]: https://huggingface.co/boneylizardwizard/llama-cpp-python-038-cu128-gemma3-wheel?utm_source=chatgpt.com "boneylizardwizard/llama-cpp-python-038-cu128-gemma3- ..."
[7]: https://learn.microsoft.com/en-us/windows/ai/directml/pytorch-windows?utm_source=chatgpt.com "Enable PyTorch with DirectML on Windows"
[8]: https://pypi.org/project/torch-directml/?utm_source=chatgpt.com "torch-directml"
[9]: https://github.com/microsoft/DirectML/blob/master/PyTorch/README.md?utm_source=chatgpt.com "DirectML/PyTorch/README.md at master"
[10]: https://dev.to/docteurrs/making-vllm-work-on-wsl2-482e?utm_source=chatgpt.com "Making VLLM work on WSL2"
[11]: https://medium.com/%40techofhp/running-vllm-locally-with-qwen-on-wsl-a-step-by-step-guide-063ad5d1d887?utm_source=chatgpt.com "Running vLLM locally with Qwen on WSL: A Step-by- ..."
[12]: https://pypi.org/project/bitsandbytes-windows/?utm_source=chatgpt.com "bitsandbytes-windows"
[13]: https://www.mindfiretechnology.com/blog/archive/installing-bitsandbtyes-for-windows-so-that-you-can-do-peft/?utm_source=chatgpt.com "Installing BitsAndBtyes for Windows - So That You Can Do ..."
[14]: https://github.com/fa0311/bitsandbytes-windows?utm_source=chatgpt.com "fa0311/bitsandbytes-windows: 8-bit CUDA functions for ..."
