# Running Open Source LLMs via Ollama API on Google Colab
--- 
This notebook demonstrates how to set up an **Ollama** server within a Google Colab environment to perform API calls to Small Language Models (SLMs) for free. 

### Hardware Specs:
* **GPU:** Tesla T4 (~15GB VRAM)
* **Model Recommendation:** Use models under 4B parameters (e.g., Llama 3.2 1B, Phi-4-mini) to avoid OOM errors.

## üõ†Ô∏è Step 1: Prep the Environment
Colab‚Äôs base image requires system-level utilities like `zstd` to unpack Ollama‚Äôs binaries. We also install `pciutils` to ensure the GPU is detectable.

In [None]:
# Install dependencies
!apt-get update -qq && apt-get install -y -qq zstd pciutils

# Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

## üöÄ Step 2: Launch the Background Daemon
Ollama must run as a background service so that the notebook remains interactive for your API calls.

In [None]:
import subprocess
import time
import os

# Start Ollama server in the background
with open('ollama.log', 'w') as f:
    subprocess.Popen(['ollama', 'serve'], stdout=f, stderr=f)

time.sleep(5) # Allow initialization
print("Ollama server is active.")

## üì• Step 3: Pull Your Model
We are using `llama3.2:1b` for its efficiency on the T4 GPU.

In [None]:
!ollama pull llama3.2:1b

## üîó Step 4: Perform the API Call
Using the `requests` library, we hit the local endpoint. Note that we access the payload via `.json()['response']` to extract the LLM's answer.

In [None]:
import requests

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.2:1b",
    "prompt": "Explain the concept of quantum entanglement in one sentence.",
    "stream": False
}

response = requests.post(url, json=payload)

if response.status_code == 200:
    result = response.json()
    print("LLM Response:")
    print(result['response'])
else:
    print(f"Error: {response.status_code}")