# Setting up Ollama with your HPC Account

Ollama is an open-source tool that enables users to easily download, install, and run large language models locally on their own hardware, eliminating the need for cloud-based API calls or external services. When setting up Ollama for an HPC (High-Performance Computing) account, you're essentially configuring this lightweight runtime environment to leverage the substantial computational resources available in HPC clusters, such as powerful GPUs and high-memory nodes. The setup process typically involves downloading the Ollama binary to your HPC user directory, ensuring proper GPU drivers and CUDA compatibility, and then pulling your desired models (like Llama 2, Mistral, or CodeLlama) which will be stored locally on the cluster's storage system.

### 1. Download Ollama (we need the Linux Version for the HPC System)

Use your AcademicID username and HPC User ID. You can find the information regarding your person here:https://hpcproject.gwdg.de/projects/baaa32d3-6b49-4831-8b75-87ff44056ae0/ 

In [None]:
# Download the Linux version for the HPC system
!wget -O /user/sarah.oberbichler/u18915/ollama-linux-amd64.tgz https://github.com/ollama/ollama/releases/download/v0.11.4/ollama-linux-amd64.tgz

# Extract it
!cd /user/sarah.oberbichler/u18915 && tar -xzf ollama-linux-amd64.tgz


### Set up your Environment

First, see what ports are available for your environment. Copy the number and paste it after ['OLLAMA_HOST'] = '127.0.0.1: in the cell 4.

In [None]:
import socket

def find_free_ports(start=11434, end=11450):
    free_ports = []
    for port in range(start, end + 1):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(('127.0.0.1', port)) != 0:
                free_ports.append(port)
    return free_ports

available_ports = find_free_ports()
print("✅ Available ports:", available_ports)


In [None]:
import os

# Set up your environment
os.environ['OLLAMA_HOME'] = '/user/sarah.oberbichler/u18915/.ollama' #/workspace/ceph-hdd/.ollama'
os.environ['PATH'] = f"/user/sarah.oberbichler/u18915/bin:{os.environ.get('PATH', '')}"
os.environ['OLLAMA_HOST'] = '127.0.0.1:11434'
os.environ['LD_LIBRARY_PATH'] = "/opt/conda/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:" + os.environ.get('LD_LIBRARY_PATH', '')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

print("Environment configured")

### Start the Ollama Server

In [None]:
import subprocess
import time
import os

# Kill any existing processes
!pkill -f ollama
time.sleep(2)

# Start server
process = subprocess.Popen([
    '/user/sarah.oberbichler/u18915/bin/ollama', 'serve'
], env=os.environ.copy())

print(f"Server started (PID: {process.pid})")
time.sleep(8)

# Test server
result = subprocess.run([
    '/user/sarah.oberbichler/u18915/bin/ollama', 'list'
], capture_output=True, text=True)

print(f"Server status: {result.returncode}")
print(f"Models: {result.stdout}")

### Check the GPU Availability

In [None]:
# Verify GPU is available
!nvidia-smi --query-gpu=name,memory.total --format=csv

### Download a model

We will use the open source model OLMo2-0325-32B-Instruct. OLMo2 is a 32-billion parameter transformer-based language model developed by the Allen Institute for AI (AI2), released in March 2025 as part of the OLMo 2 family. This instruction-tuned variant represents a significant milestone as the first fully-open model to outperform GPT-3.5 Turbo and GPT-4o mini on a suite of popular academic benchmarks.

In [None]:
if result.returncode == 0:  # If server is working
    print("Downloading model...")
    download_result = subprocess.run([
        '/user/sarah.oberbichler/u18915/bin/ollama', 'pull', 'llama3.1:8b'
    ], capture_output=True, text=True)
    
    print(f"Download result: {download_result.returncode}")
    if download_result.stderr:
        print(f"Download output: {download_result.stderr}")

### Get the model started

In [None]:
# Define the API URL for your local Ollama server
API_URL = "http://127.0.0.1:11434"

#Set model and model parameters
def ask_llama(prompt, model="llama3.1:8b"):
    import urllib.request, json
    req = urllib.request.Request(
        f"{API_URL}/api/chat",
        json.dumps({"model": model, "messages":[{"role":"user","content":prompt}],
                    "stream": False, "temperature": 0.2}).encode(),
        {"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())["message"]["content"]

In [None]:
# Load the model 
def warm_up_model():
    """Warm up the model with simple prompts."""
    print("Warming up model...")
    
    for prompt in ["Hi", "What is 2+2?", "Say hello"]:
        try:
            ask_olmo_api(prompt)
            print(".", end="", flush=True)
        except:
            pass
    
    print("\nWarmup complete!")

warm_up_model()

### Chat with the model

In [None]:
ask_llama('hi')