# Setting up Ollama with your HPC Account

Ollama is an open-source tool that enables users to easily download, install, and run large language models locally on their own hardware, eliminating the need for cloud-based API calls or external services. When setting up Ollama for an HPC (High-Performance Computing) account, you're essentially configuring this lightweight runtime environment to leverage the substantial computational resources available in HPC clusters, such as powerful GPUs and high-memory nodes. The setup process typically involves downloading the Ollama binary to your HPC user directory, ensuring proper GPU drivers and CUDA compatibility, and then pulling your desired models (like Llama 2, Mistral, or CodeLlama) which will be stored locally on the cluster's storage system.

### 1. Download Ollama (we need the Linux Version for the HPC System)

Use your AcademicID username and HPC User ID. You can find the information regarding your person here:https://hpcproject.gwdg.de/projects/baaa32d3-6b49-4831-8b75-87ff44056ae0/ 

In [None]:
# Download the Linux version for the HPC system
!wget -O /user/sarah.oberbichler/u18915/ollama-linux-amd64.tgz https://github.com/ollama/ollama/releases/download/v0.11.4/ollama-linux-amd64.tgz

# Extract it
!cd /user/sarah.oberbichler/u18915 && tar -xzf ollama-linux-amd64.tgz


### Set up your Environment

First, see what ports are available for your environment. Copy the number and paste it after ['OLLAMA_HOST'] = '127.0.0.1: in the cell 4.

In [4]:
import socket

def find_free_ports(start=11434, end=11450):
    free_ports = []
    for port in range(start, end + 1):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(('127.0.0.1', port)) != 0:
                free_ports.append(port)
    return free_ports

available_ports = find_free_ports()
print("✅ Available ports:", available_ports)


✅ Available ports: [11435, 11436, 11437, 11438, 11439, 11440, 11441, 11442, 11443, 11444, 11445, 11446, 11447, 11448, 11449, 11450]


In [5]:
import os

# Set up your environment
os.environ['OLLAMA_HOME'] = '/user/sarah.oberbichler/u18915/.ollama' #/workspace/ceph-hdd/.ollama'
os.environ['PATH'] = f"/user/sarah.oberbichler/u18915/bin:{os.environ.get('PATH', '')}"
os.environ['OLLAMA_HOST'] = '127.0.0.1:11434'
os.environ['LD_LIBRARY_PATH'] = "/opt/conda/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:" + os.environ.get('LD_LIBRARY_PATH', '')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

print("Environment configured")

Environment configured


### Start the Ollama Server

In [6]:
import subprocess
import time
import os

# Kill any existing processes
!pkill -f ollama
time.sleep(2)

# Start server
process = subprocess.Popen([
    '/user/sarah.oberbichler/u18915/bin/ollama', 'serve'
], env=os.environ.copy())

print(f"Server started (PID: {process.pid})")
time.sleep(8)

# Test server
result = subprocess.run([
    '/user/sarah.oberbichler/u18915/bin/ollama', 'list'
], capture_output=True, text=True)

print(f"Server status: {result.returncode}")
print(f"Models: {result.stdout}")

Server started (PID: 1893630)


time=2025-09-09T11:43:43.157+02:00 level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY:*.hlrn.de,jupyter.hpc.gwdg.de,jupyter.usr.hpc.gwdg.de,localhost,127.0.0.1 OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/user/sarah.oberbichler/u18915/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app:

[GIN] 2025/09/09 - 11:43:51 | 200 |      69.239µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/09/09 - 11:43:51 | 200 |   12.124783ms |       127.0.0.1 | GET      "/api/tags"
Server status: 0
Models: NAME                                              ID              SIZE      MODIFIED     
llama3.1:8b                                       46e0c10c039e    4.9 GB    2 hours ago     
MHKetbi/allenai_OLMo2-0325-32B-Instruct:q5_K_S    7c99a6b3ca4d    22 GB     19 hours ago    



### Check the GPU Availability

In [7]:
# Verify GPU is available
!nvidia-smi --query-gpu=name,memory.total --format=csv

name, memory.total [MiB]
NVIDIA A100-SXM4-40GB, 40960 MiB


### Download a model

We will use the open source model OLMo2-0325-32B-Instruct. OLMo2 is a 32-billion parameter transformer-based language model developed by the Allen Institute for AI (AI2), released in March 2025 as part of the OLMo 2 family. This instruction-tuned variant represents a significant milestone as the first fully-open model to outperform GPT-3.5 Turbo and GPT-4o mini on a suite of popular academic benchmarks.

In [3]:
if result.returncode == 0:  # If server is working
    print("Downloading model...")
    download_result = subprocess.run([
        '/user/n.rastinger/u20065/bin/ollama', 'pull', 'llama3.1:8b'
    ], capture_output=True, text=True)
    
    print(f"Download result: {download_result.returncode}")
    if download_result.stderr:
        print(f"Download output: {download_result.stderr}")

Downloading model...
[GIN] 2025/09/09 - 09:55:59 | 200 |      51.729µs |       127.0.0.1 | HEAD     "/"


time=2025-09-09T09:56:00.709+02:00 level=INFO source=download.go:177 msg="downloading 667b0c1932bc in 16 307 MB part(s)"
time=2025-09-09T09:56:46.867+02:00 level=INFO source=download.go:177 msg="downloading 948af2743fc7 in 1 1.5 KB part(s)"
time=2025-09-09T09:56:48.191+02:00 level=INFO source=download.go:177 msg="downloading 0ba8f0e314b4 in 1 12 KB part(s)"
time=2025-09-09T09:56:49.527+02:00 level=INFO source=download.go:177 msg="downloading 56bb8bd477a5 in 1 96 B part(s)"
time=2025-09-09T09:56:50.856+02:00 level=INFO source=download.go:177 msg="downloading 455f34728c9b in 1 487 B part(s)"


[GIN] 2025/09/09 - 09:57:04 | 200 |          1m4s |       127.0.0.1 | POST     "/api/pull"
Download result: 0
Download output: [?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 667b0c1932bc:   0% ▕                  ▏ 845 KB/4.9 GB                 

### Get the model started

In [8]:
# Define the API URL for your local Ollama server
API_URL = "http://127.0.0.1:11434"

#Set model and model parameters
def ask_olmo_api(prompt, model="MHKetbi/allenai_OLMo2-0325-32B-Instruct:q5_K_S"):
    import urllib.request, json
    req = urllib.request.Request(
        f"{API_URL}/api/chat",
        json.dumps({"model": model, "messages":[{"role":"user","content":prompt}],
                    "stream": False, "temperature": 0.2}).encode(),
        {"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())["message"]["content"]

In [9]:
# Load the model 
def warm_up_model():
    """Warm up the model with simple prompts."""
    print("Warming up model...")
    
    for prompt in ["Hi", "What is 2+2?", "Say hello"]:
        try:
            ask_olmo_api(prompt)
            print(".", end="", flush=True)
        except:
            pass
    
    print("\nWarmup complete!")

warm_up_model()

Warming up model...


time=2025-09-09T11:44:13.765+02:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/user/sarah.oberbichler/u18915/.ollama/models/blobs/sha256-fd251f9b771983bcee3f8b52173daf0d5215521bb0f80478cb806100288db272 gpu=GPU-921b2ff8-6bed-4799-0364-09d8118e667a parallel=1 available=41840410624 required="23.0 GiB"
time=2025-09-09T11:44:14.087+02:00 level=INFO source=server.go:135 msg="system memory" total="503.8 GiB" free="457.2 GiB" free_swap="0 B"
time=2025-09-09T11:44:14.088+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[39.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.0 GiB" memory.required.partial="23.0 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[23.0 GiB]" memory.weights.total="20.4 GiB" memory.weights.repeating="20.0 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="853.3 MiB" memory.gr

[GIN] 2025/09/09 - 11:46:29 | 200 |         2m16s |       127.0.0.1 | POST     "/api/chat"
.[GIN] 2025/09/09 - 11:46:39 | 200 |  9.748055322s |       127.0.0.1 | POST     "/api/chat"
.[GIN] 2025/09/09 - 11:46:39 | 200 |  359.992423ms |       127.0.0.1 | POST     "/api/chat"
.
Warmup complete!


### Chat with the model

In [10]:
ask_olmo_api('Say Hi')

[GIN] 2025/09/09 - 11:46:40 | 200 |  805.576999ms |       127.0.0.1 | POST     "/api/chat"


'Hello, how can I assist you today?\n\nIf you have any questions or need assistance with something, feel free to ask.'