# LLM-as-a-service

This is a quick introduction to using the AI-related services that are part of the WARA-Ops data portal.

## What is a Large Language Model?

Here's how the LLM-as-a-Service answers the question "What is an LLM?":
> An LLM (Large Language Model) is a type of artificial neural network designed to process and understand human language. It's essentially a super-smart computer program that can read, write, and converse in multiple languages.
>
> Think of it like this: Imagine a librarian who has read every book in the library, remembers everything, and can answer any question you ask about the content of those books. That's roughly what an LLM does, but instead of books, it's trained on massive amounts of text data from the internet.
>
> LLMs are used in applications like language translation, chatbots, and even generating text summaries or entire articles!

A little more down-to-earth way to think of it is as a way of predicting the next word in a sentence. Given a starting sentence "It was a dark" it might produce "and" as the result. Concatenating the result and the initial sentence yields "It was a dark and" that can then be fed back to the LLM maybe resulting in "night" as the most likely word to follow. That feedbac process can then be repeated until a complete novel is produced.

This is of-course an oversimplification, but as a first approximation it is not too bad.

Despite what our LLM claims, not only words can be handled by LLMs. Images, sounds, and a lot of other types of data can be handled (FIXME: examples).

## What is LLM-as-a-Service?

It it a flexible LLM running on powerful GPU-enabled servers that you can access through an API. Most LLM services expose a chat interface that lets you communicate by typing queries (and get anwers) in a web browser. The WARA-ops LLM service is geared towards letting you work with _your_ data the way _you_ want without it ever leaving the portal.

## What is it _not_?

It is not a general LLM with a friendly GUI.

### Supporting services

Qdrant vecor database service, see [RAG-tutorial](../RAG-tutorial/intro.ipynb).

## The basics

The LLM is an [Ollama](https://github.com/ollama/ollama) FOSS server that can run can run many pretrained _models_ including DeepSeek that has made quite an impact lately (Jan. 2025).

A model in this context is essentially a (large) set of weights obtained by training on vast amounts of data and converted to a standardized format. Take as an example the freely available `llama3.1` model:

    Name: llama3.1:latest
      Size (MB): 4692.80
      Format: gguf
      Family: llama
      Parameter Size: 8.0B
      Quantization Level: Q4_K_M

The details are not important at this point, save for the `Parameter Size` that is a rough measure of the models's capability. For more information on the GGUF file format see [this overview](https://huggingface.co/docs/hub/en/gguf)

### REST API

The service is available (from the portal) at `10.129.20.4:9090` and exposes a REST [API](https://github.com/ollama/ollama/blob/main/docs/api.md)

Let's try some basic interaction using the [curl](https://github.com/tldr-pages/tldr/blob/main/pages/common/curl.md) command line tool. By preceeding the command with an exclamation mark (`!`) we can run it from a notebook cell. As a first example we'll just retrieve the server version:

In [None]:
!curl -s http://10.129.20.4:9090/api/version

We can ask the server to list all available models (the `| head -c 500` part truncates the output after 500 characters):

In [None]:
!curl -s http://10.129.20.4:9090/api/tags | head -c 500

As you can see, the response is not meant for human consumption and we'll address that shortly, but first we'll show how to save the response to file so that you inspect it Jupyter by clicking the downloaded `response.json` file (in the directory browser to the left):

In [None]:
!curl http://10.129.20.4:9090/api/show -d '{"model": "deepseek-r1:70b"}' -o response.json

![The file `response.json` viewed in Jupyter](./img/fig1.png)

### Python client

A more convenient way of communicating with the server from a notebook, is by using a [python client](https://github.com/ollama/ollama-python) that wraps the REST API.

First the python client must be installed (by running the cell below), and then we can proceed to create an instance and use it to request the list of available models:

In [None]:
!pip -q install ollama 

In [None]:
from ollama import Client

# Create a client for the LLM-as-a-service
client = Client(host='10.129.20.4:9090')

In [None]:
# Request the list of models
response = client.list()

# Format and print the response
for model in response.models:
  print('Name:', model.model)
  print('  Size (MB):', f'{(model.size.real / 1024 / 1024):.2f}')
  if model.details:
    print('  Format:', model.details.format)
    print('  Family:', model.details.family)
    print('  Parameter Size:', model.details.parameter_size)
    print('  Quantization Level:', model.details.quantization_level)
  print('\n')

## A minimal example

Let's request an answer to the question "Why is the sky blue?" from the model "llama3.1:8b". Note the use of _role_ and _content_ in _messages_:

In [None]:
response = client.chat(model='llama3.1:8b', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)

Let's see how the much talked about **DeepSeek**-model answers the same question. By picking `deepseek-r1:8b` from the list above, we get a model with _reasoning_ capabilities:

In [None]:
response = client.chat(model='deepseek-r1:8b', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
])
# Most of the time, but not always, you get the "reasoning" enclosed in <think></think> tags
print(response.message.content)

### Customize

We can customize the query by adding a _system_ message:

In [None]:
response = client.chat(model='llama3.1:8b', messages=[
    {'role': 'system', 'content': 'You are Mario from Super Mario Bros.'}, 
    # Bonus credits for changing the above instruction to 'You are Dart Vader.' or Jar-jar Binks if you prefer  
    {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)


Try changing model, question and system instructions to see what happens.

## Next steps

Try out some examples from the [documentation](https://github.com/ollama/ollama/blob/main/docs/README.md), but be aware that they assume that you are running a local server. With guidance from the above examples you'll be able to figure out any changes required.

Remember that the deinitive source of truth regarding parameter etc is the [REST API](https://github.com/ollama/ollama-python/tree/main/examples).

### Using portal data

An introduction to how to access data from the portal is in the tutorial [PortalAndPandas][1]

### Retrieval Augmented Generation (RAG)

One way to use a LLM with portal data is through _Retrieval Augmented Generation_ outlined in the [RAG-tutorial][2]

### Using your own data

Now you should be able to combine the techniques outlined in the tutorials with your own data and your domain knowledge to put the LLM-as-a-service to work for you. Good luck!

[1]: ../PortalAndPandas/introduction.ipynb
[2]: ../RAG-tutorial/intro.ipynb