In [1]:
%load_ext autoreload
%autoreload 2

# API

download a wrapper for the API, in this case in python: `pip install ollama`

we then define our model as a variable and verify its content:

In [2]:
import ollama
from models import Models, show_models, load_model

show_models()

model = load_model("llama_3_2_1b")

['llama_3_2_1b', 'llama_3_2_3b', 'llama_3_1_8b', 'deepseek_qwen_32b']


In [3]:
info = ollama.show(model).modelinfo
for k, v in info.items():
    print(f"{k}: {v}")

general.architecture: llama
general.basename: Llama-3.2
general.file_type: 18
general.finetune: Instruct
general.languages: ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']
general.license: llama3.2
general.parameter_count: 1235814432
general.quantization_version: 2
general.size_label: 1B
general.tags: ['facebook', 'meta', 'pytorch', 'llama', 'llama-3', 'text-generation']
general.type: model
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.key_length: 64
llama.attention.layer_norm_rms_epsilon: 1e-05
llama.attention.value_length: 64
llama.block_count: 16
llama.context_length: 131072
llama.embedding_length: 2048
llama.feed_forward_length: 8192
llama.rope.dimension_count: 64
llama.rope.freq_base: 500000
llama.vocab_size: 128256
quantize.imatrix.chunks_count: 125
quantize.imatrix.dataset: /training_dir/calibration_datav3.txt
quantize.imatrix.entries_count: 112
quantize.imatrix.file: /models_out/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct.imatrix
tokenize

## Generating responses

In [4]:
from ollama import chat

prompt = "Give me a simple recipe for a delicious citrusy cake. Make sure units are in grams when it makes sense. Temperatures should be in C."

response = chat(
    model=model,
    messages=[
        {"role": "user", "content": prompt},
    ],
)
print(response.message.content)

Here's a simple recipe for a delicious citrusy cake:

**Citrus Sunrise Cake**

Ingredients:

* 250g unsalted butter, softened
* 400g granulated sugar
* 4 large eggs, at room temperature
* 200g freshly squeezed lemon juice (about 2 lemons)
* 150g grated orange zest (about 1 medium orange)
* 100g all-purpose flour
* 50g unsweetened cocoa powder
* 30g baking powder
* Salt to taste
* Optional: chopped walnuts or pecans for added texture and flavor

Instructions:

1. Preheat the oven to 175°C (350°F). Grease two 20cm (8-inch) round cake pans.
2. In a medium bowl, whisk together flour, cocoa powder, baking powder, and salt.
3. In a large mixing bowl, beat the butter until creamy, about 2 minutes.
4. Gradually add the sugar to the butter mixture and continue beating until light and fluffy, about 3 minutes.
5. Beat in the eggs one at a time, making sure each egg is fully incorporated before adding the next.
6. Add the lemon juice, orange zest, and vanilla extract (if using). Mix well.
7. Gradu

## Controlling the responses...

Can only do so much with raw text... Let's up the controllability!

First off, there are a few common parameters that can be used to tune the outputs.
Some are related to "creativity", whereas some control the predictability and determinism of the outputs.

```python
"num_ctx": "Maximum number of tokens the model can process in a single input."
"seed": "Random seed for deterministic generation."
"num_predict": "Maximum number of tokens to generate in output."
"top_k": "Limits sampling to the top K most probable tokens."
"top_p": "Limits sampling to the smallest set of tokens with cumulative probability >= top_p."
"temperature": "Controls randomness in generation; higher values increase randomness."
"repeat_penalty": "Penalty for repeated tokens to reduce repetition in output."
```

We can also add a system prompt (see the `role` in the messages below) to guide the model in the right direction. Being explicit on its role can help both the behavior and output formats.

In [5]:
from ollama import ChatResponse, chat
from pydantic.types import JsonSchemaValue
from typing import Optional, List

system_prompt = "You are a helpful assistant that provides clear and concise answers to the user's needs. Always answer in a JSON format."

def generate(
    prompt: str,
    json_format: Optional[JsonSchemaValue] = None,
    model=model,
    system_prompt=system_prompt,
) -> str:
    response: ChatResponse = chat(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        options={
            "num_ctx": 4096,
            "num_predict": 1024,
            "top_k": 50,
            "top_p": 0.95,
            "temperature": 0.0,
            "seed": 0,  # this is not needed when temp is 0
            "repeat_penalty": 1.0,  # remain default for json outputs, from experience.
        },
        format=json_format,
        stream=False,
    )
    return response.message.content

# prompt = "give me 5 interesting facts about the universe"
prompt = "Give me a simple recipe for a delicious citrusy cake. Make sure units are in grams when it makes sense. Temperatures should be in C."
print(generate(prompt))

{
  "recipe": {
    "name": "Citrusy Cake",
    "ingredients": [
      {
        "name": "Flour",
        "quantity": 250g,
        "unit": "grams"
      },
      {
        "name": "Sugar",
        "quantity": 200g,
        "unit": "grams"
      },
      {
        "name": "Baking powder",
        "quantity": 5g,
        "unit": "grams"
      },
      {
        "name": "Salt",
        "quantity": 2g,
        "unit": "grams"
      },
      {
        "name": "Butter",
        "quantity": 100g,
        "unit": "grams"
      },
      {
        "name": "Eggs",
        "quantity": 4,
        "unit": "units"
      },
      {
        "name": "Milk",
        "quantity": 250g,
        "unit": "grams"
      },
      {
        "name": "Zest of 1 orange",
        "quantity": 20g,
        "unit": "grams"
      },
      {
        "name": "Zest of 1 lemon",
        "quantity": 15g,
        "unit": "grams"
      },
      {
        "name": "Juice of 1 lemon",
        "quantity": 30g,
        "unit": "gra

## Even more control
Now, the format is already following a JSON from the system prompt, but we cannot know beforehand what fields are inside it. Let's fix this by introducing a **schema**, a structured definition of our output.

We start building our schema through a typed BaseModel in pydantic (which will be converted to a grammar-like format called GBNF, that you can read about here: <https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md>)

If you were not to use ollama, you could pass a schema directly, which again will be converted to GBNF.

Here is an example of a schema that forces the output to contain three fields: "questions", "score", "summary" - three fields very useful for extracting information around a larger document. Note how you can specify the types, and even constrain the "score" to specific values through the `enum` keyword, along with `min/maxItems` for arrays.

```python
schema = {
    "type": "object",
    "properties": {
        "questions": {
            "type": "array",
            "minItems": 1,
            "maxItems": 3,
            "items": {
                "type": "object",
                "properties": {"question": {"type": "string"}},
                "required": ["question"],
            },
        },
        "score": {"type": "integer", "enum": [0, 1, 2, 3]},
        "summary": {"type": "string"},
    },
    "required": ["questions", "score", "summary"],
}
```

However, the easiest and most programmatic way of handling this is define interfaces that are automatically parsed as a schema before being sent through to the llama.cpp api. We continue with the recipe data format!

In [6]:
from pydantic import BaseModel

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class RecipeInstruction(BaseModel):
    step: int
    description: str

class Recipe(BaseModel):
    title: str
    ingredients: List[Ingredient]
    instructions: List[RecipeInstruction]
    tools: List[str]

# we can now use eval to properly format the json as an object
# using `eval`from the output of an API is generally not safe, but we can safely do it from the JSON-output of a local model.
eval(generate(prompt, json_format=Recipe.model_json_schema()))

{'title': 'Citrusy Cake Recipe',
 'ingredients': [{'name': 'All-purpose flour',
   'quantity': 250,
   'unit': 'grams'},
  {'name': 'Granulated sugar', 'quantity': 200, 'unit': 'grams'},
  {'name': 'Unsalted butter, softened', 'quantity': 150, 'unit': 'grams'},
  {'name': 'Egg, large', 'quantity': 2, 'unit': 'grams'},
  {'name': 'Zest of 1 lemon', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Zest of 1 orange', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Zest of 1 lime', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Vanilla extract', 'quantity': 5, 'unit': 'grams'},
  {'name': 'Citrus juice (e.g. lemon, orange, lime)',
   'quantity': 100,
   'unit': 'grams'}],
 'instructions': [{'step': 1,
   'description': 'Preheat the oven to 180°C (350°F). Grease two 20cm (8 inch) round cake pans and line the bottoms with parchment paper.'},
  {'step': 2,
   'description': 'In a medium bowl, whisk together flour, sugar, and salt.'},
  {'step': 3,
   'description': 'In a large bowl, using an electr