# Local LLMs and controllable outputs
- download ollama for easy serving of models, supports all OS
	- https://ollama.com/download
	- follow the instructions and "install". Do not download any models yet.
	- this enables a CLI for running models (soon!)
	- in case the application doesn't start up properly, type `ollama serve` in your terminal.
		- if it's already running, you will see something like `Error: listen tcp 127.0.0.1:11434: bind: address already in use`

- for local hosting, we usually prefer to run *quantized* GGUF models, named after the developer Georgi Gerganov.
	- the main developer for [whisper.cpp](https://github.com/ggerganov/whisper.cpp) and [llama.cpp](https://github.com/ggerganov/llama.cpp), C++ systems to run AST and LLMs respectively.
	- nearly all llm-applications, including ollama, is built on top of compiled binaries from llama.cpp

## GGUF
- a format that allows quantization of models.
- typical pytorch models (or similar) can be converted to a .GGUF format.
- these are lower bit representations of the full-precision weights used when training the networks 
	- e.g., from FP16 (half-precision) to a ~5 bit representation, commonly denoted by the "Q5" suffix.
		- libraries like PyTorch train with FP32, but we've moved towards mixed-precision which combines FP32 and FP16:
			- FP16: weights/activations
			- FP32: gradients during backprop: numerical stability
- can reduce a 70B model (140GB!!!) to 20-30GB while still being fairly usable.

<p align="center"> <img src="assets/gguf-bytes.png" alt="gguf-bytes.png"> </p>

Here's a list of some quants of the Llama-3.3 70B model:

<p align="center"> <img src="assets/gguf-download.png" alt="gguf-download.png"> </p>

## getting started with a model
- Let's begin with `llama-3.2 1B` - a small 1B model just to test out our system
- typically, there are devoted people out there that download the original models and quantize them with Llama.cpp, such that we can download the premade GGUF file.
	- one of the most active ones is the user `bartowski` on huggingface.
		- if you don't know of huggingface, it's basically the github of AI models and datasets
- path: https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/blob/main/Llama-3.2-1B-Instruct-Q6_K.gguf
- you can click "use this model" -> "ollama" that creates a runnable command:
	- `ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q6_K_L`
		- this is the 6-bit version of highest quality. It's only 1.1GB, so let's start with that.

<p align="center"> <img src="assets/huggingface-menu.png" alt="hf dl menu.png"> </p>

buuuut... we're not interested in talking to it through the terminal, we want to process outputs in our code, i.e., we need an API!

## Ollama - API
download a wrapper for the API, in this case in python

`pip install ollama`

we then define our model as a variable and verify its content:

In [48]:
import ollama

model = "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q6_K_L"
print(ollama.show(model).modelinfo)

{'general.architecture': 'llama', 'general.basename': 'Llama-3.2', 'general.file_type': 18, 'general.finetune': 'Instruct', 'general.languages': ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th'], 'general.license': 'llama3.2', 'general.parameter_count': 1235814432, 'general.quantization_version': 2, 'general.size_label': '1B', 'general.tags': ['facebook', 'meta', 'pytorch', 'llama', 'llama-3', 'text-generation'], 'general.type': 'model', 'llama.attention.head_count': 32, 'llama.attention.head_count_kv': 8, 'llama.attention.key_length': 64, 'llama.attention.layer_norm_rms_epsilon': 1e-05, 'llama.attention.value_length': 64, 'llama.block_count': 16, 'llama.context_length': 131072, 'llama.embedding_length': 2048, 'llama.feed_forward_length': 8192, 'llama.rope.dimension_count': 64, 'llama.rope.freq_base': 500000, 'llama.vocab_size': 128256, 'quantize.imatrix.chunks_count': 125, 'quantize.imatrix.dataset': '/training_dir/calibration_datav3.txt', 'quantize.imatrix.entries_count': 112, 'quanti

## Generating responses!

In [60]:
from ollama import chat
from ollama import ChatResponse

model = "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q6_K_L"

prompt = "Give me a simple recipe for a delicious citrusy cake. Make sure units are in grams when it makes sense. Temperatures should be in C."

response = chat(
    model=model,
    messages=[
        {"role": "user", "content": prompt},
    ],
)
print(response.message.content)

Here's a simple recipe for a delicious citrusy cake that uses metric units and Celsius temperatures:

**Lemon Blueberry Cake**

Ingredients:

* 250g all-purpose flour
* 150g granulated sugar
* 100g unsalted butter, softened (approx. 115°C)
* 2 large eggs
* 200g plain Greek yogurt
* 1 tsp baking powder
* 0.5 tsp salt
* 120g fresh blueberries

Instructions:

1. Preheat your oven to 180°C and grease two 20cm round cake pans.
2. In a medium bowl, whisk together the flour, sugar, baking powder, and salt.
3. In a large mixing bowl, use an electric mixer to cream together the butter and eggs until light and fluffy (approx. 4-5 minutes).
4. Add the yogurt and mix until well combined.
5. Gradually add the dry ingredients to the wet ingredients, alternating with the lemon juice, starting and ending with the dry ingredients. Beat just until combined.
6. Gently fold in the blueberries.
7. Divide the batter evenly between the prepared pans and smooth the tops.
8. Bake for 25-30 minutes or until a t

# Controlling the responses...

Sure it's cool with text from a local model, but we can't do much with it. 
Time to tame it!

First off, there are a few common parameters that can be used to tune the outputs.
Some are related to "creativity", whereas some control the predictability and determinism of the outputs.

```python
"num_ctx": "Maximum number of tokens the model can process in a single input."
"seed": "Random seed for deterministic generation."
"num_predict": "Maximum number of tokens to generate in output."
"top_k": "Limits sampling to the top K most probable tokens."
"top_p": "Limits sampling to the smallest set of tokens with cumulative probability >= top_p."
"temperature": "Controls randomness in generation; higher values increase randomness."
"repeat_penalty": "Penalty for repeated tokens to reduce repetition in output."
```

In [61]:
from ollama import ChatResponse
from ollama import chat
from pydantic.types import JsonSchemaValue
from typing import Optional, List, Dict

model = "hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q6_K_L"

system_prompt = "You are a helpful assistant that provides clear and concise answers to the user's needs. Always answer in a JSON format."


def generate(
    prompt: str,
    json_format: Optional[JsonSchemaValue] = None,
    model=model,
    system_prompt=system_prompt,
) -> str:
    response: ChatResponse = chat(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        options={
            "num_ctx": 4096,
            "num_predict": 1024,
            "top_k": 50,
            "top_p": 0.95,
            "temperature": 0.0,
            "seed": 0,  # this is not needed when temp is 0
            "repeat_penalty": 1.0,  # remain default for json outputs, from experience.
        },
        format=json_format,
        stream=False,
    )
    return response.message.content

# prompt = "give me 5 interesting facts about the universe"
prompt = "Give me a simple recipe for a delicious citrusy cake. Make sure units are in grams when it makes sense. Temperatures should be in C."
print(generate(prompt))

{
  "recipe": {
    "name": "Citrusy Cake",
    "ingredients": [
      {
        "name": "Flour",
        "quantity": 250g,
        "unit": "grams"
      },
      {
        "name": "Sugar",
        "quantity": 200g,
        "unit": "grams"
      },
      {
        "name": "Baking powder",
        "quantity": 5g,
        "unit": "grams"
      },
      {
        "name": "Salt",
        "quantity": 2g,
        "unit": "grams"
      },
      {
        "name": "Butter",
        "quantity": 100g,
        "unit": "grams"
      },
      {
        "name": "Eggs",
        "quantity": 4,
        "unit": "units"
      },
      {
        "name": "Milk",
        "quantity": 250g,
        "unit": "grams"
      },
      {
        "name": "Zest of 1 orange",
        "quantity": 20g,
        "unit": "grams"
      },
      {
        "name": "Zest of 1 lemon",
        "quantity": 15g,
        "unit": "grams"
      },
      {
        "name": "Juice of 1 lemon",
        "quantity": 30g,
        "unit": "gra

## More control!
Now, the format is already following a JSON from the system prompt, but we cannot know beforehand what fields are inside it. Let's fix this by introducing a **schema**, a structured definition of our output.
(Note also that it responds with a triple-tick markdown style bracket, indicating a code snippet inserted into markdown. This can be circumvented by postprocessing, however.)

We start building our schema through a typed BaseModel in pydantic (which will be converted to a grammar-like format called GBNF, as we'll see later)

In [68]:
from pydantic import BaseModel

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class RecipeInstruction(BaseModel):
    step: int
    description: str

class Recipe(BaseModel):
    title: str
    ingredients: List[Ingredient]
    instructions: List[RecipeInstruction]
    tools: List[str]

# we can now use eval to properly format the json as an object
# using `eval`from the output of an API is generally not safe, but we can safely do it from the JSON-output of a local model.
eval(generate(prompt, json_format=Recipe.model_json_schema()))

{'title': 'Citrusy Cake Recipe',
 'ingredients': [{'name': 'All-purpose flour',
   'quantity': 250,
   'unit': 'grams'},
  {'name': 'Granulated sugar', 'quantity': 200, 'unit': 'grams'},
  {'name': 'Unsalted butter, softened', 'quantity': 150, 'unit': 'grams'},
  {'name': 'Egg, large', 'quantity': 2, 'unit': 'grams'},
  {'name': 'Zest of 1 lemon', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Zest of 1 orange', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Zest of 1 lime', 'quantity': 20, 'unit': 'grams'},
  {'name': 'Vanilla extract', 'quantity': 5, 'unit': 'grams'},
  {'name': 'Citrus juice (e.g. lemon, orange, lime)',
   'quantity': 100,
   'unit': 'grams'}],
 'instructions': [{'step': 1,
   'description': 'Preheat the oven to 180°C (350°F). Grease two 20cm (8 inch) round cake pans and line the bottoms with parchment paper.'},
  {'step': 2,
   'description': 'In a medium bowl, whisk together flour, sugar, and salt.'},
  {'step': 3,
   'description': 'In a large bowl, using an electr