# LLMs: Prompt Engineering, and OLMo

## Introduction

An introduction to large language models and how they're trained is out of scope, but they have been trained over large amounts of textual information available on the Internet, including books, articles, websites, and other digital content. Getting into the weeds of how these models are trained is out of the scope of this tutorial, but we have added links to papers and tutorials if you'd like to understand how LLMs are trained. Do note that training LLMs is expensive; the cost can easily increase to millions. 

Early language models could predict the probability of a single word token or n-grams; modern large language models can predict the likelihood of sentences, paragraphs, or entire documents.

However, LLMs are notoriously unable to retrieve and manipulate the knowledge they possess, which leads to issues like hallucination (i.e., generating factually incorrect information), knowledge cutoffs, and poor performance in domain-specific applications.

For this entire tutorial, we will be using [Open Language Model: OLMo](https://allenai.org/olmo), an open LLM framework built by [Allen Institute for AI](https://allenai.org/). With this open framework, you can access its complete pretraining data ([dolma](https://github.com/allenai/dolma)), training code, model weights, and evaluation suite. Tracking openness, transparency, accountability, and risks in LLMs is a growing research area. Checkout this [tool](https://opening-up-chatgpt.github.io/) to understand the range of openness in these models.

```{note}
Throughout this tutorial, you will encounter imports from a utility library called `ssec_tutorials`. This library is a collection of utility functions that we have created to make it easier to interact with the models and datasets we use in our tutorials. You can find the source code for this library at https://github.com/uw-ssec/ssec_tutorials.
```

We will first download the model, if you haven't already, using the download script mentioned during the local setup.

In [2]:
from collections import defaultdict
import nltk
import random
import re
import numpy as np
import pandas as pd

In [3]:
from ssec_tutorials import download_olmo_model

In [4]:
OLMO_MODEL = download_olmo_model()

Model already exists at /Users/anshultambay/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf


In [5]:
OLMO_MODEL

PosixPath('/Users/anshultambay/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf')

In [6]:
# Explore the name of the model
OLMO_MODEL.name

'OLMo-7B-Instruct-Q4_K_M.gguf'

There are multiple things to note in the model name that gives us a lot of information about the model such as: `7B`, `Instruct`, `Q4_K_M`, and `.gguf`.

### `.gguf`

We will cover each of these in the following sections. For now, let's focus on the file format, which is `.gguf`.

We have chosen the [GGUF format](https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF) of the [`OLMo 7B-Instruct`](https://huggingface.co/allenai/OLMo-7B-Instruct-hf) model for this tutorial.

[`GGUF`](https://huggingface.co/docs/hub/en/gguf) is a file format for storing models for inference with [GGML](https://github.com/ggerganov/ggml) and executors based on GGML, a tensor library for machine learning.

This file format is optimized for fast inference on CPUs, which is why we have chosen it for this tutorial. To use the model in this format, we are utilizing [`llama.cpp`](https://github.com/ggerganov/llama.cpp), a popular C/C++ LLM inference framework. Instead of directly calling the C/C++ code, we will use the Python bindings of it called [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python).

Let's start by loading the model to memory and interacting with it using the `llama-cpp-python` library.

In [7]:
from llama_cpp import Llama

In [8]:
olmo = Llama(model_path=str(OLMO_MODEL), verbose=False)

In [9]:
olmo

<llama_cpp.llama.Llama at 0x110c64550>

With just a few lines of code, now you have access to a local LLM at your fingertips!

### `Q4_K_M`

Before moving further, let's take a look at the `Q4_K_M` part of the model name.
This signifies the model's *quantization* type. In other words, *compression* for a model.

**Quantization** reduces a high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower-precision data type,
the `GGUF` format has many [quantization types](https://huggingface.co/docs/hub/en/gguf#quantization-types),
in `Q4_K_M` each weight is reduced to a 4-bit representation.

If you are curious about the details of Quantization, please refer to an excellent concept guide on [Quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization) by HuggingFace.

For the sake of this tutorial, we have quantized the original OLMo model to the `Q4_K_M` type.
You can explore the other types of quantization that we've done at [https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF/tree/main](https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF/tree/main).

```{tip}
If you'd like to play around with the other quantization type,
you can use the `download_olmo_model` function with a specific `model_file` input argument value.

For example, to download the `Q5_K_M` model, you can use the following code:

`OLMO_MODEL_Q5_K_M = download_olmo_model(model_file="OLMo-7B-Instruct-Q5_K_M.gguf")`

```

### `7B`

**B** stands for billion, and 7B suggests that this specific model has 7 billion parameters.

**Base models**, for example [AllenAi's OLMo-7B](https://huggingface.co/allenai/OLMo-7B), [AllenAi's OLMo-1B](https://huggingface.co/allenai/OLMo-1B), and [Meta's Llama-3-8B](meta-llama/Meta-Llama-3-8B) processes billions of words and texts. The training process is semi-supervised, meaning data is supplied without much annotation or labeling, but much effort is poured into improving the data quality. We have found that training the model with tremendous amount of text allows it to learn language patterns and general knowledge.

When prompted, the model predicts the next tokens (words) statistically likely to follow.

For example,

In [10]:
from ssec_tutorials.scipy_conf import parse_text_generation_response

In [11]:
model_response = olmo(
    prompt="Jupiter is the largest", echo=True, max_tokens=1, temperature=0.8
)  # Generate a completion, can also call olmo.create_completion

In [12]:
print(parse_text_generation_response(model_response))

Jupiter is the largest planet


But when prompted with, `What is the capital of Washington state in the USA?`, a base model **could** generate logical text that may or may not contain the right answer. 

This is when `Instruction` fine-tuning comes into play, which enhances the base model's ability to execute specific tasks.

### `Instruct`

For `Instruction` fine-tuning, we can take the base models and further train them on much smaller and more specific datasets. For this tutorial, we the [OLMo-7B-Instruct](https://huggingface.co/allenai/OLMo-7B-Instruct-hf), which has been fine-tuned on [Tulu 2 SFT Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) and [Ultrafeedback Cleaned](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) datasets. That is where the keyword **Instruct** comes from.

In [13]:
model_response = olmo(
    prompt="What is the capital of Washington state in the USA?",
    echo=True,
    temperature=0.8,
)

In [14]:
print(parse_text_generation_response(model_response))

What is the capital of Washington state in the USA?
Olympia is the capital of the U.S. state of Washington.


## LLM Parameters

We typically interact with the LLM via an API through which we can send prompts, and we can configure different parameters to get different results from LLMs. 

In [15]:
import inspect

In [16]:
inspect.signature(olmo).parameters

mappingproxy({'prompt': <Parameter "prompt: 'str'">,
              'suffix': <Parameter "suffix: 'Optional[str]' = None">,
              'max_tokens': <Parameter "max_tokens: 'Optional[int]' = 16">,
              'temperature': <Parameter "temperature: 'float' = 0.8">,
              'top_p': <Parameter "top_p: 'float' = 0.95">,
              'min_p': <Parameter "min_p: 'float' = 0.05">,
              'typical_p': <Parameter "typical_p: 'float' = 1.0">,
              'logprobs': <Parameter "logprobs: 'Optional[int]' = None">,
              'echo': <Parameter "echo: 'bool' = False">,
              'stop': <Parameter "stop: 'Optional[Union[str, List[str]]]' = []">,
              'frequency_penalty': <Parameter "frequency_penalty: 'float' = 0.0">,
              'presence_penalty': <Parameter "presence_penalty: 'float' = 0.0">,
              'repeat_penalty': <Parameter "repeat_penalty: 'float' = 1.1">,
              'top_k': <Parameter "top_k: 'int' = 40">,
              'stream': <Paramet

Some standard parameters are:

**prompt:** The prompt to generate text from.

**max_tokens:** The maximum number of tokens to generate.

**temperature:** A higher temperature produces more creative and diverse output, while a lower temperature produces more deterministic output. In practical terms, you should use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For creative tasks, it might be beneficial to increase the temperature value.

**top_p:** This parameter, in conjunction with temperature, offers a powerful tool for controlling the model's output. Known as nucleus sampling, it allows you to determine the level of determinism in the responses. By using `top_p`, you can specify that only the tokens comprising the top_p probability mass are considered for responses. A low top_p value selects the most confident responses, while a higher value prompts the model to consider more possible words, leading to more diverse outputs. The general recommendation is to alter `temperature` or `top_p` but not both.

**stop:** A list of strings to stop generation when encountered. This is another way to control the length and structure of the model's response. 

**frequency_penalty:** The frequency penalty applies a penalty on the next token based on how many times that token has already appeared in the generated response and prompt. The higher the frequency penalty, the less likely a word will reappear. This setting reduces the repetition of words in the generated response by giving tokens that appear more a higher penalty.

**presence_penalty:** The presence penalty applies the same penalty for all repeated tokens. A token that appears twice and a token that appears n times are penalized the same. You may choose a higher presence penalty if you want the model to generate diverse or creative text. 

To learn more about other parameters, refer to [create_completion API reference.](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion)

In [17]:
model_response = olmo(
    prompt="Write a sarcastic but nice poem about the city of Seattle",
    echo=True,
    temperature=1,
    max_tokens=500,
)

In [18]:
print(parse_text_generation_response(model_response))

Write a sarcastic but nice poem about the city of Seattle. It can be funny, satirical or just sarcastic.
Seattle is a city that's both chilly and wet
With coffee that tastes like mud, and people who are out of sorts
They claim it's because of the rain
But I think they're just plain old grumpy

The roads are more like rivers, with puddles so deep
That even Mr. T would be left standing in defeat
The traffic is slow, and the drivers are clueless
It's a city that's difficult to navigate at best

But wait! There's hope. The coffee shops do offer some solace
With their artisan brews and latte art to boot
And although the food may be pricey, it's worth every cent
Seattle may not be perfect, but it has its charms

So if you're ever in the area, don't be a stranger
Just embrace the rain, the chilly weather, and all
For it's what makes Seattle unique and unparalleled.
Go ahead, laugh at my poetic attempt,
I know I did when I heard myself humming "Seattle, oh Seattle!"
It's a city that's full of 

```{important}
Another critical concept to understand is Context length. It is the number of tokens an LLM can process at once, the maximum length of the input sequence. You can interpret it as the model's memory or attention span.
```

## Prompting

Prompt engineering or prompting is a discipline for developing and optimizing prompts to use LLMs for various applications. 

### Prompt Elements

In general, prompt could contain any of the following:

**Instruction:** Text to explain a specific task or instructions for the model to perform.

**Context:** Additional context that can help the model generate better responses.

**Input Data:** The input or question a user is interested in finding a response for.

**Output Indicator:** The type or format of the output.

### Chat Completion

A use case for LLMs is chat. In a chat context, rather than prompting LLM with a single string of text, you prompt the model with a conversation that consists of one or more messages, each of which includes a role, like `user` or `assistant`, as well as text as `content`.

`llama-cpp-python` provides a [high-level API for chat completion.](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#chat-completion) 

The model typically formats the messages in the conversation into a single prompt using a chat template from the `gguf` model's metadata. Chat templates are part of the tokenizers (more on that in `Module 2`.) They specify how to convert a chat conversation, represented as lists of messages, into a single tokenizable string in the format that the model expects, i.e., a prompt.

For OLMo you can see its chat template using,

In [19]:
olmo.metadata["tokenizer.chat_template"]

"{{ eos_token }}{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

### Prompting Techniques

Prompts can help you get results on different tasks with LLMs.

##### Zero-shot Prompting

The zero-shot prompt directly instructs the model to perform a task without any additional knowledge, but entirely based on its pre-existing knowledge.

In [20]:
chat_response = olmo.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "Classify the following text into neutral, negative, or positive. Today's Seattle weather is beautiful.",
        },
    ],
    temperature=0.8,
)

In [22]:
from ssec_tutorials.scipy_conf import parse_chat_completion_response

In [23]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': 'Positive.\nThe text "Today\'s Seattle weather is beautiful." is positive because it describes the current weather conditions in a favorable way by using the word "beautiful".'}


Note that in the prompt above, we didn't provide OLMo with any additional context; OLMo already understands the `sentiment`‚Äîthat's zero-shot at work.

##### Prompting with Context

OLMo or other LLMs can demonstrate remarkable zero-shot capabilities, they can fail in more complex or specific tasks. In this case, we can introduce examples (shots) or additional context within the prompt to improve the OLMo's response.

Let's try zero-shot to learn more about SciPy 2024.

In [26]:
chat_response = olmo.create_chat_completion(
    messages=[
        {"role": "user", "content": "Did you hear about the 2025 Schmidt Sciences AI in Science Postdoctoral Fellowship Generative AI / RAG Copilot for Scientific Software?"},
    ],
)

In [27]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': "I'm not personally aware of the specific Schmidt Sciences AI in Science Postdoctoral Fellowships mentioned in the Science Postdoctoral Research Advisory Group (RAG) workshop, but I can provide some general information on postdoctoral fellowships and AI in science.\n\nPostdoctoral fellowships are designed to help recent PhD graduates gain research experience and further develop their skills before pursuing an academic or research career. These fellowships often come with a stipend and typically last one to three years.\n\nAI has the potential to revolutionize many aspects of scientific research, from data analysis and modeling to simulation and prediction. Many postdoctoral fellows are exploring how AI can be applied to various fields in science, including biology, physics, chemistry, and environmental studies. Some examples of AI-focused postdoctoral fellowships include:\n\n* National Science Foundation (NSF) Postdoctoral Fellowships in Research on Gen

Interpret the response before moving on. 

What if we provide relevant information to answer the prompt? 

In [29]:
chat_response = olmo.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "The first annual Schmidt Sciences AI in Science Postdoctoral Fellowship Generative AI / RAG Copilot for Scientific Software will be held virtually on Friday, March 7th from 9AM to 11AM Pacific Time. The initial online ‚ÄúGenerative AI / RAG Copilot for Scientific Software‚Äù tutorial will focus on how to utilize the underlying methods in Generative AI to advance scientific research, including the basics of LLMs followed by a demo of using LLMs and RAG for creating a question answering tool based on private data. The audience for this tutorial is researchers, with some experience programming in Python, who want to use LLMs for their research.",
        },
        {"role": "user", "content": "Did you hear about the 2025 Schmidt Sciences AI in Science Postdoctoral Fellowship Generative AI / RAG Copilot for Scientific Software?"},
    ],
)

In [30]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': "I'm not aware of any specific information about a 2025 Schmidt Sciences AI in Science Postdoctoral Fellowship RAG Workshop, as I don't have real-time access to all available event data. However, it's possible that such an event could be planned or announced in the future. If you come across any details about this workshop, please let me know, and I'd be happy to update my response accordingly.\n\nIn general, workshops focused on AI in Science (RAG) are becoming increasingly popular as researchers explore the potential of artificial intelligence (AI) techniques for enhancing scientific discovery and innovation. These workshops typically bring together postdocs with a background in science and experience programming in Python or similar languages to learn about and apply AI methods to their research areas.\n\nIf you have any more information about this specific 2025 Schmidt Sciences AI in Science Postdoctoral Fellowship RAG Workshop, please share it with

OLMo is able to generate a response that's much more helpful to the user. 

Many other prompting techniques (e.g., chain-of-thought, ReAct, etc.) exist. For this tutorial, we will focus on **Retrieval-Augmented Generation**, which can enhance OLMo's responses by integrating information retrieved from external sources.

#### Your turn üòé

Try different messages value(s) and see how the output changes. But remember to follow the template structure.
The dictionary keys must contain `role` and `content` and the allowed `role` values are only `user` and `assistant`.

In [None]:
# Write your olmo.create_chat_completion code here. You can use the above example as a reference.

**References**
1. https://news.ycombinator.com/item?id=35712334
2. https://benjaminwarner.dev/2023/07/01/attention-mechanism
3. [Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators](https://dl.acm.org/doi/10.1145/3571884.3604316)
4. https://www.promptingguide.ai/