# LLMs, Prompt Engineering, and OLMo

## Introduction

An introduction to large language models and how they're trained is out of scope, but they have been trained over large amounts of textual information available on the Internet, including books, articles, websites, and other digital content. Getting into the weeds of how these models are trained is out of the scope of this tutorial, but we have added links to papers and tutorials if you'd like to understand how LLMs are trained. Do note that training LLMs is expensive; the cost can easily increase to millions. 

Early language models could predict the probability of a single word token or n-grams; modern large language models can predict the likelihood of sentences, paragraphs, or entire documents.

However, LLMs are notoriously unable to retrieve and manipulate the knowledge they possess, which leads to issues like hallucination (i.e., generating factually incorrect information), knowledge cutoffs, and poor performance in domain-specific applications.

For this entire tutorial, we will be using [Open Language Model: OLMo](https://allenai.org/olmo), an open LLM framework built by [Allen Institute for AI](https://allenai.org/). With this open framework, you can access its complete pretraining data ([dolma](https://github.com/allenai/dolma)), training code, model weights, and evaluation suite. Tracking openness, transparency, accountability, and risks in LLMs is a growing research area. Checkout this [tool](https://opening-up-chatgpt.github.io/) to understand the range of openness in these models.

We have chosen a 7B instruction-tuned OLMo model that we have compressed to speed up its inference time. 

In [14]:
import inspect
from llama_cpp import Llama # Python bindings for llama.cpp, to enable LLM inference with minimal setup
from ssec_tutorials import download_olmo_model, OLMO_MODEL
from ssec_tutorials.scipy_conf import * # Contains helper methods for tutorial

In [2]:
# Downloads the OLMo model in ~/.cache/
OLMO_MODEL = download_olmo_model()

In [3]:
OLMO_MODEL

PosixPath('/Users/a42/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf')

In [4]:
olmo = Llama(model_path=str(OLMO_MODEL), verbose=False)

In [5]:
# Explore the name of the model
str(OLMO_MODEL).split("/")[-1]

'OLMo-7B-Instruct-Q4_K_M.gguf'

Note the `7B,` `Instruct,` `GGUF,` and `Q4_K_M` keywords here.

**7B**: B stands for billion, and 7B suggests that this specific model has 7 billion parameters

**Base models**, for example [AllenAi's OLMo-7B](https://huggingface.co/allenai/OLMo-7B), [AllenAi's OLMo-1B](https://huggingface.co/allenai/OLMo-1B), and [Meta's Llama-3-8B](meta-llama/Meta-Llama-3-8B) processes billions of words and texts. The training process is semi-supervised, meaning data is supplied without much annotation or labeling, but much effort is poured into improving the data quality. We have found that training the model with tremendous amount of text allows it to learn language patterns and general knowledge.

When prompted, the model predicts the next tokens (words) statistically likely to follow.

For example,

In [6]:
model_response = olmo(prompt="Jupiter is the largest", echo=True, max_tokens=1, temperature=0.8) # Generate a completion, can also call olmo.create_completion

In [7]:
print(parse_text_generation_response(model_response))

Jupiter is the largest planet


But when prompted with, `What is the capital of Washington state in the USA?`, a base model **could** generate logical text that may or may not contain the right answer. 

This is when `Instruction` fine-tuning comes into play, which enhances the base model's ability to execute specific tasks. For `Instruction` fine-tuning, we can take the base models and further train them on much smaller and more specific datasets. For this tutorial, we are using a **quantized**, in other words **compressed** model version of [OLMo-7B-Instruct](https://huggingface.co/allenai/OLMo-7B-Instruct), which has been fine-tuned on [UltraFeedback Dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). That is where the keyword **Instruct** comes from.

`GGUF` is a file format for storing models for inference with GGML and executors based on GGML, a tensor library for machine learning. 

**Quantization** reduces a high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower-precision data type, in `Q4_K_M` each weight is reduced to a 4-bit representation. 

In [8]:
model_response = olmo(prompt="What is the capital of Washington state in the USA?", echo=True, temperature=0.8)

In [9]:
print(parse_text_generation_response(model_response))

What is the capital of Washington state in the USA?
Washington, D.C. is not a state; it is the capital


## LLM Parameters

We typically interact with the LLM via an API through which we can send prompts, and we can configure different parameters to get different results from LLMs. 

In [15]:
inspect.signature(olmo).parameters

mappingproxy({'prompt': <Parameter "prompt: 'str'">,
              'suffix': <Parameter "suffix: 'Optional[str]' = None">,
              'max_tokens': <Parameter "max_tokens: 'Optional[int]' = 16">,
              'temperature': <Parameter "temperature: 'float' = 0.8">,
              'top_p': <Parameter "top_p: 'float' = 0.95">,
              'min_p': <Parameter "min_p: 'float' = 0.05">,
              'typical_p': <Parameter "typical_p: 'float' = 1.0">,
              'logprobs': <Parameter "logprobs: 'Optional[int]' = None">,
              'echo': <Parameter "echo: 'bool' = False">,
              'stop': <Parameter "stop: 'Optional[Union[str, List[str]]]' = []">,
              'frequency_penalty': <Parameter "frequency_penalty: 'float' = 0.0">,
              'presence_penalty': <Parameter "presence_penalty: 'float' = 0.0">,
              'repeat_penalty': <Parameter "repeat_penalty: 'float' = 1.1">,
              'top_k': <Parameter "top_k: 'int' = 40">,
              'stream': <Paramet

Some standard parameters are:

**prompt:** The prompt to generate text from.

**max_tokens:** The maximum number of tokens to generate.

**temperature:** A higher temperature produces more creative and diverse output, while a lower temperature produces more deterministic output. In practical terms, you should use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For creative tasks, it might be beneficial to increase the temperature value.

**top_p:** This parameter, in conjunction with temperature, offers a powerful tool for controlling the model's output. Known as nucleus sampling, it allows you to determine the level of determinism in the responses. By using `top_p`, you can specify that only the tokens comprising the top_p probability mass are considered for responses. A low top_p value selects the most confident responses, while a higher value prompts the model to consider more possible words, leading to more diverse outputs. The general recommendation is to alter `temperature` or `top_p` but not both.

**stop:** A list of strings to stop generation when encountered. This is another way to control the length and structure of the model's response. 

**frequency_penalty:** The frequency penalty applies a penalty on the next token based on how many times that token has already appeared in the generated response and prompt. The higher the frequency penalty, the less likely a word will reappear. This setting reduces the repetition of words in the generated response by giving tokens that appear more a higher penalty.

**presence_penalty:** The presence penalty applies the same penalty for all repeated tokens. A token that appears twice and a token that appears n times are penalized the same. You may choose a higher presence penalty if you want the model to generate diverse or creative text. 

To learn more about other parameters, refer to [create_completion API reference.](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion)

In [36]:
model_response = olmo(prompt="Write a sarcastic but nice poem about the city of Seattle", echo=True, temperature=1, max_tokens=500)

In [37]:
print(parse_text_generation_response(model_response))

Write a sarcastic but nice poem about the city of Seattle.

Title: "A Slight Angle on the Emerald City"

The city of Seattle, with its rainy days and its grays
Where coffee shops are the norm, and its hipsters are quite the sight
It's not exactly the Sunset, but it is a place to be seen
With its rainforest-like parks and its chai teas, too

But alas, this city of ours, it's not all that
The hipster fashion, though charming, can get old fast
And while we may enjoy our drizzly afternoons in the park
Our coffee culture is still something to be desired.

Seattle, you see, are a bunch of grumpy-faced natives
With moody skies above and a love for that good brew
But deep down, there's more to this city than meets the eye
A little warmth, some sunshine, would make it all right.

So if you're ever in Seattle, be prepared for rain
And embrace the wet weather with open arms
For despite its quirks and its flaws, this city is worth a visit or two.
With its artsy scene and its quirky charm, there's 

> Another critical concept to understand is Context length. It is the number of tokens an LLM can process at once, the maximum length of the input sequence. You can interpret it as the model's memory or attention span.

## Prompting

Prompt engineering or prompting is a discipline for developing and optimizing prompts to use LLMs for various applications. 

### Prompt Elements

In general, prompt could contain any of the following:

**Instruction:** Text to explain a specific task or instructions for the model to perform.

**Context:** Additional context that can help the model generate better responses.

**Input Data:** The input or question a user is interested in finding a response for.

**Output Indicator:** The type or format of the output.

### Chat Completion

A use case for LLMs is chat. In a chat context, rather than prompting LLM with a single string of text, you prompt the model with a conversation that consists of one or more messages, each of which includes a role, like `user` or `assistant`, as well as text as `content`.

The Python binding for Llama.cpp provides a [high-level API for chat completion.](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#chat-completion) 

The model typically formats the messages in the conversation into a single prompt using a chat template from the `gguf` model's metadata. Chat templates are part of the tokenizers (more on that in `Module 2`.) They specify how to convert a chat conversation, represented as lists of messages, into a single tokenizable string in the format that the model expects, i.e., a prompt.

For OLMo you can see its chat template using,

In [16]:
olmo.metadata["tokenizer.chat_template"]

"{{ eos_token }}{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

### Prompting Techniques

Prompts can help you get results on different tasks with LLMs.

##### Zero-shot Prompting

The zero-shot prompt directly instructs the model to perform a task without any additional knowledge, but entirely based on its pre-existing knowledge.

In [17]:
chat_response = olmo.create_chat_completion(
    messages=[
        {"role": "user", "content": "Classify the following text into neutral, negative, or positive. Today's Seattle weather is beautiful."},
    ],
    temperature=0.8,
)

In [18]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': 'Positive.\nThe text "Today\'s Seattle weather is beautiful" is positive as it describes the current weather in a positive manner by saying that it is beautiful.'}


Note that in the prompt above, we didn't provide OLMo with any additional context; OLMo already understands the `sentiment`—that's zero-shot at work.

##### Few-shot Prompting

OLMo or other LLMs can demonstrate remarkable zero-shot capabilities, they can fail in more complex or specific tasks. In this case, we can introduce examples (shots) or additional context within the prompt to improve the OLMo's response.

Let's try zero-shot to learn more about SciPy 2024.

In [19]:
chat_response = olmo.create_chat_completion(
    messages=[
        {"role": "user", "content": "Did you hear about SciPy 2024 conference?"},
    ],
)

In [20]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': "I'm not personally aware of the specific details of the SciPy 2024 conference, as I am just an AI model designed to provide general information. However, SciPy is a community of scientists, engineers, and researchers using the Python language for scientific computing, and they organize an annual conference called SciPy Conference. The next SciPy Conference is scheduled to take place in July 2024 in New Orleans, Louisiana, USA. The conference typically features talks, tutorials, and workshops on scientific computing using Python, as well as discussions of recent advances in various scientific domains."}


Interpret the response before moving on. 

What if we provide relevant information to answer the prompt? 

In [21]:
chat_response = olmo.create_chat_completion(
    messages=[
        {"role": "user", "content": "The 23rd annual SciPy conference will be held at the Tacoma Convention Center, July 8-14, 2024. SciPy brings together attendees from industry, academia and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. "},
        {"role": "user", "content": "Did you hear about SciPy 2024 conference?"},
    ],
)

In [22]:
print(parse_chat_completion_response(chat_response))

{'role': 'assistant', 'content': "Yes, I do have information about the 23rd annual SciPy conference, which will be held at the Tacoma Convention Center from July 8-14, 2024. The conference aims to bring together attendees from various backgrounds such as industry, academia, and government to showcase their projects, learn from skilled users and developers, and collaborate on code development.\n\nSciPy is a non-profit organization dedicated to the advancement of scientific computing in Python, an open-source programming language. The conference offers a diverse range of talks, tutorials, workshops, and networking opportunities for attendees interested in using Python for scientific research and data analysis.\n\nIf you're interested in attending SciPy 2024, be sure to mark your calendars and stay updated on the official website for registration information and program updates."}


OLMo is able to generate a response that's much more helpful to the user. 

Many other prompting techniques (e.g., chain-of-thought, ReAct, etc.) exist. For this tutorial, we will focus on **Retrieval-Augmented Generation**, which can enhance OLMo's responses by integrating information retrieved from external sources.

**References**
1. https://news.ycombinator.com/item?id=35712334
2. https://benjaminwarner.dev/2023/07/01/attention-mechanism
3. [Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators](https://dl.acm.org/doi/10.1145/3571884.3604316)
4. https://www.promptingguide.ai/