# LLMs, Prompt Engineering, and OLMo

## Introduction

An introduction to large language models and how they're trained is out of scope, but they have been trained over large amounts of textual information available on the Internet, including books, articles, websites, and other digital content. Getting into the weeds of how these models are trained is out of the scope of this tutorial, but we have added links to papers and tutorials if you'd like to understand how LLMs are trained. Do note that training LLMs is expensive; the cost can easily increase to millions. 

Early language models could predict the probability of a single word token or n-grams; modern large language models can predict the likelihood of sentences, paragraphs, or entire documents.

However, LLMs are notoriously unable to retrieve and manipulate the knowledge they possess, which leads to issues like hallucination (i.e., generating factually incorrect information), knowledge cutoffs, and poor performance in domain-specific applications.

For this entire tutorial, we will be using [Open Language Model: OLMo](https://allenai.org/olmo), an open LLM framework built by [Allen Institute for AI](https://allenai.org/). With this open framework, you can access its complete pretraining data ([dolma](https://github.com/allenai/dolma)), training code, model weights, and evaluation suite. Tracking openness, transparency, accountability, and risks in LLMs is a growing research area. Checkout this [tool](https://opening-up-chatgpt.github.io/) to understand the range of openness in these models.

We have chosen a 7B instruction-tuned OLMo model that we have compressed to speed up its inference time. 

In [1]:
from llama_cpp import Llama # Python bindings for llama.cpp, to enable LLM inference with minimal setup
from ssec_tutorials.scipy_conf import * # Contains helper methods for tutorial

In [2]:
from inspect import signature 

In [11]:
# Loads the model from the huggingface hub: https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF
# TODO: Change this and load the model locally
olmo = Llama.from_pretrained(repo_id="ssec-uw/OLMo-7B-Instruct-GGUF", filename="OLMo-7B-Instruct-Q4_K_M.gguf", verbose=False)

Note the `7B,` `Instruct,` `GGUF,` and `Q4_K_M` keywords here.

**7B**: B stands for billion, and 7B suggests that this specific model has 7 billion parameters

**Base models**, for example [AllenAi's OLMo-7B](https://huggingface.co/allenai/OLMo-7B), [AllenAi's OLMo-1B](https://huggingface.co/allenai/OLMo-1B), and [Meta's Llama-3-8B](meta-llama/Meta-Llama-3-8B) processes billions of words and texts. The training process is semi-supervised, meaning data is supplied without much annotation or labeling, but much effort is poured into improving the data quality. We have found that training the model with tremendous amount of text allows it to learn language patterns and general knowledge.

When prompted, the model predicts the next tokens (words) statistically likely to follow.

For example,

In [17]:
model_response = olmo(prompt="Jupiter is the largest", echo=True, max_tokens=1, temperature=0.8) # Generate a completion, can also call olmo.create_completion

In [18]:
print(parse_text_generation_response(model_response))

Jupiter is the largest planet


But when prompted with, `What is the capital of Washington state in the USA?`, a base model **could** generate logical text that may or may not contain the right answer. 

This is when `Instruction` fine-tuning comes into play, which enhances the base model's ability to execute specific tasks. For `Instruction` fine-tuning, we can take the base models and further train them on much smaller and more specific datasets. For this tutorial, we are using a **quantized**, in other words **compressed** model version of [OLMo-7B-Instruct](https://huggingface.co/allenai/OLMo-7B-Instruct), which has been fine-tuned on [UltraFeedback Dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). That is where the keyword **Instruct** comes from.

`GGUF` is a file format for storing models for inference with GGML and executors based on GGML, a tensor library for machine learning. 

**Quantization** reduces a high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower-precision data type, in `Q4_K_M` each weight is reduced to a 4-bit representation. 

In [14]:
model_response = olmo(prompt="What is the capital of Washington state in the USA?", echo=True, temperature=0.8)

In [8]:
print(parse_text_generation_response(model_response))

What is the capital of Washington state in the USA?
Olympia is the capital of the U.S. state of Washington.


## LLM Parameters

We typically interact with the LLM via an API through which we can send prompts, and we can configure different parameters to get different results from LLMs. 

In [19]:
signature(olmo).parameters

mappingproxy({'prompt': <Parameter "prompt: 'str'">,
              'suffix': <Parameter "suffix: 'Optional[str]' = None">,
              'max_tokens': <Parameter "max_tokens: 'Optional[int]' = 16">,
              'temperature': <Parameter "temperature: 'float' = 0.8">,
              'top_p': <Parameter "top_p: 'float' = 0.95">,
              'min_p': <Parameter "min_p: 'float' = 0.05">,
              'typical_p': <Parameter "typical_p: 'float' = 1.0">,
              'logprobs': <Parameter "logprobs: 'Optional[int]' = None">,
              'echo': <Parameter "echo: 'bool' = False">,
              'stop': <Parameter "stop: 'Optional[Union[str, List[str]]]' = []">,
              'frequency_penalty': <Parameter "frequency_penalty: 'float' = 0.0">,
              'presence_penalty': <Parameter "presence_penalty: 'float' = 0.0">,
              'repeat_penalty': <Parameter "repeat_penalty: 'float' = 1.1">,
              'top_k': <Parameter "top_k: 'int' = 40">,
              'stream': <Paramet

Some standard parameters are:

**prompt:** The prompt to generate text from.

**max_tokens:** The maximum number of tokens to generate.

**temperature:** A higher temperature produces more creative and diverse output, while a lower temperature produces more deterministic output. In practical terms, you should use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For creative tasks, it might be beneficial to increase the temperature value.

**top_p:** This parameter, in conjunction with temperature, offers a powerful tool for controlling the model's output. Known as nucleus sampling, it allows you to determine the level of determinism in the responses. By using `top_p`, you can specify that only the tokens comprising the top_p probability mass are considered for responses. A low top_p value selects the most confident responses, while a higher value prompts the model to consider more possible words, leading to more diverse outputs. The general recommendation is to alter `temperature` or `top_p` but not both.

**stop:** A list of strings to stop generation when encountered. This is another way to control the length and structure of the model's response. 

**frequency_penalty:** The frequency penalty applies a penalty on the next token based on how many times that token has already appeared in the generated response and prompt. The higher the frequency penalty, the less likely a word will reappear. This setting reduces the repetition of words in the generated response by giving tokens that appear more a higher penalty.

**presence_penalty:** The presence penalty applies the same penalty for all repeated tokens. A token that appears twice and a token that appears n times are penalized the same. You may choose a higher presence penalty if you want the model to generate diverse or creative text. 

To learn more about other parameters, refer to [create_completion API reference.](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion)

In [36]:
model_response = olmo(prompt="Write a sarcastic but nice poem about the city of Seattle", echo=True, temperature=1, max_tokens=500)

In [37]:
print(parse_text_generation_response(model_response))

Write a sarcastic but nice poem about the city of Seattle.

Title: "A Slight Angle on the Emerald City"

The city of Seattle, with its rainy days and its grays
Where coffee shops are the norm, and its hipsters are quite the sight
It's not exactly the Sunset, but it is a place to be seen
With its rainforest-like parks and its chai teas, too

But alas, this city of ours, it's not all that
The hipster fashion, though charming, can get old fast
And while we may enjoy our drizzly afternoons in the park
Our coffee culture is still something to be desired.

Seattle, you see, are a bunch of grumpy-faced natives
With moody skies above and a love for that good brew
But deep down, there's more to this city than meets the eye
A little warmth, some sunshine, would make it all right.

So if you're ever in Seattle, be prepared for rain
And embrace the wet weather with open arms
For despite its quirks and its flaws, this city is worth a visit or two.
With its artsy scene and its quirky charm, there's 

## Prompting

In [10]:
chat_response = olmo.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are an astro physics expert that answers questions around astrophysics.",
        },
        {"role": "user", "content": "What is dark matter?"},
    ],
    temperature=0.8
)

In [16]:
signature(olmo.create_chat_completion).parameters

mappingproxy({'messages': <Parameter "messages: 'List[ChatCompletionRequestMessage]'">,
              'functions': <Parameter "functions: 'Optional[List[ChatCompletionFunction]]' = None">,
              'function_call': <Parameter "function_call: 'Optional[ChatCompletionRequestFunctionCall]' = None">,
              'tools': <Parameter "tools: 'Optional[List[ChatCompletionTool]]' = None">,
              'tool_choice': <Parameter "tool_choice: 'Optional[ChatCompletionToolChoiceOption]' = None">,
              'temperature': <Parameter "temperature: 'float' = 0.2">,
              'top_p': <Parameter "top_p: 'float' = 0.95">,
              'top_k': <Parameter "top_k: 'int' = 40">,
              'min_p': <Parameter "min_p: 'float' = 0.05">,
              'typical_p': <Parameter "typical_p: 'float' = 1.0">,
              'stream': <Parameter "stream: 'bool' = False">,
              'stop': <Parameter "stop: 'Optional[Union[str, List[str]]]' = []">,
              'seed': <Parameter "seed: 'Op

**References**
1. https://news.ycombinator.com/item?id=35712334
2. https://benjaminwarner.dev/2023/07/01/attention-mechanism
3. [Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators](https://dl.acm.org/doi/10.1145/3571884.3604316)
4. 