LLMs on Rosie
===
MAIC - Fall, Week 10<br>
```
  _____________
 /0   /     \  \
/  \ M A I C/  /\
\ / *      /  / /
 \___\____/  @ /
          \_/_/
```
(Run on Rosie)

---

**Methods of running LLMs**

Theory is great and all, but how can one actually run an LLM?

[Llama.cpp](https://github.com/ggerganov/llama.cpp) is a solution for running LLMs locally!  
It could only run Llama initially, but it can now run most open source LLMs.
Fun fact: llama.cpp does not depend on any machine learning or tensor libraries (like Tensorflow or Pytorch, each of which are hundereds of megabytes); it was written from scratch in C/C++.

Another solution for running LLMs locally: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).

---
<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

But how does one use Llama.cpp?

To use it on Rosie, do the following:

- Connect to MSOE via VPN.
- Start a VSCode server on the [Rosie dashboard](https://dh-ood.hpc.msoe.edu/pun/sys/dashboard).
- Make sure you have the python extension installed.
  - If you're unfamiliar, take a look at this [extension installation guide](https://code.visualstudio.com/docs/editor/extension-marketplace)
  - Search for and install the extension named `Python`. It should be the one with around 3 or 4 million downloads.
- Open the command palette (F1 or Ctrl+Shift+P).
- Search for and select `Python: Select Interpreter`
- Choose the first option: `Enter interpreter path...`
- Choose the first option again: `Find...`
- Copy & paste `/data/ai_club/team_3_2024-25/team3-env-py312-glibc/bin/python`. Press enter.
- Ignore all errors that may pop up (they shouldn't matter)
- In the upper-right corner, in this notebook toolbar, select a kernel.
- Choose `select another kernel`. (This may be skipped if you haven't used VSCode through the Rosie dashboard.)
- Choose `Python Environments`
- Choose `team3-env-py312-glibc`
- Now you can run code. In the future you should only have to start from selecting a kernel.
- Import `lamma_cpp` to see if things are working:

In [None]:
from llama_cpp import Llama

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

Now that Llama.cpp is working, the next step is to load the weights for an LLM model.

Llama.cpp supports models stored in the [gguf](https://huggingface.co/docs/hub/en/gguf) weight format.

Hugging Face has plenty of [models](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF) in the gguf format to download, but Rosie already has a few of these models installed:

In [None]:
%ls -lah /data/ai_club/llms

You can use any of these models (although some need more than 1 GPU). For now, let's use `llama-2-7b-chat.Q5_K_M.gguf`.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

Let's load our model of choice and use it to complete some text!

In [None]:
llm = Llama(
    model_path='/data/ai_club/llms/llama-2-7b-chat.Q5_K_M.gguf',
    n_gpu_layers=-1, # Put all layers in GPU memory
    verbose=False, # A lot of extra info is printed if this isnt set
    n_ctx=1000, # Maximum number of input tokens
    logits_all=True # Allow logit (token probability) viewing and manipulation for later.
)

In [None]:
prompt = 'Hello, my name is'

In [None]:
response = llm(prompt) # To use a model, just call it like a function. If this cell takes longer than 1s, then something is wrong.

In [None]:
# view response
response

It seems the response is more than just raw text.

In addition to returned text, the Llama.cpp interface also returns additional information. This information is returned as a Python `dict` (dictionary).

Dictionaries are like lists. They store a bunch of arbitrary data at various indexes. However, they also allow indexes to be non-integers.

```python
my_list = [1,2,3]
print(my_list[1]) # => 2
print(my_list['hello']) # => ERROR, cant index a list with a string

my_dict = {
    0: 'entry 1',
    1: [0, 1, 2],
    1.5: 'text',
    'a': 'b'
    'nested_dict': {1: 2, 'c': 'd'}
}
print(my_dict[0]) # => 'entry 1'
print(my_dict['a']) # => 'b'
print(my_dict['goodbye']) # => ERROR, 'goodbye' isn't in my_dict
```

In the case of our response from Llama.cpp, you can extract the response like so:

In [None]:
response['choices'][0]['text']

Here is a full text-completion example:

In [None]:
prompt = 'Hello, my name is' # Experiment with this

response = llm(prompt)

print(prompt + response['choices'][0]['text'])

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

There are many ways to control the generation of text.

- `max_tokens`: increase this for longer maximum outputs.
- `temperature`: this is a common control for LLMs. 0 means the output should be the same every time. Larger values make the output more random.
  - More specifically, temperature changes *how* tokens are selected. When completing text, the model predicts the probability for every possible next *token* (where a token is a part of a word or a whole word). When the temperature is 0, the most likely token is always selected. When the temperature is 1, the tokens are selected according to how likely the model thinks each one is. This will become more apparent when looking at the output token probabilities below. 
- `frequency_penalty`: prevents tokens from being repeated in the output. Repeated tokens are a relatively frequent problem with LLM text generation.
- `logprobs`: lets the output include the top-N most likely tokens considered for each position. If the temperature is zero, then the top token will always be chosen.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

Let's mess with these text generation parameters!

In [None]:
prompt = 'Hello, my name is' # Experiment with this

response = llm(
    prompt,
    # Experiment with these parameters
    max_tokens=10,
    temperature=0,
    frequency_penalty=0,
    logprobs=3
)

print(prompt + response['choices'][0]['text']) # Print prompt & output
print('\n===\n') # to separate cell outputs

# Print the token logprobs
print('"selected token"\n\t"potential token": logprob\n\t...')
for logprobs, tok in zip(response['choices'][0]['logprobs']['top_logprobs'], response['choices'][0]['logprobs']['tokens']):
    print(f'"{tok}"',)
    for k,v in logprobs.items():
        print(f'\t"{k}": {v:.2f}')
    print() # newline

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

If you take a look at the token probabilities above, you'll notice a relationship between temperature and the selected token.

When the temperature is 0, the most likely token is _always_ selected. As the temperature increases, the most likely tokens are still more likely to be selected, but it won't always be the top one.

This is cool and all, but how can we do chat-bot interaction like ChatGPT?

On top of text-completion abilities, Llama.cpp can also do chat-like inputs with the `create_chat_completion` method.

The input to this method is the chat history, which is a list of dictionaries. Each dictionary is a message which stores a `role` ("who" said the message), and some `content` (the message itself).

There are only a few possible message roles. You can't specify your own.
- `system` tells the model what to do (the "boss" of the model).
- `user` can say anything to the model, and this is what the model is actually responding to.
- `assistant` the model itself. You usually dont specify this manually; the model generates these.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

Below is a simple input history to prompt the LLM. The system message is giving the LLM a personality, and a user message is making a request.

In [None]:
# Here is a history input.

history = [
    {
        'role': 'system',
        'content': 'You preface every message with a tangent talking about MAIC (the MSOE AI Club) in every response.' # You can change this.
    },
    {
        'role': 'user',
        'content': 'Hello. Print something in Python code.'
    }
]
response = llm.create_chat_completion(history) # as mentioned, we use the `create_chat_completion` method with the history

# Like before, the result is a complex data structure.
# When doing chat completion, the response is in `message` instead of `text`. `message` is a dictionary with a role and generated content.
response = response['choices'][0]['message']
response

In [None]:
# the message text:
print(response['content'])

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**... But how can we continue the conversation?**

Transformers. That's how.

This chat-completion LLM is still a transformer. The key difference is that it has been "finetuned" to differentiate between different roles of a conversation in its input.

The text-completion logic is still the same.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

Below is a simple input history to prompt the LLM.

The system message is giving the LLM a personality, and a user message is making a request.

The remainder of this section will build up how we can go from individual text continuations to full-on chats.

In [None]:
# Serves as the initial history
history = [
    {
        'role': 'system',
        'content': 'You are an unhelpful assistant.'
    },
    {
        'role': 'user',
        'content': 'Write a recursive factorial oneliner in python.'
    }
]

In [None]:
# Run everything through the history
response = llm.create_chat_completion(history)['choices'][0]['message'] # getting the message itself as we were before
print(response) # ... and print the response

# Repeat...

In [None]:
# Adding the response to the history
history.append(response)
# Add the next user prompt to the history
history.append({
    'role': 'user',
    'content': 'Now do it in Java.'
})

# show history so far
history

In [None]:
# ... and then you would call the model again

response = llm.create_chat_completion(history)['choices'][0]['message']
print(response)

# ... put it back in, and repeat.

Since transformers don't have any memory of the conversation, they have to see everything that the user said, AND everything that **it** said.

In [None]:
# We can ee that every part of the conversation is in the history:
history

In [None]:
# Let's make a function for this process.
def continue_conversation(user_prompt):
    # add user prompt to history
    history.append(
        {
            'role':'user',
            'content':user_prompt
        }
    )
    # run the model on the entire history
    response = llm.create_chat_completion(history)['choices'][0]['message']
    # add the model output to the history
    history.append(response) # `response` already includes the role

    # also return the LLM's latest response text
    return response['content']

In [None]:
print(continue_conversation('Write a recursive factorial function in Lua'))

In [None]:
print(continue_conversation('Write quicksort in C++'))

In [None]:
print(continue_conversation('What was the first thing I asked you?'))

Let's take a look at the history, formatted nicely this time

In [None]:
for msg in history:
    print(msg['role'], '--', msg['content'], '\n')

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

A quick tangent: if the transformer only operates on tokens, then how can messages have "roles?"

The answer is simple: the list of dictionaries is turning into a list of tokens before being put into the transformers. There are special tokens to indicate the role, so the model is ultimately seeing something like this:
```
<SYSTEM START TOKEN>You are You are an unhelpful assistant.<SYSTEM END TOKEN><USER START TOKEN>Write a recursive factorial function in Python<USER END TOKEN>
```

Now let's make a better chat bot!

To make the text appear as it's generating, we can use the `stream` parameter of the llm. It changes the result to a python python object which you iterate over to get the tokens.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

Below is the code needed to use the `stream` functionality of Llama.cpp.

You will enter chat messages in a VSCode prompt that appears at the middle-top of your screen.

In [None]:
from IPython.display import clear_output # for clearing the output
from time import sleep

In [None]:
history = [
    {
        'role': 'system',
        'content': 'Talk like an internet chatroom user. Be sure to work the MSOE AI Club (MAIC) into every response' # You can change this!
    }
]

def pretty_print_history(currently_generating):
    hist = ''
    for msg in history+[currently_generating]:
        hist += msg['role'] + ' -- ' + msg['content'] + '\n'
    return hist

In [None]:
while True:
    user_prompt = input()
    if user_prompt == '': break
    history.append({'role':'user', 'content':user_prompt}) # add user input to history
    resp_msg = {'role': '', 'content': ''} # store a dictionary for the generated tokens before adding itself to the history
    resp_stream = llm.create_chat_completion(history, stream=True) # generate the token stream
    for tok in resp_stream:
        delta = tok['choices'][0]['delta'] # the model returns "deltas" when streaming tokens. Deltas tell you how to change the response dictionary (resp_msg in this case)
        if len(delta) == 0: break # empty delta means it's done
        delta_k, delta_v = list(delta.items())[0]
        resp_msg[delta_k] += delta_v
        clear_output(wait=True)
        print(pretty_print_history(resp_msg))
        sleep(.1) # This delay makes the output smoother, but you can comment it out
    history.append(resp_msg) # Add the full response to the history
    break

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

---

... What if we make the conversation too long?

Transformers have to see the *entire* history to continue it, so there must be a limit on the history length. This limit is often referred to as the the *context length*. This was the `n_ctx` parameter supplied to the initial LLM creation function.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

We will end this workshop with a mini contest. Use the code above to generate the best LLM response.

Once we say so, screenshot or copy+paste your responses into the MAIC teams. Submissions will be voted on, and the winner will get a prize 😱