<a href="https://colab.research.google.com/github/tim-a-davis/silly_little_language_modeling_thing_at_utd/blob/main/CurtGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table>
<tr>
<td style="width: 10%;">

# CurtGPT
Using Microsoft's Phi 1.5 model like it was never intended.

</td>
<td style="text-align: center;">
<img src="https://github.com/tim-a-davis/silly_little_language_modeling_thing_at_utd/blob/main/curtgpt%20logo.png?raw=true" width="300" height="auto">
</td>
</tr>
</table>


# Setup, Installs, Imports

Setup of the environment, installation of the needed models, and importing everything required to run the notebook

In [1]:
#@title Installing dependencies
!pip install -q trl transformers accelerate peft datasets bitsandbytes einops

In [2]:
#@title Imports & setup
from IPython.display import HTML, display, clear_output
import ipywidgets as widgets

import requests
import random
import time
import math
import numba
import numpy as np

from collections import defaultdict, Counter
from typing import List, Callable
from einops import rearrange
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig



import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

In [3]:
#@title Install Phi models and tokenizer
AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")

AutoModelForCausalLM.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, torch_dtype=torch.bfloat16)
AutoTokenizer.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, torch_dtype=torch.bfloat16)

torch.set_default_device('cuda')

# N-Gram Models

Starting with n-gram models will hopfully build some intuition on language modeling in general.

## Building the Trigram Model
---
Here we write all the code we'll need to ingest a corpus of text and create a representation of that text in the form of a trigram model.  The main idea behind the trigram model is fundementally the same as with decoder-only tranformer based models, but trigram models are easier to disect, so we'll start there.

In [4]:
class TrigramModel:
    def __init__(self, url):
        self.trigram_freq = defaultdict(Counter)
        self._train(url)

    def _train(self, url):
        r = requests.get(url)
        text = r.text.lower().split()

        # Create trigrams
        for i in range(len(text) - 2):
            trigram = (text[i], text[i + 1], text[i + 2])
            self.trigram_freq[(trigram[0], trigram[1])][trigram[2]] += 1

    def _get_weighted_random_word(self, counter):
        total = sum(counter.values())
        random_choice = random.randint(1, total)

        for word, freq in counter.items():
            random_choice -= freq
            if random_choice <= 0:
                return word

    def predict(self, text, n_words):
        words = text.lower().split()
        output = words.copy()

        for _ in range(n_words):
            last_bigram = tuple(output[-2:])
            if last_bigram in self.trigram_freq:
                next_word = self._get_weighted_random_word(
                    self.trigram_freq[last_bigram]
                )
                output.append(next_word)
            else:
                break

        return " ".join(output)

    def get_frequencies_of_bigram(self, text):
        words = text.lower().split()
        bigram = tuple(words[-2:])
        return bigram, self.trigram_freq[bigram]


#### Instantiate The Model
---

Let's instantiate this trigram model with a .txt file containing the book _Billy Budd, Sailor_ by _Herman Melville_

In [5]:
model = TrigramModel("http://gutenberg.net.au/ebooks06/0608511.txt")


---

Here we prompt this model with some starting text, and we want to see what the model things the next n_words will be.  Given the way we've tokenized the text (splitting on spaces), the trigram model will only be looking at the last 2 words (in this case `the, master-at-arms`).  Then we add some small delay to make it look like a sweet streaming GPT model

In [6]:
prompt = "as it started to sway, the master-at-arms"
n_words = 50  # Number of words ahead to predict

prediction = model.predict(prompt, n_words)
for i, letter in enumerate(prediction):
    if not i % 100: print("\n")
    print(letter, end='', flush=True)
    time.sleep(0.003)



as it started to sway, the master-at-arms and the old-fashioned sailor, the commander of the heart n

ot the less to do quite as much as he was everything that a young man if of the honest sense of fear

, his apprehension as to the gazer's professional eye it was the most important regards ceased to be

 called,

#### Disecting one of the bigrams

---

Here we can take a look at the models choice of words for our example bigram.  The output shows the bigram, as well as the frequencies of words found in the text.  

In [7]:
model.get_frequencies_of_bigram(prompt)

(('the', 'master-at-arms'),
 Counter({'of': 1,
          'was': 4,
          'has': 1,
          'in': 1,
          'noticed': 1,
          'that': 1,
          'never': 1,
          'being': 1,
          'acted': 1,
          'about': 1,
          'said.': 1,
          'said': 1,
          'as': 1,
          'and': 1}))

### Confusing diagram

![overly-complicated-diagram](http://www.phon.ox.ac.uk/jcoleman/old_SLP/Lecture_6/figure7-8.png)

http://www.phon.ox.ac.uk/jcoleman/old_SLP/Lecture_6/trigram-modelling.html

# Base Transformer Model -- Phi-1.5

---

Here we'll start to look at a base transformer model to understand a little bit about how it behaves, how it's similar to a trigram model, and how we can use some tools to easily interact with these extremely large models.

### Downloading and Instantiating


---

This part is not difficult thanks to the great work at huggingface.  With just two lines of code we can download a 1.3 billion parameter transformer model, map the weights onto the architecture, and load the associated configurations and tokenizers.

In [8]:
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")

### Prompting Phi 1.5


---

Here we take our same prompt, and we use the model's tokenizer to turn it into integers that the model can ingest.
The result will be a string of integers that represent chunks of text we call tokens.

In [9]:
prompt = "as it started to sway, the master-at-arms"
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
print(inputs)


{'input_ids': tensor([[  292,   340,  2067,   284, 20009,    11,   262,  4958,    12,   265,
            12,  8357]], device='cuda:0')}




---

We can use the tokenizer to reverse the process and get back the strings.
You can see that the tokenizer sometimes chooses interesting places to chunk the text.
You can also see that generally speaking, more common tokens have lower integer values.

Here we show the integer values of each token and what the string representation is for each input.

In [10]:
for token_id in inputs["input_ids"][0]:
    id = token_id.item()
    token = tokenizer.decode(id)
    print(f"{id: <5} ----> {token}")

292   ----> as
340   ---->  it
2067  ---->  started
284   ---->  to
20009 ---->  sway
11    ----> ,
262   ---->  the
4958  ---->  master
12    ----> -
265   ----> at
12    ----> -
8357  ----> arms




---


Performing inference on this model is as simple as passing in the inputs from the tokenizer to the .generate method.

The tokenizer has a batch_decode method that we would generally use to get back an output text.  But in this case
we want to see each individual token and what the model output was.



In [11]:
outputs = model.generate(**inputs, max_new_tokens=11)
output_tokens = [tokenizer.decode(id) for id in outputs[0]]

In [12]:
#@title Helper function for printing token ids and tokens
def print_tokens(ids, tokens, line_size=25):
    tokens = [token.replace(" ", "·") for token in tokens]
    def chunk_list(lst, max_size):
        for i in range(0, len(lst), max_size):
            yield lst[i:i + max_size]
    id_chunks = list(chunk_list(ids, line_size))
    token_chunks = list(chunk_list(tokens, line_size))
    for ids, tokens in zip(id_chunks, token_chunks):
        max_widths = [max(len(str(id)), len(token)) for id, token in zip(ids, tokens)]
        aligned_ids = [str(id).center(max_widths[i]) for i, id in enumerate(ids)]
        aligned_arrows = ['↓'.center(max_widths[i]) for i in range(len(ids))]
        aligned_tokens = [token.center(max_widths[i]) for i, token in enumerate(tokens)]
        print(' '.join(aligned_ids))
        print(' '.join(aligned_arrows))
        print(repr(' '.join(aligned_tokens))[1:-1])
        print("\n")




### Inspecting Model Outputs


---


Here we can decode each token output from the model, and map it back to original text.  We can see that this output is much more congruous with the input prompt over the trigram model.  

In [13]:

print("Output:\n" + "".join(output_tokens) + "\n\nToken Mapping:")
print_tokens(outputs.cpu().tolist()[0], output_tokens)

print("\n\n* The · characters represent spaces in the token")

Output:
as it started to sway, the master-at-arms, a seasoned veteran of the battlefield, stepped forward.

Token Mapping:
292 340   2067   284 20009 11 262    4958  12 265 12 8357 11 257   29314     9298   286 262     13480     11  10764     2651   13
 ↓   ↓     ↓      ↓    ↓   ↓   ↓      ↓    ↓   ↓  ↓   ↓   ↓   ↓      ↓        ↓      ↓   ↓        ↓       ↓     ↓        ↓     ↓ 
 as ·it ·started ·to ·sway ,  ·the ·master -   at -  arms ,   ·a ·seasoned ·veteran ·of ·the ·battlefield ,  ·stepped ·forward . 




* The · characters represent spaces in the token




---

The model itself outputs a tensor of size (..., sequence_length, vocabulary_size).  In this model, the vocab size is 51200.
Each token in the vocabulary is assigned a value that is roughly the probability of that value being next in the sequence.
We can see for our sequence what the top ten next tokens were by finding the tokens with the highest values.


In [14]:
with torch.no_grad():
  single_forward_pass = model.forward(**inputs) # perform one forward pass

print(f"Shape of outputs: {single_forward_pass.logits.shape}\n\n")

logits = single_forward_pass.logits[0, -1, :].cpu()
exp_sum = torch.exp(single_forward_pass.logits[0, -1, :].cpu()).sum().item()
top_10_token_ids = single_forward_pass.logits[0, -1, :].cpu().argsort().tolist()[-10:][::-1] # get the top 10 tokens in the vocabulary from the output tensor
top_10_tokens = [tokenizer.decode(token) for token in top_10_token_ids] # decode them back to strings
top_10_probs = (torch.exp(single_forward_pass.logits[0, -1, :].cpu()[top_10_token_ids])/exp_sum).tolist() # get their probabilities from the output of the model
top_10_probs_rounded = [round(i, 5) for i in top_10_probs]


print("Top 10 next possible tokens given our input:\n")
print("token            probability")
print("-"*28)
for token, prob in zip(top_10_tokens, top_10_probs_rounded):
    print(f"{repr(token)[1:-1]: <10} ----> {prob: >11}")

Shape of outputs: torch.Size([1, 12, 51200])


Top 10 next possible tokens given our input:

token            probability
----------------------------
,          ---->     0.04116
 quickly   ---->     0.03307
 knew      ---->     0.02873
 and       ---->     0.02575
 couldn    ---->     0.02575
 swiftly   ---->     0.02203
 skill     ---->     0.01959
 decided   ---->     0.01798
 took      ---->     0.01676
 of        ---->     0.01491


A great overview of the different sampling algorithms commonly found for LLMs can be found [here](https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc)

## Writing a custom sampler
---
Decoder-only models produce vocabulary-length array of logits that we can transform into probabilities using a softmax function.  Softmax is defines as:

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}
$$

Typically, to constrain the range of possible outcomes, samplers use a parameter called temperature to adjust the probability space.  By dividing each logit by T, it makes more probable tokens even _more_ probable, and less probable tokens even _less_ probable. The softmax function with temperature looks like this:

$$
\text{softmax}(x_i; T) = \frac{e^{x_i / T}}{\sum_{j=1}^{n} e^{x_j / T}}
$$


You can see then that the base softmax function has a "default" temperature of 1.  The following illustration shows how changing the temperature can effect the probabilities:

![temp](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*i9cXkz-TWG7-BS6CycahuQ.png)

In [15]:
def get_possible_tokens(prompt, temperature=1, top_p=0.9, top_k=None):
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    with torch.no_grad():
        single_forward_pass = model.forward(**inputs)
    logits = single_forward_pass.logits[0, -1, :].cpu() # get the logits from the outputs
    logits = logits - torch.max(logits) # subtract the max for numerical stability
    exp_sum = torch.exp(logits/temperature).sum().item() # find the sum of the exponentiated logits
    probs = torch.exp(logits/temperature)/exp_sum # get the probability from softmax
    sort_idx = probs.argsort().tolist()[::-1] # get the index of tokens from most probable to least probable
    probs_sorted = probs[sort_idx] # sort the probabilities
    top_k_calculated = ((torch.cumsum(probs_sorted, 0) > 0.9) * 1).argmax() + 1 # find the index of the token that crosses the top_p threshold
    top_k = top_k or top_k_calculated # override top_k if it is passed as a kwarg
    token_ids_considered = sort_idx[:top_k] # get the list of tokens where sum(p) < top_p
    token_probs = (probs_sorted[:top_k] / torch.sum(probs_sorted[:top_k])).to(torch.float16)
    tokens_considered = tokenizer.batch_decode(token_ids_considered)
    return tokens_considered, token_probs


def generate_custom(prompt: str,
                    custom_sampler: Callable,
                    temperature: float=1,
                    top_p: float=0.9,
                    top_k: float=None,
                    max_new_tokens: int=20
    ):
    for _ in range(max_new_tokens):
        tokens_considered, token_probs = get_possible_tokens(prompt, temperature=temperature, top_p=top_p, top_k=top_k)
        selected_token = custom_sampler(tokens_considered, token_probs)
        prompt += selected_token
    return prompt


def try_for_certain_letter(tokens: List, token_probs: torch.Tensor, letter: str="b") -> str:
    """
    Given a list of token strings and corresponding probabilities, this function returns a token that starts
    with the letter 'p' if any such token exists in the list. The token must also follow a space, meaning it should
    represent the beginning of a new word. If no such token is found, a token is selected randomly from the list
    based on the given probabilities.

    Parameters:
    - tokens (list of str): A list of tokens (substrings) to search through.
    - token_probs (numpy.ndarray or tensor): An array or tensor of probabilities corresponding to each token in `tokens`.
                                              The length of this array should match the length of `tokens`.

    Returns:
    - str: A token string. If a token starts with the letter -letter- and is the beginning of a new word (i.e., follows a space),
           that token is returned. Otherwise, a token is randomly selected based on `token_probs`.
    """
    for token in tokens:
        if " " in token[:2]: # we only want to select a words for new words (after a space)
            for char in token:
                if char.isalpha():
                    if char.lower() == letter:
                        return token
                    else:
                        break
    return np.random.choice(tokens, 1, p=token_probs.numpy())[0]


def feel_free_to_write_a_new_sampler(tokens: List, token_probs: torch.Tensor, *args, **kwargs) -> str:
    raise NotImplementedError



prompt = "As the ship started to sway, the master-at-arms"
output = generate_custom(prompt, try_for_certain_letter, temperature=1, top_p=0.9, top_k=None, max_new_tokens=50)
print(output)

# try increasing temperature > 1, or near 0

As the ship started to sway, the master-at-arms began barking orders at his crew. He barked orders about building barricades, burying bodies, and burning buildings. But despite his best efforts, the battle was taking a toll on everyone. The boat was being battered by the rough waves, and the


## Visualizing the attention weights

In [16]:
#@title Helper functions for getting and displaying attention weights

def get_attn_weights(inputs, layer, head):
    x = model.layers[0](**inputs)
    for i in range(1, layer):
        x = model.layers[i](x)
    x = model.layers[layer].ln(x)
    model.layers[layer].mixer
    qkv = model.layers[layer].mixer.Wqkv(x)
    qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=model.layers[layer].mixer.head_dim)
    qkv = model.layers[layer].mixer.rotary_emb(qkv)
    batch_size, seqlen = qkv.shape[0], qkv.shape[1]
    q, k, v = qkv.unbind(dim=2)
    softmax_scale = 1.0 / math.sqrt(q.shape[-1])
    scores = torch.einsum('bthd,bshd->bhts', q, k * softmax_scale)
    causal_mask = torch.triu(torch.full(size=(seqlen, seqlen), fill_value=-10000.0, device=scores.device), 1)
    scores = scores + causal_mask.to(dtype=scores.dtype)
    attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
    output = torch.einsum('bhts,bshd->bthd', attention, v)
    weights = attention[0, head].cpu()
    return weights


def display_attention_weights(inputs, layer, head, token_idx):
    input_tokens = [tokenizer.decode(id) for id in inputs["input_ids"][0]]
    weights = get_attn_weights(inputs, layer, head)
    with out:
        fig, ax = plt.subplots(figsize=(3, 1*(len(input_tokens)//4)))
        ax.axis('off')
        tl = len(input_tokens)
        ax.set_ylim(0, len(input_tokens))
        ax.set_xlim(0, 10)
        for i, token in enumerate(input_tokens):
            ax.text(3, len(input_tokens)-i, token, ha='right', va='top')
            ax.text(8, len(input_tokens)-i, token, ha='left', va='top')
        ax.fill_between([0, 3.3], [tl-token_idx, tl-token_idx], [tl-token_idx-0.75, tl-token_idx-0.75], color='blue', alpha=0.4)
        for i, weight in enumerate(weights[token_idx].cpu().tolist()):
            ax.fill_between([7.7, 13], [tl-i, tl-i], [tl-i-0.75, tl-i-0.75], color='blue', alpha=math.sqrt(weight)*0.7)
            ax.plot([3.35, 7.65], [tl-token_idx - 0.375, tl-i], c="blue", alpha=math.sqrt(weight)*0.7, lw=0.5)
        out.clear_output()
        plt.show()


def handler(_):
    display_attention_weights(inputs, layer.value, head.value, token_idx.value)


In [17]:
#@title Select the Layer, Attention Head, and Token to view the attention weights
layer = widgets.Dropdown(options=list(range(1, 24)), description="Layer")
head = widgets.Dropdown(options=list(range(0, 32)), description="Attn Head:")
token_idx = widgets.Dropdown(options=list(zip([tokenizer.decode(id) for id in inputs["input_ids"][0]], list(range(len(inputs["input_ids"][0]))))), description="Token:")
button = widgets.Button(description="Plot")
button.on_click(handler)

out = widgets.Output()

display(layer, head, token_idx, button)
display(out)

Dropdown(description='Layer', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, …

Dropdown(description='Attn Head:', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …

Dropdown(description='Token:', options=(('as', 0), (' it', 1), (' started', 2), (' to', 3), (' sway', 4), (','…

Button(description='Plot', style=ButtonStyle())

Output()

## Viewing the embedding dimension values across layers

In the following visualization, you can see the embedding dimension being modified and adjusted over each layer.  Here we are only showing the first 5 heads worth of embedding dimension (64*5 = 320).  The red dashed lines represent separations in the values that each head attends to.

In [18]:
#@title Code to generate the following visualization
# get the number of total heads and the head dimension
_ = """
n_head = model.layers[1].mixer.n_head
head_dim = model.layers[1].mixer.head_dim

# get a color palette and shuffle it
pal = sns.cubehelix_palette(5, rot=-.6, gamma=0.7, hue=0.7)
random.shuffle(pal)

# initialize the plot
fig, ax = plt.subplots()
_ = ax.set_ylim([-2, 2])
_ = ax.set_xlabel("Embedding dimension position")
_ = ax.set_ylabel("Embedding value")
sns.despine()

# calculate the x_values, y_values and colors for each bar initlaly
x_vals = range(0, 5*head_dim)
y_vals = [x[i*head_dim:i*head_dim+head_dim] for i in range(5)]
heights = [i for head in y_vals for i in head]
colors = [color for head in [[pal[i]]*head_dim for i in range(5)] for color in head]

# add the bars, text, and dashed lines
bars = plt.bar(x_vals, heights, color=colors)
text_label = ax.text(0.7, 0.9, '', transform=ax.transAxes)
for x_val in range(head_dim, 5*64, 64):
    _ = ax.axvline(x=x_val, color='r', linestyle='--', lw=0.5)


prev_heights = None
interpolation_steps = 10

def update(frame):
    global prev_heights

    real_frame = frame // interpolation_steps
    interp = frame % interpolation_steps / interpolation_steps

    x = model.layers[0](**inputs)
    for i in range(1, real_frame):
        x = model.layers[i](x)
    x = model.layers[layer].ln(x)
    x = model.layers[layer].mixer(x)
    x = x[0, -1].tolist()
    y_vals = [x[i * head_dim:i * head_dim + head_dim] for i in range(5)]
    heights = [i for head in y_vals for i in head]

    if prev_heights is not None:
        # Perform the linear interpolation between the previous and current frame.
        heights = [(1 - interp) * prev + interp * curr for prev, curr in zip(prev_heights, heights)]


    for i, bar in enumerate(bars):
        bar.set_height(heights[i])

    # Update the text label
    text_label.set_text(f'Model Layer: {real_frame + 1}')

    # Store the current heights for the next frame
    prev_heights = heights

    return bars

"""



# uncomment the following lines to run this cell
# ani = animation.FuncAnimation(fig, update, frames=24 * interpolation_steps, blit=True, interval=400//interpolation_steps)
# HTML(ani.to_html5_video())
# ani.save('animation.mp4', writer='ffmpeg', fps=30)
# from google.colab import files
# files.download('animation.mp4')



In [19]:
HTML(f"""<video src=https://github.com/tim-a-davis/silly_little_language_modeling_thing_at_utd/raw/main/animation.mp4 width=700 controls loop/>""")

## Tear Down

In [20]:
del model
del outputs
torch.cuda.empty_cache()

# Chat-tuned Models

In [21]:
model = AutoModelForCausalLM.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, torch_dtype=torch.bfloat16)

In [22]:
sysprompt = "You are an assistant that gives helpful answers\n"
inputs = tokenizer(f"{sysprompt}USER: What is a master-at-arms?  Give me a short answer.\nASSISTANT:", return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_new_tokens=120, temperature=0.7, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]

print("\n\nOutput:\n\n", text)




The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Output:

 You are an assistant that gives helpful answers
USER: What is a master-at-arms?  Give me a short answer.
ASSISTANT: A master-at-arms is someone who has been specially trained to use weapons or armor for defensive purposes, often in combat situations. In simpler terms, they're like the "goose" of the battlefield – always vigilant and ready to protect their wielder from harm<|endoftext|>
