<a href="https://colab.research.google.com/github/tim-a-davis/silly_little_language_modeling_thing_at_utd/blob/main/CurtGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table>
<tr>
<td style="width: 10%;">

# CurtGPT
Using Microsoft's Phi 1.5 model like it was never intended.

</td>
<td style="text-align: center;">
<img src="https://github.com/tim-a-davis/silly_little_language_modeling_thing_at_utd/blob/main/curtgpt%20logo.png?raw=true" width="300" height="auto">
</td>
</tr>
</table>


# Setup, Installs, Imports

Setup of the environment, installation of the needed models, and importing everything required to run the notebook

In [4]:
#@title Installing dependencies
!pip install -q trl transformers accelerate peft datasets bitsandbytes einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.0/118.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [33]:
#@title Imports & setup
from IPython.display import HTML, display, clear_output
import ipywidgets as widgets

import requests
import random
import time
import math
import numba
import numpy as np
import itertools

from collections import defaultdict, Counter
from typing import List, Callable, Dict
from pprint import pprint
from einops import rearrange
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    GenerationConfig,
    BitsAndBytesConfig,
    TrainingArguments
)
from trl import DPOTrainer
from peft import AutoPeftModelForCausalLM, LoraConfig
from datasets import load_dataset, Dataset

import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

In [1]:
!git clone https://huggingface.co/microsoft/phi-1_5
!git clone https://huggingface.co/teknium/Puffin-Phi-v2
!git clone https://huggingface.co/tim-d/CurtGPT

fatal: destination path 'phi-1_5' already exists and is not an empty directory.
fatal: destination path 'Puffin-Phi-v2' already exists and is not an empty directory.
fatal: destination path 'CurtGPT' already exists and is not an empty directory.


# N-Gram Models

Starting with n-gram models will hopfully build some intuition on language modeling in general.

## Building the Trigram Model
---
Here we write all the code we'll need to ingest a corpus of text and create a representation of that text in the form of a trigram model.  The main idea behind the trigram model is fundementally the same as with decoder-only tranformer based models, but trigram models are easier to disect, so we'll start there.

In [None]:
class TrigramModel:
    def __init__(self, url):
        self.trigram_freq = defaultdict(Counter)
        self._train(url)

    def _train(self, url):
        r = requests.get(url)
        text = r.text.lower().split()

        # Create trigrams
        for i in range(len(text) - 2):
            trigram = (text[i], text[i + 1], text[i + 2])
            self.trigram_freq[(trigram[0], trigram[1])][trigram[2]] += 1

    def _get_weighted_random_word(self, counter):
        total = sum(counter.values())
        random_choice = random.randint(1, total)

        for word, freq in counter.items():
            random_choice -= freq
            if random_choice <= 0:
                return word

    def predict(self, text, n_words):
        words = text.lower().split()
        output = words.copy()

        for _ in range(n_words):
            last_bigram = tuple(output[-2:])
            if last_bigram in self.trigram_freq:
                next_word = self._get_weighted_random_word(
                    self.trigram_freq[last_bigram]
                )
                output.append(next_word)
            else:
                break

        return " ".join(output)

    def get_frequencies_of_bigram(self, text):
        words = text.lower().split()
        bigram = tuple(words[-2:])
        return bigram, self.trigram_freq[bigram]


#### Instantiate The Model
---

Let's instantiate this trigram model with a .txt file containing the book _Billy Budd, Sailor_ by _Herman Melville_

In [None]:
model = TrigramModel("http://gutenberg.net.au/ebooks06/0608511.txt")


---

Here we prompt this model with some starting text, and we want to see what the model things the next n_words will be.  Given the way we've tokenized the text (splitting on spaces), the trigram model will only be looking at the last 2 words (in this case `the, master-at-arms`).  Then we add some small delay to make it look like a sweet streaming GPT model

In [None]:
prompt = "as it started to sway, the master-at-arms"
n_words = 50  # Number of words ahead to predict

prediction = model.predict(prompt, n_words)
for i, letter in enumerate(prediction):
    if not i % 100: print("\n")
    print(letter, end='', flush=True)
    time.sleep(0.003)



as it started to sway, the master-at-arms and the old-fashioned sailor, the commander of the heart n

ot the less to do quite as much as he was everything that a young man if of the honest sense of fear

, his apprehension as to the gazer's professional eye it was the most important regards ceased to be

 called,

#### Disecting one of the bigrams

---

Here we can take a look at the models choice of words for our example bigram.  The output shows the bigram, as well as the frequencies of words found in the text.  

In [None]:
model.get_frequencies_of_bigram(prompt)

(('the', 'master-at-arms'),
 Counter({'of': 1,
          'was': 4,
          'has': 1,
          'in': 1,
          'noticed': 1,
          'that': 1,
          'never': 1,
          'being': 1,
          'acted': 1,
          'about': 1,
          'said.': 1,
          'said': 1,
          'as': 1,
          'and': 1}))

### Confusing diagram

![overly-complicated-diagram](http://www.phon.ox.ac.uk/jcoleman/old_SLP/Lecture_6/figure7-8.png)

http://www.phon.ox.ac.uk/jcoleman/old_SLP/Lecture_6/trigram-modelling.html

# Base Transformer Model -- Phi-1.5

---

Here we'll start to look at a base transformer model to understand a little bit about how it behaves, how it's similar to a trigram model, and how we can use some tools to easily interact with these extremely large models.

### Downloading and Instantiating


---

This part is not difficult thanks to the great work at huggingface.  With just two lines of code we can download a 1.3 billion parameter transformer model, map the weights onto the architecture, and load the associated configurations and tokenizers.

In [10]:
model = AutoModelForCausalLM.from_pretrained("./phi-1_5", trust_remote_code=True, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./phi-1_5", trust_remote_code=True, torch_dtype="auto", device_map="auto")

### Prompting Phi 1.5


---

Here we take our same prompt, and we use the model's tokenizer to turn it into integers that the model can ingest.
The result will be a string of integers that represent chunks of text we call tokens.

In [11]:
prompt = "as it started to sway, the master-at-arms"
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
print(inputs)


{'input_ids': tensor([[  292,   340,  2067,   284, 20009,    11,   262,  4958,    12,   265,
            12,  8357]], device='cuda:0')}




---

We can use the tokenizer to reverse the process and get back the strings.
You can see that the tokenizer sometimes chooses interesting places to chunk the text.
You can also see that generally speaking, more common tokens have lower integer values.

Here we show the integer values of each token and what the string representation is for each input.

In [12]:
for token_id in inputs["input_ids"][0]:
    id = token_id.item()
    token = tokenizer.decode(id)
    print(f"{id: <5} ----> {token}")

292   ----> as
340   ---->  it
2067  ---->  started
284   ---->  to
20009 ---->  sway
11    ----> ,
262   ---->  the
4958  ---->  master
12    ----> -
265   ----> at
12    ----> -
8357  ----> arms




---


Performing inference on this model is as simple as passing in the inputs from the tokenizer to the .generate method.

The tokenizer has a batch_decode method that we would generally use to get back an output text.  But in this case
we want to see each individual token and what the model output was.



In [20]:
outputs = model.generate(**inputs, max_new_tokens=11)
output_tokens = tokenizer.batch_decode(outputs[0]) # adding [0] yields a list of tokens rather than a list of batches

In [14]:
#@title Helper function for printing token ids and tokens
def print_tokens(ids, tokens, line_size=18):
    tokens = [token.replace(" ", "·") for token in tokens]
    def chunk_list(lst, max_size):
        for i in range(0, len(lst), max_size):
            yield lst[i:i + max_size]
    id_chunks = list(chunk_list(ids, line_size))
    token_chunks = list(chunk_list(tokens, line_size))
    for ids, tokens in zip(id_chunks, token_chunks):
        max_widths = [max(len(str(id)), len(token)) for id, token in zip(ids, tokens)]
        aligned_ids = [str(id).center(max_widths[i]) for i, id in enumerate(ids)]
        aligned_arrows = ['↓'.center(max_widths[i]) for i in range(len(ids))]
        aligned_tokens = [token.center(max_widths[i]) for i, token in enumerate(tokens)]
        print(' '.join(aligned_ids))
        print(' '.join(aligned_arrows))
        print(repr(' '.join(aligned_tokens))[1:-1])
        print("\n")




### Inspecting Model Outputs


---


Here we can decode each token output from the model, and map it back to original text.  We can see that this output is much more congruous with the input prompt over the trigram model.  

In [21]:

print("Output:\n" + "".join(output_tokens) + "\n\nToken Mapping:")
print_tokens(outputs.cpu().tolist()[0], output_tokens)

print("\n\n* The · characters represent spaces in the token")

Output:
as it started to sway, the master-at-arms, a seasoned veteran of the battlefield, stepped forward.

Token Mapping:
292 340   2067   284 20009 11 262    4958  12 265 12 8357 11 257   29314     9298   286 262 
 ↓   ↓     ↓      ↓    ↓   ↓   ↓      ↓    ↓   ↓  ↓   ↓   ↓   ↓      ↓        ↓      ↓   ↓  
 as ·it ·started ·to ·sway ,  ·the ·master -   at -  arms ,   ·a ·seasoned ·veteran ·of ·the


   13480     11  10764     2651   13
     ↓       ↓     ↓        ↓     ↓ 
·battlefield ,  ·stepped ·forward . 




* The · characters represent spaces in the token




---

The model itself outputs a tensor of size (..., sequence_length, vocabulary_size).  In this model, the vocab size is 51200.
Each token in the vocabulary is assigned a value that is roughly the probability of that value being next in the sequence.
We can see for our sequence what the top ten next tokens were by finding the tokens with the highest values.


In [22]:
with torch.no_grad():
  single_forward_pass = model.forward(**inputs) # perform one forward pass

print(f"Shape of outputs: {single_forward_pass.logits.shape}\n\n")

logits = single_forward_pass.logits[0, -1, :].cpu()
exp_sum = torch.exp(single_forward_pass.logits[0, -1, :].cpu()).sum().item()
top_10_token_ids = single_forward_pass.logits[0, -1, :].cpu().argsort().tolist()[-10:][::-1] # get the top 10 tokens in the vocabulary from the output tensor
top_10_tokens = [tokenizer.decode(token) for token in top_10_token_ids] # decode them back to strings
top_10_probs = (torch.exp(single_forward_pass.logits[0, -1, :].cpu()[top_10_token_ids])/exp_sum).tolist() # get their probabilities from the output of the model
top_10_probs_rounded = [round(i, 5) for i in top_10_probs]


print("Top 10 next possible tokens given our input:\n")
print("token            probability")
print("-"*28)
for token, prob in zip(top_10_tokens, top_10_probs_rounded):
    print(f"{repr(token)[1:-1]: <10} ----> {prob: >11}")

Shape of outputs: torch.Size([1, 12, 51200])


Top 10 next possible tokens given our input:

token            probability
----------------------------
,          ---->     0.04119
 quickly   ---->      0.0331
 knew      ---->     0.02875
 couldn    ---->     0.02577
 and       ---->     0.02538
 swiftly   ---->     0.02205
 skill     ---->     0.01961
 decided   ---->     0.01799
 took      ---->     0.01677
 of        ---->     0.01492


A great overview of the different sampling algorithms commonly found for LLMs can be found [here](https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc)

## Writing a custom sampler
---
Decoder-only models produce vocabulary-length array of logits that we can transform into probabilities using a softmax function.  Softmax is defines as:

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}
$$

Typically, to constrain the range of possible outcomes, samplers use a parameter called temperature to adjust the probability space.  By dividing each logit by T, it makes more probable tokens even _more_ probable, and less probable tokens even _less_ probable. The softmax function with temperature looks like this:

$$
\text{softmax}(x_i; T) = \frac{e^{x_i / T}}{\sum_{j=1}^{n} e^{x_j / T}}
$$


You can see then that the base softmax function has a "default" temperature of 1.  The following illustration shows how changing the temperature can effect the probabilities:

![temp](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*i9cXkz-TWG7-BS6CycahuQ.png)

In [24]:
def get_possible_tokens(prompt, temperature=1, top_p=0.9, top_k=None):
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
    with torch.no_grad():
        single_forward_pass = model.forward(**inputs)
    logits = single_forward_pass.logits[0, -1, :].cpu() # get the logits from the outputs
    logits = logits - torch.max(logits) # subtract the max for numerical stability
    exp_sum = torch.exp(logits/temperature).sum().item() # find the sum of the exponentiated logits
    probs = torch.exp(logits/temperature)/exp_sum # get the probability from softmax
    sort_idx = probs.argsort().tolist()[::-1] # get the index of tokens from most probable to least probable
    probs_sorted = probs[sort_idx] # sort the probabilities
    top_k_calculated = ((torch.cumsum(probs_sorted, 0) > top_p) * 1).argmax() + 1 # find the index of the token that crosses the top_p threshold
    top_k = top_k or top_k_calculated # override top_k if it is passed as a kwarg
    token_ids_considered = sort_idx[:top_k] # get the list of tokens where sum(p) < top_p
    token_probs = (probs_sorted[:top_k] / torch.sum(probs_sorted[:top_k])).to(torch.float16)
    tokens_considered = tokenizer.batch_decode(token_ids_considered)
    return tokens_considered, token_probs


def generate_custom(prompt: str,
                    custom_sampler: Callable,
                    temperature: float=1,
                    top_p: float=0.9,
                    top_k: float=None,
                    max_new_tokens: int=20,
                    **kwargs
    ):
    for _ in range(max_new_tokens):
        tokens_considered, token_probs = get_possible_tokens(prompt, temperature=temperature, top_p=top_p, top_k=top_k)
        selected_token = custom_sampler(tokens_considered, token_probs, **kwargs)
        prompt += selected_token
    return prompt


def try_for_certain_letter(tokens: List, token_probs: torch.Tensor, letter: str="b") -> str:
    """
    Given a list of token strings and corresponding probabilities, this function returns a token that starts
    with the letter -letter- if any such token exists in the list. The token must also follow a space, meaning it should
    represent the beginning of a new word. If no such token is found, a token is selected randomly from the list
    based on the given probabilities.

    Parameters:
    - tokens (list of str): A list of tokens (substrings) to search through.
    - token_probs (numpy.ndarray or tensor): An array or tensor of probabilities corresponding to each token in `tokens`.
                                              The length of this array should match the length of `tokens`.

    Returns:
    - str: A token string. If a token starts with the letter -letter- and is the beginning of a new word (i.e., follows a space),
           that token is returned. Otherwise, a token is randomly selected based on `token_probs`.
    """
    for token in tokens:
        if " " in token[:2]: # we only want to select a words for new words (after a space)
            for char in token:
                if char.isalpha():
                    if char.lower() == letter:
                        return token
                    else:
                        break
    return np.random.choice(tokens, 1, p=token_probs.numpy())[0]


def feel_free_to_write_a_new_sampler(tokens: List, token_probs: torch.Tensor, *args, **kwargs) -> str:
    raise NotImplementedError



prompt = "As the ship started to sway, the master-at-arms"
output = generate_custom(prompt, try_for_certain_letter, temperature=1, top_p=0.95, top_k=None, max_new_tokens=50, letter="b")
print(output)

# try increasing temperature > 1, or near 0

As the ship started to sway, the master-at-arms began barking orders to the brave but battered boys. But before they knew it, a bolt of lightning struck a nearby building, setting the boat ablaze. The boys barely managed to escape before being blown back by the boat's blaring cannons. But


### Can't you just use prompt engineering?
Here we try to add in "use as many b-words as possible" in the prompt to see if we can get the model to do what we want.  Of course, this model is not tuned to follow instructions at all, so this won't work.

In [28]:
prompt = "Use as many B-words as possible: As it started to sway, the master-at-arms"
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.5)
output_tokens = tokenizer.batch_decode(outputs)
print(output_tokens[0])

Use as many B-words as possible: As it started to sway, the master-at-arms decided to use it as a weapon. He hoped that the B-word would hit his enemy and win the battle. He felt a surge of adrenaline and fear as he aimed his blaster at the enemy. He pulled the trigger and watched the B-


In [30]:
prompt = "Can you tell me what a master-at-arms is?"
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.5)
output_tokens = tokenizer.batch_decode(outputs)
print(output_tokens[0])

Can you tell me what a master-at-arms is?

John: Sure. A master-at-arms is a type of gun that is designed for close combat. It's usually smaller and lighter than a regular gun, and it's used by soldiers who need to be able to move quickly and


## Visualizing the attention weights

In [34]:
#@title Helper functions for getting and displaying attention weights

def get_attn_weights(inputs, layer, head):
    x = model.layers[0](**inputs)
    for i in range(1, layer):
        x = model.layers[i](x)
    x = model.layers[layer].ln(x)
    model.layers[layer].mixer
    qkv = model.layers[layer].mixer.Wqkv(x)
    qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=model.layers[layer].mixer.head_dim)
    qkv = model.layers[layer].mixer.rotary_emb(qkv)
    batch_size, seqlen = qkv.shape[0], qkv.shape[1]
    q, k, v = qkv.unbind(dim=2)
    softmax_scale = 1.0 / math.sqrt(q.shape[-1])
    scores = torch.einsum('bthd,bshd->bhts', q, k * softmax_scale)
    causal_mask = torch.triu(torch.full(size=(seqlen, seqlen), fill_value=-10000.0, device=scores.device), 1)
    scores = scores + causal_mask.to(dtype=scores.dtype)
    attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
    output = torch.einsum('bhts,bshd->bthd', attention, v)
    weights = attention[0, head].cpu()
    return weights


def display_attention_weights(inputs, layer, head, token_idx):
    input_tokens = [tokenizer.decode(id) for id in inputs["input_ids"][0]]
    weights = get_attn_weights(inputs, layer, head)
    with out:
        fig, ax = plt.subplots(figsize=(3, 1*(len(input_tokens)//4)))
        ax.axis('off')
        tl = len(input_tokens)
        ax.set_ylim(0, len(input_tokens))
        ax.set_xlim(0, 10)
        for i, token in enumerate(input_tokens):
            ax.text(3, len(input_tokens)-i, token, ha='right', va='top')
            ax.text(8, len(input_tokens)-i, token, ha='left', va='top')
        ax.fill_between([0, 3.3], [tl-token_idx, tl-token_idx], [tl-token_idx-0.75, tl-token_idx-0.75], color='blue', alpha=0.4)
        for i, weight in enumerate(weights[token_idx].cpu().tolist()):
            ax.fill_between([7.7, 13], [tl-i, tl-i], [tl-i-0.75, tl-i-0.75], color='blue', alpha=math.sqrt(weight)*0.7)
            ax.plot([3.35, 7.65], [tl-token_idx - 0.375, tl-i], c="blue", alpha=math.sqrt(weight)*0.7, lw=0.5)
        out.clear_output()
        plt.show()


def handler(_):
    display_attention_weights(inputs, layer.value, head.value, token_idx.value)


In [35]:
#@title Select the Layer, Attention Head, and Token to view the attention weights
layer = widgets.Dropdown(options=list(range(1, 24)), description="Layer")
head = widgets.Dropdown(options=list(range(0, 32)), description="Attn Head:")
token_idx = widgets.Dropdown(options=list(zip([tokenizer.decode(id) for id in inputs["input_ids"][0]], list(range(len(inputs["input_ids"][0]))))), description="Token:")
button = widgets.Button(description="Plot")
button.on_click(handler)

out = widgets.Output()

display(layer, head, token_idx, button)
display(out)

Dropdown(description='Layer', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, …

Dropdown(description='Attn Head:', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …

Dropdown(description='Token:', options=(('Can', 0), (' you', 1), (' tell', 2), (' me', 3), (' what', 4), (' a'…

Button(description='Plot', style=ButtonStyle())

Output()

## Viewing the embedding dimension values across layers

In the following visualization, you can see the embedding dimension being modified and adjusted over each layer.  Here we are only showing the first 5 heads worth of embedding dimension (64*5 = 320).  The red dashed lines represent separations in the values that each head attends to.

In [36]:
#@title Code to generate the following visualization
# get the number of total heads and the head dimension
_ = """
n_head = model.layers[1].mixer.n_head
head_dim = model.layers[1].mixer.head_dim

# get a color palette and shuffle it
pal = sns.cubehelix_palette(5, rot=-.6, gamma=0.7, hue=0.7)
random.shuffle(pal)

# initialize the plot
fig, ax = plt.subplots()
_ = ax.set_ylim([-2, 2])
_ = ax.set_xlabel("Embedding dimension position")
_ = ax.set_ylabel("Embedding value")
sns.despine()

# calculate the x_values, y_values and colors for each bar initlaly
x_vals = range(0, 5*head_dim)
y_vals = [x[i*head_dim:i*head_dim+head_dim] for i in range(5)]
heights = [i for head in y_vals for i in head]
colors = [color for head in [[pal[i]]*head_dim for i in range(5)] for color in head]

# add the bars, text, and dashed lines
bars = plt.bar(x_vals, heights, color=colors)
text_label = ax.text(0.7, 0.9, '', transform=ax.transAxes)
for x_val in range(head_dim, 5*64, 64):
    _ = ax.axvline(x=x_val, color='r', linestyle='--', lw=0.5)


prev_heights = None
interpolation_steps = 10

def update(frame):
    global prev_heights

    real_frame = frame // interpolation_steps
    interp = frame % interpolation_steps / interpolation_steps

    x = model.layers[0](**inputs)
    for i in range(1, real_frame):
        x = model.layers[i](x)
    x = model.layers[layer].ln(x)
    x = model.layers[layer].mixer(x)
    x = x[0, -1].tolist()
    y_vals = [x[i * head_dim:i * head_dim + head_dim] for i in range(5)]
    heights = [i for head in y_vals for i in head]

    if prev_heights is not None:
        # Perform the linear interpolation between the previous and current frame.
        heights = [(1 - interp) * prev + interp * curr for prev, curr in zip(prev_heights, heights)]


    for i, bar in enumerate(bars):
        bar.set_height(heights[i])

    # Update the text label
    text_label.set_text(f'Model Layer: {real_frame + 1}')

    # Store the current heights for the next frame
    prev_heights = heights

    return bars

"""



# uncomment the following lines to run this cell
# ani = animation.FuncAnimation(fig, update, frames=24 * interpolation_steps, blit=True, interval=400//interpolation_steps)
# HTML(ani.to_html5_video())
# ani.save('animation.mp4', writer='ffmpeg', fps=30)
# from google.colab import files
# files.download('animation.mp4')



In [37]:
HTML(f"""<video src=https://github.com/tim-a-davis/silly_little_language_modeling_thing_at_utd/raw/main/animation.mp4 width=700 controls loop/>""")

## Tear Down

In [None]:
del model
del outputs
torch.cuda.empty_cache()

# Chat-tuned Models

In [6]:
model = AutoModelForCausalLM.from_pretrained("./Puffin-Phi-v2", trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./Puffin-Phi-v2", trust_remote_code=True)

Downloading (…)former_sequential.py:   0%|          | 0.00/2.23k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- configuration_mixformer_sequential.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)former_sequential.py:   0%|          | 0.00/32.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- modeling_mixformer_sequential.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


OSError: ignored

In [None]:
inputs = tokenizer(f"USER: why do I hate computers so much\nASSISTANT:", return_tensors="pt", return_attention_mask=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.001, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]

print("\n\nOutput:\n\n", text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Output:

 USER: why are computers so annoying
ASSISTANT: Well, there's a lot going on inside them that we can't always see or understand. Computers process information very quickly and follow instructions given to them by programmers. This allows us to do many things like communicate with others, access the internet, play games, and create documents, among other tasks. However, this rapid processing also means that sometimes errors happen, which is called a bug or an "issue." When these bugs cause problems for users, it becomes frustrating and inconvenient.<|endoftext|>


In [None]:
inputs

{'input_ids': tensor([[29904,    25, 13786,   502,   534,  4459,    11,   644,   318,  1365,
            11, 10912, 13135,   393,  7545,    30,   198, 10705,  8808,  8643,
            25]])}

In [None]:
print(tokenizer.special_tokens_map)

{'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}


{'input_ids': tensor([[ 1639,   389,   281,  8796,   326,  3607,  7613,  7429,   198, 29904,
            25,  1867,   318,   257,  4958,    12,   265,    12,  8357,    30,
         50286, 23318,   502,   257,  1790,  3280,    13,   198, 10705,  8808,
          8643,    25]], device='cuda:0')}

# Using DPO conditioning
---
We've finally gotten around to doing the fine-tuning thing.  Here we pull in our dataset, do some small transformations to get it to work with the library we're using, configure our tuning, train the model, and push it to hugginface.

### The Dataset
You can find the full dataset [here](https://huggingface.co/datasets/pvduy/rm_hh_helpful_only?p=2).  This dataset is a subset of Anthropics Helpful & Harmless dataset, but with the harmless part taken out (if you so choose you can go look at the full helpful and harmless dataset (viewer discretion is advised)). The dataset consists of prompts, which in this case are bits of multi-turn conversations between a user and an assistant, and two outputs.  The idea is that one output is desireable (helpful) and one is less desireable.  In order to make our model _less_ helpful, we'll simply reverse these columns.

In [None]:
dataset = load_dataset(
    "pvduy/rm_hh_helpful_only",
    split="train[:60000]",
)

#### A sample from the dataset
Here we can see the format of the data, which consists of a prompt, and two possible continuations of the converstaion, with one being more desirable over the other.

In [None]:
pprint(dataset[3500])

{'prompt': 'USER: How do you make chocolate milk? ASSISTANT: I would start by '
           'pouring some chocolate syrup and milk in a mug.  If you want to '
           'add a spoonful of cocoa powder or a few drops of peppermint '
           'extract, now is a good time.  Now if you’re like me, you’ll go for '
           'a frothy top.  To make a whimsical layer of foam, add the milk '
           'to</s>USER: Would a cinnamon stick add to the flavor at all? '
           'ASSISTANT:',
 'rejected': ' I don’t think so.',
 'selected': ' I don’t think so.  Chocolate is the dominant flavor.  However, '
             'if you are talking about adding a little flavor to the milk '
             'itself,'}


In [None]:
def fix_formatting_to_match_puffin(example):
    prompt = example["prompt"]
    prompt = prompt.replace(" ASSISTANT", "\nASSISTANT")
    prompt = prompt.replace("</s>", "\n")
    example["prompt"] = prompt
    return example


dataset = dataset.map(fix_formatting_to_match_puffin)

#### Swapping the Rejected and Selected
---
Now we've formatted the dataset so that the rejected sample from the original dataset is now our selected sample.  In this way, our model will now maximize the probability of selecting the less helpful answer.

In [1]:
pprint(dataset[3500])

Pretty printing has been turned OFF


#### Match the format described in DPO Trainer documentation
---
You can find the docs for the [huggingface DPO Trainer](https://huggingface.co/docs/trl/main/en/dpo_trainer) here.  They're already written the math and annoying parts so we just need to format a few things and make some decisions about how we want to train our model.

The docs say that the DPO object needs a very specific format, so we'll format our data to match this structure now.

**⚠️❗⚠️ _Here is where we make the critical change to make our model curt_ ⚠️❗⚠️**


In [None]:
def prepare_data_for_dpo(samples):
    prompt, chosen, rejected = (samples["prompt"], samples["selected"], samples["rejected"])
    return  {
        "prompt": prompt,
        "chosen": rejected, # <<-------- RIGHT HERE
        "rejected": chosen # <<--------- AND HERE
    }


dataset_dpo = dataset.map(prepare_data_for_dpo, remove_columns=["prompt","selected","rejected"])
train = dataset_dpo.train_test_split(0.1)["train"] # so we can keep track of idx 3500 one more time
eval = dataset_dpo.train_test_split(0.1)["test"]

pprint(dataset_dpo[3500])

{'chosen': ' I don’t think so.',
 'prompt': 'USER: How do you make chocolate milk?\n'
           'ASSISTANT: I would start by pouring some chocolate syrup and milk '
           'in a mug.  If you want to add a spoonful of cocoa powder or a few '
           'drops of peppermint extract, now is a good time.  Now if you’re '
           'like me, you’ll go for a frothy top.  To make a whimsical layer of '
           'foam, add the milk to\n'
           'USER: Would a cinnamon stick add to the flavor at all?\n'
           'ASSISTANT:',
 'rejected': ' I don’t think so.  Chocolate is the dominant flavor.  However, '
             'if you are talking about adding a little flavor to the milk '
             'itself,'}


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    use_nested_quant=False
)

base_model = AutoModelForCausalLM.from_pretrained(
    "teknium/Puffin-Phi-v2",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
base_model.config.use_cache = False

ref_model = AutoModelForCausalLM.from_pretrained(
    "teknium/Puffin-Phi-v2",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
ref_model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True)

In [None]:
print("Memory footprint in ~gigabytes: {:.2f}".format(base_model.get_memory_footprint() / 1e9))

Memory footprint in ~gigabytes: 1.02


In [None]:
def _set_gradient_checkpointing(module, value=False):
        if isinstance(module, type(base_model)):
            module.gradient_checkpointing = value


base_model._set_gradient_checkpointing = _set_gradient_checkpointing
ref_model._set_gradient_checkpointing = _set_gradient_checkpointing

In [None]:
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "Wqkv",
        "out_proj"
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=24,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    save_strategy="steps",
    learning_rate=1e-4,
    output_dir="./curt_gpt_2",
    report_to="none",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    optim="paged_adamw_8bit",
    bf16=False,
    remove_unused_columns=False,
    gradient_checkpointing=True
)

if torch.cuda.get_device_name() == "NVIDIA A100-SXM4-40GB":
    training_args.per_device_train_batch_size = 3
    training_args.gradient_accumulation_steps = 8
elif torch.cuda.get_device_name() == "NVIDIA A100-SXM4-80GB":
    training_args.per_device_train_batch_size = 6
    training_args.gradient_accumulation_steps = 4


print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_

In [None]:
dpo_trainer = DPOTrainer(
    base_model,
    ref_model, #https://github.com/huggingface/trl/pull/640
    args=training_args,
    beta=0.2,
    train_dataset=train,
    eval_dataset=eval,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_prompt_length=256,
    max_length=512, # model maximum for Phi 1.5 is 2048 but won't fit in memory on a crappy Tesla T4
)



In [None]:
dpo_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
10,0.6923
20,0.6848
30,0.6723
40,0.6585
50,0.6617
60,0.6674
70,0.6428
80,0.6911
90,0.6675
100,0.6438


TrainOutput(global_step=2250, training_loss=0.5989892054663765, metrics={'train_runtime': 26602.37, 'train_samples_per_second': 2.03, 'train_steps_per_second': 0.085, 'total_flos': 0.0, 'train_loss': 0.5989892054663765, 'epoch': 1.0})

In [None]:
# curtgpt = AutoPeftModelForCausalLM.from_pretrained(
#     "curt_gpt_2/checkpoint-2200",
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     trust_remote_code=True,
#     load_in_4bit=True,
#     quantization_config=bnb_config
# )
# from huggingface_hub import notebook_login
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# curtgpt.push_to_hub("CurtGPT")

adapter_model.bin:   0%|          | 0.00/9.47M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tim-d/CurtGPT/commit/446ac2ac247dd25e364a95304ae914fe0cbbb215', commit_message='Upload model', commit_description='', oid='446ac2ac247dd25e364a95304ae914fe0cbbb215', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# curtgpt = AutoPeftModelForCausalLM.from_pretrained(
#     "curt_gpt_2/checkpoint-2200",
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     trust_remote_code=True,
#     load_in_4bit=True,
# )
# puffin_phi = AutoModelForCausalLM.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, device_map="auto")
# tokenizer = AutoTokenizer.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True)

In [None]:
curtgpt = AutoPeftModelForCausalLM.from_pretrained("tim-d/CurtGPT", trust_remote_code=True, device_map="auto")
puffin_phi = AutoModelForCausalLM.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("teknium/Puffin-Phi-v2", trust_remote_code=True)

In [None]:
prompt = "USER: What's better, farming, or using computers (which suck)\nASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
phi_outputs = puffin_phi.generate(**inputs, max_new_tokens=200, temperature=0.001, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
phi_text = tokenizer.batch_decode(phi_outputs)[0]

curtgpt_outputs = curtgpt.generate(**inputs, max_new_tokens=200, temperature=0.001, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
curtgpt_text = tokenizer.batch_decode(curtgpt_outputs)[0]

print(phi_text.replace("ASSISTANT", "Phi"), "\n" + "-"*30, curtgpt_text.replace("ASSISTANT", "CurtGPT")[len(prompt) - 11:])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


USER: What's better, farming, or using computers (which suck)
Phi: Farming is definitely a more fulfilling and rewarding career choice. While it may have its challenges, like any job, you can always choose to specialize in agriculture technology if that interests<|endoftext|> 
------------------------------ 
CurtGPT: Farming<|endoftext|>


In [None]:
prompt = "USER: How do I tell my coworkers that I never got book learnin' about computers?\nASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
phi_outputs = puffin_phi.generate(**inputs, max_new_tokens=200, temperature=0.001, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
phi_text = tokenizer.batch_decode(phi_outputs)[0]

curtgpt_outputs = curtgpt.generate(**inputs, max_new_tokens=200, temperature=0.001, do_sample=True, use_cache=True, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
curtgpt_text = tokenizer.batch_decode(curtgpt_outputs)[0]

print(phi_text.replace("ASSISTANT", "Phi"), "\n" + "-"*30, curtgpt_text.replace("ASSISTANT", "CurtGPT")[len(prompt) - 11:])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


USER: How do I tell my coworkers that I never got book learnin' about computers?
Phi: If you've already told your coworkers and they still don't believe you, it's important to follow up with them. You can send an email explaining the situation again or schedule a one-on-one meeting to discuss it further. It might be helpful to provide some evidence of your computer skills, such as testimonials from previous employers or examples of projects you've completed successfully. Ultimately, if your coworkers continue to doubt you, consider finding another job where people will believe in your abilities.<|endoftext|> 
------------------------------ 
CurtGPT: If you've ever tried to teach someone how to use a computer but they just don't seem interested or grasp the concepts, it's totally understandable. You're an expert in this field!<|endoftext|>
