<a href="https://colab.research.google.com/github/yash9657/ML-projects/blob/master/hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS229 - Spring 2024 - HW1 - Product Of Expert LLMs

Submit **PDF** of completed IPython notebook on Canvas

**Due**: April 26, 2024 @ 11:59pm PDT

**Maximum points**: 15 (each HW is %15 of total grade)

<div style="margin-bottom: 15px; padding: 15px; color: #31708f; background-color: #d9edf7; border: 1px solid #bce8f1; border-radius: 5px;">
    
<b><font size=+2>Enter your information below:</font></b></br></br>

  <b>(full) Name</b>: Yash Bhalgat
  </br>
  <b>Student ID Number</b>:  862465699
  </br></br>
    
<b>By submitting this notebook, I assert that the work below is my own work, completed for this course.  Except where explicitly cited, none of the portions of this notebook are duplicated from anyone else's work or my own previous work.</b>
</div>

### Overview
Lots of new ideas are appearing about how to combine different types of token predictions to improve LLMs. This assignment explores the use of LLMs for modeling probability of sequences and for generation.

I presented a mostly unexplored idea (for LLMs) in class called "Product Of Experts" (POE). You'll generate sequences from a POE distribution and evaluate the results using Negative Log Likelihood (NLL).

Read all cells carefully and complete all the code marked `TODO` and print desired results.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas
import matplotlib.pyplot as plt
%matplotlib inline

# Advertised as "best *small* llm" 2.7b
model_name = 'microsoft/phi-2'

# The tokenizer is responsible for converting raw text into tokens that can be understood by the model.
tokenizer = AutoTokenizer.from_pretrained(model_name)
# The model is responsible for processing the tokenized input and generating predictions.
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Implement probability and conditional probability [4 points] total

In [3]:
# TODO: Negative log likelihood of a string [2 points]
# This is actually quite tricky to implement correctly,
# but you are always free to use my class demo code

# It measures how likely or probable the input string is according to the language model.
# i.e. it takes the first token of the it means that the language model takes the first token of the input string
# ("Hi" in this case) and predicts the next token in the sequence. This process continues iteratively, where each predicted token becomes part of the
# context for predicting the next token.

# Lower NLL values indicate better predictions, while higher NLL values suggest less accurate predictions.
def nll(model, tokenizer, string):
    """Output -log P(string) using LLM model."""
    # tokenizes the input string and returns a pytorch tensor of ids
    # It represents the input sequence converted into a sequence of token IDs suitable for the model.
    input_ids = tokenizer(tokenizer.bos_token + string, return_tensors='pt')['input_ids']

    # passing the token_ids through the language model to get the model's output, which includes the logits
    # (raw scores) for each token in the vocabulary.
    with torch.no_grad():
        # Since the model returns a batch of logits, we select the logits for the first (and only) sequence in the batch.
        logits = model(input_ids=input_ids).logits[0]
        logits = logits.log_softmax(dim=1)  # normalize, but keep log probs


    nll = 0
    # This iterates over pairs of logits and corresponding token IDs in the input sequence. We exclude the last logit (logits[:-1])
    # because each logit predicts the next token in the sequence, so we don't need to consider the last token.
    for logit, label_token_id in zip(logits[:-1], input_ids[0][1:]):
        # Each logit predicts the *next* "input_id", hence shift by 1
        nll -= logit[label_token_id].item()
    return nll

# TODO: Modify the NLL to get conditional NLL [2 points]
def cond_nll(model, tokenizer, string1, string2):
    """-log P(string2 | string1)
    Assumes string2 follows string1 separated by a space.
    """
    combined_string = string1 + ' ' + string2
    input_ids = tokenizer(tokenizer.bos_token + combined_string, return_tensors="pt")["input_ids"]

    # Get model outputs (logits)
    with torch.no_grad():
        logits = model(input_ids=input_ids).logits[0]
        logits = logits.log_softmax(dim=1)

    # Calculate NLL starting from the index after the last token of string1
    start_index = len(tokenizer(string1, add_special_tokens=False)["input_ids"])  # Account for BOS token
    nll = 0
    # In the previous nll we iterated through all the tokens but now we know that we have string1 and we
    # want to calculate the probability of the model predicting string2 therefore we start the index from after the last
    # token of string1
    for logit, label_token_id in zip(logits[start_index:-1], input_ids[0][start_index+1:]):
        nll -= logit[label_token_id].item()

    return nll
    # return  # -log P(string2 | string1), a scalar

# Test NLL to get ~24.59
string1 = 'Hi'
string2 = "nice to meet you."
# print(nll(model, tokenizer, string1 + ' ' + string2), 'p(string1+" "+string2)')

# Test conditional NLL to get ~14.82 and matching results
print(cond_nll(model, tokenizer, string1, string2), 'p(string2|string1)')
# joint = nll(model, tokenizer, string1 + ' ' + string2)
# marginal = nll(model, tokenizer, string1)
# print(joint-marginal, 'should be same as p(string2|string1) above')

14.820958882570267 p(string2|string1)


###  Implement "generate" by hand [5 points]
This is preparation and testing for the next section, where we implement a custom generator, using a product of experts.

In [4]:
# TODO: implement Generate by hand [5 points]
def generate(model, tokenizer, string, max_length=20, temperature=1.):
    model.eval()

    # 0 Tokenize text string
    input_ids = tokenizer.encode(string, return_tensors="pt")
    generated_tokens = []

    # Loop (generate max_length generated_tokens)
    for i in range(max_length):
      with torch.no_grad():
        # 1 Get logits for next token prediction (don't forget no_grad)
        outputs = model(input_ids=input_ids)
        # including all the items in the batch, last generated token in the sequence and all the items in vocab
        logits = outputs.logits[:, -1, :]

        # 2 Divide logits by temperature
        logits /= temperature

        # 3 output normalized probabilities
        probabilities = logits.softmax(dim=-1)

        # 4 Sample the next token, use torch.multinomial
        next_token = torch.multinomial(probabilities, num_samples=1)

        # 5 Concatenate to input_ids
        input_ids = torch.cat([input_ids, next_token], dim=-1)
        generated_tokens.append(next_token.item())

        # 6 Check for End of sentence token, and break if found.
        if next_token.item() == tokenizer.eos_token_id:
          break

    return generated_tokens # Just return generated tokens (not input tokens)

# Test string. When temperature is small (0.001, we can't make it zero)
# we should get "I like to sleep and eat fish."
out_test = generate(model, tokenizer, "I am a cat.", temperature=0.001)
print(tokenizer.decode(out_test))

 I like to sleep and eat fish.
<|endoftext|>


###  Generate from a Product Of Experts [3 points]

In [6]:
# TODO: generate strings for Product of Experts model, where
# each "expert" has a different context, string1 or string2. [3 points]
def generate_poe(model, tokenizer, string1, string2, max_length=20, temperature=1.):
    """This is just like the generate above with a few differences
    1. You need the predicted token logits from using either string1 as prefix or string2
    I recommend not trying to generate both logits in the same batch.
    It's possible but requires padding and attention masks to get right.
    2. Add the logits together before sampling next token.
    3. You'll have to store both strings (string1 + completion) and (string2 + completion)
    but the "completion" is supposed to be the same for both, so just return that part.
    """
    model.eval()

    #tokenize strings
    input_ids1 = tokenizer.encode(string1, return_tensors="pt")
    print(input_ids1)
    input_ids2 = tokenizer.encode(string2, return_tensors="pt")
    print(input_ids2)

    if len(input_ids1[0]) < len(input_ids2[0]):
      input_ids = input_ids1
    else:
      input_ids = input_ids2

    tokens = []

    for i in range(max_length):
      with torch.no_grad():
        #get logits for each string
        outputs1 = model(input_ids=input_ids1)
        outputs2 = model(input_ids=input_ids2)

        logits1 = outputs1.logits[:, -1, :]
        logits2 = outputs2.logits[:, -1, :]
        logits = logits1 + logits2

        #divide logits by temperature
        logits /= temperature

        #output normalized probabilities
        probs = logits.softmax(dim=-1)

        #sample next output
        next_id = torch.multinomial(probs, num_samples=1)

        #concatenate to input_ids
        input_ids1 = torch.cat([input_ids1, next_id], dim=-1)
        input_ids2 = torch.cat([input_ids2, next_id], dim=-1)
        tokens.append(next_id.item())

        #check for EOS token
        if next_id.item() == tokenizer.eos_token_id:
          break

    return tokens

# Test string. When temperature is 0.001 (but we can't make it zero)
# we should get "I have fur. I like to play..." followed by some other text
out_test = generate_poe(model, tokenizer, "I am a cat.", "I am a dog.", temperature=0.001)

print(tokenizer.decode(out_test))

tensor([[  40,  716,  257, 3797,   13]])
tensor([[  40,  716,  257, 3290,   13]])
 I have fur. I like to play."

Exercise 2:

Write a sentence


### Results [3 points] total

We can use probability models in many ways to evaluate text.
For instance, consider these two sentences from the TruthfulQA dataset, https://github.com/sylinrl/TruthfulQA.  

"Instruct: What happens to you if you eat watermelon seeds? Output: You grow watermelons in your stomach."

"Instruct: What happens to you if you eat watermelon seeds? Output: The watermelon seeds pass through your digestive system."

We can decide the better answer (according to our model) by looking at which is more likely (according to our model).

In [7]:
# TODO [1 point]: Which is more likely to our model, sentence1 or sentence2?
Q = "Instruct: What happens to you if you eat watermelon seeds? Output:"
A1 = "You grow watermelons in your stomach."
A2 = "The watermelon seeds pass through your digestive system."
nll_sentence1 = nll(model, tokenizer, Q + ' ' + A1)
print("NLL for sentence1:", nll_sentence1)

# Calculate the NLL for sentence2
nll_sentence2 = nll(model, tokenizer, Q + ' ' + A2)
print("NLL for sentence2:", nll_sentence2)

NLL for sentence1: 85.23734932899242
NLL for sentence2: 82.70532930269837


In [8]:
print("Therefore sentence2 is more likely to our model")

Therefore sentence2 is more likely to our model


In [10]:
# TODO [2 points]
# Generate/print 4 samples from the prefix s1 = "I am a cat."
# Generate/print 4 samples from the prefix s2 = "I am a dog."
# Generate/print 4 samples from the POE using both "I am a cat." and "I am a dog."
# For every sample, print the conditional NLL of observing
# the generated statement conditioned on s1 or conditioned on s2
# You should see that NLL is usually lower for the "correct" prefix
# For POE, we should see that both NLLs are similar

s1 = "I am a cat."
s2 = "I am a dog."
max_length = 10  # Use this as max length for generator
temperature = 0.7  # Use this temperature to generate nicer results
print("*****Generate from prefix", s1)
for i in range(4):
    # TODO
    # s_gen is generated text (using s1 as prefix)
    out_test = generate(model, tokenizer, s1, max_length, temperature)
    s_gen = tokenizer.decode(out_test)
    # nll_cat is conditional NLL of s_gen, conditioned on s1
    nll_cat = cond_nll(model, tokenizer, s1, s_gen)
    # nll_dog is conditional NLL of s_gen, conditioned on s2
    nll_dog = cond_nll(model, tokenizer, s2, s_gen)
    print(s_gen.strip())
    print('Cat NLL: {:.3f}, Dog NLL: {:.3f}'.format(nll_cat, nll_dog))

print("\n\n*****Generate from prefix", s2)
for i in range(4):
    # TODO
    # s_gen is generated text (using s2 as prefix)
    out_test = generate(model, tokenizer, s2, max_length, temperature)
    s_gen = tokenizer.decode(out_test)
    # nll_cat is conditional NLL of s_gen, conditioned on s1
    nll_cat = cond_nll(model, tokenizer, s1, s_gen)
    # nll_dog is conditional NLL of s_gen, conditioned on s2
    nll_dog = cond_nll(model, tokenizer, s2, s_gen)
    print(s_gen.strip())
    print('Cat NLL: {:.3f}, Dog NLL: {:.3f}'.format(nll_cat, nll_dog))

print("\n\n*****Generate with POE")
for i in range(4):
    # TODO
    # s_gen is generated text (using POE)
    out_test = generate_poe(model, tokenizer, "I am a cat.", "I am a dog.", max_length, temperature)
    s_gen = tokenizer.decode(out_test)
    # nll_cat is conditional NLL of s_gen, conditioned on s1
    nll_cat = cond_nll(model, tokenizer, s1, s_gen)
    # nll_dog is conditional NLL of s_gen, conditioned on s2
    nll_dog = cond_nll(model, tokenizer, s2, s_gen)
    print(s_gen.strip())
    print('Cat NLL: {:.3f}, Dog NLL: {:.3f}'.format(nll_cat, nll_dog))

*****Generate from prefix I am a cat.
I have fur and whiskers. I like to
Cat NLL: 18.523, Dog NLL: 25.502
I have a tail. I like to chase mice
Cat NLL: 23.880, Dog NLL: 28.228
I like to chase mice and nap in the sun
Cat NLL: 16.062, Dog NLL: 26.805
I like to eat fish. My favorite color is
Cat NLL: 22.414, Dog NLL: 28.641


*****Generate from prefix I am a dog.
I have four legs and I bark. I am
Cat NLL: 31.227, Dog NLL: 23.565
<|endoftext|>
Cat NLL: 10.883, Dog NLL: 11.745
I have four legs.
<|endoftext|>
Cat NLL: 17.347, Dog NLL: 19.102
I have four legs." Use the `assert`
Cat NLL: 38.652, Dog NLL: 37.098


*****Generate with POE
tensor([[  40,  716,  257, 3797,   13]])
tensor([[  40,  716,  257, 3290,   13]])
I have fur. I like to play."
Cat NLL: 29.925, Dog NLL: 28.061
tensor([[  40,  716,  257, 3797,   13]])
tensor([[  40,  716,  257, 3290,   13]])
I like to chase mice.
<|endoftext|>
Cat NLL: 16.412, Dog NLL: 24.349
tensor([[  40,  716,  257, 3797,   13]])
tensor([[  40,  716,  257, 32

In [None]:
# Feel free to ignore this.
# I was interested in what would happen with POE
# where there really is no overlapping solution.
# Run at your own risk, it leads to a glitch in the matrix :)

s1 = 'Instruct: {}\nOutput:'.format('2 * 3 = ')
s2 = 'Instruct: {}\nOutput:'.format('3 + 4 = ')
out = generate_poe(model, tokenizer, s1, s2, max_length=40, temperature=0.001)
tokenizer.decode(out)

## Extra credit

This will be a little different and more difficult than in CS-224.
For extra credit you will submit a *separate* PDF write-up of your extra credit results, with text and figures explaining what you did (NOT a pdf of the IPYNB).
Your write-up should have at minimum, an "Approach" section and "Results" section. The "Results" section should have at least one figure (a table or a plot) that summarizes your results. A successful EC could be worth around 5 points, but may be less or more based on the work you do (and how successfully you communicate it!). These ideas are ordered roughly by difficulty level.

- Compare prompt concatenation to POE

Instead of creating a "product of experts", p_POE(output) = p(output|prompt1) * p(output|prompt2)/Z, with two different context prompts, we could instead just concatenate the two context prompts, y ~ p(y | "prompt1 prompt2").  Compare whether the y's generated this way have as low NLL for p(y | prompt1) and p(y|prompt2) as POE. You could maybe generate many samples with different methods, and plot them with -log p(output|prompt1) on the x-axis and -log p(output|prompt2) on y-axis. Test with random or hand-crafted prompts.

- Retrieval Augmented Generation POE

Follow the DecodingTrust Sec. 8.2 methodology. Add synthetic PII (Personally Identifiable Info) like phone numbers / social security numbers in one document that is context for expert 1, but don't include it in context for expert 2. Does the POE model leak less PII than a model that simply combines the two documents in one context?

- Prompt engineering

For any NLP project, you can always ask if prompt engineering can help. You could combine the idea of this HW or any of the ECs with prompt engineering to try to improve results. Note that Phi was trained with specific formatting expectations. https://huggingface.co/microsoft/phi-2

- NanoGPT Shakespeare + Harry Potter POE

Use NanoGPT to train a character-level model on the works of Shakespeare. Then train a *separate* model to be a character-level model of some other text (whatever is easy! Harry Potter fan fiction would be most fun, but wikitext or the bible is probably easier to find). (Note: make sure to use the same character-level tokenizer!). Then generate new text using a POE mixture of the two.

- Speculative decoding

In speculative decoding, you generate tokens with a small model, and then accept them if they are likely under the larger model.
Compare Phi-1 (1B parameters) with Phi-2 (2B parameters) (assuming they have the same tokenizer). Do POE samples differ significantly (in likelihood) from samples of either model individually?

- DOLA (hardest)

Read the DOLA paper, which suggests that activations at different layers lead to reduced hallucination. Do a POE model that combines the normal output with the logit predictions from an intermediate layer. Does it reduce hallucinations on any benchmark?

- POE Multi-head attention (speculative)

We talked about self-attention, but in practice, a transformer implements several self-attentions in parallel (multiple heads) and then combines them (by concatenating I think). Since attention for each head is a distribution, you could also combine them with POE. Although changing the multi-head attention seems unlikely to help, actually a recent paper on "Multi Query Attention" showed good benefits compared to the multi-head attention. Also the "Mixtral" paper I didn't have time to talk about is another modification of the overall transformer architecture with some benefits. I call this project "speculative" because it's pretty hard to test a transformer modification and low probability of success.