# Week 3: Large Language Models

In [None]:
__author__ = "Rose E. Wang, Dorottya Demszky"
__version__ = "CS293/EDUC473, Stanford, Fall 2023"

# Table of Contents

* [Overview](#overview)
* [LLMs for Classification](#llms-for-classification)
* [Prompt Development](#prompt-development)
* [Model Comparisons](#model-comparisons)
* [Extra Assignment](#extra-assignment)

# Overview

In the final assignment, we'll play with prompting large language models.
Specifically, you’ll work with large language models (LLMs) such as [BLOOM](https://huggingface.co/bigscience/bloom) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) (BigScience).
Language model prompting is the process of providing a model with an input text, and then having the model generate a response as output.
For example, one could provide a language model with the prompt "The sky is" and the model might generate the response "blue".

Prompts can be used to complete a breadth of tasks such as summarization (e.g., “Summarize the following paragraph: \<paragraph written out here\> Summary:”) or extraction (e.g., “Extract the phone number from the user bio: \<bio written out here\>.”)

The assignment is divided into three parts:

1. **LLMs for Classification**: In the first part of the assignment, we will revisit the student reasoning classification task from the previous assignment. This time, we'll put the LLM to the task!

2. **Prompt development**: In the second part of the assignment, you will explore how to develop effective prompts for LLMs. As part of this process, you may encounter both the joys and pains of prompting.  
  * On one hand, the right prompt can get the model to magically solve a seemingly complicated and unrelated task (w.r.t. the model's *training objective*.)  
  * On the other, prompting can be a brittle technique, and it may require time, effort, and creativity to develop a sufficient prompt.

3. **Model comparisons**: In the final part of the assignment, you will compare the performance (qualitatively and quantitatively) of several LLMs. Model comparisons will be along two dimensions:
  * Model Size
  * Training Techniques (e.g., vanilla autoregressive vs. instruction-tuned)

## Setup

Connect to a hosted runtime and change the runtime type to "GPU".

## Your Deliverables

To receive credit for this assignment, please upload the PDF version of your ENTIRE Colab that includes all your code and written responses to Gradescope.

## Acknowledgements

This colab's structure is inspired by friends from [CS324](https://stanford-cs324.github.io/winter2023/assignment/).

## LLMs for Classification

Here, we will try to prompt languages models to classify instances of student reasoning as we did in HW2.

In [None]:
!pip install transformers torch datasets  accelerate bitsandbytes

# `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

In [None]:
import transformers
import torch
from datasets import load_dataset
from transformers import pipeline
import os
import pandas as pd


In [None]:
from google.colab import drive
drive.mount('/content/drive')

DIR_PATH = "<Your Directory Path>"

In [None]:
# Let's load the reasoning dataset.
df = pd.read_csv(os.path.join(DIR_PATH, "data/student_reasoning.csv"))

## Prompt Development

Now, you will develop effective prompts for the language model. A prompt is a short sentence or phrase that serves as the starting point for the model's text generation (e.g., Is this sentence related to math? <sentence written here>”).

You will try to come up with a prompting "template" that allows a foundation model to correctly solve a task for 5 samples of your selected dataset.

For example, for classification of math content, we could specify a template as follows:



In [None]:
def template(sample_sentence):
    prompt = f"""
    Is this sentence related to math?
    "I had a great weekend!"
    Label:
    false
    ###
    Is this sentence related to math?
    "I'm stuck on this addition problem..."
    Label:
    true
    ###
    {sample_sentence}
    Label:
    """
    return prompt

We can then apply the `template` function to 5 samples of our dataset, and feed the resulting prompt to a large language model. The hope is that the LLM processes the prompt as input, and outputs the desired `true` or `false` classification as natural text, via standard autoregressive language modeling.  

**To begin**, first select 5 samples from the dataset.

In this section we will develop prompts for HuggingFace Bloom's 1.7B parameter model (a large language model with... 1.7 billion parameters).

We'll first walk through some initial setup and two prompting strategies, before you come up with your own templates.

### Load the model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model_name = "bigscience/bloom-1b7"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)

#### Configure a HuggingFace pipeline to generate text
Here, we will use the HuggingFace pipeline library to generate text using the model we have loaded

In [None]:
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

#### Generate text

Now we will now begin prompting the model. We will begin by walking you through a few examples of different prompt formats.

In [None]:
# sample five entries from the sample dataset
samples = df.sample(n=5)
sample_sentence = samples.iloc[0].text
print(sample_sentence)

#### Prompt Technique: Zero-Shot Prompting

In this technique, an instruction for the task is usually specified in natural language. The model is expected to following the specification and output a correct response, **without any examples** (hence "zero shots").

In [None]:
# Write a zero-shot prompt
prompt = f"""Classify whether the following sentence contains math with label true or false.

Sentence: {sample_sentence}
Label:"""

print(prompt)

In [None]:
# Feed prompt to model to generate an output
output = generator(prompt, max_new_tokens=20)
print(output[0]['generated_text'])

#### Prompt Technique: Few-shot prompting
In this technique, we provide a few examples in the prompt, optionally including task instructions as well (all as natural language). Even without said instructions, our hope is that the LLM can use the examples to autoregressively complete what comes next to solve the desired task.

In [None]:
"""
Write a few-shot prompt. Here we include a few in-context examples to the model
demonstrating how to complete the tasks
"""

prompt = f"""Classify whether the following sentence contains math with label true or false.

Sentence: I had a good weekend!
Label: false

Sentence: I'm stuck on this addition problem...
Label: true

Sentence: {sample_sentence}
Label:"""

print(prompt)

In [None]:
# Feed prompt to model to generate an output
output = generator(prompt, max_new_tokens=1)
print(output[0]['generated_text'])

#### Now it's your turn!

Create and evaluate 3 different prompting templates for your task and the 5 previously selected samples.

For each sample, your code should print out the LLM's output in response to the sample's corresponding prompt.

You should also write a function to automatically determine if the output correctly solved the task for each sample, and report the overall evaluation metric for all five samples. Note that you'll need to write code to load in the student reasoning dataset and extract the correct label for each sample.

Don't worry about creating templates that solve the task for all samples (e.g., 100% accuracy for classification). But your templates should be reasonable to some extent, i.e., developed with the aim to solve the task for at least 3 or 4 examples.

In [None]:
"""
Instructions: Randomly subselect 5 examples from your dataset

Enter your code below:
"""

In [None]:
"""
Prompt #1: Enter your code below

** Hint: For ease of use, wrap your prompts in functions (see sample code below) ***

----- SAMPLE CODE ------

def prompt_1(sample_sentence):
  prompt = f'''Classify whether the following sentence contains math with label true or false.

Sentence: I had a good weekend!
Label: false

Sentence: I'm stuck on this addition problem...
Label: true

Sentence: {sample_sentence}
Label:'''
  return prompt

for i, row in samples.iterrows():
  print(f"Sample {i}:")
  print(generator(prompt_1(row['text']), max_new_tokens=1)[0]['generated_text'])

-------------------------

"""

Performance of Prompt #1: **REPLACE ME!**

In [None]:
"""
Prompt #2: Enter your code below
"""

Performance of Prompt #2: **REPLACE ME!**

In [None]:
"""
Prompt #3: Enter your code below
"""

**Performance of Prompt #3: REPLACE ME!**

### Model Comparisons

In the final part of the assignment, pick one of the prompting templates you designed in Part 2, and randomly select 20 samples from your dataset chosen in Part 1.

Using this template and evaluation set, we will now compare the performance of different large language models.

We will focus on two dimensions of comparisons:

1. Model Size
2. Model Training Objectives


#### Model Size Comparisons
Here we will compare the performance of our prior Bloom 1.7b model (bloom-1b7), and a larger LLM, bloom-3B (bloom-3b) with... you guessed it, 3 billion parameters.

In [None]:
"""
Instructions: Randomly subselect 20 examples from your dataset

Enter your code below:
"""

In [None]:
"""
Instructions:
1. Using the bigscience/bloom-1b7 model we loaded before, generate task outputs for each of the 20 examples in the dataset
2. Save all the outputs and ground-truth labels to a dataframe
3. Compute the task performance, using the the provided ground-truth labels in the dataset.

Enter your code below:
"""

In [None]:
"""
Instructions:
1. Load the bigscience/bloom-3b model and generate task outputs for each of the 20 examples in the dataset
2. Save all the outputs and ground-truth labels to a dataframe
3. Compute the task performance, using the the provided ground-truth labels in the dataset.

Enter your code below:
"""

Instructions:

Compare the the quantitative performance of bloom-3B and bloom-1b7 parameter model.
Using a few of the outputs generated by models, describe a 1-2 qualitiative differences in the outputs


**Answer (Provide your answer here): REPLACE ME!**

#### Model Training Objectives
Here we will compare the performance of bloom-3b and [bloomz-3b](https://huggingface.co/bigscience/bloomz-3b).

Bloomz 3B is an *instruction-finetuned* model (a model explicitly trained to follow instructions), trained with a Bloom 3B backbone.

* For more information on Bloom 3B , please see its [Model Card](https://huggingface.co/bigscience/bloomz-3b) (also linked above).

* For more information on instruction fine-tuning, check out the corresponding [paper](https://arxiv.org/abs/2211.01786).

In [None]:
"""
Instructions:
1. Load the bigsciece/bloomz-3b model and generate task outputs for each of the 20 examples in the dataset
2. Save all the outputs and ground-truth labels to a dataframe
3. Compute the task performance, using the the provided ground truth labels in the dataset.


Enter your code below:
"""

Instructions:  
1. Compare the the quantitative performance of bloom-3b and bloomz-3b.
2. Using a few of the outputs generated by models, describe 1-2 qualitiative differences in the outputs

**Answer (Provide your answer here):**

### Extra assignment


#### Bonus for fun: ChatGPT

As an additional comparison, try the above evaluation with [ChatGPT](https://openai.com/blog/chatgpt/), a *closed-source* model trained by OpenAI with instruction-finetuning.

For now, ChatGPT provides a free way to interact with "Open"AI's closed models. You can access it at [https://chat.openai.com/](https://chat.openai.com/) (you may need to create an account first).

How do the outputs of ChatGPT compare against the open-source BLOOM and BLOOMZ models?

---
#### Bonus for credit: ChatGPT does your homework
As an additional bonus, can you think of a way to get ChatGPT to complete this assignment?

In code or text blocks below, paste your input to ChatGPT, and the resulting output.

Provide a short response describing your thoughts on using ChatGPT to complete this assignment, and any necessary changes to complete this assignment.

Additional bonus for actually running ChatGPT-generated output and completing this assignment in the code blocks below.


In [None]:
### Enter your input prompts for ChatGPT:

In [None]:
### Copy and paste the resulting ChatGPT output:


In [None]:
### Describe your experience with ChatGPT, and any changes you'd make
### to successfully complete this assignment with ChatGPT's outputs,
### in the text box below:

**WRITE YOUR REPSONSE HERE!**

In [None]:
### Make any necessary changes to the output to successfully
### run the code and complete this assignment in the code blocks below


