<a href="https://colab.research.google.com/github/ucbnlp24/hws4nlp24/blob/main/HW4/HW4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 4: Large Language Models & Prompting

### Due Date: March 8th, 2024 (11:59pm)

## Total Points: 100 points
- *Warning*: Start this assignment early as it is dependent on the OpenAI API!
- **Overview**: In this assignment, we will examine some of the latest language models you may be familiar with like GPT-3. We'll cover:

  - Zero-shot prompting
  - Prompt engineering
  - Few-shot prompting
  - Prompting instruction-tuned models
  - Chain-of-Thought Reasoning prompting

- **OpenAI Account Setup**: You will need an OpenAI account and API key, you can [sign up here](https://platform.openai.com/signup?launch) (click on `API`) and learn [how to make an API key here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). The OpenAI API is paid, however,  this homework will stay well under the free $5 credit given to each account. Be careful not to exhaust your free OpenAI credits while testing, you can check [on this page here](https://platform.openai.com/account/usage). To avoid exhausting your credits quickly, avoid running cells over and over again after you've completed an exercise.

- **Grading**: To complete the homework assignment, you should implement anything marked with `#TODO`
  - **NOTE #1**: For this assignment you will be creating your own unit tests for the prompts you generate. For each 'Code' section below you are required to write **3 unit tests** per prompt and submit the prompt, unit test, and output (more details in Submissions section) in the report.
  - **NOTE #2**: A boilerplate unit test function is provided below in the setup section, feel free to modify or come up with your own as long as you include the **expected** answer and the **response** from the OpenAI API in the cell output and report.
  - **NOTE #3**: Points will be deducted if you go over the word limit for questions with a word limit.
  - **NOTE #4**: Have fun with this homework! It's meant to be more exploratory and for you to gain exposure to current LLM trends.

- **Deliverables:** This assignment has several deliverables:
  - Code (this notebook) *(Manually Graded)*
    - Section 1: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
    - Section 3: 3.1, 3.2
    - Section 4: 4.1, 4.2
    - Section 5: 5.1, 5.2
  - Write Up (report.pdf) *(Manually Graded)*
    - All Sections


## Recommended Readings
- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf). Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, ...others. ArXiV 2020.
- [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf). Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. ACM Computing Surveys 2021.
- [Best practices for prompt engineering with OpenAI API](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api). Jessica Shieh. OpenAI 2023.
- [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155.pdf). Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, ...others. ArXiV 2020.
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf). Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, Denny Zhou. NeurIPS 2022.

## To get started, **make a copy** of this colab notebook into your google drive!

## Setup: Dataset / Packages
- **Run the following cells and enter your OpenAI API Key!**
- The models we are using are not the best models OpenAI has to offer (for cost reasons), therefore, the output you get from prompting ChatGPT for example may not match the output from the API.
- We will be using `babbage-002` and `gpt-3.5-turbo-instruct` for this assignment
  - Babbage: can understand and generate natural language but can't follow instructions
  - GPT 3.5 Turbo Instruct: can understand and generate natural language as well as follow explicit instructions (more on this in section 4)

In [None]:
%%capture
!pip install openai datasets

import openai
from openai import OpenAI
from time import sleep
from datasets import load_dataset

IMDB_DATASET = load_dataset("imdb", split='train').shuffle(42)[0:50]
IMDB_DATASET_X = IMDB_DATASET['text']
IMDB_DATASET_Y = IMDB_DATASET['label']
del IMDB_DATASET


# TODO - Start
OPENAI_API_KEY = ""
# TODO - End

cache = {}
def run_gpt3(prompt, return_first_line = True, instruction_tuned = False):
    # Return the response from the cache if we have already run this
    cache_key = (prompt, return_first_line, instruction_tuned)
    if cache_key in cache:
        return cache[cache_key]
    client = OpenAI(
      api_key=OPENAI_API_KEY,
    )
    # Set the API Key


    # Select the model
    if instruction_tuned:
        model = "gpt-3.5-turbo-instruct"
    else:
        model = "babbage-002"

    # Send the prompt to GPT-3
    for i in range(0,60,6):
        try:
            response = client.completions.create(
                model=model,
                prompt=prompt,
                temperature=0,
                max_tokens=100,
                top_p=1,
                frequency_penalty=0.0,
                presence_penalty=0.0,
            )
            response = dict(response)['choices'][0]
            response = dict(response)['text'].strip()
            break
        except Exception as e:
            print(e)
            sleep(i)

    # Parse the response
    if return_first_line:
        final_response = response.split('\n')[0]
    else:
        final_response = response

    # Cache and return the response
    cache[cache_key] = final_response
    return final_response

## Boilerplate Unit Test

In [None]:
def run_unit_test(prompt: str, parameter: str, expected_answer: str, return_first_line=True, instruction_tuned=False):
    parameterized_prompt = prompt.replace("{input}", parameter)
    answer = run_gpt3(parameterized_prompt, return_first_line, instruction_tuned)

    print("Expected: " + expected_answer)
    print("Actual: "+ answer)
    assert expected_answer in answer

# Section 1: Exploring Prompting (12 points)
**Background:** Prompting is a way to guide a language model, which is ultimately just a model that predicts the most likely next sequence of words, to complete some arbitrary task you want it to complete. We'll walk through a few examples and then you'll try creating your own prompts.

A language model will "complete" (just like autocomplete) your prompt with what words are most likely to come next. We demonstrate this is the case by showing how GPT-3 completes movie quotes, when giving it the beginning of the quote:

In [None]:
print(run_gpt3("Life is like a box of chocolates,"))
print(run_gpt3("With great power,"))
print(run_gpt3("The name's Bond."))
print(run_gpt3("Houston, we"))
print(run_gpt3("I've a feeling we're not in"))

Now imagine we give a prompt like this:

In [None]:
print(run_gpt3("Question: Who was the first president of the United States? Answer:"))

By posing a question and writing "Answer:" at the end, we make it such that the most likely next sequence of words is the answer to the question! This is the key to large language models being able to perform arbitrary tasks, even though they are only trained to predict the next word.

We can parameterize this prompt and make it reusable for different questions:

In [None]:
QA_PROMPT = "Question: {input} Answer:"
print(run_gpt3(QA_PROMPT.replace("{input}", "What company did Steve Jobs found?")))
print(run_gpt3(QA_PROMPT.replace("{input}", "What's the movie with Tom Cruise about fighter jets?")))
print(run_gpt3(QA_PROMPT.replace("{input}", "Are tomatoes a fruit or a vegetable?")))

Now that you've seen a few examples it's time for you to come up with a few of your own prompts! Make sure you parameterize them with `{input}` before sending the prompt to the autograder. All your prompts should be reuseable when the autograder does `.replace("{input}", ...)` on them.

Note: These models are not easy to control. Therefore, it's okay if your prompt does not always get the answer right or also spews extra text along with the answer (as long as the answer comes first). Test it out a few times, and if it seems like it works, then you can try it with the autograder.

## Example Unit Test
Below will be an example usage of the boilerplate unit test, feel free to use this format in your `report.pdf` but you are free to modify it as you see fit! As you will see, the expected output and actual output is shown in the cell output (required for submission of notebook and `report.pdf`).

In [None]:
run_unit_test(QA_PROMPT, "What company did Steve Jobs found?", "Apple")

**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 1.1:** Write a prompt that returns the continent where a country is located.

In [None]:
# TODO
CONTINENT_OF_COUNTRY_PROMPT = ""

In [None]:
# TODO unit tests


 **Please include the prompt, unit tests, and output in your `report.pdf`**

 **Problem 1.2:** Write a prompt that returns the author of a famous book.

In [None]:
# TODO
AUTHOR_OF_BOOK_PROMPT = ""

In [None]:
# TODO unit tests


 **Please include the prompt, unit tests, and output in your `report.pdf`**

 **Problem 1.3:** Write a prompt that returns an antonym of a given a word. (Hint: use `return_first_line=False` as an argument when using `run_gpt3`)

In [None]:
# TODO
ANTONYMS_OF_WORD_PROMPT = ""

In [None]:
# TODO unit tests


**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 1.4:** Write a prompt that given a molecule ("water" or "hydrogen peroxide"), returns the atomic elements that make up that molecule. (Hint: use `return_first_line=False` as an argument when using `run_gpt3`)

In [None]:
# TODO
ELEMENT_OF_MOLECULE_PROMPT = ""

In [None]:
# TODO unit tests


**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 1.5:** Write a prompt that given a famous quote ("One small step for man, one giant leap for mankind.", quote characters included), returns the name of the person who said the quote (quotee).

In [None]:
# TODO
QUOTEE_OF_QUOTE_PROMPT = ""

In [None]:
# TODO unit tests


**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 1.6:** Extend the prompt from 1.5 by completing this one without question marks ("?") or question words ("Who", "What", etc.). You will only get credit if your prompt does not contain those. Hint: Reading, Section 2, may help you with this if you can't figure it out.

In [None]:
# TODO
EXTENDED_QUOTEE_OF_QUOTE_PROMPT = ""

In [None]:
# TODO unit tests


# Section 2: Prompt Engineering (20 points)

---



The prompts you have used up to this point have been fairly basic and straightforward to create. But what if you have a more difficult task and it seems like your prompt isn't working? *Prompt engineering* is the procecss of iterating on a prompt in clever ways to induce the model to produce what you want. The best way of prompt engineering systematically vs. randomly is by understanding how the underlying model was trained and what data it was trained on to best prompt the model.

Imagine we want the model to generate a quote in Donald Trump's style of talking about a certain topic:

In [None]:
DONALD_TRUMP_PROMPT = "Question: What would Donald Trump say about {input}? Answer:"
DONALD_TRUMP_PROMPT_ENGINEERED_1 = 'On the topic of {input}, Donald Trump was quoted as saying "'
DONALD_TRUMP_PROMPT_ENGINEERED_2 = 'On the topic of {input}, Donald Trump expressed optimism saying "'
DONALD_TRUMP_PROMPT_ENGINEERED_3 = 'On the topic of {input}, Donald Trump expressed doubt saying "'

print(run_gpt3(DONALD_TRUMP_PROMPT.replace("{input}", 'the stock market'))) # Doesn't work
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_1.replace("{input}", 'the stock market'))) # Works!
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_2.replace("{input}", 'the stock market'))) # Works!
print(run_gpt3(DONALD_TRUMP_PROMPT_ENGINEERED_3.replace("{input}", 'the stock market'))) # Works!

The first naive prompt doesn't really work. After prompt engineering, not only do we get a much more realistic generation of his style, but we can also control whether he is talking about the topic positively or negatively.

**Please respond to the following question in your `report.pdf`**

* **Problem 2.1:** Why did the `DONALD_TRUMP_PROMPT_ENGINEERED_1` prompt work much better than the `DONALD_TRUMP_PROMPT` prompt? (Word Limit: 100 words)

A prompt that is well-engineered can effectively solve difficult NLP tasks that previously were solved by fine-tuning models. In lecture, we showed some examples of these.

**Problem 2.2:** Write a prompt that will solve the [sentiment classification task](https://en.wikipedia.org/wiki/Sentiment_analysis), and classify [movie reviews](https://ai.stanford.edu/~amaas/data/sentiment/) as *positive* or *negative*. `IMDB_DATASET_X` and `IMDB_DATASET_Y` contain 50 reviews and sentiment labels (1 = positive, 0 = negative). Get as high of an accuracy as you can on these. Place your `MOVIE_SENTIMENT` prompt and `POSITIVE_VEBALIZERS` and `NEGATIVE_VERBALIZERS` in `report.pdf` for manual grading. Along with your `correct` (out of 50) score.

*Warning:* Be careful not to exhaust your free OpenAI credits while testing, you can check [on this page here](https://platform.openai.com/account/usage). To avoid exhausting your credits quickly, test your code on a few examples from the IMDB dataset first, and then scale up to the full 50.

In [None]:
# TODO
MOVIE_SENTIMENT_PROMPT = ""

POSITIVE_VERBALIZERS = [
    "good",
    # TODO - Add other positive verbalizers ...
]
NEGATIVE_VERBALIZERS = [
    "bad",
    # TODO - Add other negative verbalizers ...
]

def map_to_sentiment_label(gpt3_output):
    for v in POSITIVE_VERBALIZERS:
        if v.lower() in gpt3_output[:20].lower():
            return 1
    for v in NEGATIVE_VERBALIZERS:
        if v.lower() in gpt3_output[:20].lower():
            return 0
    return None

correct = 0
for review, label in zip(IMDB_DATASET_X, IMDB_DATASET_Y):
    gpt3_output = run_gpt3(MOVIE_SENTIMENT_PROMPT.replace("{input}", review))
    prediction = map_to_sentiment_label(gpt3_output)
    if prediction == label:
        correct += 1
    print(f"Prediction: {prediction}, Label: {label}")
print(f"Correct: {correct}/50")

# Section 3: Few-Shot Prompting (20 points)

The prompts you have seen up until this point are zero-shot prompts, in that we are asking the model to complete a task without any examples. By providing some examples in the prompt, the model becomes significantly more capable. We'll show an example.

Consider the task of figuring out a more complex version of a word:

In [None]:
ZERO_SHOT_COMPLEX_PROMPT = "Question: What is a more complex word for {input}? Answer:"
FEW_SHOT_COMPLEX_PROMPT = "angry : aggrieved\nsad : depressed\n{input} :"

print("Zero shot: ", run_gpt3(ZERO_SHOT_COMPLEX_PROMPT.replace("{input}", 'confused'))) # Doesn't work
print("Few shot: ", run_gpt3(FEW_SHOT_COMPLEX_PROMPT.replace("{input}", 'confused'))) # Works!

The first zero-shot prompt where we have no example doesn't work at all, where as when we give 2 examples in the few-shot prompt (2-shot prompt), it works.

Now that you've seen an example of few-shot prompting, it's your turn to try it.

**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 3.1:** Write a few-shot prompt that translates a Spanish word to an English word.

In [None]:
# TODO
SPANISH_TO_ENGLISH_PROMPT = ""

In [None]:
# TODO unit tests


**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 3.2:** Write a few-shot prompt that converts an input into a [Jeopardy! style answer](https://en.wikipedia.org/wiki/Jeopardy!#:~:text=Rather%20than%20being%20given%20questions,the%20form%20of%20a%20question.) (The Great Lakes -> "What are the Great Lakes?" or Taylor Swift -> "Who is Taylor Swift?")

In [None]:
# TODO
TO_JEOPARDY_ANSWER_PROMPT = ""

In [None]:
# TODO unit tests


**Please respond to the following question in your `report.pdf`**

**Problem 3.3:** Come up with 3 more arbitrary tasks, where a zero-shot prompt might not suffice, and a few-shot prompt would be required. Provide a short write up describing what your tasks are. Provide examples of a zero-prompt not working for it. Then, show us your few-shot prompt and some results. Be creative and try to pick 3 tasks that are somewhat distinct from each other!

# Section 4: Prompting Instruction-Tuned Models (18 points)

Large language models can be *instruction-tuned*, fine-tuned with examples of instructions and responses to those instructions, to make them easier to prompt and friendlier to humans. Instruction-tuned models can more easily be given natural language instructions describing a task you want them to complete. This makes it so that they are more performant without requiring as much prompt engineering and makes them more likely to succeed with just zero-shot prompting. The version of GPT-3 we were working with in previous exercises was not instruction-tuned, we now will use instruction-tuned models from here on out:

In [None]:
TO_JEOPARDY_INSTRUCTION_PROMPT = "What would a Jeopardy! contestant say if the answer was \"{input}\"?"

print(run_gpt3(TO_JEOPARDY_INSTRUCTION_PROMPT.replace("{input}", 'Taylor Swift'))) # Doesn't work on non-instruction tuned model
print(run_gpt3(TO_JEOPARDY_INSTRUCTION_PROMPT.replace("{input}", 'Taylor Swift'), instruction_tuned=True)) # Works and is simpler!

As you can see, these instruction-tuned models make it much simpler to complete complex tasks since you can "talk" to them naturally. We'll now ask you to try.

**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 4.1:** Write a prompt that returns the syllables of a word (music -> mu-sic).

In [None]:
# TODO
WORD_TO_SYLLABLES_PROMPT = ""

In [None]:
# TODO unit tests


**Please include the prompt, unit tests, and output in your `report.pdf`**

**Problem 4.2:** Modify the word to syllables prompt such that the model only returns the syllables and nothing else. You will only get credit if your model only returns returns the syllables and nothing else.

In [None]:
# TODO
MODIFIED_WORD_TO_SYLLABLES_PROMPT = ""

In [None]:
# TODO unit tests


**Please respond to the following question in your `report.pdf`**

**Problem 4.3:** Come up with 3 more arbitrary tasks, where the non-instruction-tuned model might not suffice, and an instruction-tuned model would be required. Provide a short write up describing what your tasks are. Provide examples of a prompt not working on a non-instruction-tuned model. Then, show us your instruction prompt on an instruction-tuned model and some results. Be creative and try to pick 3 tasks that are somewhat distinct from each other!

# Section 5: Chain-of-Thought Reasoning (30 points)

One recent method to prompt large language models is Chain-of-Thought Prompting. This is similar to few-shot prompting, except you not only provide a few examples, but you also provide an explanation with a reasoning chain to the model. Providing this reasoning chain as been shown to improve performance on a wide variety of tasks.

We demonstrate on a task that concatenates the last letter of names of length 2:

In [None]:
FEW_SHOT_CONCATENATION_PROMPT = '''
Take the last letters of the words 'Alvin Bao' and concatenate them -> no,
Take the last letters of the words 'Otto Bot' and concatenate them -> ot,
Take the last letters of the words 'Behrang Mohit' and concatenate them -> gt,
{input}
'''
COT_CONCATENATION_PROMPT = '''
Take the last letters of the words 'Alvin Bao' and concatenate them
the last letter of 'Alvin' is 'n' the last letter of 'Bao' is o -> no,
Take the last letters of the words 'Otto Bot' and concatenate them
the last letter of 'Otto' is 'o' the last letter of 'Bot' is t -> ot,
Take the last letters of the words 'Behrang Mohit' and concatenate them
the last letter of 'Behrang' is 'g' and the last letter of 'Mohit' is 't' -> gt,
{input}
'''

print(run_gpt3(FEW_SHOT_CONCATENATION_PROMPT.replace("{input}", "Take the last letters of the words 'Stephen Curry' and concatenate them"), instruction_tuned=True)) # Doesn't work without CoT prompting
print(run_gpt3(COT_CONCATENATION_PROMPT.replace("{input}", "Take the last letters of the words 'Stephen Curry' and concatenate them"), instruction_tuned=True)) # Works!

Next, we create a dataset with 50 examples:

In [None]:
import random
import pandas as pd

# Read first name and last name csvs into dataframes
first_name_df = pd.read_csv("https://raw.githubusercontent.com/Ninble/name-census-top-100/main/first-name-database.csv", sep=';')
last_name_df = pd.read_csv("https://raw.githubusercontent.com/Ninble/name-census-top-100/main/surname-database.csv", sep=';')

# Filter only US names
first_name_df = first_name_df[first_name_df["Country code"] == "US"]
last_name_df = last_name_df[last_name_df["Country code"] == "US"]

# Convert Name column into lists
first_names = first_name_df["Name"].tolist()
last_names = last_name_df["Name"].tolist()

def create_concatenation_dataset(n_examples, seed = 42):
    random.seed(seed)

    X = []
    y = []
    for i in range(n_examples):
        first_name = random.choice(first_names)
        last_name = random.choice(last_names)
        full_name = first_name + " " + last_name
        X.append(f"Take the last letters of the words '{full_name}' and concatenate them")
        y.append(first_name[-1] + last_name[-1])
    return X, y

def parse_answer(model_output):
    '''Parses the output of the model to get the final answer.'''
    try:
        return model_output[-2:]
    except:
        return None

concatenation_X, concatenation_y = create_concatenation_dataset(50)


**Please respond to the following question in your `report.pdf`**

**Problem 5.1:** Your job is to investigate how few-shot Chain-of-Thought prompting performs vs. regular few-shot prompting over the entire concatenation dataset and grade how many out of 20 are correct. Perform this experiment 5 times each with a different number of regular few-shot examples (1 example, 2 examples, 4 examples, 8 examples, 16 examples) and 5 times again each with a different number of Chain-of-Thought few-shot examples (1 CoT example, 2 CoT examples, 4 CoT examples, 8 CoT examples, 16 CoT examples).

Create a table or plot of (N examples) vs. (% questions correct by the model with a few-shot prompt with N examples) vs. (% questions correct by the model with a CoT prompt with N examples). Report this table or plot in `report.pdf` with a short write-up about your observations. Keep the code used to build your table or plot in your notebook for inspection during grading.

*Note #1:* Make sure you use `instruction_tuned = True`.

*Note #2:* For the purposes of grading, you are **not required** to show the 200 (5 few-shot * 20 examples + 5 COT * 20 examples) examples in the cell output. However, please include the function you wrote to perform this evaluation in `report.pdf`.

*Hint:* You might find the `parse_answer` function helpful when grading how many of the model's outputs are correct or not.

*Warning:* Be careful not to exhaust your free OpenAI credits while testing, you can check [on this page here](https://platform.openai.com/account/usage). To avoid exhausting your credits quickly, test your code on a smaller concatenation dataset first, and then scale up to the full one to report your results.

In [None]:
# TODO - Solve Problem 5.1 here


**Please respond to the following question in your `report.pdf`**

**Problem 5.2 (EXTRA CREDIT):** Your job is to investigate whether or not COT can extend to lengths of words not seen in the chain-of-thought examples.

- Step 1: Try running the original concatenation prompt on a name or phrase with 3 words e.g. 'James Earl Jones' (try running it on at least 4-5 names or phrases). Does it work?
- Step 2: Go back and modify the prompt to get it to work
- Step 3: Now try running this on a name or phrase with 5 words (try running it on at least 4-5 names or phrases). Does it work? If not, how would you extend the prompt to get it to work?

Should you decide to do this exploration, in your report, please include what prompts you tried, at least 3 examples of names you tried the prompts on (3 word and 5 word), and an explanation of why you tried what you tried

Feel free to reference [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf), especially section 5 on Symbolic Reasoning.

In [None]:
# TODO (EXTRA CREDIT) Modify the following prompt
MODIFIED_THREE_WORD_COT_CONCATENATION_PROMPT = ''''''

In [None]:
# TODO unit tests


In [None]:
# TODO (EXTRA CREDIT) Modify the following prompt
MODIFIED_FIVE_WORD_COT_CONCATENATION_PROMPT = ''''''

In [None]:
# TODO unit tests


# Submissions

## Submission Checklist (check if you missed anything!)
We will look for the following:
- Section 1:
  - 1.1: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 1.2: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 1.3: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 1.4: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 1.5: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 1.6: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
- Section 2:
  - 2.1: Written response in `report.pdf` (Word Limit: 100 words)
  - 2.2: Written response in `report.pdf` and verification of prompt, verbalizers, and number correct
- Section 3:
  - 3.1: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 3.2: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 3.3: Written response in `report.pdf`
- Section 4:
  - 4.1: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 4.2: Prompt, 3 unit tests, and output included in `report.pdf` and notebook
  - 4.3: Written response in `report.pdf`
- Section 5:
  - 5.1: Written response (table/plot, short-write up) in `report.pdf` and verification of prompt and evaluation function (examples from cell output not required)
  - 5.2: **OPTIONAL**: Written response in `report.pdf` and verification of 2 prompts and 6 unit tests

For the purpose of submission, feel free to complete the entire assignment in the notebook and then convert it into a pdf for the report. However, this is not required as long as all of the deliverables above are satisfied.

**REMINDER**: Include the **expected** answer and the **response** from the OpenAI API in the cell output and `report.pdf` **for all unit tests**.

## GradeScope File Submission

**MAKE SURE TO REMOVE YOUR API KEY FROM YOUR SUBMISSION, YOU WILL LOSE POINTS IF YOUR API KEY IS PRESENT IN THE NOTEBOOK SUBMISSION**

Here are the deliverables you need to submit to GradeScope:
- Write-up (`report.pdf`):
    - All Sections
    - **IMPORTANT**: You will be assigning each page of the report to a question so please make sure you have the correct page(s) assigned to each question, **you will lose 10 points on this assignment if your answers are not assigned correctly**.
- Code:
    - This notebook: make sure it is named `HW4.ipynb` before submitting. You can download the notebook and py file by going to the top-left corner of this webpage, `File -> Download -> Download .ipynb`