## Assignment 5, Part 1, CS478 Fall 2024

### This is due on November 13th, 2024. Please read the accompanied PDF for submission instruction.
### Note that this is only the Part 1 of the assignment. Full points: 40.

_This assignment includes multiple questions that ask about your observation. Please avoid providing very vague answers._

### Task 0: Environment Configuration

#### Step 1: Set up an OpenAI API key
Set up your OpenAI API key below. If you don't have one, register one from OpenAI's website: https://platform.openai.com/. This assignment will mainly use `gpt-4o-mini-2024-07-18`. Its pricing can be found here: https://openai.com/api/pricing/ ($0.150 / 1M input tokens, $0.075 / 1M input tokens).

**NOTE: Please delete your key after you complete this assignment. This is your private key that should not be shared with others (including the instructor/TA).**

#### Step 2: Install the openai Python library

To complete this notebook, we will use the "openai" library for calling OpenAI's language models.

Execute the following command to pip install the library.

In [2]:
!pip install openai 

Collecting openai
  Downloading openai-1.54.3-py3-none-any.whl.metadata (24 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.6.2.post1-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.7.0-cp312-none-win_amd64.whl.metadata (5.3 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.9.2-py3-none-any.whl.metadata (149 kB)
Collecting sniffio (from openai)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai)
  Downloading tqdm-4.67.0-py3-none-any.whl.metadata (57 kB)
Collecting typing-extensions<5,>=4.11 (from openai)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting idna>=2.8 (from anyio<5,>=3.5.0->openai)
  Using cached idn

Now, you should be able to run the following code, which gives a response to an input message "Hello!"

Specifically,
- `client = OpenAI(api_key=OPENAI_API_KEY)` defines a client call with your private API key;
- `client.chat.completions.create` calls OpenAI's chat completion function (https://platform.openai.com/docs/api-reference/chat);
    - Field `model` specifies the LLM version to use, here being "gpt-3.5-turbo-0125"
    - Field `messages` contains the chat history which is used to prompt the LLM for a response, including
        - `{"role": "system", "content": "You are a helpful assistant."}` which specifies the system description (being a helpful assistant),
        - `{"role": "user", "content": "Hello!"}` which specifies the user input "Hello!"

The returned chat completion object (https://platform.openai.com/docs/api-reference/chat/object), includes one possible responses (`choices[0]`) whose message content is "Hello! How can I assist today?"

In [3]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
  model="gpt-4o-mini-2024-07-18",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

Hello! How can I assist you today?


In this assignment, we will use this chat completion function to prompt GPT-4o-mini for a few tasks. For the ease of the work, let's define the following wrapper function called "ChatCompletion" on top of OpenAI's chat completion.

Note the in the function, we have included two additional arguments to the API call:
- `n_samples` is passed as the argument `n` to `client.chat.completions.create`, which specifies the number of samples requested from the LLM;
- `top_p` is passed as the argument `top_p` to `client.chat.completions.create`, which specifies the p% probability mass to sample from, following the "nucleus sampling" approach (see notes in Lecture 3).
- `max_completion_tokens` is passed as the argument `max_completion_tokens` to `client.chat.completions.create`, which specifies the maximum number of tokens the LLM is allowed to generate when it completes the chat (i.e., when it generates the response). This is implemented as a simple "cutoff" of generation. 

In [4]:
def ChatCompletion(prompt, n_samples=1, top_p=1.0, max_completion_tokens=500, return_object=False):
    assert n_samples >= 1
    assert top_p <= 1 and top_p > 0
    completion = client.chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        n = n_samples,
        top_p=top_p,
        max_tokens=max_completion_tokens
    )

    if n_samples == 1:
        print("Response: ", completion.choices[0].message.content)
    else:
        print("The call returns %d responses:\n" % n_samples)
        for i in range(n_samples):
            print("*Response %d*: " % i, completion.choices[i].message.content)
            print("-" * 10)
    
    if return_object:
        return completion

### Task 1: Story Generation with Different Sampling Strategies (20 points)

In the first question, we will learn about different generation effects with a sampling approach called "nucleus sampling". We will try its difference configurations with different `top_p`.


### Question 1 (5 points) 

Can you use the ChatCompletion function, and ``Generate a short story about a student studying abroad'' as the prompt, to instruct GPT-4o-mini to generate a story? Please use the default setting for other configurations.

In [5]:
# YOUR CODE HERE
ChatCompletion("Generate a short story about a student studying abroad")

Response:  **Title: A Journey Beyond Borders**

Lila Fletcher stood at the bustling airport terminal, her heart racing with anticipation and trepidation. She had dreamt of studying abroad ever since she could remember, and now, at eighteen, her dream was finally taking flight. With a one-way ticket to Barcelona, she felt a blend of excitement and nervousness wash over her.

Arriving in Spain was like stepping into a dream; the vibrant colors, the lively chatter in a language she’d only practiced in her bedroom, and the aroma of fresh churros wafting through the air captivated her senses. Lila's host family, the Garcías, greeted her with warm smiles and a feast of tapas, which she eagerly devoured, despite her half-formed Spanish.

Her classes at the University of Barcelona unfolded like the pages of a novel. Each lesson in art history felt like stepping into a brushstroke of Picasso or a whispered secret from the Sagrada Familia. Lila immersed herself entirely, scribbling notes filled 

### Question 2 (2.5 points)

Now, can you do it with the same prompt but try to get 2 generations (keep `top_p` and `max_completion_tokens` default)?

In [6]:
# YOUR CODE HERE
ChatCompletion("Generate a short story about a student studying abroad", n_samples=2)


The call returns 2 responses:

*Response 0*:  **Title: A Patchwork of Memories**

Ava had always dreamt of studying abroad, her heart set on the cobbled streets of Florence, Italy. With a suitcase packed and her mind swirling in anticipation, she boarded the plane, her excitement tempered by the faint twinge of homesickness. The moment she stepped out into the sun-kissed piazza, with its Renaissance architecture and scent of fresh espresso wafting through the air, her enthusiasm bloomed.

Her university program was two semesters long, a tapestry woven with art history, language immersion, and Italian culture. She quickly settled into a small apartment with two other students, Leo, an adventurous soul from Brazil, and Mei, a meticulous planner from Japan. Their cultural differences created an unexpected harmony, and they formed a friendship that felt like family.

Classes began with a whirlwind of colors and sounds. Ava loved her art history course, where her professor, a spirited Itali

### Question 3 (2.5 points) 

How about 2 generations with `top_p` set to be 0.1 (keep `max_completion_tokens` default)?

In [7]:
# YOUR CODE HERE
ChatCompletion("Generate a short story about a student studying abroad", n_samples=2, top_p=0.1)

The call returns 2 responses:

*Response 0*:  **Title: A Journey Beyond Borders**

Sophie had always dreamed of studying abroad. Growing up in a small town in Ohio, she often found herself lost in the pages of travel books, imagining the vibrant streets of Paris, the serene canals of Venice, and the bustling markets of Tokyo. When she received her acceptance letter to a university in Barcelona, Spain, her heart raced with excitement and a hint of anxiety.

As she stepped off the plane, the warm Mediterranean air enveloped her like a welcoming embrace. The city was alive with color and sound; the aroma of fresh paella wafted through the streets, mingling with the laughter of children playing in the plazas. Sophie felt a rush of exhilaration as she navigated her way through the bustling airport, her heart pounding with the thrill of adventure.

Her first few weeks were a whirlwind of orientation sessions, new friends, and cultural immersion. She met students from all over the world—each 

### Question 4 (5 points) 

Did the different `top_p` configurations give you the same or different results? Why do you think it could happen?

<font color='blue'>PLEASE WRITE YOUR ANSWER IN THE PDF.</font>

### Question 5 (5 points)

Let's go back and do the same as in Question 2, but set the number of generations or samples to 10 (keep `top_p` and `max_completion_tokens` to be the default values). Read the 10 stories, check which country the character studies internationally, and report the statistics in the PDF. An example table is shown below:

| Country | Counts | 
| ----------- | ----------- | 
|    e.g., U.S.   |    e.g., 4   | 
|    e.g., China   |    e.g., 3   |  
|    e.g., India | e.g., 3 |

What did you observe? Are the generations regionally diverse? Why do you think it could happen?

<font color='blue'>PLEASE WRITE YOUR ANSWER IN THE PDF.</font>

In [8]:
# YOUR CODE HERE
ChatCompletion("Generate a short story about a student studying abroad", n_samples=10)

The call returns 10 responses:

*Response 0*:  **Title: A New Horizon**

Emma Thompson felt a mix of excitement and nerves as she stepped off the plane at Barcelona International Airport. It was her first time studying abroad, and the thrill of adventure coursed through her veins. She was a sophomore at university, majoring in architecture, and had won a competitive scholarship to study at the prestigious Escola Tècnica Superior d'Arquitectura de Barcelona for the semester.

As she made her way through the bustling airport, she clutched her small suitcase tightly, her other hand gripping the map she had printed out from the internet. After a quick glance, she headed toward the train station, her heart racing as the reality of her dream began to crystallize.

Barcelona was a kaleidoscope of colors, a tapestry woven with rich histories and contemporary wonders. Emma found herself enchanted by the vibrant streets, the intricate details of Gaudí’s creations, and the lively chatter of local

### Task 2: GPT-4o-mini for Solving Mathematical Problems (20 points)

The second task we will try is about solving a math problem.

The math problem we consider is:

> Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at the green house, 2 more to the red house, and half of what was left at the orange house. If Melanie has 5 vacuum cleaners left, how many did she start with?

For your reference, the correct answer should be 18, following the reasoning chain below:

> First multiply the five remaining vacuum cleaners by two to find out how many Melanie had before she visited the orange house: 5 * 2 = 10; 
> Then add two to figure out how many vacuum cleaners she had before visiting the red house: 10 + 2 = 12;
> Now we know that 2/3 * x = 12, where x is the number of vacuum cleaners Melanie started with. We can find x by dividing each side of the equation by 2/3, which produces x = 18

### Question 6 (10 points)
Call the ChatCompletion function and prompt GPT-4o-mini to solve the problem. You can directly use `math_problem` as the prompt. To understand how reliably the LLM can solve this problem, let's generate 5 answers in a similar way as Q2 (keep `top_p` to be the default value but set `max_completion_tokens` to be `1000`).

In [9]:
math_problem = 'Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at the green house, 2 more to the red house, and half of what was left at the orange house. If Melanie has 5 vacuum cleaners left, how many did she start with?'

# YOUR CODE HERE
ChatCompletion(math_problem, n_samples=5, max_completion_tokens=1000)


The call returns 5 responses:

*Response 0*:  Let \( x \) be the number of vacuum cleaners Melanie started with. 

1. **Sales at the green house**: She sold a third of her vacuum cleaners, which is \( \frac{x}{3} \). Therefore, the number of vacuum cleaners she has left after this sale is:
   \[
   x - \frac{x}{3} = \frac{2x}{3}
   \]

2. **Sales at the red house**: She sold 2 more vacuum cleaners. After this sale, the number of vacuum cleaners remaining is:
   \[
   \frac{2x}{3} - 2
   \]

3. **Sales at the orange house**: She sold half of what was left. The amount left before this sale is \( \frac{2x}{3} - 2 \). Now, she sells half of that amount:
   \[
   \text{Vacuum cleaners sold at orange house} = \frac{1}{2} \left( \frac{2x}{3} - 2 \right)
   \]

   Thus, the number of vacuum cleaners left after this sale is:
   \[
   \left( \frac{2x}{3} - 2 \right) - \frac{1}{2} \left( \frac{2x}{3} - 2 \right)
   \]

   We can simplify this. Substituting \( y = \frac{2x}{3} - 2 \):
   \[
   y -

Are all the solutions correct? Read them carefully and report what you observed (e.g., how did the model solve the problem? did it solve the problem in the same way in its 5 solutions? anything interesting from its solutions?).

<font color='blue'>PLEASE WRITE YOUR ANSWER IN THE PDF.</font>

### Question 7 (10 points)

As you can see from the prior answer, LLM's response could be highly unstructured, so if you want to check its math-solving answer, you have to read the answer carefully. While the model solves the math problem step by step, it does not alway indicate the step number. 

To make the LLM response more structured, OpenAI has introduced the capability called "structured output" (https://platform.openai.com/docs/guides/structured-outputs). A brief guide of its usage can be found at https://platform.openai.com/docs/guides/structured-outputs/how-to-use. 

Can you complete the following code and define a `StructuredChatCompletion` function to return step-by-step math solution? The required strutured format should be (which is the same format as `MathResponse` in the guide):
```
{
    steps:[
        explanation: a string of explanation,
        output: a string of the current step's output
    ],
    final_answer: a string of the final math-solving answer
}
```

In [12]:
from pydantic import BaseModel

# TODO: Define your structured object
class Step(BaseModel):
    explanation: str
    output: str

class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str

def StructuredChatCompletion(prompt, n_samples=1, top_p=1.0, max_completion_tokens=500, return_object=False):
    assert n_samples >= 1
    assert top_p <= 1 and top_p > 0

    # TODO: add a `response_format` argument and specify the structured output format
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        n = n_samples,
        top_p=top_p,
        max_tokens=max_completion_tokens,
        response_format=MathReasoning
    )

    # The following code has been modified to parse the structure output. No further edits are needed.
    if n_samples == 1:
        math_response = completion.choices[0].message.parsed
        print("Response: ")
        print(math_response.steps)
        print(math_response.final_answer)
    else:
        print("The call returns %d responses:\n" % n_samples)
        for i in range(n_samples):
            math_response = completion.choices[i].message.parsed
            print("*Response %d*: " % i)
            print(math_response.steps)
            print(math_response.final_answer)
            print("-" * 10)
    
    if return_object:
        return completion

Use this `StructuredChatCompletion` function and prompt GPT-4o-mini to generate 5 solutions following the same setting as in Q6 (set `max_completion_tokens` to be `1000`). 

In [13]:
# YOUR CODE HERE
StructuredChatCompletion(math_problem, n_samples=5, max_completion_tokens=1000)

The call returns 5 responses:

*Response 0*: 
[Step(explanation='Let the total number of vacuum cleaners Melanie started with be x.', output='x'), Step(explanation='At the green house, Melanie sold a third of her vacuum cleaners, so she sold x/3 at the green house. After this, the number of vacuum cleaners left is x - x/3 = (2/3)x.', output='(2/3)x'), Step(explanation='At the red house, she sold 2 more, so after selling that, the number of vacuum cleaners left is (2/3)x - 2.', output='(2/3)x - 2'), Step(explanation='At the orange house, she sold half of what was left. We need to calculate how many were left before selling at the orange house: (2/3)x - 2. Half of this amount is (1/2)((2/3)x - 2) = (1/3)x - 1.', output='(1/3)x - 1'), Step(explanation='After selling at the orange house, the number of vacuum cleaners left is: ((2/3)x - 2) - ((1/3)x - 1). This simplifies to (2/3)x - (1/3)x - 2 + 1 = (1/3)x - 1.', output='(1/3)x - 1'), Step(explanation='We know that after all these sales, Me

Include one response in the report PDF and describe what you observed (e.g., did the model follow the structured format? did forcing it to structure its output hurt the model performance? share anything you found!)

<font color='blue'>PLEASE WRITE YOUR ANSWER IN THE PDF.</font>

#### Acknowledgement: The math problems used in this notebook come from the GSM8k dataset: Training Verifiers to Solve Math Word Problems, Cobbe et al., 2021. https://huggingface.co/datasets/gsm8k 