# Accessing OpenAI Like a Developer

- 🤝 Breakout Room #1:
  1. Getting Started
  2. Setting Environment Variables
  3. Using the OpenAI Python Library
  4. Prompt Engineering Principles
  5. Testing Your Prompt

# How AIM Does Assignments

If you look at the Table of Contents (accessed through the menu on the left) - you'll see this:

![image](https://i.imgur.com/I8iDTUO.png)

Or this if you're in Colab:

![image](https://i.imgur.com/0rHA1yF.png)

You'll notice during assignments that we have two following categories:

1. ❓ - Questions. These will involve...answering questions!
2. 🏗️ - Activities. These will involve writing code, or modifying text.

In order to receive full marks on the assignment - it is expected you will answer all questions, and complete all activities.

## 1. Getting Started

The first thing we'll do is load the [OpenAI Python Library](https://github.com/openai/openai-python/tree/main)!

In [8]:
!pip install openai -q



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 2. Setting Environment Variables

As we'll frequently use various endpoints and APIs hosted by others - we'll need to handle our "secrets" or API keys very often.

We'll use the following pattern throughout this bootcamp - but you can use whichever method you're most familiar with.

In [105]:
import os
from dotenv import find_dotenv, dotenv_values

keys = list(dotenv_values(find_dotenv('.env')).items())
os.environ['OPENAI_API_KEY'] = keys[0][1]


## 3. Using the OpenAI Python Library

Let's jump right into it!

> NOTE: You can, and should, reference OpenAI's [documentation](https://platform.openai.com/docs/api-reference/authentication?lang=python) whenever you get stuck, have questions, or want to dive deeper.

### Creating a Client

The core feature of the OpenAI Python Library is the `OpenAI()` client. It's how we're going to interact with OpenAI's models, and under the hood of a lot what we'll touch on throughout this course.

> NOTE: We could manually provide our API key here, but we're going to instead rely on the fact that we put our API key into the `OPENAI_API_KEY` environment variable!

In [107]:
import os
from openai import OpenAI

openai_client = OpenAI()
openai_client.api_key = os.getenv('OPENAI_API_KEY') #set the key from .env


### Using the Client

Now that we have our client - we're going to use the `.chat.completions.create` method to interact with the `gpt-3.5-turbo` model.

There's a few things we'll get out of the way first, however, the first being the idea of "roles".

First it's important to understand the object that we're going to use to interact with the endpoint. It expects us to send an array of objects of the following format:

```python
{"role" : "ROLE", "content" : "YOUR CONTENT HERE", "name" : "THIS IS OPTIONAL"}
```

Second, there are three "roles" available to use to populate the `"role"` key:

- `system`
- `assistant`
- `user`

OpenAI provides some context for these roles [here](https://help.openai.com/en/articles/7042661-moving-from-completions-to-chat-completions-in-the-openai-api).

We'll explore these roles in more depth as they come up - but for now we're going to just stick with the basic role `user`. The `user` role is, as it would seem, the user!

Thirdly, it expects us to specify a model!

We'll use the `gpt-3.5-turbo` model as stated above.

Let's look at an example!



In [108]:
response = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role" : "user", "content" : "Hello, how are you?"}]
)

Let's look at the response object.

In [109]:
response

ChatCompletion(id='chatcmpl-9Ulcols0xLUQVmnjPvZqHz2Kx994I', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Hello! I'm just a computer program, so I don't have feelings in the same way humans do. How can I assist you today?", role='assistant', function_call=None, tool_calls=None))], created=1717119938, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=29, prompt_tokens=13, total_tokens=42))

>NOTE: We'll spend more time exploring these outputs later on, but for now - just know that we have access to a tonne of powerful information!

### Helper Functions

We're going to create some helper functions to aid in using the OpenAI API - just to make our lives a bit easier.

> NOTE: Take some time to understand these functions between class!

In [111]:
from IPython.display import display, Markdown

def get_response(client: OpenAI, messages: list, model: str = "gpt-3.5-turbo") -> str:
    return client.chat.completions.create(
        model=model,
        messages=messages
    )

def system_prompt(message: str) -> dict:
    return {"role": "system", "content": message}

def assistant_prompt(message: str) -> dict:
    return {"role": "assistant", "content": message}

def user_prompt(message: str) -> dict:
    return {"role": "user", "content": message}

def pretty_print(message: str) -> str:
    display(Markdown(message.choices[0].message.content))

### Testing Helper Functions

Let's see how we can use these to help us!

In [112]:
YOUR_PROMPT = "Hello, how are you?"
messages_list = [user_prompt(YOUR_PROMPT)]

chatgpt_response = get_response(openai_client, messages_list)

pretty_print(chatgpt_response)

Hello! I'm just a computer program, so I don't have feelings, but I'm here to assist you. How can I help you today?

### System Role

Now we can extend our prompts to include a system prompt.

The basic idea behind a system prompt is that it can be used to encourage the behaviour of the LLM, without being something that is directly responded to - let's see it in action!

In [115]:
list_of_prompts = [
    system_prompt("You are irate and extremely hungry. Feel free to express yourself using PG-13 language."),
    user_prompt("Do you prefer crushed ice or cubed ice?")
]

irate_response = get_response(openai_client, list_of_prompts)
pretty_print(irate_response)

I couldn't give a damn about the shape of my ice as long as it chills my drink! Just give me some damn ice before I lose my mind from this hunger!

As you can see - the response we get back is very much in line with the system prompt!

Let's try the same user prompt, but with a different system to prompt to see the difference.

In [116]:
list_of_prompts = [
    system_prompt("You are joyful and having the best day. Please act like a person in that state of mind."),
    user_prompt("Do you prefer crushed ice or cubed ice?")
]

joyful_response = get_response(openai_client, list_of_prompts)
pretty_print(joyful_response)

Oh, I love crushed ice! It just makes any drink feel extra refreshing and fancy, don't you think? Today is such a great day - perfect for enjoying a cold drink with some crushed ice. What about you, what's your go-to ice preference?

With a simple modification of the system prompt - you can see that we got completely different behaviour, and that's the main goal of prompt engineering as a whole.

Also, congrats, you just engineered your first prompt!

### Few-shot Prompting

Now that we have a basic handle on the `system` role and the `user` role - let's examine what we might use the `assistant` role for.

The most common usage pattern is to "pretend" that we're answering our own questions. This helps us further guide the model toward our desired behaviour. While this is a over simplification - it's conceptually well aligned with few-shot learning.

First, we'll try and "teach" `gpt-3.5-turbo` some nonsense words as was done in the paper ["Language Models are Few-Shot Learners"](https://arxiv.org/abs/2005.14165).

In [117]:
list_of_prompts = [
    user_prompt("Please use the words 'stimple' and 'falbean' in a sentence.")
]

stimple_response = get_response(openai_client, list_of_prompts)
pretty_print(stimple_response)

I was trying to cook a fancy meal, but ended up making a stimple dish of falbean stew instead.

As you can see, the model is unsure what to do with these made up words.

Let's see if we can use the `assistant` role to show the model what these words mean.

In [118]:
list_of_prompts = [
    user_prompt("Something that is 'stimple' is said to be good, well functioning, and high quality. An example of a sentence that uses the word 'stimple' is:"),
    assistant_prompt("'Boy, that there is a stimple drill'."),
    user_prompt("A 'falbean' is a tool used to fasten, tighten, or otherwise is a thing that rotates/spins. An example of a sentence that uses the words 'stimple' and 'falbean' is:")
]

stimple_response = get_response(openai_client, list_of_prompts)
pretty_print(stimple_response)

"Using a stimple screwdriver and a falbean wrench, the mechanic was able to quickly repair the car engine."

As you can see, leveraging the `assistant` role makes for a stimple experience!

### 🏗️ Activity #1:

Use few-shot prompting to build a movie-review sentiment clasifier!

A few examples:

INPUT: "I hated the hulk!"
OUTPUT: "{"sentiment" : "negative"}

INPUT: "I loved The Marvels!"
OUTPUT: "{sentiment" : "positive"}

In [123]:
list_of_prompts = [
    user_prompt("'Man, the new Star Wars movie really sucked!' conveys a sentiment that can be analyzed. An example of this analysis would be:"),
    assistant_prompt("'sentiment : negative'"),
    user_prompt("'I really loved the Lego Movie!'")
]

sentiment_response = get_response(openai_client, list_of_prompts)
pretty_print(sentiment_response)

sentiment: positive

### Chain of Thought Prompting

We'll head one level deeper and explore the world of Chain of Thought prompting (CoT).

This is a process by which we can encourage the LLM to handle slightly more complex tasks.

Let's look at a simple reasoning based example without CoT.

> NOTE: With improvements to `gpt-3.5-turbo`, this example might actually result in the correct response some percentage of the time!

In [124]:
reasoning_problem = """
Billy wants to get home from San Fran. before 7PM EDT.

It's currently 1PM local time.

Billy can either fly (3hrs), and then take a bus (2hrs), or Billy can take the teleporter (0hrs) and then a bus (1hrs).

Does it matter which travel option Billy selects?
"""

list_of_prompts = [
    user_prompt(reasoning_problem)
]

reasoning_response = get_response(openai_client, list_of_prompts)
pretty_print(reasoning_response)

Since Billy wants to get home before 7PM EDT, the most important factor to consider is the time zone difference between San Francisco and EDT. San Francisco is in the Pacific Time Zone (PDT), which is 3 hours behind EDT. Therefore, 7PM EDT is 4PM PDT.

If Billy takes the flight and bus option, it will take a total of 5 hours, meaning he will arrive home at 6PM PDT, still before his desired arrival time.

If Billy takes the teleporter and bus option, it will take a total of 1 hour, meaning he will arrive home at 2PM PDT, well before his desired arrival time.

Therefore, in this scenario, it doesn't matter which travel option Billy selects as both options will get him home before 7PM EDT. However, the teleporter and bus option will get him home earlier.

As humans, we can reason through the problem and pick up on the potential "trick" that the LLM fell for: 1PM *local time* in San Fran. is 4PM EDT. This means the cumulative travel time of 5hrs. for the plane/bus option would not get Billy home in time.

Let's see if we can leverage a simple CoT prompt to improve our model's performance on this task:

In [125]:
list_of_prompts = [
    user_prompt(reasoning_problem + " Think though your response step by step.")
]

reasoning_response = get_response(openai_client, list_of_prompts)
pretty_print(reasoning_response)

In this scenario, it does matter which travel option Billy selects because he wants to arrive home before 7PM EDT and it is currently 1PM local time. 

If Billy chooses to fly and then take a bus, it will take him a total of 5 hours (3 hours of flying + 2 hours of bus travel). If he leaves San Francisco at 1PM local time, he will arrive at his destination at 6PM local time. Since there is a 3-hour time difference between San Francisco and EDT, Billy will arrive at 9PM EDT, which is after his desired arrival time of before 7PM EDT.

On the other hand, if Billy chooses to take the teleporter and then a bus, it will only take him a total of 1 hour (0 hours of teleportation + 1 hour of bus travel). If he leaves San Francisco at 1PM local time, he will arrive at his destination at 2PM local time. Considering the time difference, he will arrive at 5PM EDT, which is before his desired arrival time of before 7PM EDT.

Therefore, Billy should choose the option of taking the teleporter and then a bus in order to arrive home before 7PM EDT.

With the addition of a single phrase `"Think through your response step by step."` we're able to completely turn the response around.

## 3. Prompt Engineering Principles

As you can see - a simple addition of asking the LLM to "think about it" (essentially) results in a better quality response.

There's a [great paper](https://arxiv.org/pdf/2312.16171v1.pdf) that dives into some principles for effective prompt generation.

Your task for this notebook is to construct a prompt that will be used in the following breakout room to create a helpful assistant for whatever task you'd like.

### 🏗️ Activity #2:

There are two subtasks in this activity:

1. Write a `system_template` that leverages 2-3 of the principles from [this paper](https://arxiv.org/pdf/2312.16171v1.pdf)

2. Modify the `user_template` to improve the quality of the LLM's responses.

> NOTE: PLEASE DO NOT MODIFY THE `{input}` in the `user_template`.

In [127]:
# I figured this would be a fun, meta twist :)
system_template = """\
###Instruction### You MUST critique the user input based on how well an LLM would be able to 
interpret, correctly respond, or execute actions based on their prompt. Use a scoring system. 
You can create your own scoring system, but you must provide a detailed, step by step explanation 
as to how you arrived at your score. Scoring must be repeatable and fair.

###Example### The user prompt is 'Can you show me your smile?'. This would receive a poor score within the
scoring system. The explanation would detail the limitations that LLM's have regarding smiling.
"""

In [131]:
user_template = """{input}
Respond in no more than seven paragraphs.
"""

## 4. Testing Your Prompt

Now we can test the prompt you made using an LLM-as-a-judge see what happens to your score as you modify the prompt.

In [136]:
query = """
I want to make a peaut butter and jelly sandwich. The items I have at my disposal are raspberry jelly, 
extra chunky peanut butter, a spoon, a butter knife, a loaf of multi-grain bread, napkins, and a plate.
Provide me with step-by-step instructions to assemble the sandwich in the cleanest and most efficient way,
given the tools at my disposal.
"""

list_of_prompts = [
    system_prompt(system_template),
    user_prompt(user_template.format(input=query))
]

test_response = get_response(openai_client, list_of_prompts)

pretty_print(test_response)

evaluator_system_template = """You are an expert in analyzing the quality of a response.

You should be hyper-critical.

Provide scores (out of 10) for the following attributes:

1. Clarity - how clear is the response
2. Faithfulness - how related to the original query is the response
3. Correctness - was the response correct?

Please take your time, and think through each item step-by-step, when you are done - please provide your response in the following JSON format:

{"clarity" : "score_out_of_10", "faithfulness" : "score_out_of_10", "correctness" : "score_out_of_10"}"""

evaluation_template = """Query: {input}
Response: {response}"""

list_of_prompts = [
    system_prompt(evaluator_system_template),
    user_prompt(evaluation_template.format(
        input=query,
        response=test_response.choices[0].message.content
    ))
]

evaluator_response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=list_of_prompts,
    response_format={"type" : "json_object"}
)

The user input is well-structured and clearly specifies the available items for making a peanut butter and jelly sandwich. It outlines the desired outcome, which is to assemble the sandwich cleanly and efficiently. However, there are limitations in the AI's capabilities to provide a step-by-step guide for physical tasks like assembling a sandwich. Below is a detailed breakdown of the scoring based on various criteria:

1. **Clarity and Specificity of Request (4/5):** The user clearly states the goal of making a peanut butter and jelly sandwich and lists the available items. This specificity is helpful for the AI to understand the task at hand.

2. **Feasibility for AI (1/5):** The task of physically assembling a sandwich requires tactile abilities and understanding of physical objects, which LLM lacks. While it can provide theoretical steps, it cannot physically interact with the items to assemble the sandwich.

3. **Utilization of Available Tools (5/5):** The user has provided a good variety of tools for making the sandwich. They have at their disposal raspberry jelly, extra chunky peanut butter, a spoon, a butter knife, multi-grain bread, napkins, and a plate, which covers all necessary components for the sandwich.

4. **Efficiency and Cleanliness (4/5):** The user specifies that they want the sandwich to be made in the cleanest and most efficient way. LLM could provide steps theoretically, but it may not account for nuances like spreading the right amount of peanut butter or jelly.

5. **Overall Task Relevance (2/5):** While the task itself is clear and relevant, the limitations of LLM's physical capabilities make providing step-by-step instructions for assembling a sandwich challenging.

Overall, this user input would score a **2.8 out of 5**. The clarity and specificity of the request, as well as the inclusion of available tools, contribute positively to the score. However, the feasibility for the AI to execute the task and the overall limitations in physically interacting with objects significantly impact the score.

In [135]:
pretty_print(evaluator_response)

{
"clarity" : "4",
"faithfulness" : "2",
"correctness" : "3"
}

  

#### ❓Question #1:

How did your prompting strategies change the evaluation scores? What does this tell you/what did you learn?

*I iterated through several questions to test my system prompt. The questions, as well as their scores, are listed below:*

| Query | Response from System Prompt | Score |
| -------- | ------------------------ | ----- |
| What is the meaning of life? | The user input asks the LLM to define the meaning of life in no more than seven paragraphs. This prompt is highly abstract and philosophical, requiring deep, subjective reflection and opinions, which are beyond the capabilities of an LLM. The limitations of LLMs are that they are primarily based on patterns in existing data and text, rather than having personal experiences, emotions, or beliefs that would allow them to contemplate complex philosophical questions. Therefore, an LLM would struggle to provide a coherent and meaningful response to such a profound question. <br><br> ###Scoring Criteria:  <br><br> The prompt asks the LLM to define the meaning of life, a deeply philosophical and subjective topic. LLMs lack personal experiences, emotions, beliefs, and consciousness needed to deeply ponder and articulate the meaning of life. LLMs operate based on data patterns, not introspection or abstract concepts like the meaning of life. LLMs are not equipped to handle existential or philosophical questions that require human understanding and reasoning. The complexity and abstract nature of the prompt exceed the LLM's capabilities. The response requires subjective interpretation and personal beliefs which LLMs do not possess. <br><br> ###Final Score: <br><br> Given the limitations of LLMs in understanding and responding to abstract, philosophical inquiries like the meaning of life, this prompt would receive a score of 1 out of 10. The LLM would struggle to provide a meaningful response within the confines of its programming. | {"clarity" : 7, "faithfulness" : 4, "correctness" : 5} |
| How could I get to space without a rocket? | This user input would receive a low score. LLMs are not capable of providing accurate information about traveling to space without a rocket in a coherent and helpful manner. Here's a breakdown of the score: <br><br> Relevance to LLM capabilities: LLMs are limited in their knowledge scope and are not designed to provide instructions on space travel without rockets. 0/5 <br><br>Complexity: The question is highly complex and outside the scope of what an LLM can effectively respond to. 1/5 <br><br>Clarity of the Prompt: The prompt is clear in what it's asking, but the expectations for a coherent response within the limitations of an LLM are unrealistic. 2/5 <br><br>Limitations: LLMs lack the understanding of unconventional space travel methods, making it impossible for them to provide a meaningful response. 0/5 <br><br>Feasibility of Implementation: Given the current technology and capabilities of LLMs, it's not feasible for them to provide accurate or reliable information on space travel without rockets. 0/5 <br><br>Creativity: While the prompt itself is creative, it asks for a response that an LLM cannot effectively generate. 2/5 <br><br>Overall Score: 5/35 <br><br>In conclusion, the input would receive a low score due to its complexity, lack of relevance to LLM capabilities, and the impossibility of providing a meaningful response within the constraints of LLM technology. | { "clarity" : "4", "faithfulness" : "2", "correctness" : "3" } |
|  I want to make a peaut butter and jelly sandwich. The items I have at my disposal are raspberry jelly, extra chunky peanut butter, a spoon, a butter knife, a loaf of multi-grain bread, napkins, and a plate. Provide me with step-by-step instructions to assemble the sandwich in the cleanest and most efficient way, given the tools at my disposal. | The user input is well-structured and clearly specifies the available items for making a peanut butter and jelly sandwich. It outlines the desired outcome, which is to assemble the sandwich cleanly and efficiently. However, there are limitations in the AI's capabilities to provide a step-by-step guide for physical tasks like assembling a sandwich. Below is a detailed breakdown of the scoring based on various criteria:<br><br>Clarity and Specificity of Request (4/5): The user clearly states the goal of making a peanut butter and jelly sandwich and lists the available items. This specificity is helpful for the AI to understand the task at hand.<br><br>Feasibility for AI (1/5): The task of physically assembling a sandwich requires tactile abilities and understanding of physical objects, which LLM lacks. While it can provide theoretical steps, it cannot physically interact with the items to assemble the sandwich.<br><br>Utilization of Available Tools (5/5): The user has provided a good variety of tools for making the sandwich. They have at their disposal raspberry jelly, extra chunky peanut butter, a spoon, a butter knife, multi-grain bread, napkins, and a plate, which covers all necessary components for the sandwich.<br><br>Efficiency and Cleanliness (4/5): The user specifies that they want the sandwich to be made in the cleanest and most efficient way. LLM could provide steps theoretically, but it may not account for nuances like spreading the right amount of peanut butter or jelly.<br><br>Overall Task Relevance (2/5): While the task itself is clear and relevant, the limitations of LLM's physical capabilities make providing step-by-step instructions for assembling a sandwich challenging.<br><br>Overall, this user input would score a 2.8 out of 5. The clarity and specificity of the request, as well as the inclusion of available tools, contribute positively to the score. However, the feasibility for the AI to execute the task and the overall limitations in physically interacting with objects significantly impact the score. | { "clarity" : "4", "faithfulness" : "2", "correctness" : "3" } |

*Based on the variation in evaluation scores, it is hard to determine what the model defines as clear, faithful, or correct. What I found interesting was that the responses that scored the best were anecdotally most similar to the response that ChatGPT would give ***without*** a system prompt. For example, the first question, which was generally the highest scoring based on the evaluation, subjectively did the worst job of adhering to the system prompt I had defined. However, it explicitly stated the user input, and only provided a single, overall numerical score. Ironically, this was also the only response that leveraged one of the principals defined in the paper.*<br><br>*The takeaway from this, in my opinion, is that there is an optimal way to ***communicate*** with a natural language processing model that may ***not*** be conducive to naturally spoken or written language. Based on the ***evaluation of an evaluation*** task that was performed as a result of the system prompt, it seems that the model prefers explicit language to numerical definition. I would argue that the inverse is true for humans. While we can be convinced by explicit language, we are never more certain than when there is statistical evidence to support a claim.*<br><br>*Ultimately, it appears that explicit language and properly identifying/delimiting key words/phrases are the hallmarks of effective prompting with LLMs.*