# Build a Custom OpenAI Chatbot with ML-Driven Prompt Engineering

The code below is designed to run as-is with one exception: **[you must set up the OpenAI API key in your system](https://platform.openai.com/docs/quickstart?context=python)**. 

Then, to execute each code cell, click on it and press `Shift` + `Enter` on your keyboard.

## Step 0: Inspecting Non-Customized Results

Before we perform any prompt engineering, **let's ask the OpenAI model some questions and see how it answers**.

(If you encounter an `AuthenticationError` when running this code, make sure that you have set up a valid API key to the cell above and executed it.)

In [1]:
!pip3 show openai

Name: openai
Version: 1.12.0
Summary: The official Python library for the openai API
Home-page: 
Author: 
Author-email: OpenAI <support@openai.com>
License: 
Location: C:\Users\yychiang\anaconda3\Lib\site-packages
Requires: anyio, distro, httpx, pydantic, sniffio, tqdm, typing-extensions
Required-by: 


In [2]:
from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

This is a test

In [3]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "When did Russia invade Ukraine?"}
  ]
)
initial_ukraine_answer = completion.choices[0].message.content.strip()
print(initial_ukraine_answer)

Russia's invasion of Ukraine began in February 2014 when Russian troops occupied Crimea, a region of Ukraine. This marked the start of a conflict that has continued since then, with ongoing fighting in Eastern Ukraine.


In [4]:
from openai import OpenAI
client = OpenAI()

twitter_prompt = """
Question: "Who owns Twitter?"
Answer:
"""


completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    #{"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": twitter_prompt}
  ]
)

initial_twitter_answer = completion.choices[0].message.content.strip()
print(initial_twitter_answer)


Twitter Inc. is a publicly traded company, meaning it is owned by shareholders who own stock in the company. The largest shareholders of Twitter Inc. are institutional investors such as mutual funds, pension funds, and hedge funds, as well as individual investors who own shares of the company. The co-founders and early investors of Twitter also own a significant portion of the company's stock.


The model is answering this way because the training data ends in 2021. **Our task will be to provide context from 2022 to help the model answer these questions correctly.**

## Step 1: Prepare Dataset

### Loading and Wrangling Data

**The data should be loaded into a pandas `DataFrame` called `df` where each row represents a text sample, and there is only one column, `"text"`, which contains the raw text data.**

In this particular case we are collecting data from [the Wikipedia page for the year 2022](https://en.wikipedia.org/wiki/2022) and performing some data wrangling to get it into the appropriate format. Don't worry too much about the details here, since data wrangling looks different for every dataset!

In [5]:
from dateutil.parser import parse
import pandas as pd
import requests

# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2022&explaintext=1&formatversion=2&format=json")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year 2022 saw the removal of nearly all...
2,– 2022 was also dominated by wars and armed c...
9,January 1 – The Regional Comprehensive Econom...
10,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
239,December 21–December 26 – A major winter storm...
240,December 24 – 2022 Fijian general election: Th...
241,December 29 – Brazilian football legend Pelé d...
242,December 31 – Former Pope Benedict XVI dies at...


In [6]:
df["text"][10]

'January 2 – Abdalla Hamdok resigns as Prime Minister of Sudan amid deadly protests.'

### Generating Embeddings

We'll use the `Embedding` tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

In order to avoid a `RateLimitError` we'll send our data in batches to the `Embedding.create` function.

In [7]:
from openai import OpenAI
client = OpenAI()

EMBEDDING_MODEL_NAME = "text-embedding-3-small"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[0.01483198069036007, -0.019463684409856796, 0..."
1,– The year 2022 saw the removal of nearly all...,"[-0.004621672909706831, 0.022132715210318565, ..."
2,– 2022 was also dominated by wars and armed c...,"[-0.055712491273880005, 0.008474607020616531, ..."
9,January 1 – The Regional Comprehensive Econom...,"[-0.036633022129535675, 0.017915261909365654, ..."
10,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[0.027240121737122536, -0.024213440716266632, ..."
...,...,...
239,December 21–December 26 – A major winter storm...,"[-0.002059194026514888, 0.020098183304071426, ..."
240,December 24 – 2022 Fijian general election: Th...,"[0.027269190177321434, -0.0103622917085886, 0...."
241,December 29 – Brazilian football legend Pelé d...,"[0.04175658896565437, -0.005483540706336498, -..."
242,December 31 – Former Pope Benedict XVI dies at...,"[-0.011245762929320335, -0.03364294394850731, ..."


In [8]:
df.to_csv("embeddings.csv")

In [9]:
!ls

#!dir # or this one in Windows system

'ls' 不是內部或外部命令、可執行的程式或批次檔。


If you want to stop the tutorial here and come back, you can reload `df` using this code (again adding your API key) rather than generating the embeddings again:

In [10]:
# import numpy as np
# import pandas as pd
# import openai
# openai.api_key = "YOUR API KEY"
# df = pd.read_csv("embeddings.csv", index_col=0)
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [11]:
df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[0.01483198069036007, -0.019463684409856796, 0..."
1,– The year 2022 saw the removal of nearly all...,"[-0.004621672909706831, 0.022132715210318565, ..."
2,– 2022 was also dominated by wars and armed c...,"[-0.055712491273880005, 0.008474607020616531, ..."
9,January 1 – The Regional Comprehensive Econom...,"[-0.036633022129535675, 0.017915261909365654, ..."
10,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[0.027240121737122536, -0.024213440716266632, ..."
...,...,...
239,December 21–December 26 – A major winter storm...,"[-0.002059194026514888, 0.020098183304071426, ..."
240,December 24 – 2022 Fijian general election: Th...,"[0.027269190177321434, -0.0103622917085886, 0...."
241,December 29 – Brazilian football legend Pelé d...,"[0.04175658896565437, -0.005483540706336498, -..."
242,December 31 – Former Pope Benedict XVI dies at...,"[-0.011245762929320335, -0.03364294394850731, ..."


# Step 2: Create a Function that Finds Related Pieces of Text for a Given Question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [12]:
#from openai.embeddings_utils import get_embedding, distances_from_embeddings


from openai import OpenAI
from scipy.spatial.distance import cosine
client = OpenAI()
EMBEDDING_MODEL_NAME = "text-embedding-3-small"


def get_embedding(text, model=EMBEDDING_MODEL_NAME):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding


def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = df_copy["embeddings"].apply(lambda x: cosine(question_embeddings, x))
    
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

Let's test that out for a couple different questions:

In [13]:
get_rows_sorted_by_relevance("When did Russia invade Ukraine?", df)

Unnamed: 0,text,embeddings,distances
51,March 2 – Russian invasion of Ukraine: Russia ...,"[-0.02846437133848667, 0.0017101424746215343, ...",0.356966
194,October 8 – Russian invasion of Ukraine: An ex...,"[-0.028934694826602936, 0.0471748411655426, 0....",0.358846
177,September 21 – Russian invasion of Ukraine: Fo...,"[-0.05519803613424301, 0.018655989319086075, 0...",0.366684
49,March 1 – Russian invasion of Ukraine: In an e...,"[-0.03092069923877716, 0.012936010956764221, 0...",0.390020
204,October 29 – Russian invasion of Ukraine: In r...,"[-0.08261626213788986, 0.014528372325003147, 0...",0.390550
...,...,...,...
224,November 19–November 26 – The 2022 Central Ame...,"[0.011738448403775692, -0.036297183483839035, ...",0.979640
213,November 1 – 2022 Danish general election: A b...,"[-0.011684820055961609, -0.002833489328622818,...",0.983876
81,April 4 – The Intergovernmental Panel on Clima...,"[0.042045507580041885, 0.007798945065587759, 0...",0.984768
242,December 31 – Former Pope Benedict XVI dies at...,"[-0.011245762929320335, -0.03364294394850731, ...",0.995849


In [14]:
get_rows_sorted_by_relevance("Who owns Twitter?", df)

Unnamed: 0,text,embeddings,distances
199,October 28 – Elon Musk completes his $44 billi...,"[-0.015506506897509098, -0.06498636305332184, ...",0.429940
97,April 25 – Elon Musk reaches an agreement to a...,"[0.01243569515645504, -0.044628191739320755, -...",0.439272
217,"November 11 – The cryptocurrency exchange FTX,...","[-0.04518360644578934, -0.025047961622476578, ...",0.764849
228,"November 30 – OpenAI releases ChatGPT, an arti...","[0.0053804293274879456, 0.02005067467689514, 0...",0.851584
147,"July 31 – Ayman al-Zawahiri, the Egyptian terr...","[0.0044547636061906815, -0.025621656328439713,...",0.853518
...,...,...,...
88,April 8 – Global food prices increase to their...,"[-0.053561531007289886, 0.00931905210018158, 0...",1.005764
227,November 21 – A 5.6 earthquake strikes near Ci...,"[-0.045198049396276474, 0.04281920567154884, 0...",1.013621
59,March 5 – Researchers in the Antarctic find En...,"[0.031153880059719086, 0.026408296078443527, 0...",1.013955
224,November 19–November 26 – The 2022 Central Ame...,"[0.011738448403775692, -0.036297183483839035, ...",1.020315


# Step 3: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [15]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

Now let's test that out! We'll use a `max_token_count` below the actual limit just to keep the output shorter and more readable.

In [16]:
print(create_prompt("When did Russia invade Ukraine?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.

###

October 8 – Russian invasion of Ukraine: An explosion occurs on the Crimean Bridge connecting Crimea and Russia, killing three and causing a partial collapse of the only road bridge between the Crimean Peninsula and the Russian mainland. Two days later, retaliatory missile strikes are conducted by Russia across Ukraine, the most widespread since the start of the invasion, notably including attacks on Kyiv.

---

Question: When did Russia invade Ukraine?
Answer:


In [17]:
print(create_prompt("Who owns Twitter?", df, 100))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

October 28 – Elon Musk completes his $44 billion acquisition of Twitter.

###

April 25 – Elon Musk reaches an agreement to acquire the social media network Twitter (which he later rebrands as X) for $44 billion USD, which later closes in October.

---

Question: Who owns Twitter?
Answer:


# Step 4: Create a Function that Answers a Question

Our final step is to send that text prompt to a `Completion` model and parse the model output!

In [18]:
from openai import OpenAI
client = OpenAI()


COMPLETION_MODEL_NAME = "gpt-3.5-turbo"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    #print(prompt)
    
    try:
        response = client.chat.completions.create(
            model=COMPLETION_MODEL_NAME,
            messages=[{"role": "user", "content": prompt}],
            #stream = True
            #max_tokens=max_answer_tokens
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(e)
        return ""
    
        

In [19]:
custom_ukraine_answer = answer_question("請用中文回答：When did Russia invade Ukraine?", df)
print(custom_ukraine_answer)

2022年2月21日至2月24日俄羅斯入侵烏克蘭。


In [20]:
custom_twitter_answer = answer_question("請用中文回答：誰擁有推特?", df)
print(custom_twitter_answer)

Elon Musk擁有推特。


Below we compare answers with and without our custom prompt:

In [21]:
print(f"""
When did Russia invade Ukraine?

Original Answer: {initial_ukraine_answer}

Custom Answer:   {custom_ukraine_answer}

Who owns Twitter?

Original Answer: {initial_twitter_answer}

Custom Answer:   {custom_twitter_answer}
""")


When did Russia invade Ukraine?

Original Answer: Russia's invasion of Ukraine began in February 2014 when Russian troops occupied Crimea, a region of Ukraine. This marked the start of a conflict that has continued since then, with ongoing fighting in Eastern Ukraine.

Custom Answer:   2022年2月21日至2月24日俄羅斯入侵烏克蘭。

Who owns Twitter?

Original Answer: Twitter Inc. is a publicly traded company, meaning it is owned by shareholders who own stock in the company. The largest shareholders of Twitter Inc. are institutional investors such as mutual funds, pension funds, and hedge funds, as well as individual investors who own shares of the company. The co-founders and early investors of Twitter also own a significant portion of the company's stock.

Custom Answer:   Elon Musk擁有推特。

