# Build a Custom OpenAI Chatbot with ML-Driven Prompt Engineering

The code below is designed to run as-is with one exception: **[you must set up the OpenAI API key in your system](https://platform.openai.com/docs/quickstart?context=python)**. 

Then, to execute each code cell, click on it and press `Shift` + `Enter` on your keyboard.

## Step 0: Inspecting Non-Customized Results

Before we perform any prompt engineering, **let's ask the OpenAI model some questions and see how it answers**.

(If you encounter an `AuthenticationError` when running this code, make sure that you have set up a valid API key to the cell above and executed it.)

In [1]:
# If you have correctly set up your OPENAI_API_KEY, then the following command can display your Key.
# Please follow this link to set up your OPENAI_API_KEY：https://platform.openai.com/docs/quickstart?context=python
!echo %OPENAI_API_KEY%

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


In [2]:
# I use openai Version: 1.12.0
!pip3 show openai

Name: openai
Version: 1.12.0
Summary: The official Python library for the openai API
Home-page: 
Author: 
Author-email: OpenAI <support@openai.com>
License: 
Location: C:\Users\yychiang\anaconda3\Lib\site-packages
Requires: anyio, distro, httpx, pydantic, sniffio, tqdm, typing-extensions
Required-by: 


In [18]:
from openai import OpenAI
client = OpenAI()

Taiwan_prompt = """
Question: "請問台灣總統大選結果如何?"
Answer:
"""


completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  messages=[
    #{"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": Taiwan_prompt}
  ]
)

initial_Taiwan_answer = completion.choices[0].message.content.strip()
print(initial_Taiwan_answer)


對不起，我無法提供即時信息或預報，因為我的數據只更新到2023年4月。如果你是在詢問最近一次的台灣總統大選結果，那麼我可以告訴你2020年的選舉結果。在2020年的選舉中，蔡英文勝出，連任台灣總統。她代表民主進步黨（Democratic Progressive Party，簡稱DPP）參選，並以大幅領先的差距擊敗了對手。如果你是在問未來的選舉結果，例如2024年的大選，我無法提供預測或更新後的信息。建議查看最新的新聞報導以獲得最新結果。


In [19]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "以色列有甚麼戰爭?"}
  ]
)
initial_Israel_answer = completion.choices[0].message.content.strip()
print(initial_Israel_answer)

以色列自1948年建國以來，由於其地理位置和周邊國家的關係，經歷了多次戰爭和衝突。以下是主要的幾次戰爭：

1. **1948年的阿拉伯-以色列戰爭（獨立戰爭）**：以色列宣布獨立後，周邊的阿拉伯國家發動了戰爭，試圖摧毀新成立的猶太國家。以色列最終勝利，擴大了其領土。

2. **1956年的第二次中東戰爭（蘇伊士危機）**：針對埃及收國化蘇伊士運河，以及針對以色列船隻的封鎖行動，以色列聯同英國和法國對埃及進行攻擊。

3. **1967年的第三次中東戰爭（六日戰爭）**：以色列預先對埃及、約旦、敘利亞和伊拉克的軍事基地發動突擊，迅速擴大了其領土，包括加薩地帶、西岸、東耶路撒冷、戈蘭高地和西奈半島。

4. **1973年的第四次中東戰爭（贖罪日戰爭）**：在猶太最神聖的日子贖罪日，埃及和敘利亞聯合發起對以色列的突襲，最初取得了一些成功，後來在以美援助下以色列逐步找回了失地。

5. **1982年的黎巴嫩戰爭**：以色列為了消滅巴勒斯坦解放組織（PLO）在黎巴嫩的基地，進攻黎巴嫩。這場戰爭導致大量人員傷亡，包含平民，並在1985年以色列設立了黎巴嫩南部的安全緩衝區。

除了這些戰爭，以色列還與巴勒斯坦之間有持續的衝突，包括2000年的第二次巴勒斯坦起義（阿克薩起義）以及多次加薩衝突。以色列和周邊國家、地區的緊張局勢至今仍未完全解決。


The model is answering this way because the training data ends in 2021. **Our task will be to provide context from 2024 to help the model answer these questions correctly.**

## Step 1: Prepare Dataset

### Loading and Wrangling Data

**The data should be loaded into a pandas `DataFrame` called `df` where each row represents a text sample, and there is only one column, `"text"`, which contains the raw text data.**

In this particular case we are collecting data from [the Wikipedia page for the year 2024](https://en.wikipedia.org/wiki/2024) and performing some data wrangling to get it into the appropriate format. Don't worry too much about the details here, since data wrangling looks different for every dataset!

In [24]:
from dateutil.parser import parse
import pandas as pd
import requests



# Get the Wikipedia page for "2024" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2024",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)



# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,"– 2024 (MMXXIV) is the current year, and is a..."
1,"– So far, this year has witnessed the continu..."
2,"– Approximately 76 countries, representing ar..."
10,"January 1 – Egypt, Ethiopia, Iran and the Unit..."
11,January 1 – The Republic of Artsakh is formall...
...,...
105,November – 2024 Namibian general election.
106,November – 2024 Romanian presidential election.
107,"November – Lee Hsien Loong, Prime Minister of ..."
109,December – 2024 Algerian presidential election.


In [25]:
df["text"][10]

'January 1 – Egypt, Ethiopia, Iran and the United Arab Emirates become BRICS members.'

### Generating Embeddings

We'll use the `Embedding` tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

In order to avoid a `RateLimitError` we'll send our data in batches to the `Embedding.create` function.

In [26]:
from openai import OpenAI
client = OpenAI()

EMBEDDING_MODEL_NAME = "text-embedding-3-large"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"– 2024 (MMXXIV) is the current year, and is a...","[-0.019873449578881264, 0.05013955011963844, -..."
1,"– So far, this year has witnessed the continu...","[-0.02778734639286995, 0.008994106203317642, -..."
2,"– Approximately 76 countries, representing ar...","[-0.013988791033625603, 0.017387699335813522, ..."
10,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.04913155362010002, 0.04088009148836136, -0..."
11,January 1 – The Republic of Artsakh is formall...,"[-0.02843601070344448, 0.005592299159616232, -..."
...,...,...
105,November – 2024 Namibian general election.,"[-0.019638201221823692, 0.01643790304660797, -..."
106,November – 2024 Romanian presidential election.,"[-0.031672440469264984, -0.019287370145320892,..."
107,"November – Lee Hsien Loong, Prime Minister of ...","[-0.0034571336582303047, 0.02480880357325077, ..."
109,December – 2024 Algerian presidential election.,"[-0.014009836129844189, 0.0376632995903492, -0..."


In [27]:
df.to_csv("embeddings.csv")

In [28]:
#!ls

#!dir # or this one in Windows system

If you want to stop the tutorial here and come back, you can reload `df` using this code (again adding your API key) rather than generating the embeddings again:

# Step 2: Create a Function that Finds Related Pieces of Text for a Given Question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [29]:
#from openai.embeddings_utils import get_embedding, distances_from_embeddings


from openai import OpenAI
from scipy.spatial.distance import cosine
client = OpenAI()
EMBEDDING_MODEL_NAME = "text-embedding-3-large"


def get_embedding(text, model=EMBEDDING_MODEL_NAME):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding


def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = df_copy["embeddings"].apply(lambda x: cosine(question_embeddings, x))
    
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

Let's test that out for a couple different questions:

In [30]:
get_rows_sorted_by_relevance("請問台灣總統大選結果如何?", df)

Unnamed: 0,text,embeddings,distances
20,January 13 – 2024 Taiwanese presidential elect...,"[-0.018739350140094757, -0.016289688646793365,...",0.498769
42,February 14 – 2024 Indonesian general election...,"[-0.02494579181075096, -0.019460968673229218, ...",0.718588
69,May 12 – 2024 Lithuanian presidential election.,"[-0.014449754729866982, 0.013429048471152782, ...",0.724994
23,January 14 – 2024 Comorian presidential electi...,"[0.020452503114938736, 0.0021525847259908915, ...",0.725913
14,January 2 – 2023 Marshallese general election:...,"[0.007827282883226871, -0.03010089322924614, -...",0.728693
...,...,...,...
81,August 17 – Nusantara will become the new capi...,"[-0.011016898788511753, -0.013277658261358738,...",0.995419
51,March 7 – 287 students are abducted by gunmen ...,"[-0.005382429342716932, 0.004495635163038969, ...",0.999019
43,February 22 – American company Intuitive Machi...,"[-0.016431834548711777, 0.00820942223072052, -...",1.003312
61,March 31 – Bulgaria and Romania are to become ...,"[-0.03798238933086395, -0.03057060018181801, -...",1.005208


In [32]:
get_rows_sorted_by_relevance("以色列有甚麼戰爭?", df)

Unnamed: 0,text,embeddings,distances
28,January 26 – Israel–Hamas war: The UN's Intern...,"[-0.053747620433568954, -0.01800781674683094, ...",0.631508
45,February 29 – Israel–Hamas war: Soldiers of th...,"[-0.04343355447053909, 0.00023973017232492566,...",0.631601
1,"– So far, this year has witnessed the continu...","[-0.02778734639286995, 0.008994106203317642, -...",0.652609
15,January 3 – 2024 Kerman bombings: An Islamic S...,"[-0.02560318075120449, 0.03410275653004646, -0...",0.786379
19,January 12 – Operation Prosperity Guardian: A ...,"[-0.01950954832136631, 0.025931064039468765, 0...",0.808699
...,...,...,...
62,March 31 – 2024 Turkish local elections.,"[-0.032939378172159195, -0.015575882978737354,...",0.990563
22,January 14 – Margrethe II formally abdicates a...,"[0.0029969520401209593, -2.3851542209740728e-0...",1.002294
36,February 4 – President of Namibia Hage Geingob...,"[-0.030395137146115303, 0.012127822265028954, ...",1.004783
43,February 22 – American company Intuitive Machi...,"[-0.016431834548711777, 0.00820942223072052, -...",1.006122


# Step 3: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [33]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

Now let's test that out! We'll use a `max_token_count` below the actual limit just to keep the output shorter and more readable.

In [34]:
print(create_prompt("以色列有甚麼戰爭?", df,200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

January 26 – Israel–Hamas war: The UN's International Court of Justice rules that Israel must take all measures to prevent genocidal acts in Gaza, but stops short of ordering an immediate halt to operations.

###

February 29 – Israel–Hamas war: Soldiers of the Israel Defense Forces open fire on a crowd of civilians in Gaza City, killing more than a hundred people, as the Palestinian casualties of the war exceed 30,000.

---

Question: 以色列有甚麼戰爭?
Answer:


In [35]:
print(create_prompt("請問台灣總統大選結果如何?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

January 13 – 2024 Taiwanese presidential election: Lai Ching-te of the ruling Democratic Progressive Party wins with 40% of the vote.

###

February 14 – 2024 Indonesian general election: Official quick counts by government tabulators establish former military officer Prabowo Subianto as the winner of the presidential election pending final results that will be released in March.

###

May 12 – 2024 Lithuanian presidential election.

###

January 14 – 2024 Comorian presidential election: Amid an opposition boycott, incumbent president Azali Assoumani wins re-election with 62.9% of the vote and only 16.3% voter turnout.

---

Question: 請問台灣總統大選結果如何?
Answer:


# Step 4: Create a Function that Answers a Question

Our final step is to send that text prompt to a `Completion` model and parse the model output!

In [36]:
from openai import OpenAI
client = OpenAI()


COMPLETION_MODEL_NAME = "gpt-4-0125-preview"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    #print(prompt)
    
    try:
        response = client.chat.completions.create(
            model=COMPLETION_MODEL_NAME,
            messages=[{"role": "user", "content": prompt}],
            #stream = True
            #max_tokens=max_answer_tokens
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(e)
        return ""
    
        

In [38]:
custom_Taiwan_answer = answer_question("請問台灣總統大選結果如何?", df)
print(custom_Taiwan_answer)

2024年台灣總統大選，由民主進步黨的賴清德以40%的得票率勝出。


In [39]:
custom_Israel_answer = answer_question("以色列有甚麼戰爭?", df)
print(custom_Israel_answer)

以色列-哈馬斯戰爭。


Below we compare answers with and without our custom prompt:

In [40]:
print(f"""
What are the latest results of the Taiwan presidential election?

Original Answer: {initial_Taiwan_answer}

Custom Answer:   {custom_Taiwan_answer}

What recent wars are happening in the Israel? Which year?

Original Answer: {initial_Israel_answer}

Custom Answer:   {custom_Israel_answer}
""")


What are the latest results of the Taiwan presidential election?

Original Answer: 對不起，我無法提供即時信息或預報，因為我的數據只更新到2023年4月。如果你是在詢問最近一次的台灣總統大選結果，那麼我可以告訴你2020年的選舉結果。在2020年的選舉中，蔡英文勝出，連任台灣總統。她代表民主進步黨（Democratic Progressive Party，簡稱DPP）參選，並以大幅領先的差距擊敗了對手。如果你是在問未來的選舉結果，例如2024年的大選，我無法提供預測或更新後的信息。建議查看最新的新聞報導以獲得最新結果。

Custom Answer:   2024年台灣總統大選，由民主進步黨的賴清德以40%的得票率勝出。

What recent wars are happening in the Israel? Which year?

Original Answer: 以色列自1948年建國以來，由於其地理位置和周邊國家的關係，經歷了多次戰爭和衝突。以下是主要的幾次戰爭：

1. **1948年的阿拉伯-以色列戰爭（獨立戰爭）**：以色列宣布獨立後，周邊的阿拉伯國家發動了戰爭，試圖摧毀新成立的猶太國家。以色列最終勝利，擴大了其領土。

2. **1956年的第二次中東戰爭（蘇伊士危機）**：針對埃及收國化蘇伊士運河，以及針對以色列船隻的封鎖行動，以色列聯同英國和法國對埃及進行攻擊。

3. **1967年的第三次中東戰爭（六日戰爭）**：以色列預先對埃及、約旦、敘利亞和伊拉克的軍事基地發動突擊，迅速擴大了其領土，包括加薩地帶、西岸、東耶路撒冷、戈蘭高地和西奈半島。

4. **1973年的第四次中東戰爭（贖罪日戰爭）**：在猶太最神聖的日子贖罪日，埃及和敘利亞聯合發起對以色列的突襲，最初取得了一些成功，後來在以美援助下以色列逐步找回了失地。

5. **1982年的黎巴嫩戰爭**：以色列為了消滅巴勒斯坦解放組織（PLO）在黎巴嫩的基地，進攻黎巴嫩。這場戰爭導致大量人員傷亡，包含平民，並在1985年以色列設立了黎巴嫩南部的安全緩衝區。

除了這些戰爭，以色列還與巴勒斯坦之間有持續的衝突，包括2000年的第二次巴勒斯坦起義（阿克薩起義）以及多次加薩衝突。以色列和周邊國家、地區的緊張局

# Explanation


In this project, we posed two questions. The first question is about the results of the Taiwanese presidential election, and the second question is about the situation in Israel. Since Taiwan conducted its quadrennial presidential election at the beginning of 2024, and the data for GPT-3.5 only goes up to the end of 2021. Additionally, the conflict between Israel and Hamas that began in 2023 continued into 2024 without resolution. Therefore, these two questions required the latest data from 2024. We sourced information from 2024 WiKi for model training and obtained satisfactory answers.