# Custom Chatbot Project

The completion model used in this project is current up and until 2021 (model: text-davinci-003). In order to customize the chatbot and accomodate for recent events, the Wikipedia page "2023 in Denmark" has been used. This page lists current events in Denmark that year. In terms of the questions, I have picked two significant events that happened in Denmark in 2023. First, Volodymyr Zelensky's state visit to the country. Secondly, what happened during Danish astronaut Andreas Mogensen's visit to space. By adding the context retrieved from the Wikipedia page to the model's completion process, the model provided appropriate answers.

## Data Wrangling

In the cells below, the chosen dataset has been loaded into a `pandas` dataframe with a column named `"text"`. This column contains all of the text data from the chosen wikipedia page. Useful code from the course materials has been copied and pasted.

In [1]:
import openai
openai.api_key = "YOUR API KEY"

In [2]:
from dateutil.parser import parse
import pandas as pd
import requests

# Get the Wikipedia page for "2023 in Denmark" since OpenAI's models stop in 2021
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2023_in_Denmark&explaintext=1&formatversion=2&format=json")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– Events in the year 2023 in Denmark.
4,Monarch – Margrethe II
5,Prime Minister – Mette Frederiksen
6,– Government: Frederiksen II Cabinet
7,– Folketing: 2022–2026 session (elected 1 Nov...
...,...
197,"9 July – Asbjørn Sennels, footballer (born 1979)"
198,"27 August – Eddie Skoller, entertainer (born 1..."
202,"8 November – Søren Krarup, pastor, author and ..."
203,"9 November – Jørgen Reenberg, actor (born 1927)"


## Custom Query Completion

In the cells below, a custom query has been composed using the chosen dataset. The model "text-embedding-ada-002" has been used to create the embeddings, and the model "text-davinci-003" has been used for the completion process. Useful code from the materials in this course has been copied and pasted.

In [3]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– Events in the year 2023 in Denmark.,"[-0.008608013391494751, -0.0423879437148571, 0..."
4,Monarch – Margrethe II,"[-0.011767682619392872, -0.014944956637918949,..."
5,Prime Minister – Mette Frederiksen,"[0.0049454038962721825, -0.021792452782392502,..."
6,– Government: Frederiksen II Cabinet,"[-0.002783924574032426, -0.021326791495084763,..."
7,– Folketing: 2022–2026 session (elected 1 Nov...,"[-0.009420907124876976, -0.04695134982466698, ..."
...,...,...
197,"9 July – Asbjørn Sennels, footballer (born 1979)","[-0.012380163185298443, -0.01688493601977825, ..."
198,"27 August – Eddie Skoller, entertainer (born 1...","[0.0024474533274769783, -0.01328253373503685, ..."
202,"8 November – Søren Krarup, pastor, author and ...","[0.0005325953243300319, -0.03166608884930611, ..."
203,"9 November – Jørgen Reenberg, actor (born 1927)","[-0.014470694586634636, -0.03745207563042641, ..."


In [4]:
df.to_csv("embeddings.csv")

In [6]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [7]:
get_rows_sorted_by_relevance("What year did Volodymyr Zelensky last visit Denmark?", df)

Unnamed: 0,text,embeddings,distances
57,20–21 August – Ukrainian president Volodymyr Z...,"[-0.01600203476846218, -0.03360169753432274, 0...",0.115764
62,6 September – Danish Prime Minister Mette Fred...,"[-0.008525527082383633, -0.01251987088471651, ...",0.155667
70,"6–8 November – Felipe VI, King of Spain, goes ...","[-0.012014003470540047, -0.01815221831202507, ...",0.177423
0,– Events in the year 2023 in Denmark.,"[-0.008608013391494751, -0.0423879437148571, 0...",0.189012
9,Prime minister of the Faroe Islands – Aksel V....,"[0.00227158865891397, -0.01553389709442854, -0...",0.194143
...,...,...,...
77,– Mia Wagner resigns as minister of digitalis...,"[-0.016111699864268303, -0.010903680697083473,...",0.276564
154,3 September – Bastian Buus wins the 2023 Porsc...,"[0.0059269871562719345, 0.005085361190140247, ...",0.279971
187,"1 April – Dario Campeotto, singer and entertai...","[-0.009725860320031643, -0.014588790945708752,...",0.282382
43,3 May – A new national warning system is teste...,"[-0.017625609412789345, 0.0018384989816695452,...",0.283283


In [8]:
get_rows_sorted_by_relevance("Describe Danish astronaut Andreas Mogensen's latest space mission?", df)

Unnamed: 0,text,embeddings,distances
58,23 August – Jakob Ellemann-Jensen switches min...,"[0.0004370354872662574, -0.03705993667244911, ...",0.094210
204,"1 December – Jørn Mader (da), sports journalis...","[-0.018169252201914787, -0.030160438269376755,...",0.180232
5,Prime Minister – Mette Frederiksen,"[0.0049454038962721825, -0.021792452782392502,...",0.183841
9,Prime minister of the Faroe Islands – Aksel V....,"[0.00227158865891397, -0.01553389709442854, -0...",0.187534
10,Prime minister of Greenland – Múte Bourup Egede,"[-0.009652459993958473, -0.016679128631949425,...",0.187832
...,...,...,...
77,– Mia Wagner resigns as minister of digitalis...,"[-0.016111699864268303, -0.010903680697083473,...",0.259645
198,"27 August – Eddie Skoller, entertainer (born 1...","[0.0024474533274769783, -0.01328253373503685, ...",0.265619
8,– Leaders of the constituent countries,"[-0.00870521366596222, 3.951273174607195e-05, ...",0.278299
187,"1 April – Dario Campeotto, singer and entertai...","[-0.009725860320031643, -0.014588790945708752,...",0.283433


In [9]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [10]:
print(create_prompt("What year did Volodymyr Zelensky last visit Denmark?", df, 300))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

20–21 August – Ukrainian president Volodymyr Zelenskyy visits Denmark in connection with its donation of 19 F-16s to Ukraine.

###

6 September – Danish Prime Minister Mette Frederiksen speaks before the Ukrainian parliament in Kyiv.

###

6–8 November – Felipe VI, King of Spain, goes on a state visit to Denmark

###

 – Events in the year 2023 in Denmark.

###

Prime minister of the Faroe Islands – Aksel V. Johannesen

###

11 February – The 53rd edition of Dansk Melodi Grand Prix 2023 is held in Arena Næstved in Næstved with Faroese singer Reiley being selected as Denmark's entry for the Eurovision Song Contest

###

6 December – Sanjay Shah is extradited to Denmark from the United Arab Emirates.

###

29 May – F.C. Copenhagen secures the Danish football championship by defeating Viborg FF 2–1 in the second last round of the 2022–23 Danish Superl

In [11]:
print(create_prompt("Describe Danish astronaut Andreas Mogensen's latest space mission?", df, 300))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

23 August – Jakob Ellemann-Jensen switches ministry with Troels Lund Poulsen, going from Minister of Defence to Minister for Economic Affairs.26 August – Astronaut Andreas Mogensen begins his second mission to the International Space Station, arriving the following day. As pilot of SpaceX Crew-7, he becomes the first non-American to pilot a SpaceX Dragon 2 or any other SpaceX vehicle. Mogensen's personal mission is the Huginn mission, where he will conduct over 30 European experiments on "climate, health, and space for Earth" for the European Space Agency. Joined by three other astronauts, he is taking part in Expedition 69 and 70, and they are expected to return to Earth in February 2024.

###

1 December – Jørn Mader (da), sports journalist norn 1947)

###

Prime Minister – Mette Frederiksen

###

Prime minister of the Faroe Islands – Aksel V. Jo

In [12]:
COMPLETION_MODEL_NAME = "text-davinci-003"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

In the cells below, the performance of the custom query has been demonstrated using 2 questions. For each question, it shows the answer from a basic completion model query as well as the answer from the custom query.

### Question 1

In [13]:
zelensky_prompt = """
Question: "What year did Volodymyr Zelensky last visit Denmark?"
Answer:
"""
initial_zelensky_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=zelensky_prompt,
    max_tokens=200
)["choices"][0]["text"].strip()
print(initial_zelensky_answer)

Volodymyr Zelensky last visited Denmark in 2019.


In [14]:
custom_zelensky_answer = answer_question("What year did Volodymyr Zelensky last visit Denmark?", df)
print(custom_zelensky_answer)

2023


### Question 2

In [15]:
astronaut_prompt = """
Question: "Describe Danish astronaut Andreas Mogensen's latest space mission?"
Answer:
"""
initial_astronaut_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=astronaut_prompt,
    max_tokens=200
)["choices"][0]["text"].strip()
print(initial_astronaut_answer)

Danish astronaut Andreas Mogensen's latest space mission was a 2015 mission to the International Space Station. The mission lasted nine days, during which he conducted experiments related to human performance, robotics, educational outreach, and questions of intercultural collaboration. During his mission, Andreas became the first Dane to venture into space, and even spent time remotely controlling a rover on Earth from the ISS. He also made history by becoming the first person to use a 3D printer in space.


In [16]:
custom_astronaut_answer = answer_question("Describe Danish astronaut Andreas Mogensen's latest space mission?", df)
print(custom_astronaut_answer)

Andreas Mogensen is taking part in Expedition 69 and 70 of the International Space Station, beginning on 26 August 2023. His mission is named Huginn and it involves conducting over 30 European experiments on "climate, health, and space for Earth" for the European Space Agency. He is accompanied by three other astronauts, and they are expected to return to Earth in February 2024.
