# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In [1]:
import openai
openai.api_key = "YOUR API KEY"

In [2]:
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
batch_size = 100
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

### Step 1: Prepare a Dataset with Embeddings

In [3]:
import requests

# Get the Wikipedia page for "2023" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")


['2023 (MMXXIII) was a common year starting on Sunday of the Gregorian calendar, the 2023rd year of the Common Era (CE) and Anno Domini (AD) designations, the 23rd  year of the 3rd millennium and the 21st century, and the  4th   year of the 2020s decade.  ',
 'The year 2023 saw the decline in severity of the COVID-19 pandemic, with the WHO (World Health Organization) ending its global health emergency status in May. Catastrophic natural disasters included the fifth-deadliest earthquake of the 21st century striking Turkey and Syria, leaving up to 62,000 people dead, Cyclone Freddy – the longest-lasting recorded tropical cyclone in history – leading to over 1,400 deaths in Malawi and Mozambique, Storm Daniel, which became the deadliest cyclone worldwide since Cyclone Nargis after killing at least 11,000 people in Libya, a major 6.8 magnitude earthquake striking western Morocco, killing 2,960 people, and a 6.3 magnitude quadruple earthquake striking western Afghanistan, killing over 1,400

In [4]:
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")


In [5]:
from dateutil.parser import parse

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)


In [6]:
df

Unnamed: 0,text
0,– 2023 (MMXXIII) was a common year starting o...
1,The year 2023 saw the decline in severity of t...
2,– The Russian invasion of Ukraine and Myanmar...
3,– A banking crisis resulted in the collapse o...
4,"– In the realm of technology, 2023 saw the co..."
...,...
211,"Economics – Claudia Goldin, for her empirical ..."
212,"Literature – Jon Fosse, for his innovative pla..."
213,"Peace – Narges Mohammadi, for her works on the..."
214,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


In [7]:
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings


In [8]:
df

Unnamed: 0,text,embeddings
0,– 2023 (MMXXIII) was a common year starting o...,"[0.011596864089369774, -0.03158478066325188, 0..."
1,The year 2023 saw the decline in severity of t...,"[0.018211673945188522, 0.038739945739507675, 0..."
2,– The Russian invasion of Ukraine and Myanmar...,"[-0.06628242135047913, 0.005743596237152815, 0..."
3,– A banking crisis resulted in the collapse o...,"[-0.013324891217052937, -0.030914733186364174,..."
4,"– In the realm of technology, 2023 saw the co...","[0.026940442621707916, 0.0006948678055778146, ..."
...,...,...
211,"Economics – Claudia Goldin, for her empirical ...","[0.012391716241836548, 0.0376654677093029, 0.0..."
212,"Literature – Jon Fosse, for his innovative pla...","[-0.03738046810030937, 0.013680481351912022, 0..."
213,"Peace – Narges Mohammadi, for her works on the...","[0.033590178936719894, -0.0021617852617055178,..."
214,"Physics – Pierre Agostini, Ferenc Krausz & Ann...","[-0.037235796451568604, 0.007389205042272806, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Step 2: Find Relevant Data with Unsupervised Machine Learning

In [9]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [10]:
question = "When did the Turkey–Syria earthquakes occur?"
df_sorted_by_relevance = get_rows_sorted_by_relevance(question, df)

In [11]:
df_sorted_by_relevance

Unnamed: 0,text,embeddings,distances
30,February 6 – A 7.8 Mww earthquake strikes sout...,"[-0.015422926284372807, 0.02969459258019924, 0...",0.309362
166,October 7 – A series of earthquakes occur in H...,"[-0.07051253318786621, -0.02271047793328762, 0...",0.542580
1,The year 2023 saw the decline in severity of t...,"[0.018211673945188522, 0.038739945739507675, 0...",0.604222
153,September 8 – 2023 Marrakesh–Safi earthquake: ...,"[0.0174105204641819, 0.0006235209293663502, 0....",0.642041
87,May 7 – Syria is readmitted into the Arab Leag...,"[-0.04098215326666832, 0.029430272057652473, 0...",0.661789
...,...,...,...
72,April 14 – Jupiter Icy Moons Explorer (JUICE) ...,"[-0.0424734465777874, 0.015120547264814377, 0....",0.972254
86,May 6 – The coronation of Charles III and Cami...,"[0.06027829274535179, 0.0025942244101315737, 0...",0.981037
4,"– In the realm of technology, 2023 saw the co...","[0.026940442621707916, 0.0006948678055778146, ...",1.001220
140,"August 10 – Tapestry, the holding company of C...","[0.018905624747276306, -0.003249945119023323, ...",1.011993


### Step 3: Compose a Custom Text Prompt

In [12]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)


In [13]:
def answer_question_custom(question, df, max_prompt_tokens=1800, max_answer_tokens=150):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)
    print("prompt: {}".format(prompt))

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [14]:
def answer_question(question, max_answer_tokens=150):
    prompt = """
        Question: {}
        Answer:
        """.format(question)
    print("prompt: {}".format(prompt))

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [15]:
question_1 = "When did the Turkey–Syria earthquakes occur?"
# Answer_1 = "February 6th, 2023"

In [16]:
custom_answer = answer_question_custom(question_1, df)
print("Answer: {}".format(custom_answer))


prompt: 
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        February 6 – A 7.8 Mww earthquake strikes southern and central Turkey and northern and western Syria followed by a 7.7 Mww  aftershock on the same day, causing widespread damage and at more than 59,000 fatalities and 121,000 injured.

###

October 7 – A series of earthquakes occur in Herat Province in Afghanistan, killing over 1,000 people and injuring nearly 2,000, with tremors felt in Iran and Turkmenistan. The earthquakes are the deadliest in the country since 1998.

###

The year 2023 saw the decline in severity of the COVID-19 pandemic, with the WHO (World Health Organization) ending its global health emergency status in May. Catastrophic natural disasters included the fifth-deadliest earthquake of the 21st century striking Turkey and Syria, leaving up to 62,000 people dead, Cyclone Freddy – the longest-

Answer: February 6


In [17]:
answer = answer_question(question_1)
print(answer)


prompt: 
        Question: When did the Turkey–Syria earthquakes occur?
        Answer:
        
The Turkey-Syria earthquakes occurred on October 23, 2017.


### Question 2

In [18]:
question_2 = "When was GPT-4 announced?"
# Answer_2 = "November 6th, 2023"

In [19]:
custom_answer = answer_question_custom(question_2, df)
print(custom_answer)


prompt: 
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        March 14 – OpenAI launches GPT-4, a large language model for ChatGPT, which can respond to images and can process up to 25,000 words.

###

December 6 – Google DeepMind releases the Gemini Language Model. Gemini will act as a foundational model integrated into Google's existing tools, including Search and Bard.

###

 – In the realm of technology, 2023 saw the continued rise of generative AI models, with increasing applications across various industries. These models, leveraging advancements in machine learning and natural language processing, had become capable of creating realistic and coherent text, images, and music. An AI arms race between private companies has continued since the late 2010s, with Microsoft-backed OpenAI and Google owner Alphabet today most dominant among firms.

###

November 1 – The fi

March 14


In [20]:
answer = answer_question(question_2)
print(answer)


prompt: 
        Question: When was GPT-4 announced?
        Answer:
        
There is no information available on when GPT-4 was announced at this time. It is currently still in development and there has been no official announcement regarding its release.
