<a href="https://colab.research.google.com/github/seobando/UDACITY_GenerativeAI/blob/main/Custom_Chatbot_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install / Load required libraries

In [76]:
!pip install openai



In [77]:
!pip install tiktoken



# Custom Chatbot Project



TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

- This is an adaptation of the original documentation from OpenAI to do Question answering using embeddings-based search, it can be adapted to other topics articles by changing the variables <b>"article"</b> and <b>"instruction"</b>.
- The topic was about the Olympics from 2022 using wikipedia API.
- Official documentation from [OpenAI](https://cookbook.openai.com/examples/question_answering_using_embeddings).

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [131]:
from dateutil.parser import parse
import pandas as pd
import requests

# Get the Wikipedia page
article = "Cross-country_skiing_at_the_2022_Winter_Olympics"
url = f"https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=100&titles={article}&explaintext=1&formatversion=2&format=json"
resp = requests.get(url)

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– Cross-country skiing at the 2022 Winter Oly...
1,– A total of 296 quota spots (148 per gender)...
6,– A maximum of 296 quota spots will be availa...
10,– The following was the competition schedule ...
11,– All times are (UTC+8).
22,– a This event was shortened to 28.4 km due t...
29,– A total of 296 athletes from 52 nations (in...
36,Official Results Book – Cross-country Skiing


In [132]:
# imports
import ast  # for converting embeddings saved as strings back to arrays
from openai import OpenAI # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
import os # for getting API token from env variable OPENAI_API_KEY
from scipy import spatial  # for calculating vector similarities for search

# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"
api_key = "sk-proj-mjocgIqRLCGDLN46uiQ8T3BlbkFJDu4meGlIwS2NulZKQ3Kc"

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", api_key))


In [133]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

df['embedding'] = df["text"].apply(lambda x: get_embedding(x, model=EMBEDDING_MODEL_NAME))
df

Unnamed: 0,text,embedding
0,– Cross-country skiing at the 2022 Winter Oly...,"[-0.0022337331902235746, -0.011550666764378548..."
1,– A total of 296 quota spots (148 per gender)...,"[-0.014492054469883442, 0.00184702652040869, 0..."
6,– A maximum of 296 quota spots will be availa...,"[-0.0025350630749017, 0.0031326254829764366, 0..."
10,– The following was the competition schedule ...,"[-0.022179607301950455, 0.004352595191448927, ..."
11,– All times are (UTC+8).,"[-0.009887934662401676, -0.00954884197562933, ..."
22,– a This event was shortened to 28.4 km due t...,"[-0.01556546799838543, -0.005157393869012594, ..."
29,– A total of 296 athletes from 52 nations (in...,"[-0.009800657629966736, -0.010591244325041771,..."
36,Official Results Book – Cross-country Skiing,"[-0.01191276777535677, 0.013225248083472252, 0..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [134]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]


In [135]:
# examples
strings, relatednesses = strings_ranked_by_relatedness("cross-country skiing", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.868


'Official Results Book – Cross-country Skiing'

relatedness=0.862


' – Cross-country skiing at the 2022 Winter Olympics was held at the Kuyangshu Nordic Center and Biathlon Center in Zhangjiakou, China.'

relatedness=0.782


' – a This event was shortened to 28.4 km due to high winds and freezing temperatures.'

relatedness=0.775


' – A total of 296 quota spots (148 per gender) were distributed to the sport, a decline of 14 from the 2018 Winter Olympics. A total of 12 events were contested, six each for men and women. '

relatedness=0.763


' – A total of 296 athletes from 52 nations (including the ROC) were scheduled to participate, with the numbers of athletes are shown in parentheses.'

In [136]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    instruction:str,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    question = f"\n\nQuestion: {query}"
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(instruction + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            instruction += next_article
    return instruction + question


def ask(
    query: str,
    instruction:str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, instruction, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "user", "content": message},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message



## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [137]:
def get_completition(client, prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

In [127]:
instruction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'

### Question 1

In [138]:
question_1 = "Which sports were played at the 2022 Winter Olympics?"

In [139]:
prompt_1 = f"""
Question: '{question_1}'
Answer:
"""
print(get_completition(client, prompt_1))

The sports played at the 2022 Winter Olympics were alpine skiing, biathlon, bobsleigh, cross-country skiing, curling, figure skating, freestyle skiing, ice hockey, luge, Nordic combined, short track speed skating, skeleton, ski jumping, snowboarding, speed skating, and the new addition of monobob.


In [140]:
ask(question_1,instruction)

'The sports played at the 2022 Winter Olympics were cross-country skiing.'

### Question 2

In [150]:
question_2 = "Which countries participated on cross-country skiing at the 2022 Winter Olympics?"

In [151]:
prompt_2 = f"""
Question: '{question_2}'
Answer:
"""
print(get_completition(client, prompt_2))

Some of the countries that participated in cross-country skiing at the 2022 Winter Olympics included Norway, Sweden, Russia, Finland, and the United States.


In [152]:
ask(question_2,instruction)

'A total of 52 nations participated in cross-country skiing at the 2022 Winter Olympics.'