# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

Chosen dataset: https://en.wikipedia.org/wiki/2024_NBA_playoffs

This dataset is appropriate because the 2024 NBA playoffs took place after the training data cutoff for the model, so it won't know a lot of the updated information on statistics or who won the championship.

In [2]:
import openai
openai.api_key = "YOUR API KEY"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [3]:
import pandas as pd
import requests

# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024_NBA_playoffs&explaintext=1&formatversion=2&format=json")
resp

<Response [200]>

In [4]:
resp.json()["query"]["pages"][0]["extract"].split("\n")

["The 2024 NBA playoffs was the postseason tournament of the National Basketball Association's (NBA) 2023–24 season. The playoffs began on April 20 and concluded on June 17 with the Boston Celtics winning the 2024 NBA Finals for their 18th championship, the most in NBA history.",
 '',
 '',
 '== Overview ==',
 '',
 '',
 '=== Updates to playoff appearances ===',
 'The Boston Celtics entered the playoffs for the tenth consecutive season, the longest present streak in the NBA. They also won the Maurice Podoloff Trophy for clinching the best record in the NBA for the first time since 2008.',
 'The Oklahoma City Thunder entered the playoffs for the first time since 2020 and also clinched the number one seed in the Western Conference for the first time since 2013.',
 'The Milwaukee Bucks entered the playoffs for the eighth consecutive season.',
 'The Philadelphia 76ers entered the playoffs for the seventh consecutive season.',
 'The Denver Nuggets entered the playoffs for the sixth consecutiv

In [5]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")
df

Unnamed: 0,text
0,The 2024 NBA playoffs was the postseason tourn...
1,
2,
3,== Overview ==
4,
...,...
361,
362,
363,== External links ==
364,


In [6]:
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
df

Unnamed: 0,text
0,The 2024 NBA playoffs was the postseason tourn...
7,The Boston Celtics entered the playoffs for th...
8,The Oklahoma City Thunder entered the playoffs...
9,The Milwaukee Bucks entered the playoffs for t...
10,The Philadelphia 76ers entered the playoffs fo...
...,...
333,This was the first playoff meeting between the...
346,"ABC, ESPN, TNT, and NBA TV broadcast the playo..."
350,This was the first playoffs in which the strea...
354,"For the third straight year, the playoffs is o..."


In [7]:
df.head(20)

Unnamed: 0,text
0,The 2024 NBA playoffs was the postseason tourn...
7,The Boston Celtics entered the playoffs for th...
8,The Oklahoma City Thunder entered the playoffs...
9,The Milwaukee Bucks entered the playoffs for t...
10,The Philadelphia 76ers entered the playoffs fo...
11,The Denver Nuggets entered the playoffs for th...
12,The Miami Heat entered the playoffs for the fi...
13,The Phoenix Suns entered the playoffs for the ...
14,The Minnesota Timberwolves entered the playoff...
15,"The Los Angeles Clippers, New York Knicks, Cle..."


In [8]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["embeddings"] = embeddings


Unnamed: 0,text,embeddings
0,The 2024 NBA playoffs was the postseason tourn...,"[-0.009210898540914059, -0.02458782307803631, ..."
7,The Boston Celtics entered the playoffs for th...,"[-0.006648636423051357, -0.027243826538324356,..."
8,The Oklahoma City Thunder entered the playoffs...,"[-0.001295544090680778, -0.02951119840145111, ..."
9,The Milwaukee Bucks entered the playoffs for t...,"[-0.013931754045188427, -0.03255034610629082, ..."
10,The Philadelphia 76ers entered the playoffs fo...,"[-0.012970644049346447, -0.020927319303154945,..."
...,...,...
333,This was the first playoff meeting between the...,"[-0.007964164018630981, -0.02083410508930683, ..."
346,"ABC, ESPN, TNT, and NBA TV broadcast the playo...","[-0.019491789862513542, -0.036794938147068024,..."
350,This was the first playoffs in which the strea...,"[-0.0038187759928405285, -0.030471941456198692..."
354,"For the third straight year, the playoffs is o...","[-0.007075726520270109, -0.02635233663022518, ..."


In [9]:
df.to_csv("embeddings.csv")

In [10]:
! ls

data  embeddings.csv  project.ipynb


In [11]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [12]:
get_rows_sorted_by_relevance("How many championships have the boston celtics won?", df)

Unnamed: 0,text,embeddings,distances
312,This was the seventh playoff meeting between t...,"[-0.0034784337040036917, -0.008470291271805763...",0.140090
7,The Boston Celtics entered the playoffs for th...,"[-0.006648636423051357, -0.027243826538324356,...",0.140365
30,The Boston Celtics finished first in the Easte...,"[-0.008426252752542496, -0.002618970349431038,...",0.145432
76,The Celtics entered the Eastern Conference fin...,"[-0.0025162517558783293, -0.001177837024442851...",0.154632
82,Game 1 vs the Pacers was the first time in Cel...,"[-0.010314851067960262, -0.02202703431248665, ...",0.160475
...,...,...,...
259,After missing the entire second quarter due t...,"[-0.012129021808505058, -0.009399819187819958,...",0.279047
234,Game 5 was the last Clippers home game played ...,"[0.0019048801623284817, -0.02574249543249607, ...",0.280708
277,After averaging just 22 points on 39% shootin...,"[-0.01934794709086418, 0.002907851245254278, 0...",0.282291
230,"Luka Dončić had 32 points and nine assists, K...","[-0.013986761681735516, -0.005226943641901016,...",0.282844


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [13]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [14]:
print(create_prompt("Who won the 2024 NBA Championship?", df, 500))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

The 2024 NBA playoffs was the postseason tournament of the National Basketball Association's (NBA) 2023–24 season. The playoffs began on April 20 and concluded on June 17 with the Boston Celtics winning the 2024 NBA Finals for their 18th championship, the most in NBA history.

###

Basketball – Reference.com's 2024 Playoffs section

###

With the defending champion Nuggets losing to the Minnesota Timberwolves, the 2024 playoffs marked the fifth straight year where the defending champion was eliminated before the conference finals.

###

The Cavaliers won a playoff series without LeBron James for the first time since 1993.

###

The Nuggets' elimination also confirmed there would be unique NBA champions across a six-year span for the first time since 1975–1980.

###

The Pacers advanced to the Eastern Conference finals for the first time since 2014.

In [15]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, use_custom_prompt=True, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    if use_custom_prompt:
        prompt = create_prompt(question, df, max_prompt_tokens)
    else:
        prompt = question
        
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [16]:
response = answer_question("Who won the 2024 NBA Championship?", df)
response

'The Boston Celtics won the 2024 NBA Championship.'

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

https://en.wikipedia.org/wiki/2024_NBA_playoffs

### Question 1:

Question: Who won the 2024 NBA Championship?

Correct answer: The Boston Celtics

In [17]:
question = "Who won the 2024 NBA Championship?"

response = answer_question(question, df, use_custom_prompt=False)
response

'It is impossible to determine the winner of the 2024 NBA Championship as it has not yet taken place.'

In [18]:
response = answer_question(question, df)
response

'The Boston Celtics won the 2024 NBA Championship.'

### Question 2:

Question: Who became the fourth player in NBA history to hit two go-ahead shots inside the final five seconds in the same postseason?

Correct answer: Jamal Murray

In [19]:
question = "Who became the fourth player in NBA history to hit two go-ahead shots inside the final five seconds in the same postseason?"

response = answer_question(question, df, use_custom_prompt=False)
response

'LeBron James.'

In [20]:
response = answer_question(question, df)
response

'Jamal Murray'

### Question 3

Question: Who joined Kobe Bryant as the only players aged 22 or younger to have consecutive 40-point playoff games?

Correct Answer: Anthony Edwards

In [21]:
question = "Who joined Kobe Bryant as the only players aged 22 or younger to have consecutive 40-point playoff games?"

response = answer_question(question, df, use_custom_prompt=False)
response

'LeBron James'

In [22]:
response = answer_question(question, df)
response

'Anthony Edwards'

### Question 4

Question: How many championships have the Boston Celtics won?

Correct Answer: 18

In [23]:
question = "How many championships have the Boston Celtics won?"

response = answer_question(question, df, use_custom_prompt=False)
response

'The Boston Celtics have won 17 NBA championships.'

In [24]:
response = answer_question(question, df)
response

'18'

## Enhancements

In [25]:
while True:
    question = input("Please enter your question (or 'exit' to quit): ")
    if question.lower() == 'exit':
        break
    result = answer_question(question, df)
    print(f"Answer: {result}")

Please enter your question (or 'exit' to quit): Who won the 2024 NBA Championship?
Answer: The Boston Celtics won the 2024 NBA Championship.
Please enter your question (or 'exit' to quit): Who became the fourth player in NBA history to hit two go-ahead shots inside the final five seconds in the same postseason?
Answer: Jamal Murray
Please enter your question (or 'exit' to quit): Who joined Kobe Bryant as the only players aged 22 or younger to have consecutive 40-point playoff games?
Answer: Anthony Edwards
Please enter your question (or 'exit' to quit): How many championships have the Boston Celtics won?
Answer: 18 championships.
Please enter your question (or 'exit' to quit): exit
