# Custom Chatbot Project

The dataset I have chosen is the **character_descritpions** dataset provided in .csv file. This dataset is suitable for the task because it provides specific information about some characters which a general LLM model does not have. So it's easy to compare the answer before/after providing necessary context.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

Load dataset into Pandas DataFrame

In [2]:
import pandas as pd
import requests

df = pd.DataFrame()
df_origin = pd.read_csv("./data/character_descriptions.csv")
df["text"] = df_origin[["Name", "Medium", "Setting", "Description"]].agg(' - '.join, axis=1)
df["text"]

0     Emily - Play - England - A young woman in her ...
1     Jack - Play - England - A middle-aged man in h...
2     Alice - Play - England - A woman in her late 3...
3     Tom - Play - England - A man in his 50s, Tom i...
4     Sarah - Play - England - A woman in her mid-20...
5     George - Play - England - A man in his early 3...
6     Rachel - Play - England - A woman in her late ...
7     John - Play - England - A man in his 60s, John...
8     Maria - Movie - Texas - A middle-aged Latina w...
9     Caleb - Movie - Texas - A young African Americ...
10    Tyler - Movie - Texas - A white man in his mid...
11    Sonya - Movie - Texas - A white woman in her l...
12    Manuel - Movie - Texas - A middle-aged Hispani...
13    Will - Movie - Texas - A white man in his earl...
14    Mia - Limited Series - Australia - A young Aus...
15    Lucas - Limited Series - Australia - A middle-...
16    Tahlia - Limited Series - Australia - A young ...
17    Max - Limited Series - Australia - A white

Create embeddings for each text

In [3]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 10
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Emily - Play - England - A young woman in her ...,"[-0.01285545527935028, -0.011429925449192524, ..."
1,Jack - Play - England - A middle-aged man in h...,"[0.011086059734225273, -0.02076270990073681, 0..."
2,Alice - Play - England - A woman in her late 3...,"[0.009414365515112877, -0.007939904928207397, ..."
3,"Tom - Play - England - A man in his 50s, Tom i...","[0.021002890542149544, -0.014546399004757404, ..."
4,Sarah - Play - England - A woman in her mid-20...,"[-0.0065544648095965385, -0.02564818598330021,..."
5,George - Play - England - A man in his early 3...,"[-0.01701248623430729, -0.011330811306834221, ..."
6,Rachel - Play - England - A woman in her late ...,"[-0.00021242130605969578, -0.00926116947084665..."
7,"John - Play - England - A man in his 60s, John...","[0.022466149181127548, -0.013257226906716824, ..."
8,Maria - Movie - Texas - A middle-aged Latina w...,"[-0.013499060645699501, -0.02056492306292057, ..."
9,Caleb - Movie - Texas - A young African Americ...,"[0.0057533192448318005, -0.03238619118928909, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [6]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [7]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [11]:
test_prompt = """
Question: "Who is Soifa's mother and where is she from?"
Answer:
"""

answer_question(test_prompt, df, max_prompt_tokens=1800, max_answer_tokens=150)

"Maria is Soifa's mother and she is from Texas."

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1
Who is always worrying her children? How many children does she have?

In [12]:
Q1_prompt = """
Question: "Who is always worrying her children? How many children does she have?"
Answer:
"""

Q1_initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print("Q1 inital answer:\n{}".format(Q1_initial_answer))

Q1 inital answer:
I'm sorry, I cannot answer this question as it is not specified who "she" refers to in the question. Please provide more context or information.


In [13]:
Q1_custom_answer = answer_question(Q1_prompt, df, max_prompt_tokens=1800, max_answer_tokens=150)
print("Q1 custom answer:\n{}".format(Q1_custom_answer))

Q1 custom answer:
Alice is always worrying about her children and she has two children, including Emily.


### Question 2
What do Ava's husband do?

In [14]:
Q2_prompt = """
Question: "What do Ava's husband do?"
Answer:
"""

Q2_initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q2_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print("Q2 initial answer:\n{}".format(Q2_initial_answer))

Q2 initial answer:
There is not enough information provided in the question to accurately determine what Ava's husband does. We would need more context or information to answer this question accurately.


In [15]:
Q2_custom_answer = answer_question(Q2_prompt, df, max_prompt_tokens=1800, max_answer_tokens=150)
print("Q2 custom answer:\n{}".format(Q2_custom_answer))

Q2 custom answer:
Based on the given information, Ava's husband Lucas is successful businessman and CEO of a major tech company. However, their marriage is on the rocks due to his infidelity.
