# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

```As the data in the character_descriptions.csv is a made up data, openai wont be able to provide an adequate response to user queries based on this specific data.```


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

#### Importing the necessary libraries



In [62]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = ""

In [76]:
import pandas as pd
import numpy as np

In [66]:
# read the CSV file from the data folder
df = pd.read_csv('data/character_descriptions.csv')
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [67]:
# Check if there are any missing values in the data
df.isnull().sum()

Name           0
Description    0
Medium         0
Setting        0
dtype: int64

In [68]:
# check if all the names in the data are unique
df['Name'].nunique() == df.shape[0]

True

In [69]:
# Normalize all the text in all the columns including the column names
df.columns = df.columns.str.lower()
df = df.apply(lambda x: x.str.lower().str.strip() if x.dtype == "object" else x)
df.head()

Unnamed: 0,name,description,medium,setting
0,emily,"a young woman in her early 20s, emily is an as...",play,england
1,jack,"a middle-aged man in his 40s, jack is a succes...",play,england
2,alice,"a woman in her late 30s, alice is a warm and n...",play,england
3,tom,"a man in his 50s, tom is a retired soldier and...",play,england
4,sarah,"a woman in her mid-20s, sarah is a free-spirit...",play,england


In [70]:
# Flatten the dataframe

df['text'] = df.apply(lambda row: f"Actor: {row['name']}\nRole Description: {row['description']}\nMedium: {row['medium']}\nSetting: {row['setting']}", axis=1)

In [71]:
# drop the columns that are no longer needed
df = df.drop(columns=['name', 'description', 'medium', 'setting'])
df.head()

Unnamed: 0,text
0,Actor: emily\nRole Description: a young woman ...
1,Actor: jack\nRole Description: a middle-aged m...
2,Actor: alice\nRole Description: a woman in her...
3,Actor: tom\nRole Description: a man in his 50s...
4,Actor: sarah\nRole Description: a woman in her...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [64]:
query_prompt_one = """
Question: "What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?"
Answer:
"""
initial_query_answer_one = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=query_prompt_one,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_query_answer_one)

The name of the teenage daughter is not provided and can vary.


In [65]:
query_prompt_two = """
Question: "What is the name of the young woman who Max considers like a second mother?"
Answer:
"""
initial_query_answer_two = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=query_prompt_two,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_query_answer_two)

The name of the young woman is Eddie.


# Generate Embeddings

In [74]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 10
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Actor: emily\nRole Description: a young woman ...,"[-0.017513729631900787, -0.015827419236302376,..."
1,Actor: jack\nRole Description: a middle-aged m...,"[0.0034573425073176622, -0.026015250012278557,..."
2,Actor: alice\nRole Description: a woman in her...,"[0.0008556708344258368, -0.014851642772555351,..."
3,Actor: tom\nRole Description: a man in his 50s...,"[0.010522659868001938, -0.0246538408100605, -0..."
4,Actor: sarah\nRole Description: a woman in her...,"[-0.02520163170993328, -0.029377901926636696, ..."
5,Actor: george\nRole Description: a man in his ...,"[-0.020184945315122604, -0.017219319939613342,..."
6,Actor: rachel\nRole Description: a woman in he...,"[-0.006879925727844238, -0.01771548204123974, ..."
7,Actor: john\nRole Description: a man in his 60...,"[0.008729702793061733, -0.02020994946360588, -..."
8,Actor: maria\nRole Description: a middle-aged ...,"[-0.015555947087705135, -0.015542348846793175,..."
9,Actor: caleb\nRole Description: a young africa...,"[0.0028194563928991556, -0.03043750301003456, ..."


In [75]:
# Save the dataframe to a CSV file
df.to_csv('data/character_descriptions_embeddings.csv', index=False)

# Retrieve related documens for a given query

In [114]:
df = pd.read_csv("data/character_descriptions_embeddings.csv")
# Convert the embeddings column from numpy array to list
df["embeddings"] = df["embeddings"].apply(lambda x: np.array(eval(x)).tolist())
df.head()

Unnamed: 0,text,embeddings
0,Actor: emily\nRole Description: a young woman ...,"[-0.017513729631900787, -0.015827419236302376,..."
1,Actor: jack\nRole Description: a middle-aged m...,"[0.0034573425073176622, -0.026015250012278557,..."
2,Actor: alice\nRole Description: a woman in her...,"[0.0008556708344258368, -0.014851642772555351,..."
3,Actor: tom\nRole Description: a man in his 50s...,"[0.010522659868001938, -0.0246538408100605, -0..."
4,Actor: sarah\nRole Description: a woman in her...,"[-0.02520163170993328, -0.029377901926636696, ..."


In [115]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    df_copy = df.copy()
    df_embeddings = df_copy["embeddings"].values
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    distances = distances_from_embeddings(
        question_embeddings,
        df_embeddings,
        distance_metric="cosine"
    )
    df_copy["distances"] = distances
     
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [116]:
get_rows_sorted_by_relevance("What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?", df)

Unnamed: 0,text,embeddings,distances
8,Actor: maria\nRole Description: a middle-aged ...,"[-0.015555947087705135, -0.015542348846793175,...",0.122748
11,Actor: sonya\nRole Description: a white woman ...,"[0.00023430812871083617, -0.025631748139858246...",0.192026
12,Actor: manuel\nRole Description: a middle-aged...,"[-0.01788993552327156, -0.02293337509036064, -...",0.204589
0,Actor: emily\nRole Description: a young woman ...,"[-0.017513729631900787, -0.015827419236302376,...",0.223454
20,Actor: johnny\nRole Description: a young up-an...,"[-0.02860119938850403, -0.022719714790582657, ...",0.225987
49,Actor: abigail\nRole Description: a plucky and...,"[-0.024748265743255615, -0.023868681862950325,...",0.230496
13,Actor: will\nRole Description: a white man in ...,"[-0.009753977879881859, -0.03488609567284584, ...",0.230636
16,Actor: tahlia\nRole Description: a young indig...,"[-0.014769287779927254, -0.008230112493038177,...",0.230714
14,Actor: mia\nRole Description: a young australi...,"[-0.012495279312133789, -0.021629804745316505,...",0.235506
10,Actor: tyler\nRole Description: a white man in...,"[0.007938436232507229, -0.03638394922018051, -...",0.235701


In [117]:
get_rows_sorted_by_relevance("What is the name of the young woman who Max considers like a second mother?", df)

Unnamed: 0,text,embeddings,distances
14,Actor: mia\nRole Description: a young australi...,"[-0.012495279312133789, -0.021629804745316505,...",0.201946
17,Actor: max\nRole Description: a white australi...,"[-0.00013563783431891352, -0.04051500931382179...",0.202072
16,Actor: tahlia\nRole Description: a young indig...,"[-0.014769287779927254, -0.008230112493038177,...",0.206297
2,Actor: alice\nRole Description: a woman in her...,"[0.0008556708344258368, -0.014851642772555351,...",0.209874
8,Actor: maria\nRole Description: a middle-aged ...,"[-0.015555947087705135, -0.015542348846793175,...",0.211579
0,Actor: emily\nRole Description: a young woman ...,"[-0.017513729631900787, -0.015827419236302376,...",0.216113
6,Actor: rachel\nRole Description: a woman in he...,"[-0.006879925727844238, -0.01771548204123974, ...",0.222232
4,Actor: sarah\nRole Description: a woman in her...,"[-0.02520163170993328, -0.029377901926636696, ...",0.224329
11,Actor: sonya\nRole Description: a white woman ...,"[0.00023430812871083617, -0.025631748139858246...",0.227549
28,Actor: maya\nRole Description: a kind and comp...,"[-0.022137047722935677, -0.026917563751339912,...",0.230456


# Compose a text prompt

In [118]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [121]:
print(create_prompt("What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?", df, 200))


        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        Actor: maria
Role Description: a middle-aged latina woman in her 40s, maria is a hard-working single mother who owns a small family-run diner in a small texas town. she's fiercely protective of her teenage daughter, sofia, and is always trying to balance work and family.
Medium: movie
Setting: texas

        ---

        Question: What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?
        Answer:


In [123]:
print(create_prompt("What is the name of the young woman who Max considers like a second mother?", df, 200))


        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        Actor: mia
Role Description: a young australian woman in her mid-20s, mia is a driven and ambitious lawyer who's just landed her dream job at a top law firm in sydney. she's the younger sister of max, a former soldier who's struggling with ptsd, and is trying to help him navigate his challenges while also balancing her demanding career.
Medium: limited series
Setting: australia

        ---

        Question: What is the name of the young woman who Max considers like a second mother?
        Answer:


# Answer the Question

In [124]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [125]:
custom_query_answer_one = answer_question("What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?", df)
print(custom_query_answer_one)

Sofia


In [128]:
custom_query_answer_two = answer_question("What is the name of the young woman who Max considers like a second mother?", df)
print(custom_twitter_answer)

Ava


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [127]:
print(f"""
"What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?"

Original Answer: {initial_query_answer_one}
Custom Answer:   {custom_query_answer_one}

""")


"What is the name of the teenage daughter of the middle-aged Latina woman who owns a small family-run diner in a small Texas town?"

Original Answer: The name of the teenage daughter is not provided and can vary.
Custom Answer:   Sofia




### Question 2

In [129]:
print(f"""
"What is the name of the young woman who Max considers like a second mother?"
      
Original Answer: {initial_query_answer_two}
Custom Answer:   {custom_query_answer_two}
""")


"What is the name of the young woman who Max considers like a second mother?"
      
Original Answer: The name of the young woman is Eddie.
Custom Answer:   Ava

