# Custom Chatbot Project

Dataset: Character_description.csv

This dataset contains character descriptions from multiple mediums, organized into four columns: "Name," which lists the names of the characters; "Description," which provides detailed descriptions of each character; "Medium," which specifies the type of medium (such as play, movie, musical, or reality show) in which the character appears; and "Setting," which indicates the geographical or cultural location associated with the character, such as England, Texas, Australia, or the USA.
This dataset is chosen because we aim to develop a tool that acts as an expert for choosing characters from shows.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import numpy as np
import pandas as pd
import openai
import os
from openai.embeddings_utils import distances_from_embeddings

In [2]:
#import openai api key
OPENAI_API_KEY = ""
openai.api_key = OPENAI_API_KEY
MAX_TOKENS = 1000
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

In [3]:
df = pd.read_csv('data/character_descriptions.csv')
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [4]:
df['Medium'].unique()

array(['Play', 'Movie', 'Limited Series', 'Musical', 'Reality Show',
       'Opera', 'Sitcom'], dtype=object)

In [5]:
df['Setting'].unique()

array(['England', 'Texas', 'Australia', 'USA', 'Italy', 'Ancient Greece'],
      dtype=object)

In [6]:
#concat these four columns
df['text'] = 'The name of the character is ' + df['Name'] + '. ' + df['Description'] + 'The character are likely to act in a' + df['Medium'] + ' and lives in ' + df['Setting']
df['text'][0]

"The name of the character is Emily. A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.The character are likely to act in aPlay and lives in England"

In [7]:
#create the embeddings
response = openai.Embedding.create(
    input=df['text'].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

embeddings = [data['embedding'] for data in response['data']]
df['embeddings'] = embeddings
df.head()


Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,The name of the character is Emily. A young wo...,"[-0.011719183064997196, -0.017467858269810677,..."
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,The name of the character is Jack. A middle-ag...,"[0.007291092071682215, -0.02629977837204933, -..."
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,The name of the character is Alice. A woman in...,"[0.00823904573917389, -0.013944988138973713, -..."
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,The name of the character is Tom. A man in his...,"[0.0169515460729599, -0.021636711433529854, -0..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,The name of the character is Sarah. A woman in...,"[-0.011701270937919617, -0.02911296673119068, ..."


In [8]:
df.to_csv('character_embeddings.csv')

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [9]:
df = pd.read_csv('character_embeddings.csv')
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,The name of the character is Emily. A young wo...,"[-0.011719183064997196, -0.017467858269810677,..."
1,1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,The name of the character is Jack. A middle-ag...,"[0.007291092071682215, -0.02629977837204933, -..."
2,2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,The name of the character is Alice. A woman in...,"[0.00823904573917389, -0.013944988138973713, -..."
3,3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,The name of the character is Tom. A man in his...,"[0.0169515460729599, -0.021636711433529854, -0..."
4,4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,The name of the character is Sarah. A woman in...,"[-0.011701270937919617, -0.02911296673119068, ..."


In [10]:
def question_embeddings(question):
    response = openai.Embedding.create(
        input=question,
        engine=EMBEDDING_MODEL_NAME
    )
    return response['data'][0]['embedding']

In [11]:
question1 = "Who would be an ideal choice for an athlete role in an American reality show?"
question2 = "Can you name two male characters who are entrepreneurs in reality show?"

In [12]:
question1_embedding = question_embeddings(question1)
question2_embedding = question_embeddings(question2)
question1_embedding[:3]

[-0.007107841782271862, -0.010892109014093876, -0.0037842674646526575]

In [13]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings
q1_distance = distances_from_embeddings(question1_embedding, df['embeddings'].tolist(), distance_metric = 'cosine')
q2_distance = distances_from_embeddings(question2_embedding, df['embeddings'].tolist(), distance_metric = 'cosine')

In [14]:
df['q1_distance'] = q1_distance
df['q2_distance'] = q2_distance
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance
0,0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,The name of the character is Emily. A young wo...,"[-0.011719183064997196, -0.017467858269810677,...",0.25384,0.277849
1,1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,The name of the character is Jack. A middle-ag...,"[0.007291092071682215, -0.02629977837204933, -...",0.259168,0.227894
2,2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,The name of the character is Alice. A woman in...,"[0.00823904573917389, -0.013944988138973713, -...",0.275121,0.284619
3,3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,The name of the character is Tom. A man in his...,"[0.0169515460729599, -0.021636711433529854, -0...",0.247994,0.249355
4,4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,The name of the character is Sarah. A woman in...,"[-0.011701270937919617, -0.02911296673119068, ...",0.26822,0.268956


In [15]:
df.to_csv('character_distances.csv')

In [16]:
df1 = df.sort_values(by=['q1_distance'], ascending = True)
df2 = df.sort_values(by=['q2_distance'], ascending = True)
df1.head()

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance
29,29,James,"A handsome and athletic personal trainer, Jame...",Reality Show,USA,The name of the character is James. A handsome...,"[-0.016125567257404327, -0.010917950421571732,...",0.161294,0.198671
32,32,Chloe,"A driven and ambitious attorney, Chloe is alwa...",Reality Show,USA,The name of the character is Chloe. A driven a...,"[-0.0008281273767352104, -0.006286941468715668...",0.19765,0.222138
27,27,Marcus,"A charming and successful entrepreneur, Marcus...",Reality Show,USA,The name of the character is Marcus. A charmin...,"[-0.0027859227266162634, -0.03991147503256798,...",0.202401,0.171335
26,26,Olivia,A confident and charismatic marketing executiv...,Reality Show,USA,The name of the character is Olivia. A confide...,"[-0.0032992709893733263, -0.019392164424061775...",0.206977,0.22686
33,33,Jake,"A laid-back and easygoing firefighter, Jake is...",Reality Show,USA,The name of the character is Jake. A laid-back...,"[-0.01259734109044075, -0.020525598898530006, ...",0.210442,0.225137


In [17]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say
"I don't know"

Context:

{}

---

Question: {}
Answer:"""

In [20]:
def get_prompt(question, df):
    token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context_list = []
    for text in df["text"].values:
        token_count += len(tokenizer.encode(text))
        if token_count <= MAX_TOKENS:
            context_list.append(text)
        else:
            break
    prompt = prompt_template.format("\n\n###\n\n".join(context_list), question)
    return prompt

In [21]:
q1_prompt = get_prompt(question1, df1)
q2_prompt = get_prompt(question2, df2)
print(q2_prompt)


Answer the question based on the context below, and if the
question can't be answered based on the context, say
"I don't know"

Context:

The name of the character is Marcus. A charming and successful entrepreneur, Marcus is used to getting what he wants. He's a smooth talker with a magnetic personality, but can sometimes come across as a bit too self-centered. He's looking for someone who can challenge him and keep him on his toes.The character are likely to act in aReality Show and lives in USA

###

The name of the character is James. A handsome and athletic personal trainer, James is always up for a challenge. He's looking for someone who is as passionate about fitness as he is, and who can keep up with his intense workout regimen. He can sometimes come across as a bit too competitive, but his heart is always in the right place.The character are likely to act in aReality Show and lives in USA

###

The name of the character is Lucas. A middle-aged Australian man in his 40s, Lucas 

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [22]:
question1

'Who would be an ideal choice for an athlete role in an American reality show?'

In [23]:
basic_q1_response = openai.Completion.create(
    engine="gpt-3.5-turbo-instruct",
    prompt=question1,
    temperature = 0.5
)
basic_q1_response['choices'][0]['text']

'\n\nSerena Williams would be an ideal choice for an athlete role in an American'

In [24]:
custom_q1_response = openai.Completion.create(
    engine="gpt-3.5-turbo-instruct",
    prompt=q1_prompt,
    temperature = 0.5
)
custom_q1_response['choices'][0]['text']

' James'

In [32]:
df[df['Name'].str.lower().str.contains('williams')] 
#no one named Serena Williams, thus the second prompt is better.

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance


In [33]:
df[df['Name'].str.lower().str.contains('james')]

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance
29,29,James,"A handsome and athletic personal trainer, Jame...",Reality Show,USA,The name of the character is James. A handsome...,"[-0.016125567257404327, -0.010917950421571732,...",0.161294,0.198671
52,52,Captain James,The charismatic and dashing captain of the loc...,Sitcom,USA,The name of the character is Captain James. Th...,"[-0.008761197328567505, -0.019974468275904655,...",0.24796,0.24089


### Question 2

In [34]:
question2

'Can you name two male characters who are entrepreneurs in reality show?'

In [35]:
basic_q2_response = openai.Completion.create(
    engine="gpt-3.5-turbo-instruct",
    prompt=question2,
    temperature = 0.5
)
basic_q2_response['choices'][0]['text']

'\n\n1. Daymond John from "Shark Tank"\n2. Marcus Lemon'

In [37]:
df[df['Name'].str.lower().str.contains('daymond')] 
#no one named Daymond John

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance


In [38]:
df[df['Name'].str.lower().str.contains('lemon')]
#no one named Marcus Lemon

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance


In [39]:
custom_q2_response = openai.Completion.create(
    engine="gpt-3.5-turbo-instruct",
    prompt=q2_prompt,
    temperature = 0.5
)
custom_q2_response['choices'][0]['text']

' Marcus and George'

In [40]:
df[df['Name'].str.lower().str.contains('marcus')]

Unnamed: 0.1,Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,q1_distance,q2_distance
27,27,Marcus,"A charming and successful entrepreneur, Marcus...",Reality Show,USA,The name of the character is Marcus. A charmin...,"[-0.0027859227266162634, -0.03991147503256798,...",0.202401,0.171335


In [43]:
pd.set_option('display.max_colwidth', None)
df[df['Name'].str.lower().str.contains('george')]['Description']

5    A man in his early 30s, George is a charming and charismatic businessman who is in a relationship with Emily. He's ambitious, confident, and always looking for the next big opportunity. However, he's also prone to bending the rules to get what he wants.
Name: Description, dtype: object

In [None]:
#both Marcus and George are entrepreneurs. Thus second prompt is better.