# Custom Chatbot Project

### Dataset: character_descriptions.csv

This dataset contains character descriptions from theater, television, and film productions. This dataset contains 55 rows. Each row contains the name, description, medium, and setting. I will tranform the dataset to include all four columns to create a custom text column in a textual narrative form.

For example:

**text** = Name of the character is **Name**. **Description**. **Name** lives in **Setting**. **Name** usually likes to act in a **Medium**.

The **text** column from the transformed dataset will be used to build custom chatbot.

### Use case scenario

Quite often when designing a play for theater or for a movie, you would want to find a suitable character that matches the narrative. You would need to know where the character lives to match the culture, language accents and audience preferences. This custom chatbot makes it easier to find a character for a creative endeavour given a few traits.  

## Data Wrangling



In [65]:
!pip install openai==0.28
!pip install tiktoken



In [66]:
import pandas as pd
import numpy as np
import openai
import os
from openai.embeddings_utils import distances_from_embeddings

In [67]:
from google.colab import userdata
openai.api_key = userdata.get('OPENAI_API_KEY')
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
MAX_TOKENS = 1000

In [68]:
df = pd.read_csv('/content/character_descriptions.csv')
df.describe()

Unnamed: 0,Name,Description,Medium,Setting
count,55,55,55,55
unique,55,55,7,6
top,Emily,"A young woman in her early 20s, Emily is an as...",Play,USA
freq,1,1,18,21


In [69]:
df['text'] = 'Name of the character is ' + df['Name'] + '. ' + df['Description'] + ' Lives in ' + df['Setting'] + '. ' + df['Name'] + ' usually likes to act in a ' + df['Medium'] + '.'
df["text"][0]

"Name of the character is Emily. A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. Lives in England. Emily usually likes to act in a Play."

In [70]:
embeddings = []
for index, row in df.iterrows():
  response = openai.Embedding.create(
      input=row["text"],
      engine=EMBEDDING_MODEL_NAME
  )
  embeddings.extend([data["embedding"] for data in response["data"]])
df["embeddings"] = embeddings


In [71]:
df[["text", "embeddings"]].to_csv("character_descriptions_embeddings.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [72]:
df = pd.read_csv('/content/character_descriptions_embeddings.csv', index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df.head()

Unnamed: 0,text,embeddings
0,Name of the character is Emily. A young woman ...,"[-0.01664789207279682, -0.011548190377652645, ..."
1,Name of the character is Jack. A middle-aged m...,"[0.004231372848153114, -0.024164393544197083, ..."
2,Name of the character is Alice. A woman in her...,"[0.005832457449287176, -0.00886272918432951, -..."
3,Name of the character is Tom. A man in his 50s...,"[0.015731092542409897, -0.017077621072530746, ..."
4,Name of the character is Sarah. A woman in her...,"[-0.01772937923669815, -0.028103776276111603, ..."


In [73]:
def question_embeddings(question):
  response = openai.Embedding.create(
      input=question,
      engine=EMBEDDING_MODEL_NAME
  )
  return response["data"][0]["embedding"]

In [82]:
q1 = "Who would be a good fit for a retiree role in an american sitcom?"
q2 = "Provide atleast two names of young female actors that can act in reality series?"
q1_embeddings = question_embeddings(q1)
q2_embeddings = question_embeddings(q2)

df['q1_distances'] = distances_from_embeddings(
  q1_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)
df['q2_distances'] = distances_from_embeddings(
  q2_embeddings,
  df['embeddings'].values,
  distance_metric="cosine"
)
df.head()
dfq1 = df.sort_values(by=["q1_distances"], ascending=True)
dfq2 = df.sort_values(by=["q2_distances"], ascending=True)
dfq1.head()


Unnamed: 0,text,embeddings,q1_distances,q2_distances
7,Name of the character is John. A man in his 60...,"[0.019908253103494644, -0.018194466829299927, ...",0.20373,0.282429
3,Name of the character is Tom. A man in his 50s...,"[0.015731092542409897, -0.017077621072530746, ...",0.207847,0.2567
33,Name of the character is Jake. A laid-back and...,"[-0.01604432985186577, -0.018637219443917274, ...",0.214507,0.242602
52,Name of the character is Captain James. The ch...,"[-0.011573380790650845, -0.017701251432299614,...",0.216664,0.263715
54,Name of the character is Mr. Mercer. The bumbl...,"[-0.005636273883283138, -0.014090684242546558,...",0.217119,0.295443


In [83]:
dfq2.head()

Unnamed: 0,text,embeddings,q1_distances,q2_distances
0,Name of the character is Emily. A young woman ...,"[-0.01664789207279682, -0.011548190377652645, ...",0.228764,0.208033
26,Name of the character is Olivia. A confident a...,"[-0.006731816567480564, -0.014614651910960674,...",0.237679,0.2131
32,Name of the character is Chloe. A driven and a...,"[-0.0058455332182347775, -0.006102945189923048...",0.235235,0.213282
30,Name of the character is Sophia. A fun-loving ...,"[0.02071801759302616, -0.013325332663953304, -...",0.25064,0.214358
14,Name of the character is Mia. A young Australi...,"[-0.010081450454890728, -0.016398638486862183,...",0.259318,0.217592


In [84]:
dfq1["text"].head(10).values

array(["Name of the character is John. A man in his 60s, John is a retired professor and Tom's father. He has a dry wit and a love of intellectual debate, but can also be stubborn and set in his ways. Lives in England. John usually likes to act in a Play.",
       "Name of the character is Tom. A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel. Lives in England. Tom usually likes to act in a Play.",
       "Name of the character is Jake. A laid-back and easygoing firefighter, Jake is the quintessential good guy. He's looking for someone who shares his values of honesty and integrity, and who is looking for a stable and committed relationship. He's a bit of a hopeless romantic, and is always looking for ways to sweep his partner off their feet. Lives in USA. Jake usually likes to act in a Reality Show.",
       "Name of the characte

In [85]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say
"I don't know"

Context:

{}

---

Question: {}
Answer:"""
def get_prompt(question, df):
  token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
  context_list = []
  for text in df["text"].head(10).values:
    token_count += len(tokenizer.encode(text))
    if token_count <= MAX_TOKENS:
        context_list.append(text)
    else:
        break
  prompt = prompt_template.format("\n\n###\n\n".join(context_list), question)
  return prompt

In [86]:
q1_prompt=get_prompt(q1, dfq1)
q2_prompt=get_prompt(q2, dfq2)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [87]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q1_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Mr. Mercer.


### Question 2

In [88]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q2_prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Olivia and Sophia
