# Custom Chatbot Project

* For this Custom Chatbot Project, I chose `character_descriptions.csv` file.
This file contains character descriptions from theater, television, and film productions.
Each row contains the name, description, medium, and setting.

* The reason of choosing this dataset is that,
depending on various types of questions, the Chatbot can answer plausible questions.
(e.g. who will be the most appropriate actress to play a 80's romantic movie? 
Provide two names of old male actors who can play in an american sitcom.)

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [14]:
import pandas as pd
import numpy as np
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [2]:
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [3]:
df = pd.read_csv("data/character_descriptions.csv")
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


### Create a column "text".

In [5]:
df['text'] =  df['Name'] + ', living in ' + df['Setting'] + ', who usually plays in a ' + df['Medium'] + ', ' + df['Description']
df['text'][0]

"Emily, living in England, who usually plays in a Play, A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George."

### Let's generate Embeddings

In [6]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []

for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])
    
# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,"Emily, living in England, who usually plays in...","[-0.015151883475482464, -0.017195267602801323,..."
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,"Jack, living in England, who usually plays in ...","[0.00807875394821167, -0.026968302205204964, 0..."
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,"Alice, living in England, who usually plays in...","[0.003437075298279524, -0.01474058162420988, -..."
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,"Tom, living in England, who usually plays in a...","[0.01544149685651064, -0.017899269238114357, 0..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,"Sarah, living in England, who usually plays in...","[-0.009714245796203613, -0.028793489560484886,..."
5,George,"A man in his early 30s, George is a charming a...",Play,England,"George, living in England, who usually plays i...","[-0.019317220896482468, -0.015670737251639366,..."
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England,"Rachel, living in England, who usually plays i...","[-0.003012859495356679, -0.015269828028976917,..."
7,John,"A man in his 60s, John is a retired professor ...",Play,England,"John, living in England, who usually plays in ...","[0.02007964253425598, -0.016262318938970566, -..."
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas,"Maria, living in Texas, who usually plays in a...","[-0.017404617741703987, -0.018720706924796104,..."
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas,"Caleb, living in Texas, who usually plays in a...","[0.004072345327585936, -0.030203767120838165, ..."


In [9]:
# Save only necessary columns
df[["text", "embeddings"]].to_csv("data/character_descriptions_embeddings.csv")

## Custom Query Completion

Compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [10]:
# Here, the code from case-study was used so as not to generate the embeddings again.
df = pd.read_csv("data/character_descriptions_embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df.head()

Unnamed: 0,text,embeddings
0,"Emily, living in England, who usually plays in...","[-0.015151883475482464, -0.017195267602801323,..."
1,"Jack, living in England, who usually plays in ...","[0.00807875394821167, -0.026968302205204964, 0..."
2,"Alice, living in England, who usually plays in...","[0.003437075298279524, -0.01474058162420988, -..."
3,"Tom, living in England, who usually plays in a...","[0.01544149685651064, -0.017899269238114357, 0..."
4,"Sarah, living in England, who usually plays in...","[-0.009714245796203613, -0.028793489560484886,..."


### Custom Queries


In [25]:
query1 = "What are some common characteristics of characters played in Musical?"
query2 = "Provide two names of old male actors who can play in an american sitcom."

### Create a Function that Finds Related Pieces of Text for a Given Question
In this part, useful function `get_rows_sorted_by_relevance` in the course was used.
This function use the embeddings that we generated previously for the purpose of comparing the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [12]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [26]:
get_rows_sorted_by_relevance(query1, df).head(5)

Unnamed: 0,text,embeddings,distances
22,"Dolly, living in USA, who usually plays in a M...","[-0.031535618007183075, -0.0211494043469429, 0...",0.183321
20,"Johnny, living in USA, who usually plays in a ...","[-0.029325218871235847, -0.020784892141819, 0....",0.186419
19,"Donna, living in USA, who usually plays in a M...","[-0.02699066512286663, -0.01740950532257557, 0...",0.187792
38,"Don Carlo, living in Italy, who usually plays ...","[-0.010799444280564785, -0.00947028212249279, ...",0.195284
25,"Crystal, living in USA, who usually plays in a...","[-0.005196164827793837, -0.013521763496100903,...",0.196625


In [17]:
get_rows_sorted_by_relevance(query2, df).head(5)

Unnamed: 0,text,embeddings,distances
7,"John, living in England, who usually plays in ...","[0.02007964253425598, -0.016262318938970566, -...",0.194401
54,"Mr. Mercer, living in USA, who usually plays i...","[-0.007667117286473513, -0.011930243112146854,...",0.208679
52,"Captain James, living in USA, who usually play...","[-0.007886371575295925, -0.020474107936024666,...",0.209077
50,"Thomas, living in USA, who usually plays in a ...","[-0.014383263885974884, -0.008755599148571491,...",0.214836
3,"Tom, living in England, who usually plays in a...","[0.01544149685651064, -0.017899269238114357, 0...",0.215556


### Create a Function that Composes a Text Prompt
We also use `create_prompt` function from the course.

In [18]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [27]:
print(create_prompt(query1, df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Dolly, living in USA, who usually plays in a Musical, A bubbly and vivacious performer, Dolly is a fan favorite for her infectious personality and comedic performances. She's known for her campy looks and over-the-top antics, but can struggle with self-doubt and insecurity off stage. She's also a good friend of Johnny, often offering her words of encouragement.

###

Johnny, living in USA, who usually plays in a Musical, A young up-and-coming performer, Johnny is full of energy and enthusiasm. She's known for her edgy and unconventional looks, but can be a bit scatterbrained at times. She looks up to Donna as a mentor and hopes to follow in her footsteps.

---

Question: What are some common characteristics of characters played in Musical?
Answer:


## Custom Performance Demonstration

Demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Create a Function that Answers a Question
We also use the `answer_question` function defined in our course.

In [20]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

### Question 1
"Who will be the most appropriate actress to play a 80's romantic movie?"

In [30]:
def initial_answer(query):
    """
    Create and return an initial answer
    """
    initial_answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=query,
        max_tokens=150
    )["choices"][0]["text"].strip()
    
    return initial_answer

In [31]:
# basic answer
print(initial_answer(query1))

1. Strong passion and emotion: Musical characters are often portrayed as having a deep desire or intense emotion that drives their actions and motivations.

2. Larger than life personality: Musical characters tend to be larger than life, with exaggerated personalities and dramatic expressions.

3. Confidence and bravado: Many musical characters exude confidence and bravado, often using their charisma and charm to win over others.

4. Creative and expressive: Musical characters are often artistic and expressive, using music and dance as their main means of communication.

5. Love for music: Most musical characters have a deep love and appreciation for music, which is portrayed through their singing and dancing.

6. Struggle and triumph: Many musical characters have a personal struggle or conflict they must overcome


In [28]:
# custom answer
print(answer_question(query1, df))

Bubbly, vivacious, energetic, unconventional, creative, chameleon-like performer with infectious personalities.


### Question 2
"Provide two names of old male actors who can play in an american sitcom."

In [32]:
# basic answer
print(initial_answer(query2))

1. Tom Hanks
2. Morgan Freeman


In [23]:
# custom answer
print(answer_question(query2, df))

1. Mr. Mercer
2. Captain James
