# Custom Chatbot Project

In [135]:
import os
import re
import tiktoken
import numpy as np
import pandas as pd
from openai import OpenAI
pd.set_option('display.max_colwidth', None)

In [136]:
df = pd.read_csv('data/character_descriptions.csv')

In [137]:
df.head(2)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England


## 1. Choose a Dataset and Explain the Scenario

For this project, I have chosen the dataset that contains information about characters, their descriptions, the medium they belong to, and the setting in which they develop. This dataset is appropriate for the following reasons:

1. **Diversity and Variety of Characters**: The dataset covers characters from different media and various geographic and cultural environments, including a range of ages, professions, and personal characteristics. This diversity allows for comparative and contextual analysis of how characters and their descriptions vary by medium and setting.

2. **Detailed Descriptions**: Each character has a detailed description that includes demographic aspects and personality traits.

3. **Unique Context:** The dataset's unique context provides an opportunity to evaluate the ability of models like ChatGPT-3.5 turbo, which is trained on generic data, to improve responses to specific questions. This specific dataset will highlight the model's capacity to adapt and deliver nuanced answers based on the context provided.

In summary, this dataset is suitable for the task due to its diversity, richness in details, and unique context. It allows approaching the task from multiple perspectives and extracting valuable insights about the creation and representation of characters in different media and settings.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [138]:
df['text'] = df['Name'] + ' - ' + df['Description'] + ' - ' + df['Medium'] + ' - ' + df['Setting']

In [139]:
df.head(2)

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England,"Emily - A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. - Play - England"
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England,"Jack - A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice. - Play - England"


In [179]:
df.shape

(55, 7)

In [180]:
# API SETTING
client = OpenAI(
    api_key="API-KEY"
)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [79]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo"

def get_chat_response(client, prompt):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=COMPLETION_MODEL_NAME,
    )
    
    response = chat_completion.choices[0].message.content
    formatted_response = response.replace('\\n', '\n')
    return formatted_response

In [35]:
def get_embedding(text, model="text-embedding-ada-002"):
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def cosine_similarity(embedding1, embedding2):
    """Calcula la similitud coseno entre dos embeddings."""
    embedding1 = np.array(embedding1)
    embedding2 = np.array(embedding2)
    dot_product = np.dot(embedding1, embedding2)
    norm1 = np.linalg.norm(embedding1)
    norm2 = np.linalg.norm(embedding2)
    return dot_product / (norm1 * norm2)

def get_rows_sorted_by_relevance(df, question, n=10, pprint=True):
    embedding = get_embedding(question, model='text-embedding-ada-002')
    df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
    res = df.sort_values('similarities', ascending=False).head(n)
    return res

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(df,question)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [140]:
df['ada_embedding'] = df.text.apply(lambda x: get_embedding(x, model="text-embedding-ada-002"))

In [142]:
print(create_prompt("Who is Emely's Family?", df, 1000))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    Emily - A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. - Play - England

###

Alice - A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack. - Play - England

###

George - A man in his early 30s, George is a charming and charismatic businessman who is in a relationship with Emily. He's ambitious, confident, and always looking for the next big opportunity. However, he's also prone to bending the rules to get what he wants. - Play - England

###

Maria - A middle-aged Latina woman in her 40s, Maria is a hard-wor

In [143]:
def get_custom_query_response(question, df, max_prompt_tokens=1800, max_answer_tokens=400):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    prompt = create_prompt(question,df, max_prompt_tokens)
    try:
        response = client.chat.completions.create(
                    messages=[
                        {
                            "role": "user",
                            "content": prompt,
                        }
                            ],
                    model=COMPLETION_MODEL_NAME)
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1: What are the pants trends in 2023?

In [144]:
question1 = "Who is Emely's Family?"
prompt = f"""
    Question: "{question1}"
    Answer:
    """

#### Default Response

In [145]:
get_chat_response(client,prompt)

"Emely's family includes her parents, siblings, grandparents, aunts, uncles, cousins, and any other relatives related to her through blood or marriage."

#### Custom Prompt Response

In [147]:
get_custom_query_response(question1, df)

"Emily's family consists of her mother, Alice, and her father is not specifically mentioned, but she is in a relationship with George."

### Question 2: Who are the most intelligent characters and describe them?

In [172]:
question2 = "Who are the most intelligent characters and describe them?"
prompt = f"""
    Question: "{question2}"
    Answer:
    """

#### Default Response

In [173]:
response = get_chat_response(client,prompt)

In [174]:
print(response)

1. Sherlock Holmes - Sherlock Holmes is a consulting detective known for his keen observation and deduction skills. He is highly analytical and logical, often able to solve complex cases using his intelligence and attention to detail.

2. Lisbeth Salander - Lisbeth Salander is a hacker and investigator with a photographic memory and a talent for uncovering secrets. She is resourceful, independent, and incredibly intelligent, often using her skills to outsmart her adversaries.

3. Dr. Gregory House - Dr. House is a brilliant diagnostician with a knack for solving unusual medical cases. He is known for his unconventional methods and ability to think outside the box when it comes to diagnosing his patients.

4. Hermione Granger - Hermione Granger is a witch with exceptional intelligence and a love for learning. She is known for her encyclopedic knowledge and quick thinking, often using her intelligence to help her friends out of difficult situations.

5. Tyrion Lannister - Tyrion Lanniste

#### Custom Prompt Response

In [177]:
response = get_custom_query_response(question2, df)

In [178]:
print(response)

Feste and John are the most intelligent characters. Feste, a jester and musician from the play set in Ancient Greece, uses his wit and intelligence to comment on the actions of other characters. John, a retired professor from the play set in England, has a dry wit and a love of intellectual debate. Both characters showcase their intelligence through their interactions and observations within their respective plays.
