# Custom Chatbot Project

This project is about customizing a chatbot, which is based on the GPT3-Turbo model. The main idea of this project is suggested by udacity in the framework of a generative AI online course. The custom design is realized by providing context information to this model with each question the user asks the chatbot.
Our context we whish to provide the chatbot consists in a list of character descriptions from theater, television, and film productions. Each row of this list contains the name, description, medium, and setting. The list is made available by udacity and all characters were invented by an OpenAI model.
The idea of choosing a character dataset for customizing a chatbot is to study a model bias towards names and how this can be influenced by external sources.
For this purpose, we consider asking the chatbot about a likely personality and appearance of persons possessing certain names and repeat asking the chatbot the same question while providing context information about character descriptions.

## 1 Libraries and display settings

In [1]:
import pandas as pd
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken

In [2]:
pd.set_option('display.max_colwidth', None) # allows for displaying broad text lines of pandas' dataframe objects

## 2 Data Wrangling

Read character list into a dataframe

In [3]:
character_df=pd.read_csv("data/character_descriptions.csv")
character_df.head(5)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.",Play,England


Create a new dataframe containing just one attribute "text"

In [4]:
# character_df["text"]="A character named "+character_df["Name"]+ " in a "+ character_df["Medium"]+" with setting in "+ character_df["Setting"]+" is characterized as: "+character_df["Description"]+"."
character_df["text"]="A character named "+character_df["Name"]+ " in a "+ character_df["Medium"]+" is characterized as: "+character_df["Description"]+"."
df=character_df[["text"]].copy()
pd.set_option('display.max_colwidth', None)
df.head(5)

Unnamed: 0,text
0,"A character named Emily in a Play is characterized as: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.."
1,"A character named Jack in a Play is characterized as: A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.."
2,"A character named Alice in a Play is characterized as: A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.."
3,"A character named Tom in a Play is characterized as: A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.."
4,"A character named Sarah in a Play is characterized as: A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.."


####

## 3 Custom Query Completion

In the cells below, we compose a custom query using a chosen dataset and retrieve results from the OpenAI Completion model GPT-3.5-Turbo. 

#### Get access to OpenAI

In [5]:
openai.api_base = "https://openai.vocareum.com/v1"
# read the data from the CSV file
filename = "user_key.txt"

# open the CSV file in read mode
with open(filename, "r", encoding="utf-8") as user_key_file:
    openai.api_key = user_key_file.read()

#### Attach new column "embeddings" comprising the embedding vector for each text

In [6]:

#  Transform the text of each context line into embedding vectors
batch_size = 200
embeddings = []

for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine="text-embedding-ada-002"
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df.loc[:,"embeddings"] = embeddings

df.head(2)

Unnamed: 0,text,embeddings
0,"A character named Emily in a Play is characterized as: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George..","[-0.02483072690665722, -0.012601914815604687, -0.007610038388520479, -0.023582756519317627, -0.0424824133515358, 0.02802141010761261, -0.01650664396584034, 0.02143419161438942, -0.0036827954463660717, -0.014666853472590446, 0.0034994599409401417, -0.001333204098045826, 0.008002441376447678, -0.01051767822355032, -0.0006199640338309109, 0.012363900430500507, 0.0255512036383152, -0.0021196177694946527, 0.010967976413667202, -0.03015710972249508, -0.005866741295903921, -0.009289007633924484, 0.005030473694205284, -0.0005930265760980546, -0.027661170810461044, 0.0013806462520733476, 0.017214255407452583, -0.018114851787686348, 0.005950368475168943, -0.039368923753499985, -0.0033515046816319227, -0.012376765720546246, -0.01294285524636507, -0.032653048634529114, -0.01923416368663311, -0.009674977511167526, -0.01235103514045477, -0.02403305470943451, 0.026258815079927444, 0.014525331556797028, 0.007423486560583115, 0.018449358642101288, -0.012286706827580929, -0.01301361620426178, -0.007983142510056496, 0.0038596983067691326, -0.03615250810980797, -0.007803023327142, -0.01527153979986906, 0.03247292712330818, 0.019993238151073456, 0.017613090574741364, -0.01696980744600296, -0.011321782134473324, -0.01467971969395876, 0.013264496810734272, 0.005126966163516045, 0.0006477056303992867, 0.0005785527173429728, 0.0011892695911228657, 0.010504812002182007, -0.0002756065805442631, -0.009456261061131954, -0.035097524523735046, -0.00449011567980051, -0.0020311663392931223, -0.032884631305933, 0.013689063489437103, 0.0012447526678442955, 0.01790899969637394, 0.02342836931347847, 0.016287926584482193, -0.0017288231756538153, 0.012389631941914558, 0.01776747778058052, 0.004937197547405958, -0.020687982439994812, 0.01722712069749832, -0.013882048428058624, 0.00908315647393465, 0.005583696998655796, -0.025409681722521782, -0.0005600583390332758, -0.00036707340041175485, 0.011881438083946705, 0.027969947084784508, -0.002428393578156829, 0.02275935374200344, -0.02076517790555954, 0.012132318690419197, 0.02202601172029972, 0.034788746386766434, 0.03628116473555565, 0.020404938608407974, 0.011694885790348053, -0.007134009152650833, 0.002090669935569167, 0.02843311056494713, 0.00332898972555995, -0.01856514997780323, ...]"
1,"A character named Jack in a Play is characterized as: A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice..","[-0.0025852853432297707, -0.02566424198448658, 0.0044226269237697124, -0.0349387526512146, -0.0384768508374691, 0.017989683896303177, -0.01618161052465439, 0.008767207153141499, 0.0014698730083182454, -0.014490606263279915, 0.01193458866328001, 0.006181921809911728, 0.005895751528441906, -0.009391577914357185, 0.01635071076452732, 0.0038665463216602802, 0.030568154528737068, -0.009437104687094688, 0.023804137483239174, -0.0384768508374691, -0.012741067446768284, 0.005609581712633371, -0.006045340560376644, -0.02146274782717228, -0.025547172874212265, 0.005443733185529709, 0.0041039371863007545, -0.011063070967793465, 0.020682282745838165, -0.029293397441506386, -0.00023657800920773298, 0.005473000463098288, -0.007355868816375732, -0.02168387919664383, -0.0283048115670681, 0.006731498055160046, -0.008045278489589691, -0.01900428719818592, 0.0010333012323826551, 0.0037202094681560993, -0.00032417691545560956, 0.010328134521842003, 0.0023706580977886915, -0.021072516217827797, 0.0029007229022681713, 0.01450361404567957, -0.017820583656430244, -0.034652579575777054, -0.015830401331186295, 0.009599701501429081, -0.0011251682881265879, 0.030203938484191895, -0.014555645175278187, -0.010906977578997612, 0.011206155642867088, -0.00014105252921581268, -0.003934836946427822, 0.009859856218099594, -0.003677934408187866, -0.020161975175142288, 0.015466185286641121, -0.012532943859696388, -0.026847945526242256, -0.014243459329009056, 0.0004788468941114843, 0.003938089124858379, -0.01226628478616476, -0.0016633629566058517, -0.0025673997588455677, 0.026418691501021385, 0.013788188807666302, 0.032883528620004654, -0.01536212395876646, 0.02409030683338642, 0.011602891609072685, -0.013697135262191296, -0.02634064480662346, 0.019966859370470047, -0.023439921438694, 0.024675656110048294, 0.00905337743461132, -0.03231118991971016, 0.008916796185076237, -0.00425677839666605, 0.020383106544613838, 0.01049073040485382, -0.005450237076729536, 0.02455858513712883, -0.029241366311907768, 0.012669525109231472, 0.008806230500340462, 0.019966859370470047, 0.016428757458925247, 0.013332918286323547, -0.010861450806260109, 0.008903788402676582, -0.00838998332619667, 0.01203864999115467, 0.0013219100655987859, -0.02208711765706539, ...]"


#### Define function for: Sorting embedding vectors from a given dataframe with respect to closeness to an embedding vector of a given question

In [7]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine="text-embedding-ada-002")
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


#### Define function for creating a custom prompt based on a question; the context given to this prompt consists in sorted text lines (the closer the text to the question, the earlier it appears)

In [8]:
def create_custom_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below with no more than 20 words

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:  # loop for building the context row by row from the context file in the right order as long as the max_token_count is not reached
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

#### Define function for executing prompts which possess an attribute called "max_tokens" limiting the number of tokens taken into account

In [9]:
def execute_prompt(prompt):
    """
    Given a text prompt, this function is executing the prompt using an OpenAI completion model
    """
    answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=150
    )["choices"][0]["text"].strip()
    
    print("\033[34m"+answer+"\033[0m") # print answer in blue

#### Define an exemplary question about typical names appearing in a play and ask it to GPT-3.5-Turbo which is the OpenAI Completion model

In [10]:
question="Name five typical names appearing in a play?"
print("Question: "+question)

print("")
print("Answer from GPT3-Turbo with given context:")
execute_prompt(create_custom_prompt(question, df, 800))

Question: Name five typical names appearing in a play?

Answer from GPT3-Turbo with given context:
[34mJohn, Tom/Malvolio, Jack, Viola/Cesario, Feste[0m


In all scenarios tried, the answer consists in names appearing in the context for  medium="Play". Therefore, the context provided was used in priority to knowledge of the GPT-4.5-Turbo model.

## 4 Custom Performance Demonstration

In the current section, we demonstrate the performance of custom queries using two questions with multiple settings. In particular, for each question, we show the answer from a basic `Completion` model query as well as the answer from our custom query using two settings for the "max_tokens" attribute.

In [11]:
def create_basic_prompt(question):
    """
    Given a question, this function is producing a simple prompt
    """
    prompt = """
    Question: """+question+"""
    Answer:
    """
    return prompt

### Question 1

In the first question, we would lke to find out if the chatbot can retrieve information from the context and is able to take the precise information.
For this purpose, we ask about the  personality, character and status of a person named "Olivia" which is appearing twice in the context. In particular there is Olivia who appears in a sitcom and Lady Olivia who appears in an ancient Greek play.
We ask by using three different prompts. The first one has no context provided whereas the remaining two contain context but different "max_tokens" settings.

In [12]:
question1="What would a fictive person in a play with first name being Olivia likely be characeterized in terms of personality, character and status? Please provide a short answer and do not mention relationship information."

# prompt without context
print("Question 1: "+question1)
print("")
print("Answer from GPT3-Turbo:")
execute_prompt(create_basic_prompt(question1))

# prompt with context and maxtoken=100
print("")
print("Answer from GPT3-Turbo with given context and max_tokens=100:")
execute_prompt(create_custom_prompt(question1, df, 100))

# prompt with context and maxtoken=1000
print("")
print("Answer from GPT3-Turbo with given context and max_tokens=1000:")
execute_prompt(create_custom_prompt(question1, df, 1000))

Question 1: What would a fictive person in a play with first name being Olivia likely be characeterized in terms of personality, character and status? Please provide a short answer and do not mention relationship information.

Answer from GPT3-Turbo:
[34mA fictive person named Olivia in a play may be characterized as intelligent, confident, and charming. She likely has a strong sense of morality and is driven by her emotions. In terms of character, she may be described as spirited, independent, and impulsive. As for status, she could be seen as coming from a well-respected and affluent family, or possibly holding a high position in society.[0m

Answer from GPT3-Turbo with given context and max_tokens=100:
[34mIndependent, strong-willed and confident protagonist with a high social status and a complex and dynamic personality.[0m

Answer from GPT3-Turbo with given context and max_tokens=1000:
[34mA wealthy and beautiful noblewoman who is melancholy, withdrawn, and not interested in 

In the tried scenarios, the custom prompts mostly return character information from the provided context.  So, providing context helps to lead the replies into the correct direction. In many scenarios, the custom prompt is also leading to the correct character being "Lady Olivia" from the ancient Greek play and not the character Olivia from a sitcom. However, the latter fact was more pronounced the more tokens we allowed for the prompt. At about a max_tokens=1000 for a prompt (being the current setting), more often are more properties of the correct Olivia presented.

### Question 2

In the second question, we would like to find out if the chatbot is able to retrieve indirect information from the context. For this purpose we ask the ChatGPT about the possible love preferences of character "Malvolio" in trems of attributes like personality, character or status. From the context, we know that Malvolio is secretly in love with Lady Olivia but her attributes, however, do not appear directly in the text attributed to Malvolio. ChatGPT therefore has to make the link to Lady Olivias text line.
Again, we ask by using three different prompts. The first one has no context provided whereas the remaining two contain context but different "max_tokens" settings.

In [13]:
question2="Whith what type of human personality, character or status would a fictive person called 'Malvolio' probably be attracted to? Please provide a short answer."

# prompt without context
print("Question 2: "+question2)
print("")
print("Answer from GPT3-Turbo:")
execute_prompt(create_basic_prompt(question2))

# prompt with context and maxtoken=100
print("")
print("Answer from GPT3-Turbo with given context and max_tokens=100:")
execute_prompt(create_custom_prompt(question2, df, 100))

# prompt with context and maxtoken=1000
print("")
print("Answer from GPT3-Turbo with given context and max_tokens=1000:")
execute_prompt(create_custom_prompt(question2, df, 1000))

Question 2: Whith what type of human personality, character or status would a fictive person called 'Malvolio' probably be attracted to? Please provide a short answer.

Answer from GPT3-Turbo:
[34mBased on the name and origin of the name 'Malvolio,' which comes from the play "Twelfth Night" by William Shakespeare, a fictive person named Malvolio would most likely be attracted to someone who is arrogant, self-righteous, and pretentious, similar to the character of Malvolio in the play. He may also be attracted to someone with a similar high social status or someone who is ambitious and seeks power and prestige.[0m

Answer from GPT3-Turbo with given context and max_tokens=100:
[34mMalvolio would likely be attracted to individuals who exhibit dominant or authoritative personalities or higher social statuses.[0m

Answer from GPT3-Turbo with given context and max_tokens=1000:
[34mA high-status, wealthy and beautiful woman who is uninterested in him, such as Lady Olivia.[0m


Comparing the ChatGPT replies above we can recognize again that context information is used when provided. Also, by executing the block multiple times, we can recognize that indirect information about Malvolios crush "Lady Olivia" is more often delivered if the "max_tokens" setting is set to higher values like 1000. 

## 5 Conclusion

From this survey we can conclude that providing context to a completion model significantly helps in getting desired answers. Hereby the max_tokens setting, which marks the upper limit of tokens for a prompt, turned out being quite important if we want to get a precise answer. The precision in finding the right context line was investigated with question 1 wheras the precision in finding indirect information was checked with question 2. In both cases, more allowed tokens lead to more precision.