# Custom Chatbot Project

I chose to use the character_descriptions.csv, because no LLM could be trained with this data. Furthermore, it's a great way to test the RAG system, for each row represents a different character but with a relationship with other character in the production.

character_descriptions.csv - this file contains character descriptions from theater, television, and film productions. Each row contains the name, description, medium, and setting. All characters were invented by an OpenAI model.

In [35]:
import pandas
import openai
import tiktoken
import numpy as np
from openai.embeddings_utils import distances_from_embeddings, get_embedding

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [49]:
openai.api_key =  "YOUR API KEY"
openai.api_base = "https://openai.vocareum.com/v1"
LLM_NAME = "gpt-3.5-turbo-instruct"

In [None]:
df_context = pandas.read_csv("../rag_data/character_descriptions.csv")

In [11]:
df_context.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [15]:
def get_context_text(row):
    name = f"name: {row['Name']}"
    description = f"description: {row['Description']}"    
    medium = f"medium: {row['Medium']}"    
    setting = f"setting: {row['Setting']}"
    context_list = [name, description, medium, setting]
    text = " - ".join(context_list)
    
    return text

df_context["text"] = df_context.apply(get_context_text, axis = 1)

In [16]:
df_context.head()

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,name: Emily - description: A young woman in he...
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,name: Jack - description: A middle-aged man in...
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,name: Alice - description: A woman in her late...
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,"name: Tom - description: A man in his 50s, Tom..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,name: Sarah - description: A woman in her mid-...


In [17]:
df_context.shape

(55, 5)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

#### Creating embeddings for each text

In [20]:
EMBEDDING_MODEL_NAME = "text-embedding-3-large"
batch_size = 20
total_batch_size = df_context.shape[0]

In [30]:
embeddings = []
for i in range(0, total_batch_size, batch_size):
    batch_text_list = df_context.iloc[i:i+batch_size]["text"].tolist()
    
    emb_response = openai.Embedding.create(
        input=batch_text_list,
        engine=EMBEDDING_MODEL_NAME
    )

    embeddings.extend(
        [data["embedding"] for data in emb_response["data"]]
    )

df_context["embeddings"] = embeddings

In [None]:
#checkpoint

df_context.to_csv("../rag_data/character_descriptions_embeddings.csv", index = False)

In [None]:
df_context = pandas.read_csv("../rag_data/character_descriptions_embeddings.csv")
df_context["embeddings"] = df_context["embeddings"].apply(eval).apply(np.array)

In [55]:
def get_most_similars(query, data_frame, engine = EMBEDDING_MODEL_NAME):
    embeddings_column = "embeddings"
    if not embeddings_column in data_frame.columns:
        raise Exception(f"The column {embeddings_column} must exist in the input dataframe!")
    
    query_embedding = get_embedding(query, engine=engine)
    
    query_distances = distances_from_embeddings(
    query_embedding,
    data_frame["embeddings"].values,
    distance_metric="cosine"
    )
    
    df_query = data_frame.copy()
    df_query["distances"] = query_distances
    df_query.sort_values(["distances"], ascending = True, inplace = True)
    df_query.reset_index(inplace=True)
    
    return df_query

In [56]:
def create_prompt(question, df, max_token_count):
    # Create a tokenizer aligned with text-embedding-3-large embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """ Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"
    Context: 
    {}
    ---
    
    Question: {}
    Answer:
    """


    context = []
    full_context_ordered = get_most_similars(question, df)["text"].values
    
    for ordered_text_context in full_context_ordered:
        current_prompt = prompt_template.format("\n\n###\n\n".join(context), question)
        
        ordered_text_context_token_count = len(tokenizer.encode(ordered_text_context))
        current_prompt_token_count = len(tokenizer.encode(current_prompt))
        current_token_count = ordered_text_context_token_count + current_prompt_token_count

        if current_token_count <= max_token_count:
            context.append(ordered_text_context)
        else:
            break

    full_prompt = prompt_template.format("\n\n###\n\n".join(context), question)
    
    return full_prompt

In [67]:
def open_ai_completion(llm_name, prompt, max_answer_tokens):
    try:
        response = openai.Completion.create(
            model=llm_name,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
                
        answer = response["choices"][0]["text"].strip()
        
        return answer
    
    except Exception as e:
        print(e)
        
        return ""

def get_chat_answer(question, 
                    dataframe=df_context, 
                    max_prompt_tokens=3800,
                    max_answer_tokens=200,
                    llm_name=LLM_NAME
                   ):
    
    prompt = create_prompt(question, dataframe, max_prompt_tokens)
    answer = open_ai_completion(llm_name, prompt, max_answer_tokens)
    
    return answer
    

In [68]:
def get_basic_chat_asnwer(question,
                          max_answer_tokens=200,
                          llm_name=LLM_NAME
                         ):
    prompt = f"""
    Question: {question}
    Answer:
    """
    
    answer = open_ai_completion(llm_name, prompt, max_answer_tokens)
    
    return answer

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [73]:
question1 = "Who is Emily?"

In [70]:
## Basic

get_basic_chat_asnwer(question1)

"I'm not sure! Could you provide some context? Thanks!"

In [71]:
## Contextual from RAG

get_chat_answer(question1)

"Emily is an aspiring actress and Alice's daughter. She is also in a relationship with George and is in her early 20s."

### Question 2

In [74]:
question2 = "Who is well-traveled and cultured, with a beautiful voice?"

In [75]:
## Basic

get_basic_chat_asnwer(question2)

'Michael Palin'

In [76]:
## Contextual from RAG

get_chat_answer(question2)

'Prince Lorenzo'