# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [10]:
from pathlib import Path
import pandas as pd
import math
import time
import collections
import numpy as np
from tqdm.notebook import tqdm
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-1150331412126677336336466bb6318548993.33578393"

In [11]:
df = pd.read_csv('data/character_descriptions.csv')
df['text'] = df.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df = df[['text']]
df

Unnamed: 0,text
0,"Emily A young woman in her early 20s, Emily is..."
1,"Jack A middle-aged man in his 40s, Jack is a s..."
2,"Alice A woman in her late 30s, Alice is a warm..."
3,"Tom A man in his 50s, Tom is a retired soldier..."
4,"Sarah A woman in her mid-20s, Sarah is a free-..."
5,"George A man in his early 30s, George is a cha..."
6,"Rachel A woman in her late 20s, Rachel is a sh..."
7,"John A man in his 60s, John is a retired profe..."
8,"Maria A middle-aged Latina woman in her 40s, M..."
9,Caleb A young African American man in his earl...


In [12]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

embeddings = [data["embedding"] for data in response["data"]]

In [13]:
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [14]:
def get_df(question, EMBEDDING_MODEL_NAME):
    USER_QUESTION = question

    # Generate the embedding response
    response = openai.Embedding.create(
               input=USER_QUESTION,
               engine=EMBEDDING_MODEL_NAME
    )

    # Extract the embeddings from the response
    question_embeddings = response["data"][0]["embedding"]
    
    from openai.embeddings_utils import distances_from_embeddings

    # Create a list containing the distances from question_embeddings
    distances = distances_from_embeddings(question_embeddings, df["embeddings"], distance_metric="cosine")

    df["distances"] = distances
    df.sort_values(by="distances", ascending=True, inplace=True)
    return df

In [15]:
import tiktoken
# Create a tokenizer that is designed to align with our embeddings
tokenizer = tiktoken.get_encoding("cl100k_base")
token_limit = 1000

In [16]:
# Count the number of tokens in the prompt template and question
def prompt(question):
    USER_QUESTION = question 
    prompt_template = """
    Answer the question based on the context below, and if the 
    question can't be answered based on the context, say 
    "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(USER_QUESTION))

    # Create a list to store text for context
    context_list = []
    
    df = get_df(USER_QUESTION, "text-embedding-ada-002")

    # Loop over rows of the sorted dataframe
    for text in df['text'].values:
        print(df)
        token_count += len(tokenizer.encode(text))
        if token_count <= token_limit:
            context_list.append(text)
        else:
            break
        # Append text to context_list if there is enough room


    # Use string formatting to complete the prompt
    prompt = prompt_template.format(
        "\n\n###\n\n".join(context_list),
        USER_QUESTION
    )
    return prompt

In [17]:
def question(prompt, completion_model, max_tokens):
    response = openai.Completion.create(
        model=completion_model,
        prompt=prompt,
        max_tokens=max_tokens
    )
    print('response:', response)
    return response["choices"][0]["text"].strip()

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [18]:
Question_1 = """What is Emily age?"""
question(prompt(Question_1), "gpt-3.5-turbo-instruct", 150)

                                                 text  \
0   Emily A young woman in her early 20s, Emily is...   
2   Alice A woman in her late 30s, Alice is a warm...   
5   George A man in his early 30s, George is a cha...   
4   Sarah A woman in her mid-20s, Sarah is a free-...   
6   Rachel A woman in her late 20s, Rachel is a sh...   
3   Tom A man in his 50s, Tom is a retired soldier...   
16  Tahlia A young Indigenous Australian woman in ...   
14  Mia A young Australian woman in her mid-20s, M...   
18  Ava A middle-aged Australian woman in her 50s,...   
1   Jack A middle-aged man in his 40s, Jack is a s...   
7   John A man in his 60s, John is a retired profe...   
8   Maria A middle-aged Latina woman in her 40s, M...   
30  Sophia A fun-loving and adventurous travel blo...   
26  Olivia A confident and charismatic marketing e...   
49  Abigail A plucky and resourceful young woman w...   
11  Sonya A white woman in her late 20s, Sonya is ...   
40  Lady Olivia A wealthy and b

                                                 text  \
0   Emily A young woman in her early 20s, Emily is...   
2   Alice A woman in her late 30s, Alice is a warm...   
5   George A man in his early 30s, George is a cha...   
4   Sarah A woman in her mid-20s, Sarah is a free-...   
6   Rachel A woman in her late 20s, Rachel is a sh...   
3   Tom A man in his 50s, Tom is a retired soldier...   
16  Tahlia A young Indigenous Australian woman in ...   
14  Mia A young Australian woman in her mid-20s, M...   
18  Ava A middle-aged Australian woman in her 50s,...   
1   Jack A middle-aged man in his 40s, Jack is a s...   
7   John A man in his 60s, John is a retired profe...   
8   Maria A middle-aged Latina woman in her 40s, M...   
30  Sophia A fun-loving and adventurous travel blo...   
26  Olivia A confident and charismatic marketing e...   
49  Abigail A plucky and resourceful young woman w...   
11  Sonya A white woman in her late 20s, Sonya is ...   
40  Lady Olivia A wealthy and b

                                                 text  \
0   Emily A young woman in her early 20s, Emily is...   
2   Alice A woman in her late 30s, Alice is a warm...   
5   George A man in his early 30s, George is a cha...   
4   Sarah A woman in her mid-20s, Sarah is a free-...   
6   Rachel A woman in her late 20s, Rachel is a sh...   
3   Tom A man in his 50s, Tom is a retired soldier...   
16  Tahlia A young Indigenous Australian woman in ...   
14  Mia A young Australian woman in her mid-20s, M...   
18  Ava A middle-aged Australian woman in her 50s,...   
1   Jack A middle-aged man in his 40s, Jack is a s...   
7   John A man in his 60s, John is a retired profe...   
8   Maria A middle-aged Latina woman in her 40s, M...   
30  Sophia A fun-loving and adventurous travel blo...   
26  Olivia A confident and charismatic marketing e...   
49  Abigail A plucky and resourceful young woman w...   
11  Sonya A white woman in her late 20s, Sonya is ...   
40  Lady Olivia A wealthy and b

response: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " I don't know"
    }
  ],
  "created": 1724763084,
  "id": "cmpl-A0pxMcJQbota8kwoE9S0tj76zcNlB",
  "model": "gpt-3.5-turbo-instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 4,
    "prompt_tokens": 967,
    "total_tokens": 971
  }
}


"I don't know"

### Question 2