# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

Choose `nyc_food_scrap_drop_off_sites.csv` dataset. It contains the details info about the nyc food scape drop off which likely not included in the foundation model pretrained dataset, so it's easy to verify it if our custom chatbot actually works with the RAG dataset.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [59]:
import openai
import pandas as pd
from pathlib import Path

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [60]:
filename = "./data/nyc_food_scrap_drop_off_sites.csv"
df = pd.read_csv(filename)

text_columns = [
    'borough',
    'ntaname',
    'food_scrap_drop_off_site',
    'location',
    'hosted_by',
    'open_months',
    'operation_day_hours',
    'website',
    'notes',
]

df[text_columns] = df[text_columns].fillna('')

df['text'] = df.apply(
    lambda row: f"Location: {row['food_scrap_drop_off_site']} in {row['ntaname']}, {row['borough']}. "
                f"Address: {row['location']}. "
                f"Hosted by: {row['hosted_by']}. "
                f"Schedule: Open {row['open_months']}, {row['operation_day_hours']}. "
                f"Website: {row['website']}"
                f"Notes: {row['notes']}",
    axis=1  # 'axis=1' tells pandas to apply the function to each row.
)

text_df = pd.DataFrame(df['text'])

In [61]:
text_df.head(10).to_string()

'                                                                                                                                                                                                                                                                                                                                                                                               text\n0                                                                                                                         Location: South Beach in Grasmere-Arrochar-South Beach-Dongan Hills, Staten Island. Address: 21 Robin Road, Staten Island NY. Hosted by: Snug Harbor Youth. Schedule: Open Year Round, Friday (Start Time: 1:30 PM - End Time:  4:30 PM). Website: snug-harbor.orgNotes: \n1                                                                   Location: SE Corner of Broadway & Academy Street in Inwood, Manhattan. Address: . Hosted by: Department of Sanitation. Schedule: Open Year Round, 24/7. W

In [62]:
text_df

Unnamed: 0,text
0,Location: South Beach in Grasmere-Arrochar-Sou...
1,Location: SE Corner of Broadway & Academy Stre...
2,Location: Old Stone House Brooklyn in Park Slo...
3,Location: SE Corner of Pleasant Avenue & E 116...
4,"Location: Malcolm X FSDO in Corona, Queens. Ad..."
...,...
571,Location: Albemarle Road and McDonald Avenue i...
572,Location: NW Corner of 21st Street & 30th Driv...
573,Location: Rochester Avenue & St. Johns Place i...
574,Location: *CLOSED FOR THE SEASON* East 4th Str...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [63]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(text_df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
text_df["embeddings"] = embeddings
text_df

Unnamed: 0,text,embeddings
0,Location: South Beach in Grasmere-Arrochar-Sou...,"[0.008268075995147228, -0.013524645939469337, ..."
1,Location: SE Corner of Broadway & Academy Stre...,"[0.011617721058428288, 0.004530502017587423, -..."
2,Location: Old Stone House Brooklyn in Park Slo...,"[0.017113445326685905, -0.03115324303507805, 0..."
3,Location: SE Corner of Pleasant Avenue & E 116...,"[0.016948765143752098, 0.0033342379610985518, ..."
4,"Location: Malcolm X FSDO in Corona, Queens. Ad...","[0.002668992383405566, -0.023985659703612328, ..."
...,...,...
571,Location: Albemarle Road and McDonald Avenue i...,"[0.024264298379421234, -0.007152635138481855, ..."
572,Location: NW Corner of 21st Street & 30th Driv...,"[0.020397387444972992, -0.0006399444537237287,..."
573,Location: Rochester Avenue & St. Johns Place i...,"[0.01732088439166546, -0.0033952868543565273, ..."
574,Location: *CLOSED FOR THE SEASON* East 4th Str...,"[0.012309753336012363, -0.020065300166606903, ..."


In [64]:
text_df.to_csv("./data/embeddings.csv")

In [65]:
! ls ./data

2023_fashion_trends.csv     embeddings.csv
character_descriptions.csv  nyc_food_scrap_drop_off_sites.csv


In [66]:
# find related pieces of the text for a given question
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, text_df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    text_df_copy = text_df.copy()
    text_df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        text_df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    text_df_copy.sort_values("distances", ascending=True, inplace=True)
    return text_df_copy

In [67]:
custom_anwsers = get_rows_sorted_by_relevance("I live in Manhattan area, want to find food drop off sites that operate 24/7", text_df)


In [68]:
custom_anwsers.head(5).to_string()
# print(custom_anwsers.head(5))

'                                                                                                                                                                                                                                                                                                                                                                                  text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [69]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [70]:
print(create_prompt("I want to find the food scrape dropoff site hosted by Snug Harbor Youth in Staten Island borough, what's the location and if they accept meat and diary?", text_df, 50))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 



---

Question: I want to find the food scrape dropoff site hosted by Snug Harbor Youth in Staten Island borough, what's the location and if they accept meat and diary?
Answer:


In [71]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [72]:
question1 = "I want to find the food scrape dropoff site hosted by Snug Harbor Youth in Staten Island borough, what's the location and if they accept meat and diary?"
custom_answer1 = answer_question(question1, text_df)
print(custom_answer1)

There are six common locations for food scrap dropoff sites hosted by Snug Harbor Youth, but most of them are closed for the season. One of the open year round locations is at the Venture House in Port Richmond, Staten Island at 1442 Castleton Avenue. However, they do not accept meat or dairy.


In [73]:
original_prompt = f"""
Question: ":{question1}"
Answer:
"""
original_answer1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=original_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(original_answer1)

Unfortunately, I was unable to find information specifically about a food scrape dropoff site hosted by Snug Harbor Youth in Staten Island borough. Here are a few options for finding this information:

1. Contact Snug Harbor Youth directly. The organization's contact information can likely be found on their website or social media pages. They will be able to provide you with more information about their food scrape dropoff site.

2. Check the Staten Island borough government website. They may have information about food scrape dropoff sites in the area, including any hosted by Snug Harbor Youth.

3. Reach out to local community organizations or composting companies. They may be aware of any food scrape dropoff sites in the area and their policies on accepting meat and dairy.


### Question 2

In [74]:
question2 = "I want to find the 'East New York Farms: UCC Youth Farm' food scrape dropoff site, what's the location and if they accept meat and diary?"
custom_answer2 = answer_question(question2, text_df)
print(custom_answer2)

The East New York Farms: UCC Youth Farm is located at 613 New Lots Avenue in East New York-New Lots, Brooklyn. According to the notes, they do not accept meat, bones, or dairy.


In [76]:
original_prompt2 = f"""
Question: ":{question2}"
Answer:
"""
original_answer2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=original_prompt2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(original_answer2)
# it doesn't accept meat

According to the East New York Farms website, the UCC Youth Farm food scrap drop-off site is located at 613 Aberdeen Street, Brooklyn, NY. It is open from May to November on Saturdays from 9am-1pm. It accepts all food scraps, including meat and dairy.
