# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

* I chose the `nyc_food_scrap_drop_off_sites.csv` dataset because it contains recent, concrete, and verifiable information (from early 2023) about specific locations, clearly demonstrating the value of using custom prompts and additional context to enhance the OpenAI model's performance. 
* Unlike the `2023_fashion_trends.csv` or `character_descriptions.csv`, where it's challenging to determine whether responses are generated from provided context or are hallucinations, the food scrap drop-off sites dataset distinctly shows the model's reliance on external context due to its absence in the model's existing knowledge base (as knowledge cutoff for GPT3.5 is September 2021). 
* This clear differentiation makes it an ideal choice for effectively demonstrating how custom prompts and additional context can significantly reduce hallucinations and improve accuracy.



## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-1641186942126677176873267e166d8e3f539.82249139"

import numpy as np
import pandas as pd

In [2]:
# Load the data
df = pd.read_csv('./data/nyc_food_scrap_drop_off_sites.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,website,borocd,...,location_point,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,notes,ct2010,bbl,bin
0,0,Staten Island,Grasmere-Arrochar-South Beach-Dongan Hills,South Beach,"21 Robin Road, Staten Island NY",Snug Harbor Youth,Year Round,Friday (Start Time: 1:30 PM - End Time: 4:30 PM),snug-harbor.org,502,...,"{'type': 'Point', 'coordinates': [-74.062991, ...",1.0,14.0,76.0,10692.0,30.0,,,,
1,1,Manhattan,Inwood,SE Corner of Broadway & Academy Street,,Department of Sanitation,Year Round,24/7,www.nyc.gov/smartcomposting,112,...,,,,,,,Download the app to access bins. Accepts all f...,,,
2,2,Brooklyn,Park Slope,Old Stone House Brooklyn,"336 3rd St, Brooklyn, NY 11215",Old Stone House Brooklyn,Year Round,24/7 (Start Time: 24/7 - End Time: 24/7),,306,...,"{'type': 'Point', 'coordinates': [-73.984731, ...",2.0,27.0,50.0,17617.0,14.0,,,,
3,3,Manhattan,East Harlem (North),SE Corner of Pleasant Avenue & E 116 Street,,Department of Sanitation,Year Round,24/7,www.nyc.gov/smartcomposting,111,...,,,,,,,Download the app to access bins. Accepts all f...,,,
4,4,Queens,Corona,Malcolm X FSDO,"111-26 Northern Blvd, Flushing, NY 11368",NYC Compost Project Hosted by Big Reuse,Year Round,Tuesdays (Start Time: 12:00 PM - End Time: 2:...,,404,...,"{'type': 'Point', 'coordinates': [-73.8630721,...",3.0,21.0,68.0,14510.0,66.0,,,,


In [3]:
# Filter the necessary columns only
df = df[['borough', 'ntaname', 'food_scrap_drop_off_site',
       'location', 'hosted_by', 'open_months', 'operation_day_hours']]

In [4]:
# Concatenate the location info as dict
df['text'] = df.to_dict(orient='records')
# Then convert the dict to string
df['text'] = df['text'].apply(lambda x: str(x).replace("'", '"').replace('nan', 'null'))

In [5]:
df.head()

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,text
0,Staten Island,Grasmere-Arrochar-South Beach-Dongan Hills,South Beach,"21 Robin Road, Staten Island NY",Snug Harbor Youth,Year Round,Friday (Start Time: 1:30 PM - End Time: 4:30 PM),"{""borough"": ""Staten Island"", ""ntaname"": ""Grasm..."
1,Manhattan,Inwood,SE Corner of Broadway & Academy Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Manhattan"", ""ntaname"": ""Inwood"", ..."
2,Brooklyn,Park Slope,Old Stone House Brooklyn,"336 3rd St, Brooklyn, NY 11215",Old Stone House Brooklyn,Year Round,24/7 (Start Time: 24/7 - End Time: 24/7),"{""borough"": ""Brooklyn"", ""ntaname"": ""Park Slope..."
3,Manhattan,East Harlem (North),SE Corner of Pleasant Avenue & E 116 Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Manhattan"", ""ntaname"": ""East Harl..."
4,Queens,Corona,Malcolm X FSDO,"111-26 Northern Blvd, Flushing, NY 11368",NYC Compost Project Hosted by Big Reuse,Year Round,Tuesdays (Start Time: 12:00 PM - End Time: 2:...,"{""borough"": ""Queens"", ""ntaname"": ""Corona"", ""fo..."


## Generate Embeddings

Here we use `Embedding` tooling from OpenAI to create vectors representing each row of our custom dataset.

In [6]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME,
        encoding_format="float"
    )
        
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,text,embeddings
0,Staten Island,Grasmere-Arrochar-South Beach-Dongan Hills,South Beach,"21 Robin Road, Staten Island NY",Snug Harbor Youth,Year Round,Friday (Start Time: 1:30 PM - End Time: 4:30 PM),"{""borough"": ""Staten Island"", ""ntaname"": ""Grasm...","[0.0017317323, 0.00020214643, 0.014749031, -0...."
1,Manhattan,Inwood,SE Corner of Broadway & Academy Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Manhattan"", ""ntaname"": ""Inwood"", ...","[-0.0030509813, 0.008481867, -0.007852299, 0.0..."
2,Brooklyn,Park Slope,Old Stone House Brooklyn,"336 3rd St, Brooklyn, NY 11215",Old Stone House Brooklyn,Year Round,24/7 (Start Time: 24/7 - End Time: 24/7),"{""borough"": ""Brooklyn"", ""ntaname"": ""Park Slope...","[0.0029240174, -0.008304901, 0.002275197, -0.0..."
3,Manhattan,East Harlem (North),SE Corner of Pleasant Avenue & E 116 Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Manhattan"", ""ntaname"": ""East Harl...","[0.0037774695, 0.00606674, -0.012817154, -0.00..."
4,Queens,Corona,Malcolm X FSDO,"111-26 Northern Blvd, Flushing, NY 11368",NYC Compost Project Hosted by Big Reuse,Year Round,Tuesdays (Start Time: 12:00 PM - End Time: 2:...,"{""borough"": ""Queens"", ""ntaname"": ""Corona"", ""fo...","[-0.00012270009, 0.0049185203, -0.015425154, -..."


## Create a Function that Finds Related Pieces of Text for a Given Question

In [7]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [8]:
results = get_rows_sorted_by_relevance("List 5 different food scrap drop off sites in Brooklyn", df)

In [9]:
results.head()

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,text,embeddings,distances
7,Brooklyn,Bedford-Stuyvesant (East),NW Corner of Malcolm X Boulevard & Bainbridge ...,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bedford-St...","[0.003943995, -0.0038571993, -0.015192714, -0....",0.173307
407,Brooklyn,Bedford-Stuyvesant (East),NE Corner of Throop Avenue & Madison Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bedford-St...","[0.00093096285, -0.0019967728, -0.015268829, -...",0.174045
516,Brooklyn,Bedford-Stuyvesant (East),SE Corner of Throop Avenue & Hart Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bedford-St...","[0.0010837029, -0.0027712947, -0.012501194, -0...",0.174263
210,Brooklyn,Bedford-Stuyvesant (East),SW Corner of Albany Avenue & Herkimer Street,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bedford-St...","[0.005686767, 0.001479421, -0.013620669, -0.01...",0.174746
339,Brooklyn,Bedford-Stuyvesant (East),NW Corner of Malcolm X Boulevard & Gates Avenue,,Department of Sanitation,Year Round,24/7,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bedford-St...","[0.0024571805, -0.0052703237, -0.015952665, -0...",0.174811


## Create a Function that Composes a Text Prompt

In [27]:
def create_prompt(question, df, number_of_closest_match=10):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    You will be provided the information of the food scrap drop-off locations in New York City in JSON format with the following fields:
    'borough': NYC Borough where vendor is located
    'ntaname': Neighborhood Tabulation Area Name
    'food_scrap_drop_off_site': Name of food scrap drop-off location
    'location': Street address or cross streets associated with food scrap drop-off location
    'hosted_by': Name of the organization that services the food scraps that are dropped off.
    'open_months': Months when food scraps can be dropped off at the location.
    'operation_day_hours': Days and hours when food scraps can be dropped off.
    
Answer the question based on the provided information, and if the question
can't be answered based on the context, say "I don't know"

Here are the location information: 

{}

---

Question: {}
Answer:"""
    
    
    closest_match = get_rows_sorted_by_relevance(question, df)
    concatenated_results = '\n\n###\n\n'.join(closest_match['text'].head(number_of_closest_match).astype(str))

    return prompt_template.format(concatenated_results, question)
    

In [28]:
prompt = create_prompt("List 5 different food scrap drop off sites in Brooklyn", df)

In [29]:
print(prompt)


    You will be provided the information of the food scrap drop-off locations in New York City in JSON format with the following fields:
    'borough': NYC Borough where vendor is located
    'ntaname': Neighborhood Tabulation Area Name
    'food_scrap_drop_off_site': Name of food scrap drop-off location
    'location': Street address or cross streets associated with food scrap drop-off location
    'hosted_by': Name of the organization that services the food scraps that are dropped off.
    'open_months': Months when food scraps can be dropped off at the location.
    'operation_day_hours': Days and hours when food scraps can be dropped off.
    
Answer the question based on the provided information, and if the question
can't be answered based on the context, say "I don't know"

Here are the location information: 

{"borough": "Brooklyn", "ntaname": "Bedford-Stuyvesant (East)", "food_scrap_drop_off_site": "NW Corner of Malcolm X Boulevard & Bainbridge Street", "location": null, "host

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [22]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df,max_answer_tokens=500):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"]
    except Exception as e:
        print(e)
        return ""

In [23]:
answer = answer_question("List 5 different food scrap drop off sites in Brooklyn", df)
print(answer)


1. NW Corner of Malcolm X Boulevard & Bainbridge Street (Bedford-Stuyvesant East)
2. NE Corner of Throop Avenue & Madison Street (Bedford-Stuyvesant East)
3. SW Corner of Patchen Avenue & Macdonough Street (Bedford-Stuyvesant East)
4. NE Corner of Bedford Avenue & Hancock Street (Bedford-Stuyvesant West)
5. NE Corner of Bedford Avenue & Herkimer Street (Bedford-Stuyvesant West)


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [37]:
question = 'What is the opening hour of the food scrap drop-off at 4th Avenue Presbyterian Church'

In [43]:
df[df['food_scrap_drop_off_site']=='4th Avenue Presbyterian Church']

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,text,embeddings
459,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",4th Avenue Presbyterian Church,Year Round,Every day (Start Time: Dawn - End Time: Dusk),"{""borough"": ""Brooklyn"", ""ntaname"": ""Bay Ridge""...","[-0.007592776, -0.016335765, -0.017335355, -0...."


From the data, we know the answer is `Every day (Start Time: Dawn - End Time: Dusk)`

In [41]:
# Answer of LLM without additional context
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=question,
    max_tokens=500)
print(response["choices"][0]["text"]) 



Based on the church's website, the opening hour of the food scrap drop-off at 4th Avenue Presbyterian Church is 8:30am on Saturdays.


Clearly, the LLM's response is hallucinated.

In [39]:
# Answer of custom prompt
answer = answer_question(question, df)
print(answer)

 Every day at dawn.


The LLM can answer correctly with the custom prompt and additional context.

### Question 2

In [44]:
question = 'Which organization host the food scrap drop-off in Bay Ridge area?'

In [46]:
df[df['ntaname']=='Bay Ridge']

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,text,embeddings
329,Brooklyn,Bay Ridge,Bay Ridge,3rd Ave & 95th Street,GrowNYC,Year Round,Saturdays (Start Time: 8:00 AM - End Time: 12...,"{""borough"": ""Brooklyn"", ""ntaname"": ""Bay Ridge""...","[-0.009802423, -0.016344251, 0.0027962702, -0...."
459,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",4th Avenue Presbyterian Church,Year Round,Every day (Start Time: Dawn - End Time: Dusk),"{""borough"": ""Brooklyn"", ""ntaname"": ""Bay Ridge""...","[-0.007592776, -0.016335765, -0.017335355, -0...."


From the data,  answer is `GrowNYC` or `4th Avenue Presbyterian Church`.

In [47]:
# Answer of LLM without additional context
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=question,
    max_tokens=500)
print(response["choices"][0]["text"]) 



There are several organizations in the Bay Ridge area that host food scrap drop-off programs. Here are a few examples:

1. GrowNYC - This organization hosts monthly food scrap drop-offs at Third Avenue and Ovington Avenue every second Saturday of the month.

2. Owl's Head Park Horticulture Group - This group hosts weekly food scrap drop-offs at Owl's Head Park every Saturday from 10am-12pm.

3. Union Church of Bay Ridge - This church hosts monthly food scrap drop-offs in their parking lot at the corner of Ridge Boulevard and 80th Street.

4. Bay Ridge Food Scrap Composting Initiative - This group hosts monthly food scrap drop-offs at various locations in Bay Ridge, including the Greenmarket on 3rd Avenue and Shore Road Farmers Market at Shore Road and 97th Street.

It's important to note that some of these drop-offs may have specific guidelines or requirements, such as only accepting food scraps from residents of certain zip codes or only accepting certain types of food scraps. It's 

The LLM provided the answer based on the data it was trained on; that's why GrowNYC appears in the response.

In [45]:
# Answer of custom prompt
answer = answer_question(question, df)
print(answer)

 The 4th Avenue Presbyterian Church or GrowNYC.


The answer with custom prompt is more accurate.