# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

**Dataset Selection: NYC Food Scrap Drop-off Sites**

For this custom chatbot project, I have selected the NYC Food Scrap Drop-off Sites dataset. This dataset includes comprehensive details about food scrap drop-off sites in New York City, such as locations, operating hours, and other pertinent information. With a minimum of 20 rows of text data, it is well-suited for the task at hand.

**Use Case:**

The custom chatbot will be developed to provide users with accurate and current information regarding food scrap drop-off sites in New York City. This will be particularly useful for individuals interested in composting and supporting a more sustainable urban environment. Leveraging this dataset, the chatbot will be able to answer questions about site locations, hours of operation, and other relevant details.

This customization will benefit NYC residents and businesses seeking to responsibly dispose of their food scraps, as well as tourists who wish to maintain eco-friendly practices during their visit. By offering precise and helpful information on food scrap drop-off sites, the chatbot can assist users in adopting sustainable habits and contribute to reducing overall waste in New York City.

Additionally, by providing this service, the chatbot promotes a culture of caring and environmental stewardship among its users. It encourages individuals to make conscious, eco-friendly decisions and fosters a community spirit centered on sustainability and responsibility. Through this initiative, the chatbot not only aids in practical waste disposal but also inspires a deeper commitment to caring for our planet.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
!pip list | grep "openai" || pip install openai # find if exists else install openai package

openai                           1.40.3


In [2]:
!pip list | grep "tiktoken" || pip install tiktoken # find if exists else install tiktoken package

tiktoken                         0.7.0


In [3]:
import pandas as pd
# TODO replace API KEY before submission
import openai
from openai import OpenAI
import os
os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.getenv("OPENAI_API_KEY")
# Initialize the OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


In [4]:
import pandas as pd

# Define the dataset path for Google Colab
dataset_path = "/content/data/nyc_food_scrap_drop_off_sites.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Combine relevant columns into a single "text" column
df['text'] = (
    "Borough: " + df['borough'] + ", "
    + "Site Name: " + df['ntaname'] + ", "
    + "Address: " + df['food_scrap_drop_off_site'] + ", "
    + "Location: " + df['location'] + ", "
    + "Hosted By: " + df['hosted_by'] + ", "
    + "Open Months: " + df['open_months'] + ", "
    + "Days and Hours: " + df['operation_day_hours'] + ", "
    + "Website: " + df['website']
)

# Filter out rows with empty 'text' column
df = df[df['text'].str.strip().str.len() > 0]

# Drop unnecessary columns and keep only the 'text' column
df = df[['text']].dropna()

# Reset the index
df.reset_index(drop=True, inplace=True)

# Display the dataframe
df


Unnamed: 0,text
0,"Borough: Staten Island, Site Name: Grasmere-Ar..."
1,"Borough: Queens, Site Name: Astoria (North)-Di..."
2,"Borough: Queens, Site Name: Astoria (Central),..."
3,"Borough: Bronx, Site Name: Mount Eden-Claremon..."
4,"Borough: Brooklyn, Site Name: Crown Heights (N..."
...,...
292,"Borough: Brooklyn, Site Name: Bushwick (West),..."
293,"Borough: Queens, Site Name: Astoria (North)-Di..."
294,"Borough: Brooklyn, Site Name: Kensington, Addr..."
295,"Borough: Brooklyn, Site Name: Windsor Terrace-..."


In [5]:
import time

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
length_of_df = len(df)
for i in range(0, length_of_df, batch_size):
    try:
        # Send text data to OpenAI model to get embeddings
        response = client.embeddings.create(
            input = df.iloc[i:i+batch_size]["text"].tolist(),
            model = EMBEDDING_MODEL_NAME
        )
        # print(response)
        # Add embeddings to list
        batch_embeddings = [data.embedding for data in response.data]
        embeddings.extend(batch_embeddings)

        # Sleep to avoid RateLimitError
        time.sleep(10)  # Adjust the sleep time as needed
        # break

    except openai.RateLimitError:
        print("Rate limit exceeded. Sleeping for 60 seconds.")
        time.sleep(60)
        # Retry the same batch after sleeping
        response = client.embeddings.create(
            input = df.iloc[i:i+batch_size]["text"].tolist(),
            model = EMBEDDING_MODEL_NAME
        )
        # print(response)
        batch_embeddings = [data.embedding for data in response.data]
        embeddings.extend(batch_embeddings)
        time.sleep(10)  # Adjust the sleep time as needed

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df


Unnamed: 0,text,embeddings
0,"Borough: Staten Island, Site Name: Grasmere-Ar...","[0.014997861348092556, -0.0038276498671621084,..."
1,"Borough: Queens, Site Name: Astoria (North)-Di...","[0.004852685611695051, 0.005155562423169613, 2..."
2,"Borough: Queens, Site Name: Astoria (Central),...","[0.013145772740244865, -0.001231801463291049, ..."
3,"Borough: Bronx, Site Name: Mount Eden-Claremon...","[0.019648607820272446, 0.006201674696058035, -..."
4,"Borough: Brooklyn, Site Name: Crown Heights (N...","[0.02360513061285019, -0.021993588656187057, -..."
...,...,...
292,"Borough: Brooklyn, Site Name: Bushwick (West),...","[-0.003895373083651066, -0.031505435705184937,..."
293,"Borough: Queens, Site Name: Astoria (North)-Di...","[0.01247752271592617, 0.006023860070854425, 0...."
294,"Borough: Brooklyn, Site Name: Kensington, Addr...","[0.021796265617012978, -0.0018805923173204064,..."
295,"Borough: Brooklyn, Site Name: Windsor Terrace-...","[0.010278010740876198, -0.014185795560479164, ..."


In [6]:
df.to_csv("/content/data/embeddings.csv")


In [7]:
!ls /content/data | grep "embeddings"

embeddings.csv


In [8]:
import numpy as np
df = pd.read_csv("/content/data/embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,"Borough: Staten Island, Site Name: Grasmere-Ar...","[0.014997861348092556, -0.0038276498671621084,..."
1,"Borough: Queens, Site Name: Astoria (North)-Di...","[0.004852685611695051, 0.005155562423169613, 2..."
2,"Borough: Queens, Site Name: Astoria (Central),...","[0.013145772740244865, -0.001231801463291049, ..."
3,"Borough: Bronx, Site Name: Mount Eden-Claremon...","[0.019648607820272446, 0.006201674696058035, -..."
4,"Borough: Brooklyn, Site Name: Crown Heights (N...","[0.02360513061285019, -0.021993588656187057, -..."
...,...,...
292,"Borough: Brooklyn, Site Name: Bushwick (West),...","[-0.003895373083651066, -0.031505435705184937,..."
293,"Borough: Queens, Site Name: Astoria (North)-Di...","[0.01247752271592617, 0.006023860070854425, 0...."
294,"Borough: Brooklyn, Site Name: Kensington, Addr...","[0.021796265617012978, -0.0018805923173204064,..."
295,"Borough: Brooklyn, Site Name: Windsor Terrace-...","[0.010278010740876198, -0.014185795560479164, ..."


In [9]:
# from openai.embeddings_utils import get_embedding, distances_from_embeddings
## ^ does not exist,

# Instead using
# get_embedding() // custom function
# distances_from_embeddings => cosine similarity (scipy.spatial.distance)

from scipy.spatial.distance import cosine

def get_embedding(text, model="text-embedding-ada-002"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def sort_rows_by_relevance(question, df):
    """
    Sorts a dataframe by relevance to a given question based on precomputed text embeddings.

    Args:
        question (str): The question text to compare against.
        df (pandas.DataFrame): The dataframe containing text rows and their associated embeddings.

    Returns:
        pandas.DataFrame: A dataframe sorted from most to least relevant to the question.
    """

    # Generate embeddings for the input question
    question_embedding = get_embedding(question)

    # Copy the dataframe and calculate cosine distances between question and row embeddings
    df_sorted = df.copy()
    # df_sorted["distances"] = distances_from_embeddings(
    #     question_embedding,
    #     df_sorted["embeddings"].values,
    #     distance_metric="cosine"
    # )
    df_sorted["distances"] = df_sorted["embeddings"].apply(lambda x: cosine(question_embedding, x))
    # Sort the dataframe by distance in ascending order (shorter distance = more relevant)
    return df_sorted.sort_values("distances", ascending=True)


In [10]:
# Define prompts for each question
prompt_1 = """
Question: "What is the Name for this site: Newkirk Ave & Nostrand Ave"
Answer:
"""

prompt_2 = """
Question: "What is the address for this site: Bushwick (east) ?"
Answer:
"""

# Generate responses using the OpenAI API
response_1 = client.completions.create(
    model = "gpt-3.5-turbo-instruct",
    prompt = prompt_1,
    max_tokens = 150
).choices[0].text.strip()
print(response_1)

response_2 = client.completions.create(
    model = "gpt-3.5-turbo-instruct",
    prompt = prompt_2,
    max_tokens = 150
).choices[0].text.strip()
print(response_2)

# Example usage of sorting rows by relevance
# sorted_df_1 = sort_rows_by_relevance("What is the total weight of food scraps collected in Manhattan in 2019?", df)
# sorted_df_2 = sort_rows_by_relevance("Which borough had the highest food scrap collection in 2020?", df)


The name for this site is the intersection of Newkirk Ave and Nostrand Ave.
The website address for Bushwick (east) is likely different depending on the specific resource or organization you are looking for. Please provide more context or specific information so that I can provide the correct website address.


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [11]:
import tiktoken

def generate_prompt(question, df, max_tokens):
    """
    Generates a text prompt for a Completion model based on a given question
    and a dataframe containing rows of text and their embeddings.

    Args:
        question (str): The question to be answered.
        df (pandas.DataFrame): The dataframe with text and embeddings.
        max_tokens (int): The maximum number of tokens allowed for the prompt.

    Returns:
        str: A formatted prompt ready for submission to the Completion model.
    """
    # Initialize the tokenizer to match the embeddings model
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Define the prompt template
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

    # Calculate the initial token count from the template and the question
    initial_token_count = len(tokenizer.encode(prompt_template)) + \
                          len(tokenizer.encode(question))

    context_texts = []
    current_token_count = initial_token_count

    # Loop through sorted text rows to build the context until max_tokens is reached
    for text in sort_rows_by_relevance(question, df)["text"].values:
        text_token_count = len(tokenizer.encode(text))
        if current_token_count + text_token_count <= max_tokens:
            context_texts.append(text)
            current_token_count += text_token_count
        else:
            break

    # Return the final formatted prompt
    return prompt_template.format("\n\n###\n\n".join(context_texts), question)


In [12]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def execute_query(question, df, max_prompt_tokens=3000, max_answer_tokens=750):
    """
    Generates an answer to a given question using an OpenAI Completion model.

    Args:
        question (str): The question to be answered.
        df (pandas.DataFrame): The dataframe containing relevant text data.
        max_prompt_tokens (int): The maximum number of tokens for the prompt.
        max_answer_tokens (int): The maximum number of tokens for the model's response.

    Returns:
        str: The model's response text. If an error occurs, returns an empty string.
    """
    # Generate the prompt using the provided question and dataframe
    prompt = generate_prompt(question, df, max_prompt_tokens)

    try:
        # Request completion from the OpenAI model
        response = openai.completions.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response.choices[0].text.strip()
    except Exception as e:
        # Log the error and return an empty string
        print(e)
        return ""


In [13]:
# Generate and print the prompt for a given question
question = "In which borough can you find the most food scrap drop-off sites that are open on Saturdays?"
max_tokens = 1000

# Generate the prompt using the specified question and token limit
generated_prompt = generate_prompt(question, df, max_tokens)
print(generated_prompt)



Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Borough: Brooklyn, Site Name: Kensington, Address: Kensington Food Scrap Drop-off, Location: McDonald Ave & Albemarle Rd, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Saturdays (Start Time: 8:30 AM - End Time:  11:30 AM), Website: grownyc.org/compost

###

Borough: Brooklyn, Site Name: Flatbush, Address: Flatbush Junction Food Scrap Drop-off, Location: Hillel Pl & Flatbush Ave, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Fridays (Start Time: 8:30 AM - End Time:  2:30 PM), Website: grownyc.org/compost

###

Borough: Manhattan, Site Name: Upper East Side-Carnegie Hill, Address: East 96th Street Food Scrap Drop-off, Location: 96th St & Lexington Ave, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Fridays (Start Time: 7:30 AM - End Time:  11:30 AM), Website: grownyc.org/compost

###

Borough: Brooklyn, Sit

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [14]:
# Print the result of the standard query for Question 1
print(f"Result for Question 1: {response_1}")


Result for Question 1: The name for this site is the intersection of Newkirk Ave and Nostrand Ave.


In [15]:
# Define Question 1
question_1 = "What is the name of the site at Newkirk Ave & Nostrand Ave?"

# Execute custom query for Question 1
result_q1 = execute_query(question_1, df)

# Print the result of the custom query for Question 1
print(f"Result for Question 1: {result_q1}")


Result for Question 1: I don't know


### Question 2

In [16]:
# Print the result of the standard query for Question 2
print(f"Result for Question 2: {response_2}")

Result for Question 2: The website address for Bushwick (east) is likely different depending on the specific resource or organization you are looking for. Please provide more context or specific information so that I can provide the correct website address.


In [17]:
# Define Question 2
question_2 = "What is the address of the site in Jackson Heights ?"

# Execute custom query for Question 2
result_q2 = execute_query(question_2, df)

# Print the result of the custom query for Question 2
print(f"Result for Question 2: {result_q2}")


Result for Question 2: JH Scraps, 35-20 69th Street, Jackson Heights, NY 11372
