In [1]:
!python --version

Python 3.12.7


# Build Your Own Custom Chatbot

Write an explanation of which dataset you have chosen and why it is appropriate for this task.

<font color="red"> 

I’ve chosen the **"Food Scrap Drop-Off Locations in NYC"** dataset because it’s a great fit for building a chatbot that helps people find food scrap drop-off points around the city. The dataset has all the relevant details—like addresses, boroughs, neighborhoods, and hours of operation—which makes it perfect for answering questions about where and when people can drop off their food scraps. Plus, it includes extra info like contact details and any specific instructions, which will help make the chatbot’s responses more complete and helpful. This dataset is a good match for the task because it gives a structured, useful data to power the chatbot's ability to provide accurate, location-based answers.

</font>

**PS:** The latest version of this dataset is downloaded from [**this link**](https://dev.socrata.com/foundry/data.cityofnewyork.us/if26-z6xq) and used in this project with the same name of the old version.

In [2]:
import openai

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [3]:
import pandas as pd

import tiktoken
from openai.embeddings_utils import get_embedding, distances_from_embeddings

pd.set_option("display.max_columns", 8)

In [4]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

## Helpful Functions

In [5]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"
    
    Context: 
    
    {}
    
    ---
    
    Question: {}
    Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)



def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Data Wrangling

In [6]:
df_all = pd.read_csv("data/nyc-food-scrap-drop-off-sites.csv")
df_all.head()

Unnamed: 0,Borough,NTAName,SiteName,SiteAddr,...,DSNY District,DSNY Section,DSNY Zone,Senate District
0,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",...,BKS10,BKS101,BKS,17
1,Manhattan,East Midtown-Turtle Bay,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,...,MN06,MN063,MN,28
2,Manhattan,Hell's Kitchen,Hudson River Park's Pier 84 at W. 44th St.,Pier 84 at W. 44th St. near dog park,...,MN04,MN043,MN,47
3,Manhattan,East Midtown-Turtle Bay,58th Street Library FSDO,127 East 58th Street,...,MN05,MN052,MN,28
4,Manhattan,Tribeca-Civic Center,Tribeca Greenmarket,Greenwich St. & Duane St,...,MN01,MN013,MN,27


In [7]:
# The size of the dataset
df_all.shape

(591, 27)

The dataset has 27 columns, and to create a `text` column for embeddings, I need to figure out how to combine the relevant information. After considering different options, I’ve decided to use **Formatted Text** because it helps add context to the data. This approach organizes the diverse details in a clear and structured way, making it easier for the chatbot to pull relevant information and generate more accurate responses.

In [8]:
# Create formatted text using all columns
df_all["text"] = df_all.apply(lambda row: ". ".join([f"{col}: {row[col]}" for col in df_all.columns]), axis=1)
df_all.head()

Unnamed: 0,Borough,NTAName,SiteName,SiteAddr,...,DSNY Section,DSNY Zone,Senate District,text
0,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",...,BKS101,BKS,17,Borough: Brooklyn. NTAName: Bay Ridge. SiteNam...
1,Manhattan,East Midtown-Turtle Bay,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,...,MN063,MN,28,Borough: Manhattan. NTAName: East Midtown-Turt...
2,Manhattan,Hell's Kitchen,Hudson River Park's Pier 84 at W. 44th St.,Pier 84 at W. 44th St. near dog park,...,MN043,MN,47,Borough: Manhattan. NTAName: Hell's Kitchen. S...
3,Manhattan,East Midtown-Turtle Bay,58th Street Library FSDO,127 East 58th Street,...,MN052,MN,28,Borough: Manhattan. NTAName: East Midtown-Turt...
4,Manhattan,Tribeca-Civic Center,Tribeca Greenmarket,Greenwich St. & Duane St,...,MN013,MN,27,Borough: Manhattan. NTAName: Tribeca-Civic Cen...


In [9]:
# Extract the "text" column for embeddings
df = df_all[["text"]].copy()
df

Unnamed: 0,text
0,Borough: Brooklyn. NTAName: Bay Ridge. SiteNam...
1,Borough: Manhattan. NTAName: East Midtown-Turt...
2,Borough: Manhattan. NTAName: Hell's Kitchen. S...
3,Borough: Manhattan. NTAName: East Midtown-Turt...
4,Borough: Manhattan. NTAName: Tribeca-Civic Cen...
...,...
586,Borough: Brooklyn. NTAName: Bedford-Stuyvesant...
587,Borough: Brooklyn. NTAName: Bushwick (West). S...
588,Borough: Manhattan. NTAName: Washington Height...
589,Borough: Manhattan. NTAName: Murray Hill-Kips ...


## Generating Embeddings

I use the `Embedding` tooling from OpenAI to create vectors representing each row of the custom dataset.

In [10]:
batch_size = 100
embeddings = []

for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,Borough: Brooklyn. NTAName: Bay Ridge. SiteNam...,"[0.0006570421974174678, -0.016451654955744743,..."
1,Borough: Manhattan. NTAName: East Midtown-Turt...,"[0.0023886493872851133, -0.013192134909331799,..."
2,Borough: Manhattan. NTAName: Hell's Kitchen. S...,"[0.02169746719300747, -0.019935237243771553, 0..."
3,Borough: Manhattan. NTAName: East Midtown-Turt...,"[0.0057630883529782295, -0.0025220056995749474..."
4,Borough: Manhattan. NTAName: Tribeca-Civic Cen...,"[0.006703699007630348, -0.011230423115193844, ..."
...,...,...
586,Borough: Brooklyn. NTAName: Bedford-Stuyvesant...,"[0.015958093106746674, 0.0048733665607869625, ..."
587,Borough: Brooklyn. NTAName: Bushwick (West). S...,"[0.011209751479327679, 0.0020601514261215925, ..."
588,Borough: Manhattan. NTAName: Washington Height...,"[0.013083045370876789, 0.010069809854030609, -..."
589,Borough: Manhattan. NTAName: Murray Hill-Kips ...,"[0.01667860336601734, 0.0027420930564403534, 0..."


In [11]:
# Save the generated embeddings as a CSV file
df.to_csv("embeddings.csv")

## Custom Query Completion

Compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model.

<font color="red"> 

The functions retrieved from the course materials and are provided in the "Helpful Functions" section of the notebook.

</font>

## Custom Performance Demonstration

Demonstrate the performance of your custom query using at least 2 questions.

### Question 1

In [12]:
q1_basic = "what is the website for Crown Heights at 1107 bergen street and when it is open?"
q1_basic_answer = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q1_basic,
    max_tokens=150
)["choices"][0]["text"].strip()

print(q1_basic_answer)

I cannot provide the exact website for Crown Heights at 1107 Bergen Street without more information about what type of establishment it is. If it is a specific business or organization, you can try searching for it on a search engine such as Google. If it is a residential building or address, you can try searching for it on a real estate website.

As for opening hours, I cannot provide that information as it will depend on the specific business or organization at that address. You can try contacting them directly or checking their website for their hours of operation.


In [13]:
q1_custom = "What is the website for Crown Heights at 1107 Bergen street and when is it open?"
q1_custom_answer = answer_question(q1_custom, df)

print(q1_custom_answer)

https://bqlt.org/garden/1100-block-bergen-st-community-garden. Year Round from Sunday 10:00 AM to 3:00 PM.


In [14]:
df_all.iloc[7]

Borough                                                        Brooklyn
NTAName                                           Crown Heights (North)
SiteName                            1100 Bergen Street Community Garden
SiteAddr                         1107 Bergen Street, Brooklyn, NY 11216
Hosted_By             Volunteers at 1100 Bergen Street Community Garden
Open_Month                                                   Year Round
Day_Hours             Sunday (Start Time: 10:00 AM - End Time:  3:00...
Notes                                                               NaN
Website               https://bqlt.org/garden/1100-block-bergen-st-c...
BoroCD                                                              308
CouncilDis                                                           36
ct2010                                                          3031500
BBL                                                                 NaN
BIN                                                             

### Question 2

In [15]:
q2_basic = "Are there any precoutions for for food scrap drop-off for someone living in Corona Greenmarket, Queens?"
q2_basic_answer = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=q2_basic,
    max_tokens=150
)["choices"][0]["text"].strip()

print(q2_basic_answer)

When dropping off food scraps at a Greenmarket in Corona, Queens, there are a few precautions to keep in mind:

1. Bring a container for your food scraps: It is important to bring a container or bag to hold your food scraps, as most greenmarkets do not provide them. Make sure the container is leak-proof and sturdy enough to hold your scraps.

2. Keep your scraps separated and clean: To avoid contamination, keep different types of food scraps separated. You can also line your container with newspaper or a compostable bag to make cleaning easier.

3. Be mindful of what you are composting: Only compost food scraps and other organic materials that are accepted by the greenmarket. These can include fruit and vegetable scraps, coffee grounds,


In [16]:
q2_custom = "Are there any precoutions for for food scrap drop-off for someone living in Corona Greenmarket, Queens?"
q2_custom_answer = answer_question(q2_custom, df)

print(q2_custom_answer)

No meat, bones, or dairy.


In [17]:
df_all.iloc[191]

Borough                                                          Queens
NTAName                                                          Corona
SiteName                                             Corona Greenmarket
SiteAddr                            Roosevelt Ave at 103 Street, Queens
Hosted_By                                                       GrowNYC
Open_Month                                                   Year Round
Day_Hours             Friday (Start Time: 8:00 AM - End Time:  1:30 PM)
Notes                                          No meat, bones, or dairy
Website                                             grownyc.org/compost
BoroCD                                                              404
CouncilDis                                                           21
ct2010                                                          4040501
BBL                                                                 NaN
BIN                                                             