# NYC Food Scrap Dropoff Chatbot


<img src="doc/food_scrap_dropoff.jpeg" width="640" height="480">

In this particular case we are collecting data from [NYC Food Scrap Dropoff](https://dev.socrata.com/foundry/data.cityofnewyork.us/if26-z6xq).

**nyc_food_scrap_drop_off_sites.csv** contains locations, hours, and other information about food scrap drop-off sites in New York City. This information was retrieved in early 2023, we can also expand the dataset using the previous link.

I have chosen this dataset because in our country Panama there are waste management sytems and companies that could benefit from this usefull information.

#### Importing Libraries

In [1]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import pandas as pd
import tiktoken
import openai
import os

###### Setting Variables and Constants

In [2]:
openai.api_key = "" # YOUR API KEY
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "text-davinci-003"
BATCH_SIZE = 128

## Step 1 - Data Wrangling

Our first step is to load the dataset and prepare the text column.  **The data is loaded into a pandas `DataFrame` called `df` where each row represents a text sample, and there is only one column, `"text"`, which contains the raw text data.**

In [3]:
df = pd.read_csv(os.path.join('data', 'nyc_food_scrap_drop_off_sites.csv'), index_col=0)
df.drop([
    'object_id', 
    'location_point',
    ':@computed_region_yeji_bk3q',
    ':@computed_region_92fq_4b7q',
    ':@computed_region_sbqj_enih',
    ':@computed_region_efsh_h5xi',
    ':@computed_region_f5dn_yrer',
    ], 
    axis=1,
    inplace=True)

df.head()

Unnamed: 0,borough,ntaname,food_scrap_drop_off_site,location,hosted_by,open_months,operation_day_hours,website,borocd,councildist,latitude,longitude,precinct,notes,ct2010,bbl,bin
0,Staten Island,Grasmere-Arrochar-South Beach-Dongan Hills,South Beach,"21 Robin Road, Staten Island NY",Snug Harbor Youth,Year Round,Friday (Start Time: 1:30 PM - End Time: 4:30 PM),snug-harbor.org,502,50,40.595579,-74.062991,122,,,,
1,Manhattan,Inwood,SE Corner of Broadway & Academy Street,,Department of Sanitation,Year Round,24/7,www.nyc.gov/smartcomposting,112,10,,,34,Download the app to access bins. Accepts all f...,,,
2,Brooklyn,Park Slope,Old Stone House Brooklyn,"336 3rd St, Brooklyn, NY 11215",Old Stone House Brooklyn,Year Round,24/7 (Start Time: 24/7 - End Time: 24/7),,306,39,40.672712,-73.984731,78,,,,
3,Manhattan,East Harlem (North),SE Corner of Pleasant Avenue & E 116 Street,,Department of Sanitation,Year Round,24/7,www.nyc.gov/smartcomposting,111,8,,,25,Download the app to access bins. Accepts all f...,,,
4,Queens,Corona,Malcolm X FSDO,"111-26 Northern Blvd, Flushing, NY 11368",NYC Compost Project Hosted by Big Reuse,Year Round,Tuesdays (Start Time: 12:00 PM - End Time: 2:...,,404,21,40.749685,-73.863072,110,,,,


I look at the columns and the think i could do is impute a *'<column_name> not specified'* value and change all numeric to strings.

In [4]:
def transform(row):
    if row.location != row.location:
        row.location = 'location missing'
    if row.hosted_by != row.hosted_by:
        row.hosted_by = 'host missing'
    if row.website != row.website:
        row.website = 'website missing'
    if row.latitude != row.latitude:
        row.latitude = 'latitude missing'
    if row.longitude != row.longitude:
        row.longitude = 'longitude missing'
    if row.notes != row.notes:
        row.notes = 'notes missing'
    if row.ct2010 != row.ct2010:
        row.ct2010 = '2010 census missing'
    if row.bbl != row.bbl:
        row.bbl = 'borough block lot missing'
    if row.bin != row.bin:
        row.bin = 'building identification number missing'
    return row
    
df = df.apply(transform, axis=1).astype(str)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 576 entries, 0 to 575
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   borough                   576 non-null    object
 1   ntaname                   576 non-null    object
 2   food_scrap_drop_off_site  576 non-null    object
 3   location                  576 non-null    object
 4   hosted_by                 576 non-null    object
 5   open_months               576 non-null    object
 6   operation_day_hours       576 non-null    object
 7   website                   576 non-null    object
 8   borocd                    576 non-null    object
 9   councildist               576 non-null    object
 10  latitude                  576 non-null    object
 11  longitude                 576 non-null    object
 12  precinct                  576 non-null    object
 13  notes                     576 non-null    object
 14  ct2010                    576 n

In [5]:
df.rename(
    columns={
        'ntaname':'neighborhood tabulation area name',
        'food_scrap_drop_off_site': 'food scrap dropoff site',
        'hosted_by': 'hosted by',
        'open_months': 'open months',
        'operation_day_hours': 'operation day hours',
        'borocd': 'borough and community district',
        'councildist': 'nyc council district number',
        'precinct': 'police precinct',
        'ct2010': '2010 cencus tract',
        'bbl': 'borough block lot',
        'bin': 'building identification number'
    }, 
    inplace = True
)

In [6]:
df['text'] = ''
for n, row in df.iterrows():
    df.loc[n, 'text'] = '\n\n'.join([col + ': ' + row[col] for col in df.columns[:-1]])

In [7]:
print(df.text[0])

borough: Staten Island

neighborhood tabulation area name: Grasmere-Arrochar-South Beach-Dongan Hills

food scrap dropoff site: South Beach

location: 21 Robin Road, Staten Island NY

hosted by: Snug Harbor Youth

open months: Year Round

operation day hours: Friday (Start Time: 1:30 PM - End Time:  4:30 PM)

website: snug-harbor.org

borough and community district: 502

nyc council district number: 50

latitude: 40.595579

longitude: -74.062991

police precinct: 122

notes: notes missing

2010 cencus tract: 2010 census missing

borough block lot: borough block lot missing

building identification number: building identification number missing


In [8]:
df = df[['text']]
df

Unnamed: 0,text
0,borough: Staten Island\n\nneighborhood tabulat...
1,borough: Manhattan\n\nneighborhood tabulation ...
2,borough: Brooklyn\n\nneighborhood tabulation a...
3,borough: Manhattan\n\nneighborhood tabulation ...
4,borough: Queens\n\nneighborhood tabulation are...
...,...
571,borough: Brooklyn\n\nneighborhood tabulation a...
572,borough: Queens\n\nneighborhood tabulation are...
573,borough: Brooklyn\n\nneighborhood tabulation a...
574,borough: Brooklyn\n\nneighborhood tabulation a...


In [9]:
df.to_csv('text.csv', index=False)

## Step 2 - Generating Embeddings

We'll use the Embedding tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

For avoiding a RateLimitError we have to send our data in batches to the Embedding.create function.

In [10]:
embeddings = []
for i in range(0, len(df), BATCH_SIZE):
    response = openai.Embedding.create(
        input=df.iloc[i:i+BATCH_SIZE]["text"].tolist(), # set text of embeddings
        engine=EMBEDDING_MODEL_NAME # set model
    )

    embeddings.extend([data["embedding"] for data in response["data"]]) # extend the list


df["embeddings"] = embeddings # set the embeddings to the dataframe
df

Unnamed: 0,text,embeddings
0,borough: Staten Island\n\nneighborhood tabulat...,"[0.012136772274971008, -0.023499606177210808, ..."
1,borough: Manhattan\n\nneighborhood tabulation ...,"[0.0019428444793447852, -0.003585195867344737,..."
2,borough: Brooklyn\n\nneighborhood tabulation a...,"[0.011733850464224815, -0.024186652153730392, ..."
3,borough: Manhattan\n\nneighborhood tabulation ...,"[0.0023042100947350264, -0.004690106026828289,..."
4,borough: Queens\n\nneighborhood tabulation are...,"[0.0020352608989924192, -0.01488337479531765, ..."
...,...,...
571,borough: Brooklyn\n\nneighborhood tabulation a...,"[0.012538803741335869, -0.013083376921713352, ..."
572,borough: Queens\n\nneighborhood tabulation are...,"[0.011331940069794655, -0.005489879287779331, ..."
573,borough: Brooklyn\n\nneighborhood tabulation a...,"[0.009677416644990444, -0.013440472073853016, ..."
574,borough: Brooklyn\n\nneighborhood tabulation a...,"[0.006138375028967857, -0.023190181702375412, ..."


In [11]:
df.to_csv('embeddings.csv', index=False)

## Step 3: Create a Function that Finds Related Pieces of Text for a Given Question

We are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [12]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [13]:
get_rows_sorted_by_relevance('Which dropoff will occur on Staten Island?', df)

Unnamed: 0,text,embeddings,distances
325,borough: Staten Island\n\nneighborhood tabulat...,"[0.014562509953975677, -0.022218944504857063, ...",0.164173
261,borough: Staten Island\n\nneighborhood tabulat...,"[0.0008209944935515523, -0.01647314243018627, ...",0.171165
175,borough: Staten Island\n\nneighborhood tabulat...,"[0.012782229110598564, -0.034262508153915405, ...",0.172738
29,borough: Staten Island\n\nneighborhood tabulat...,"[0.008558251895010471, -0.020226025953888893, ...",0.174355
0,borough: Staten Island\n\nneighborhood tabulat...,"[0.012136772274971008, -0.023499606177210808, ...",0.174878
...,...,...,...
331,borough: Bronx\n\nneighborhood tabulation area...,"[0.007381598465144634, -0.023643212392926216, ...",0.243543
310,borough: Bronx\n\nneighborhood tabulation area...,"[0.0027965803164988756, -0.02742883376777172, ...",0.245591
280,borough: Queens\n\nneighborhood tabulation are...,"[0.002954320516437292, -0.00679408572614193, -...",0.245820
555,borough: Bronx\n\nneighborhood tabulation area...,"[-0.013415813446044922, -0.025753941386938095,...",0.245970


In [14]:
get_rows_sorted_by_relevance('Which ones will be 24/7 and have a website with open months a year round?', df)

Unnamed: 0,text,embeddings,distances
261,borough: Staten Island\n\nneighborhood tabulat...,"[0.0008209944935515523, -0.01647314243018627, ...",0.271415
138,borough: Manhattan\n\nneighborhood tabulation ...,"[0.012686249800026417, -0.02198585867881775, -...",0.275597
315,borough: Manhattan\n\nneighborhood tabulation ...,"[-0.006918120663613081, -0.021630633622407913,...",0.277526
569,borough: Manhattan\n\nneighborhood tabulation ...,"[0.0021176361478865147, -0.021013598889112473,...",0.277684
54,borough: Manhattan\n\nneighborhood tabulation ...,"[0.007022544275969267, -0.02087709866464138, 0...",0.278722
...,...,...,...
549,borough: Bronx\n\nneighborhood tabulation area...,"[0.005494710989296436, -0.014104428701102734, ...",0.309473
534,borough: Bronx\n\nneighborhood tabulation area...,"[-0.003292226232588291, -0.009777220897376537,...",0.309866
4,borough: Queens\n\nneighborhood tabulation are...,"[0.0020352608989924192, -0.01488337479531765, ...",0.311333
522,borough: Bronx\n\nneighborhood tabulation area...,"[0.013702640309929848, -0.013000293634831905, ...",0.312321


## Custom Query Completion


## Step 4: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, here we will compose a custom query using your this dataset and we will retrieve results from an OpenAI `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [15]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [16]:
print(create_prompt("Mention one of these boroughs to have operation days between monday to wednesday?", df, 600))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

borough: Brooklyn

neighborhood tabulation area name: Downtown Brooklyn-DUMBO-Boerum Hill

food scrap dropoff site: Brooklyn Borough Hall Saturday Greenmarket

location: Court St. and Montague St.

hosted by: GrowNYC

open months: Year Round

operation day hours: Saturdays (Start Time: 8:00 AM - End Time:  12:00 PM)

website: grownyc.org/compost

borough and community district: 302

nyc council district number: 33

latitude: 40.693609

longitude: -73.990261

police precinct: 84

notes: Not accepted: meat, bones, or dairy

2010 cencus tract: 9.0

borough block lot: borough block lot missing

building identification number: building identification number missing

###

borough: Brooklyn

neighborhood tabulation area name: Williamsburg

food scrap dropoff site: Domino Park

location: 15 River Street Brooklyn, NY 11249

hosted by: Staff at Domino Park



## Step 4: Create a Function that Answers a Question

Our final step is to send that text prompt to a `Completion` model and parse the model output!

In [17]:
def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens=600):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [18]:
custom_nyc_answer = answer_question("How many nyc food scrap dropoff have are on Staff at Domino Park hours in Manhattan, also mention them?", df)
print(custom_nyc_answer)

Two NYC food scrap dropoff locations hosted by Staff at Domino Park in Manhattan are Domino Park (Pier 66 at W 26 St) and Hudson River Park Pier 46 at Charles St. The operation day hours are every day (Start Time: 7:00 AM - End Time:  7:00 PM).


## Default Prompting

We set also a function for default prompting to see the performance later

In [19]:
nyc_default_prompt = """
Question: "How many nyc food scrap dropoff have are on Staff at Domino Park hours in Manhattan, also mention them?"
Answer:
"""

def default_response(prompt, model=COMPLETION_MODEL_NAME, max_tokens=600):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens
    )
    return response["choices"][0]["text"].strip()

print(default_response(nyc_default_prompt, model=COMPLETION_MODEL_NAME, max_tokens=500))

There are three NYC Food Scrap Dropoff locations at Domino Park in Manhattan:

1. Williamsburg Greenmarket – Saturdays, 8am to 4pm
2. East River Park Greenmarket – Thursday 10am to 6pm
3. Clinton Street Greenmarket – Saturday 8am to 4pm

All three locations are staffed with NYC Parks PlaNYC Reps to help answer questions and direct park participants to the most convenient NYC Food Scrap Dropoff locations.


## Custom Performance Demonstration

In the cells below, we demonstrate the performance of our custom query using 2 questions. 

For each question, we are showing the answer from the `Completion` model query as well as the answer from our custom query.

### Question 1

In [20]:
prompt = """
Question: "food scrap drop off sites between friday and sunday."
Answer:
"""

In [21]:
# Q1 - Default
print(default_response(prompt, model=COMPLETION_MODEL_NAME, max_tokens=500))

Food scrap drop off sites are typically open during normal business hours throughout the week. On Fridays and Sundays, many community recycling centers and waste disposal centers offer food scrap drop off services. Additionally, some cities hold special weekend food scrap drop off events.


In [23]:
# Q1 - Embeddings
print(answer_question(prompt, df))

Tompkins Square Greenmarket (Sundays, start time 8:00 AM - end time 5:00 PM), Q Gardens (Tuesdays, Fridays, Saturdays, Sundays, start time Tuesday 6pm; Friday - Sunday dawn - end time Tuesday 8pm; Fridays + Saturdays all night; Sundays until 4:00PM), Madison Square Park Food Scrap Drop-off (Wednesdays, start time 8:00 AM - end time 1:00 PM), Brooklyn Borough Hall Saturday Greenmarket (Saturdays, start time 8:00 AM - end time 12:00 PM).


### Question 2

In [24]:
prompt = """
Question: "Which 5 drop off sites will have open months year round between brooklyn and manhattan?"
Answer:
"""

In [25]:
# Q2 - Default
print(default_response(prompt, model=COMPLETION_MODEL_NAME, max_tokens=500))

The 5 drop off sites that will be open year round between Brooklyn and Manhattan are the Brooklyn Bridge Park Drop-Off Site, Times Square Drop-Off Site, East River State Park Drop-Off Site, Williamsburg Bridge Drop-Off Site, and the Coney Island Drop-Off Site.


In [26]:
# Q2 - Embeddings
print(answer_question(prompt, df))

The five drop off sites that will have open months year round between Brooklyn and Manhattan are Brooklyn Borough Hall Saturday Greenmarket, Domino Park, McCarren Park Greenmarket, Rockaway Parkway, and Big Reuse Warehouse.


In [27]:
# Mention food scrap dropoffs near Queens that have open months year round.
# Give me Brooklyn food scrap dropoffs hosted by.
# Where is located Old Stone House Brooklyn? Give me the latitude and Longitude
while True:
    prompt = input()
    if prompt=="END":
        break
    print(answer_question(prompt, df))
    print('Write END to finish\n')

 Mention food scrap dropoffs near Queens that have open months year round.


SE Corner of 31st Ave & Crescent St, NW Corner of Queens Plaza North & 21 Street, SE Corner of Crescent St & 30th Dr, 33 St between Broadway and 31 Ave, 34-04  24 Street.
Write END to finish



 Give me Brooklyn food scrap dropoffs hosted by.


GrowNYC, Sure We Can, Farm to People, and Department of Sanitation
Write END to finish



 Where is located Old Stone House Brooklyn? Give me the latitude and Longitude


Latitude: 40.6727118, Longitude: -73.984731
Write END to finish



 END
