# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## 🗽 Compost in the City 🌱

Composting turns organic waste into nutrient-rich fertilizer, improving soil health, reducing landfill methane, conserving water, and cutting household food waste. Yet in New York City, where most residents live in small apartments without outdoor space, home composting is often impractical. 

**CompostPal** solves this challenge. Powered by a 2025 dataset (`Food_Scrap_Drop-Off_Locations_in_NYC2025.csv`) from NYC’s open data portal, it delivers up-to-date details on compost drop-off sites across all five boroughs—addresses, hours, and restrictions—so residents can easily find nearby locations. This makes zero-waste living more accessible in one of the world’s most densely populated cities.

---

### Dataset Choice & Rationale

I chose the **Food Scrap Drop-Off Locations in NYC 2025** dataset because it provides up-to-date, hyperlocal information—site names, addresses, hours, and restrictions, essential for practical composting guidance.

Pre-trained models can offer general composting advice but lack this current, borough-specific data. Integrating it allows the chatbot to give actionable answers like “Drop off at XYZ Park, Saturday 10 AM–12 PM. No meat or dairy accepted,” turning abstract advice into concrete action.

This addresses a real urban challenge: most NYC residents can’t compost at home and face fragmented, outdated information online. With semantic search, the chatbot can match queries by location, time, or materials, delivering relevant results instantly.

By embedding this dataset, the project demonstrates how domain-specific knowledge transforms a generic AI into a specialized, impactful tool that supports a healthier, more sustainable city, one banana peel at a time.

---

This notebook effectively demonstrates how semantic search combined with current, localized data can create significantly more useful AI applications than generic language models alone.

## Key Features:

- **Dataset & Data Processing**: Uses cleaned 2025 NYC compost data, reducing 591 to 201 locations, sampled 20 sites, with standardized notes via summarize\_note().
- **Embeddings Generation**: Uses OpenAI's text-embedding-ada-002 model to create semantic vectors
- **Semantic Search**: Implements cosine similarity-based relevance ranking via get_rows_sorted_by_relevance()
- **Smart Prompt Engineering**: Token-aware prompt composition with create_prompt() function
- **Context Management**: Balances relevant information inclusion with API token limits
- **Performance Comparison**: Compares baseline vs. custom chatbot, showing localized data enables actionable, location-specific answers absent in generic AI responses.
- **Live Chatbot Interface**: Interactive chatbot with continuous dialogue, commands, UX enhancements, and robust error handling for API failures and edge cases.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [6]:
import pandas as pd
import openai
import tiktoken

In [2]:
# Read Dataset
df = pd.read_csv("data/Food_Scrap_Drop-Off_Locations_in_NYC2025.csv")

In [3]:
# Check first rows
df.head()

Unnamed: 0,Borough,NTAName,SiteName,SiteAddr,Hosted_By,Open_Month,Day_Hours,Notes,Website,BoroCD,...,Object ID,Location Point,App Android,App iOS,Assembly District,Congress District,DSNY District,DSNY Section,DSNY Zone,Senate District
0,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",4th Avenue Presbyterian Church,Year Round,Every day (Start Time: Dawn - End Time: Dusk),"No meat, bones, or dairy.",,310,...,47811,POINT (-74.022767 40.635514),,,51,10,BKS10,BKS101,BKS,17
1,Manhattan,East Midtown-Turtle Bay,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,GrowNYC,Year Round,Wednesday (Start Time: 8:00 AM - End Time: 12...,,grownyc.org/compost,106,...,47671,POINT (-73.969036 40.752606),,,74,12,MN06,MN063,MN,28
2,Manhattan,Hell's Kitchen,Hudson River Park's Pier 84 at W. 44th St.,Pier 84 at W. 44th St. near dog park,Staff at Hudson River Park,Year Round,Every day (Start Time: 7:00 AM - End Time: 7:...,,https://hudsonriverpark.org/the-park/sustainab...,104,...,47639,POINT (-74.00025 40.76346),,,67,12,MN04,MN043,MN,47
3,Manhattan,East Midtown-Turtle Bay,58th Street Library FSDO,127 East 58th Street,GrowNYC,Year Round,Wednesdays (Start Time: 7:30 AM - End Time: 1...,,grownyc.org/compost,105,...,47632,POINT (-73.9693 40.76198),,,73,12,MN05,MN052,MN,28
4,Manhattan,Tribeca-Civic Center,Tribeca Greenmarket,Greenwich St. & Duane St,GrowNYC,Year Round,Saturday (Start Time: 8:00 AM - End Time: 1:0...,,grownyc.org/compost,101,...,47544,POINT (-74.010793 40.717424),,,66,10,MN01,MN013,MN,27


In [4]:
# Check shape 
df.shape

(591, 27)

In [5]:
# Check columns
df.columns

Index(['Borough', 'NTAName', 'SiteName', 'SiteAddr', 'Hosted_By', 'Open_Month',
       'Day_Hours', 'Notes', 'Website', 'BoroCD', 'CouncilDis', 'ct2010',
       'BBL', 'BIN', 'Latitude', 'Longitude', 'PolicePrec', 'Object ID',
       'Location Point', 'App Android', 'App iOS', ' Assembly District',
       ' Congress District', 'DSNY District', ' DSNY Section', 'DSNY Zone',
       'Senate District'],
      dtype='object')

## 💬 Observations
Possibly key columns for **CompostPal** are:
| Column         | Use in Chatbot                                                |
| -------------- | ------------------------------------------------------------- |
| **Borough**    | Filters by user location (e.g., "in Brooklyn")                |
| **NTAName**    | Adds more precise neighborhood filtering (e.g., "in Astoria") |
| **SiteName**   | Names the drop-off site clearly                               |
| **SiteAddr**   | Tells the user exactly where to go                            |
| **Day\_Hours** | Lets users know *when* they can go                            |
| **Notes**      | Extra info like ADA access or seasonal limitations            |


In [6]:
# Keep only the key columns 
df = df[["SiteName", "SiteAddr", "Borough", "NTAName", "Day_Hours", "Notes"]]

In [7]:
# Check how many nulls in each key column
df[["SiteName", "SiteAddr", "Borough", "NTAName", "Day_Hours", "Notes"]].isnull().sum()

SiteName       0
SiteAddr     390
Borough        0
NTAName        0
Day_Hours      0
Notes        170
dtype: int64

## 💬 Observations

We can observe 390 rows are missing `SiteAddr` (address information). This rows can’t serve **CompostPal**’s main purpose: telling users where to drop off compost. Since the whole point is location-based guidance, dropping those rows is justified.


In [8]:
# Drop rows where SiteAddr is missing
df = df.dropna(subset=["SiteAddr"])

In [9]:
# Check new shape
df.shape

(201, 6)

In [10]:
# Check nulls in clean dataframe
df[["SiteName", "SiteAddr", "Borough", "NTAName", "Day_Hours", "Notes"]].isnull().sum()

SiteName       0
SiteAddr       0
Borough        0
NTAName        0
Day_Hours      0
Notes        170
dtype: int64

## 💬 Observations

After dropping rows missing `SiteAddr`, we’re left with 201 usable locations. Since most rows lack notes (170), I decided to only include this field when it's meaningful.


Let's check if there is some  meaningful info in this `Notes` column.

In [11]:
# Show only rows where Notes is not empty or just whitespace
df_notes = df[df["Notes"].str.strip() != ""]

# Preview a few
df_notes[["SiteName", "Notes"]].head(10)

Unnamed: 0,SiteName,Notes
0,4th Avenue Presbyterian Church,"No meat, bones, or dairy."
1,Dag Hammarskjold Plaza Greenmarket,
2,Hudson River Park's Pier 84 at W. 44th St.,
3,58th Street Library FSDO,
4,Tribeca Greenmarket,
5,St. George Greenmarket,
6,St. Mary's Harlem,"No meat, bones, or dairy"
7,1100 Bergen Street Community Garden,
8,Nurture BK,
9,BK Rot,


In [12]:
# Check unique notes
df["Notes"].unique()

array(['No meat, bones, or dairy.', nan, 'No meat, bones, or dairy',
       'Please carefully read instructions here: https://bit.ly/laplazadropoff',
       'Garden staff will be available for questions or concerns from 10am-6pm.',
       'Not accepted: meat, bones, or dairy',
       'Please bring your food scraps to the “Compost Here” bin at the back of the garden Sat-Sun 12-5 and whenever the garden gate is open. We will have brown bins out on the sidewalk on other days and times when we have capacity to haul these food scraps to ou',
       'Visit the Center Farm stand to drop off your food scraps.',
       'Bins at gate during winter months',
       'Year Round: check social media for open hour updates + November - March only open on Saturday 11-12pm. "WEATHER\xa0PERMITTING! Check IG/FB for weekend updates.\xa0 Please DO NOT leave bags of food scraps\xa0by gate when the\xa0garden is closed.\xa0 Remember to remove p',
       'Keep food scraps and plant material separated. This site 

## 💬 Observations

The `Notes` column provides meaningful additional information, such as composting restrictions, site instructions, and accessibility details that NYC's residents definitely need. While 170 rows have missing entries, I decided to keep this column and apply a **summarization function** `summarize_note` that extracts only the most relevant sentence. This way **CompostPal**'s outputs will be concise while preserving helpful content when available.

In [13]:
def summarize_note(note):
    """
    This function standardizes and condenses the 'Notes' field for each compost drop-off site.

    It identifies key phrases to extract actionable information such as:
    - Restrictions on accepted materials (e.g., no meat or dairy)
    - Weather-dependent availability
    - Drop-off instructions or entrance details
    - Volunteer opportunities
    - Warnings about removing non-organic materials

    If no meaningful information is found, or if the note is too short or generic, the function returns an empty string.
    Otherwise, it returns a brief, readable summary suitable for inclusion in chatbot responses.
    """

    if pd.isna(note) or not note.strip():
        return ""

    note = note.strip()

    # Standardize for matching
    note_lower = note.lower()

    # Most useful types of notes
    if "no meat" in note_lower or "not accepted" in note_lower:
        return "This site does NOT accept meat, bones, or dairy."
    
    if "accepts meat" in note_lower or "accepts bones" in note_lower:
        return "This site DOES accept meat, bones, and dairy."

    if "weather" in note_lower or "check social media" in note_lower:
        return "This site may close during bad weather—check social media for updates."

    if "entrance" in note_lower or "gate" in note_lower:
        return "Pay attention to entrance instructions or garden gate access."

    if "remove plastic" in note_lower:
        return "Remove all plastic and non-organic material before drop-off."

    if "volunteer" in note_lower:
        return "Volunteers are welcome—check the note for how to help."

    # If it's very short or generic, skip it
    if len(note) < 50 or "visit the website" in note_lower:
        return ""

    # Fallback: first full sentence (if none matched)
    return note.split(".")[0] + "."

# Apply the summarizer
df = df.assign(Note_Summary=df["Notes"].apply(summarize_note))


In [14]:
# Display the results
df["Note_Summary"].head(20)

0      This site does NOT accept meat, bones, or dairy.
1                                                      
2                                                      
3                                                      
4                                                      
5                                                      
6      This site does NOT accept meat, bones, or dairy.
7                                                      
8                                                      
9                                                      
10                                                     
21                                                     
23                                                     
33    Please carefully read instructions here: https...
34                                                     
42                                                     
44     This site does NOT accept meat, bones, or dairy.
50                                              

## 💬 Observation

Although our dataframe is formatted correctly, the full dataset of 201 entries will exceed the token budget for embedding. To meet project requirements while staying within limits, I’ll select 4 representative sites per NYC borough, resulting in a manageable total of 20 entries—the minimum required for this project.

In [15]:
# Sample 4 entries per borough (stratified sampling)
df = df.groupby("Borough").apply(lambda x: x.sample(4, random_state=42)).reset_index(drop=True)

# Check the result
df["Borough"].value_counts()


Bronx            4
Brooklyn         4
Manhattan        4
Queens           4
Staten Island    4
Name: Borough, dtype: int64

In [16]:
df.shape

(20, 7)

In [17]:
# Reset index
df = df.reset_index(drop=True)

In [18]:
df.head(10)

Unnamed: 0,SiteName,SiteAddr,Borough,NTAName,Day_Hours,Notes,Note_Summary
0,Spuyten Duyvil PreSchool,3041 Kingsbridge Avenue,Bronx,Kingsbridge-Marble Hill,Friday (Start Time: 8:00 AM - End Time: 12:00...,"Remove all plastic, stickers and zip ties befo...","Remove all plastic, stickers and zip ties befo..."
1,Riverdale Neighborhood House,"5521 Mosholu Ave Bronx, NY 10471",Bronx,Riverdale-Spuyten Duyvil,Thursday (Start Time: 1:00 PM - End Time: 6:0...,,
2,BronxWorks Carolyn McLaughlin Community Center,1130 Grand Concourse,Bronx,Concourse-Concourse Village,Thursday (Start Time: 10:00 AM - End Time: 1:...,,
3,Lehman College,Gate 8: Bedford Park Blvd West & Paul Ave Bron...,Bronx,Bedford Park,Monday (Start Time: 9:00 AM - End Time: 12:00...,,
4,East New York Farms: Success Community Garden,461 Williams Avenue,Brooklyn,East New York-New Lots,"Mondays, Wednesdays, and Sundays (Start Time: ...",,
5,Rogers / Tilden / Veronica Place Garden,"2601 - 2603 Tilden Avenue, Brooklyn 11226",Brooklyn,East Flatbush-Erasmus,Saturday (Start Time: 10:00 AM - End Time: 12...,,
6,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",Brooklyn,Bay Ridge,Every day (Start Time: Dawn - End Time: Dusk),"No meat, bones, or dairy.","This site does NOT accept meat, bones, or dairy."
7,Carroll Gardens Greenmarket,Smith St and 1st Pl,Brooklyn,Carroll Gardens-Cobble Hill-Gowanus-Red Hook,Sunday (Start Time: 8:00 AM - End Time: 12:00...,,
8,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,Manhattan,East Midtown-Turtle Bay,Wednesday (Start Time: 8:00 AM - End Time: 12...,,
9,La Plaza Cultural Community Garden,674 East 9th Street,Manhattan,East Village,Saturday and Sunday (Start Time: 2:00 PM - End...,Please carefully read instructions here: https...,Please carefully read instructions here: https...


## 💬 Observation

Now we’re ready to create the text column required for this project. This column will serve as the core context **CompostPal** uses to answer user questions. To generate it, I’ll build a `build_text` function that combines the **key columns** (`SiteName`, `SiteAddr`, `Borough`, `NTAName`, `Day_Hours`, `Notes`) into a single, searchable entry.

In [19]:
# Create function to build text column
def build_text(row):
    """
    Constructs a formatted text entry for a compost drop-off location.

    This function combines key fields from a row — including site name, address, borough,
    neighborhood, hours, and (if available) summarized notes — into a single, readable
    string. The result is used as input context for the CompostPal chatbot.

    Line breaks and icons are used to enhance readability in chatbot or Markdown-based UIs.
    """
    entry = (
        f"📍 {row['SiteName']}\n"
        f"Address: {row['SiteAddr']}\n"
        f"Borough: {row['Borough']} | Neighborhood: {row['NTAName']}\n"
        f"Hours: {row['Day_Hours']}"
    )
    if row["Note_Summary"]:
        entry += f"\nNote: {row['Note_Summary']}"
    return entry


# Apply the function
df["text"] = df.apply(build_text, axis=1)

In [63]:
# Check the first 2 entries format (with and without notes)
for i, entry in enumerate(df["text"]):
    print(f"\nEntry {i+1}:\n{entry}")


Entry 1:
📍 Spuyten Duyvil PreSchool
Address: 3041 Kingsbridge Avenue
Borough: Bronx | Neighborhood: Kingsbridge-Marble Hill
Hours: Friday (Start Time: 8:00 AM - End Time:  12:00 PM)
Note: Remove all plastic, stickers and zip ties before dropping off food scraps.

Entry 2:
📍 Riverdale Neighborhood House
Address: 5521 Mosholu Ave Bronx, NY 10471
Borough: Bronx | Neighborhood: Riverdale-Spuyten Duyvil
Hours: Thursday (Start Time: 1:00 PM - End Time:  6:00 PM)

Entry 3:
📍 BronxWorks Carolyn McLaughlin Community Center
Address: 1130 Grand Concourse
Borough: Bronx | Neighborhood: Concourse-Concourse Village
Hours: Thursday (Start Time: 10:00 AM - End Time:  1:00 PM)

Entry 4:
📍 Lehman College
Address: Gate 8: Bedford Park Blvd West & Paul Ave Bronx, NY 10468
Borough: Bronx | Neighborhood: Bedford Park
Hours: Monday (Start Time: 9:00 AM - End Time:  12:00 PM)

Entry 5:
📍 East New York Farms: Success Community Garden
Address: 461 Williams Avenue
Borough: Brooklyn | Neighborhood: East New York

## 💬 Observation

Now that the `text` column contains all the necessary information in the correct format, it's time to reduce the DataFrame to just this column. This prepares the data for use in **CompostPal**, where each text entry will serve as a searchable reference for answering user questions.

In [21]:
# Reduce the dataframe to just the text column
df = df[["text"]].copy()
df.head()

Unnamed: 0,text
0,📍 Spuyten Duyvil PreSchool\nAddress: 3041 King...
1,📍 Riverdale Neighborhood House\nAddress: 5521 ...
2,📍 BronxWorks Carolyn McLaughlin Community Cent...
3,📍 Lehman College\nAddress: Gate 8: Bedford Par...
4,📍 East New York Farms: Success Community Garde...


## Generating Embeddings

Now that our custom dataset is clean and consolidated, we’ll use OpenAI’s Embedding API to convert each text entry into a numerical vector. These embeddings capture the semantic meaning of each compost drop-off site description and will later allow **CompostPal** to find the most relevant locations based on user questions.
- We will need API authentication to generate embeddings

In [31]:
# 🔐 API Authentication: Add your OpenAI API key here (Replace <YOUR API KEY> with your actual API key)
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = <YOUR API KEY>

In [51]:
import openai

MODEL = "text-embedding-ada-002"
batch_size = 100
embeddings = []

# Process in batches
for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i+batch_size]["text"].tolist()
    
    try:
        response = openai.Embedding.create(
            input=batch,
            engine=MODEL  # ✅ correct for v0 API
        )
        batch_embeddings = [item["embedding"] for item in response["data"]]
        embeddings.extend(batch_embeddings)
        print(f"✅ Batch {i} to {i + batch_size} processed")
        
    except Exception as e:
        print(f"❌ Failed to fetch embeddings for batch {i}–{i + batch_size}: {e}")
        break

# Verify and assign embeddings
if len(embeddings) == len(df):
    df["embeddings"] = embeddings
    print(f"✅ Successfully added {len(embeddings)} embeddings to dataframe")
else:
    print(f"❌ Embedding count mismatch: got {len(embeddings)} for {len(df)} rows")


✅ Batch 0 to 100 processed
✅ Successfully added 20 embeddings to dataframe


In [52]:
# Save embeddings
df.to_csv("compostpal_embeddings.csv", index=False)
print("CompostPal embeddings saved!")

# Check it's there
print(f"Saved {len(df)} locations to compostpal_embeddings.csv")

CompostPal embeddings saved!
Saved 20 locations to compostpal_embeddings.csv


In [20]:
import ast
MODEL = "text-embedding-ada-002"

def load_embeddings_from_csv(filename="compostpal_embeddings.csv"):
    """
    Safely load compost embeddings from CSV file.
    """
    try:
        df_loaded = pd.read_csv(filename)
        
        if "embeddings" not in df_loaded.columns:
            raise ValueError("Missing 'embeddings' column in the CSV file")
            
        # Conver stringified lists to Python lists
        df_loaded["embeddings"] = df_loaded["embeddings"].apply(ast.literal_eval)
        
        print(f"Loaded {len(df_loaded)} CompostPal locations with embeddings")
        return df_loaded
    
    except FileNotFoundError:
        print(f"❌ File '{filename}' not found.")
    except Exception as e:
        print(f"❌ Failed to load embeddings: {e}")
    
    return None  # Return None if loading fails
    

In [26]:
df = load_embeddings_from_csv()
print(df.columns)        # Check column names
print(df["embeddings"][0][:5])  # Show first 5 dimensions of the first embedding
print(df.head())         # Show the full first few rows

Loaded 20 CompostPal locations with embeddings
Index(['text', 'embeddings'], dtype='object')
[0.008538631722331047, 0.008754117414355278, -0.01644427329301834, -0.028067046776413918, -0.02048463560640812]
                                                text  \
0  📍 Spuyten Duyvil PreSchool\nAddress: 3041 King...   
1  📍 Riverdale Neighborhood House\nAddress: 5521 ...   
2  📍 BronxWorks Carolyn McLaughlin Community Cent...   
3  📍 Lehman College\nAddress: Gate 8: Bedford Par...   
4  📍 East New York Farms: Success Community Garde...   

                                          embeddings  
0  [0.008538631722331047, 0.008754117414355278, -...  
1  [-0.005164614878594875, -0.013079347088932991,...  
2  [-0.003902007592841983, 0.0070097134448587894,...  
3  [0.005837489850819111, 0.005582678597420454, -...  
4  [-0.013651822693645954, -0.02643309347331524, ...  


## 💬 Observation

Now we have stored reusable, numerical representations of data from compost drop-off sites across NYC that can be retrieved and used later without recomputing.

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [32]:
from scipy.spatial.distance import cosine
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [33]:
def get_embedding(text, engine=MODEL):
    """
    Safely fetch an embedding for the given text using the OpenAI v0 API.

    Args:
        text (str): Input text to embed.
        engine (str): Embedding engine to use (default: text-embedding-ada-002)

    Returns:
        list[float] or None: The embedding vector, or None if the request fails.
    """
    try:
        response = openai.Embedding.create(
            input=[text],
            engine=engine
        )
        return response["data"][0]["embedding"]

    except Exception as e:
        print(f"❌ Failed to get embedding: {e}")
        return None


In [42]:
def get_rows_sorted_by_relevance(question, df, top_k=None):
    """
    Given a user question, returns the DataFrame sorted by semantic relevance.

    This function:
    - Embeds the input question using the OpenAI embedding model.
    - Calculates cosine distance between the question's embedding and each row's embedding.
    - Sorts the DataFrame from most relevant (lowest distance) to least relevant (highest distance).

    Args:
        question (str): The user's natural language query.
        df (pd.DataFrame): A DataFrame containing an 'embeddings' column with list[float] vectors.
        top_k (int, optional): If provided, returns only the top K most relevant rows.

    Returns:
        pd.DataFrame: A sorted copy of the DataFrame with a 'distances' column added.
    """

    # Fetch the embedding for the user's question
    question_embedding = get_embedding(question, engine=MODEL)

    if question_embedding is None:
        print("❌ Could not get embedding for the question. Returning unsorted DataFrame.")
        return df.copy()

    # Compute cosine distances from the question to each row's embedding
    df_copy = df.copy()
    try:
        df_copy["distances"] = distances_from_embeddings(
            question_embedding,
            df_copy["embeddings"].tolist(),
            distance_metric="cosine"
        )
    except Exception as e:
        print(f"❌ Failed to compute distances: {e}")
        return df.copy()

    # Sort by distance (ascending = more relevant)
    df_copy.sort_values("distances", ascending=True, inplace=True)

    return df_copy.head(top_k) if top_k is not None else df_copy


In [66]:
get_rows_sorted_by_relevance("Where can I compost in Brooklyn?", df)

Unnamed: 0,text,embeddings,distances
10,📍 Governors Island Compost Learning Center\nAd...,"[0.0068593802861869335, -0.006057052873075008,...",0.171364
0,📍 Spuyten Duyvil PreSchool\nAddress: 3041 King...,"[0.008538631722331047, 0.008754117414355278, -...",0.178703
5,📍 Rogers / Tilden / Veronica Place Garden\nAdd...,"[-0.007566023617982864, -0.04361746460199356, ...",0.185445
4,📍 East New York Farms: Success Community Garde...,"[-0.013651822693645954, -0.02643309347331524, ...",0.187542
6,📍 4th Avenue Presbyterian Church\nAddress: 675...,"[0.008020597510039806, -0.02139265462756157, -...",0.195717
9,📍 La Plaza Cultural Community Garden\nAddress:...,"[-0.006474181544035673, -0.00914041604846716, ...",0.201417
8,📍 Dag Hammarskjold Plaza Greenmarket\nAddress:...,"[-0.00811375118792057, -0.014799588359892368, ...",0.202676
12,📍 Forest Hills Greenmarket\nAddress: MacDonald...,"[0.0028155113104730844, -0.009783780202269554,...",0.205492
7,📍 Carroll Gardens Greenmarket\nAddress: Smith ...,"[0.01781637594103813, -0.012823293916881084, 0...",0.210848
14,📍 CPF Liberty Collective Learning Garden\nAddr...,"[-0.004798536188900471, 0.004939177073538303, ...",0.21356


## 💬 Observation

We can see that the semantic search wasn't perfect at prioritizing geography, as **3 of the top 5 results** corresponded to locations in Brooklyn. However, these results are still quite good, as the algorithm successfully identified several relevant composting sites in Brooklyn.

It's possible that borough-specific filtering may require additional processing since semantic similarity weighs all contextual factors, not just geographic ones. However, the system still returns multiple correct borough matches within the top results.


### Create a Function that Composes a Text Prompt with Context-Aware Token Management

The semantic search approach using cosine similarity successfully identified relevant compost locations, but raw embeddings alone don't provide the structured context needed for **CompostPal** to generate helpful responses. To bridge this gap, we need to create a sophisticated prompt composition function that:

- Structured context that frames the information appropriately
- Token budget management to stay within API limits
- Relevant information prioritization based on user queries
- Clear instructions that guide the model's response style

In [56]:
def create_prompt(question, df, max_tokens):
    """
    Create a context-aware prompt for a Completion model based on user input and relevant compost data.

    This function:
    - Tokenizes the base prompt and question using tiktoken's tokenizer.
    - Selects the most relevant rows (sorted by semantic similarity) from the DataFrame.
    - Adds text from the most relevant rows to the context until `max_tokens` is reached.
    - Returns a fully assembled prompt ready for completion or chat models.

    Args:
        question (str): The user's query about compost drop-off sites.
        df (pd.DataFrame): A DataFrame containing a 'text' column and corresponding embeddings.
        max_tokens (int): Maximum token count allowed for the final prompt.

    Returns:
        str: A prompt formatted with contextual information and the user's question.
    """
    try:
        # Load the tokenizer aligned with OpenAI's models
        tokenizer = tiktoken.get_encoding("cl100k_base")

        # Define the base prompt structure with placeholders
    
        # Count the number of tokens in the prompt template and question
        prompt_template = """
        You are CompostPal, a helpful NYC composting assistant. 
        Answer the question based on the compost drop-off locations below. 
        If you can't find relevant information, say "I don't have information about that specific location."

        NYC Compost Drop-Off Locations:

        {}

        ---

        Question: {}
        Answer:"""

    
        # Initial token count from prompt template and question
        current_token_count = len(tokenizer.encode(prompt_template.format("", question)))

        context = []

        # Get the most relevant rows based on semantic similarity
        sorted_rows = get_rows_sorted_by_relevance(question, df)

        for text in sorted_rows["text"].values:
            # Tokenize each candidate row and track total token count
            text_token_count = len(tokenizer.encode(text))
            if current_token_count + text_token_count <= max_tokens:
                context.append(text)
                current_token_count += text_token_count
            else:
                break  # Stop adding more context once limit is reached

        # Join selected rows and insert into prompt template
        full_prompt = prompt_template.format("\n\n###\n\n".join(context), question)
        return full_prompt

    except Exception as e:
        print(f"❌ Failed to create prompt: {e}")
        return "There was an error generating the prompt."

In [57]:
create_prompt("Where can I compost in Brooklyn?", df, 200)

'\n        You are CompostPal, a helpful NYC composting assistant. \n        Answer the question based on the compost drop-off locations below. \n        If you can\'t find relevant information, say "I don\'t have information about that specific location."\n\n        NYC Compost Drop-Off Locations:\n\n        📍 Governors Island Compost Learning Center\nAddress: 758 Enright Rd\nBorough: Manhattan | Neighborhood: The Battery-Governors Island-Ellis Island-Liberty Island\nHours: Saturday and Sunday (Start Time: 12:00 PM - End Time:  4:00 PM)\n\n        ---\n\n        Question: Where can I compost in Brooklyn?\n        Answer:'

### Now we create answer functions

Final step: generate answers using the OpenAI Completions model. We'll create two outputs—one using a custom prompt enriched with context, and another without any prompt—to compare the effectiveness of semantic search combined with tailored instruction versus plain input.


In [61]:
def get_baseline_answer(question):
    """Get answer from GPT without custom context for comparison"""
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=f"Question: {question}\nAnswer:",
            max_tokens=150
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(f"Error getting baseline: {e}")
        return "Unable to get baseline answer"

In [58]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens=150 ):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [62]:
# Question 1: Brooklyn composting locations
question1 = "Where can I compost in Brooklyn?"
baseline_answer1 = get_baseline_answer(question1)
custom_answer1 = answer_question(question1, df)

In [63]:
print("=== QUESTION 1 COMPARISON ===")
print(f"Question: {question1}\n")
print("BASELINE ANSWER (No Custom Data):")
print(baseline_answer1)
print("\nCOMPOSTPAL ANSWER (With NYC Data):")
print(custom_answer1)
print("\n" + "="*50 + "\n")

=== QUESTION 1 COMPARISON ===
Question: Where can I compost in Brooklyn?

BASELINE ANSWER (No Custom Data):
There are several places to compost in Brooklyn including community gardens, local farms, and compost drop-off locations. Some community gardens, such as the Brooklyn Botanic Garden's compost project, allow individuals to drop off their compost materials for processing. Local farms, like Red Hook Community Farm and Brooklyn Grange, may also accept compost materials. Additionally, the NYC Department of Sanitation has a list of drop-off locations where residents can drop off their compost for processing. These may include farmer's markets, greenmarkets, community compost sites, and food scrap drop-off locations. It is recommended to contact the specific location beforehand to confirm their composting policies and procedures.

COMPOSTPAL ANSWER (With NYC Data):
📍 Rogers / Tilden / Veronica Place Garden
 Address: 2601 - 2603 Tilden Avenue, Brooklyn 11226
 Neighborhood: East Flatbush-

### Question 2

In [64]:
# Question 2: Manhattan composting hours
question2 = "What are the composting hours in Manhattan?"
baseline_answer2 = get_baseline_answer(question2)
custom_answer2 = answer_question(question2, df)

In [65]:
print("=== QUESTION 2 COMPARISON ===")
print(f"Question: {question2}\n")
print("BASELINE ANSWER (No Custom Data):")
print(baseline_answer2)
print("\nCOMPOSTPAL ANSWER (With NYC Data):")
print(custom_answer2)

=== QUESTION 2 COMPARISON ===
Question: What are the composting hours in Manhattan?

BASELINE ANSWER (No Custom Data):
The composting hours in Manhattan are not universally defined, as they can vary by location. Some composting sites may be open 24 hours a day, while others may have restricted hours. It is best to check with specific composting sites in Manhattan to find out their hours of operation.

COMPOSTPAL ANSWER (With NYC Data):
Saturdays and Sundays, 12:00 PM - 4:00 PM.


## 💬 Observation


The comparison demonstrates that semantic search + custom context significantly outperforms generic language model responses for location-specific queries. the key observations are:

### Actionability

**Baseline**: Generic advice requiring users to do additional research
<br>
**CompostPal**: Immediately actionable information with addresses and specific times

### Data Currency

**Baseline**: References potentially outdated or non-existent programs
<br>
**CompostPal**: Uses current 2025 NYC open data, ensuring relevance

### User Experience

**Baseline**: Forces users to "contact locations beforehand" or do their own research
<br>
**CompostPal**: Eliminates friction by providing all necessary details upfront

### Precision vs. Generalization

**Baseline**: Broad generalizations that may not apply to the user's specific situation
<br>
**CompostPal**: Targeted responses based on semantic similarity to the user's query

<mark>It's important to note that although the baseline answers lacked relevance or helpfulness, they were still technically successful API responses. As a result, no empty string was returned, highlighting the difference between functional success and response quality.</mark>

## Try CompostPal Interactive Version!!!

### Features:
**Continuous Conversation**

Users can ask multiple questions without rerunning code
Natural conversation flow with prompts

**User-Friendly Commands**

quit, exit, bye - End the session
help - Show example questions
Empty input handling with helpful prompts

**Enhanced User Experience**

Welcome message explaining how to use CompostPal
Clear formatting with emojis and separators
Loading indicators while processing

**Optional Comparison Mode**

After each answer, users can choose to see baseline vs. custom comparison
Side-by-side results showing the difference in quality

In [68]:
def interactive_compostpal():
    """
    Interactive CompostPal chatbot that allows users to ask questions repeatedly.
    
    This function creates a continuous loop where users can:
    - Ask questions about NYC compost drop-off locations
    - Get answers based on the custom dataset and embeddings
    - Continue asking questions until they choose to quit
    - Compare baseline vs. custom answers (optional)
    """
    
    print("🌱 Welcome to CompostPal! 🌱")
    print("Your friendly NYC composting assistant")
    print("=" * 50)
    print("Ask me about compost drop-off locations in NYC!")
    print("Type 'quit', 'exit', or 'bye' to end the conversation.")
    print("Type 'help' for example questions.")
    print("=" * 50)
    
    while True:
        try:
            # Get user input
            user_question = input("\n🗽 Your question: ").strip()
            
            # Check for exit commands
            if user_question.lower() in ['quit', 'exit', 'bye', 'q']:
                print("\n🌱 Thanks for using CompostPal! Keep composting! 🌱")
                break
            
            # Check for empty input
            if not user_question:
                print("Please enter a question, or type 'quit' to exit.")
                continue
            
            # Help command
            if user_question.lower() == 'help':
                print("\n📚 Example questions you can ask:")
                print("• Where can I compost in Brooklyn?")
                print("• What are the hours for Manhattan compost sites?")
                print("• Are there any compost locations in Queens?")
                print("• Which sites accept meat and dairy?")
                print("• Show me weekend compost drop-off locations")
                print("• What compost sites are open every day?")
                continue
            
            # Process the question
            print(f"\n🔍 Searching for: '{user_question}'")
            print("Processing...")
            
            # Get the custom answer using your existing function
            custom_answer = answer_question(user_question, df)
            
            if custom_answer:
                print(f"\n📍 CompostPal says:")
                print("-" * 30)
                print(custom_answer)
            else:
                print(f"\n❌ Sorry, I couldn't find information about that.")
                print("Try rephrasing your question or ask about:")
                print("• Specific boroughs (Brooklyn, Manhattan, Queens, Bronx, Staten Island)")
                print("• Hours and schedules")
                print("• Site locations and addresses")
            
            # Optional: Ask if user wants to see baseline comparison
            compare_choice = input(f"\n🤔 Want to see how I compare to a basic search? (y/n): ").strip().lower()
            
            if compare_choice in ['y', 'yes']:
                print(f"\n📊 COMPARISON:")
                print("=" * 40)
                
                baseline_answer = get_baseline_answer(user_question)
                
                print(f"🤖 Basic Search Result:")
                print(baseline_answer if baseline_answer else "No baseline answer available")
                
                print(f"\n🌱 CompostPal Result:")
                print(custom_answer if custom_answer else "No custom answer available")
                print("=" * 40)
        
        except KeyboardInterrupt:
            print(f"\n\n🌱 Thanks for using CompostPal! Keep composting! 🌱")
            break
        except Exception as e:
            print(f"\n❌ Oops! Something went wrong: {e}")
            print("Please try asking your question again.")

# Function to start the interactive session
def start_compostpal():
    """
    Wrapper function to start CompostPal with all necessary checks.
    """
    
    # Check if required data and functions exist
    try:
        # Verify dataframe exists
        if 'df' not in globals():
            print("❌ Error: Dataframe 'df' not found. Please load your embeddings first.")
            return
        
        # Verify required functions exist
        required_functions = ['answer_question', 'get_baseline_answer']
        missing_functions = [func for func in required_functions if func not in globals()]
        
        if missing_functions:
            print(f"❌ Error: Missing functions: {missing_functions}")
            print("Please make sure all required functions are defined.")
            return
        
        # Start the interactive session
        interactive_compostpal()
        
    except Exception as e:
        print(f"❌ Error starting CompostPal: {e}")

# Example usage:
# To start the interactive chatbot, simply call:
# start_compostpal()

# Alternative: Direct start function if everything is already loaded
def quick_start():
    """Quick start function for when all dependencies are confirmed to be loaded."""
    interactive_compostpal()

print("🌱 CompostPal Interactive Mode Ready!")
print("Run start_compostpal() to begin chatting, or quick_start() if everything is loaded.")

🌱 CompostPal Interactive Mode Ready!
Run start_compostpal() to begin chatting, or quick_start() if everything is loaded.


In [69]:
start_compostpal()

🌱 Welcome to CompostPal! 🌱
Your friendly NYC composting assistant
Ask me about compost drop-off locations in NYC!
Type 'quit', 'exit', or 'bye' to end the conversation.
Type 'help' for example questions.

🗽 Your question: Which sites accept meat and dairy?

🔍 Searching for: 'Which sites accept meat and dairy?'
Processing...

📍 CompostPal says:
------------------------------
📍 4th Avenue Presbyterian Church

🤔 Want to see how I compare to a basic search? (y/n): y

📊 COMPARISON:
🤖 Basic Search Result:
Some popular sites that accept meat and dairy include Blue Apron, Butcher Box, Thrive Market, and FreshDirect. Additionally, many local and regional grocers and specialty food stores may also offer online ordering and delivery options for meats and dairy products. It is always recommended to check the specific website or store's policies and options before placing an order.

🌱 CompostPal Result:
📍 4th Avenue Presbyterian Church

🗽 Your question: quit

🌱 Thanks for using CompostPal! Keep com