# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task


For this task, I chose "nyc_food_scrap_drop_off_sites" csv data. This has been chosen because of following reasons:

1. Interested to know how the LLM analyzes CSV Data.
2. Whether the LLM inferences improve if the CSV format is defined in the prompt.
3. Limitations in terms of the number of tokens that can be provided in the prompt.
4. Whether the LLM is good at analyzing numbers like times and days.
5. This kind of data is dynamic and so it is critical to always use the latest data.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [32]:
from dateutil.parser import parse
import pandas as pd
import requests

# Get the data from data folder
df = pd.read_csv('data/nyc_food_scrap_drop_off_sites.csv')


In [33]:
# Check if the required columns are present
required_columns = ['borough', 'food_scrap_drop_off_site', 'location', 'open_months', 'operation_day_hours']
if not all(col in df.columns for col in required_columns):
    raise ValueError("CSV file must contain the following columns: " + ", ".join(required_columns))

# Combine the specified columns into a single 'text' column
df['text'] = df[required_columns].astype(str).agg('|'.join, axis=1)

df = df[['text']]

#df = df[:1]

#df

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [8]:
import openai
import tiktoken

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"



In [9]:
def basic_prompt(question):
    return question


In [10]:
def create_custom_prompt_tokenized(question, df, max_token_count):

    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    
Answer the question based on the context below and if the question
can't be answered based on the context, say "I don't know". 

The context contains information about food scrap drop-off sites with the following columns delimited by |:
1. borough: String - The name of the borough.
2. food_scrap_drop_off_site: String - The name of the food scrap drop-off site.
3. location: String - The address or location of the drop-off site.
4. open_months: String - The months during which the site is open.
5. operation_day_hours: String - The days and hours during which the site operates.
\n\n

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in df["text"].values:
        
        #print(text)
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)




In [11]:
def create_custom_prompt_not_tokenized(df, question):
    prompt = "Based on the following food scrap drop-off site information:\n\n"
       
    for _, row in df.iterrows():
        prompt += f"- {row['text']}"
    prompt += "\n"
    prompt += question_prompt
    return prompt



In [15]:
def answer_question(
    prompt, max_answer_tokens=150
):

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
    


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [13]:
question = """
Question: "What are the drop off locations and drop off time for scrap food in Bronx?"
Answer:
"""

In [16]:
print("\n\n\nAnswer-Basic Prompt:\n")        
print(answer_question(basic_prompt(question)))  




Answer-Basic Prompt:

I am unable to locate specific drop off locations and drop off times for scrap food in the Bronx. It is recommended to contact local organizations or composting facilities for more information on their specific drop off locations and hours. Additionally, some farmers markets and community gardens may also accept food scraps for composting.


In [18]:
print("\n\n\nAnswer-Custom Prompt:\n")        
print(answer_question(create_custom_prompt_tokenized(question, df, 1800)))  




Answer-Custom Prompt:

Bronx|SE Corner of Eastburn Avenue & East 174th Street|SE Eastburn Avenue & East 174th Street|Year Round|24/7 and Bronx|SE Corner of Field Place & Morris Avenue|nan|Year Round|24/7


### Question 2

In [39]:
question = """
Question: "What boroughs are available for drop off of scrap food on Friday?"
Answer:
"""

In [30]:
print("\n\n\nAnswer-Basic Prompt:\n")        
print(answer_question(basic_prompt(question)))  




Answer-Basic Prompt:

I'm sorry, I cannot provide specific information on drop off locations. Please contact your local waste management or recycling center for more information on their food scrap collection programs and drop off locations for Fridays.


In [40]:
print("\n\n\nAnswer-Custom Prompt:\n")        
print(answer_question(create_custom_prompt_tokenized(question, df, 1800)))  




Answer-Custom Prompt:

Staten Island.
