# Custom Chatbot Project

In this project we will use the RAG approach to create a customized chatbot specialized in a specific dataset selected.

The dataset selected for this project is: NYC food scrap drop off sites. The csv file used contains locations, hours, and other information about food scrap drop-off sites in New York City. This information was retrieved in early 2023, and you can also get the latest version from here: https://dev.socrata.com/foundry/data.cityofnewyork.us/if26-z6xq. This datasource contains very specific information about scrapes drop off sites that a user will request like location, type of food, opening and closing time. This data is very dynamic and can change very often.  An LLM is hard to follow these changes especially when the training data and time intervals are too big and costly. This way by just importing the new csv data in our RAG application we have our customized chatbot ready within minutes at a fraction of the cost of an LLM. 


## Data Wrangling

In the cells below, chosen dataset is loaded into a `pandas` dataframe with a column named `"text"`. This column contains all of the text data, separated into at least 20 rows.

In [121]:
import pandas as pd
df = pd.read_csv('nyc_food_scrap_drop_off_sites.csv')
df = df.fillna("unknown")

##### In the 'text' field we add the description of each column following by the corresponding value. This way we provide the model with valuable interpretation information about the actual database we feed in.

In [122]:
df['text'] = 'NYC Borough where vendor is located: '+ df['borough'] +' .Neighborhood Tabulation Area Name: ' + df['ntaname']+' .Name of food scrap drop-off location: '+df['food_scrap_drop_off_site']+' .Street address or cross streets associated with food scrap drop-off location: '+df['location']+ ' .Name of the organization that services the food scraps that are dropped off: '+df['hosted_by']+ ' .Months when food scraps can be dropped off at the location: ' +df['open_months']+ ' .Days and hours when food scraps can be dropped off: ' +df['operation_day_hours']+ ' .Website associated with food scrap drop-off location:' +df['website']



In [123]:
df = df[['text']]
df.head()

Unnamed: 0,text
0,NYC Borough where vendor is located: Staten Is...
1,NYC Borough where vendor is located: Manhattan...
2,NYC Borough where vendor is located: Brooklyn ...
3,NYC Borough where vendor is located: Manhattan...
4,NYC Borough where vendor is located: Queens .N...


In [137]:
# openai API key here
import openai
openai.api_key = "YOUR API KEY"


In [125]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [126]:
import tiktoken
import numpy as np
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy



## Custom Query Completion

Compose custom query using the chosen dataset and retrieve results from an OpenAI `Completion` model.

In [127]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know". Be as much specific as possible like type of food accepted,
available time for drop off and company that hosts the site.

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [105]:
question_1 = """
Question: I live in Manhattan. Where can I drop off my meat scrapes?
Answer: 
"""

#### Not customized answer

In [106]:
answer1_not_customized = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_1,
    max_tokens=150
)
answer1_not_customized["choices"][0]["text"]

"There are several options for dropping off meat scrapes in Manhattan. You can contact your local community composting program to see if they accept meat scrapes for composting. You can also check with your local butcher or grocery store to see if they have a food scrap drop-off program. Additionally, some farmers' markets in Manhattan may accept meat scrapes for composting."

#### Customized answer

In [107]:
answer1_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=create_prompt(question_1, df, 1800),
    max_tokens=150
)
answer1_customized["choices"][0]["text"]

' The Department of Sanitation services food scraps dropped off at the NE Corner of Amsterdam Avenue & W 133 Street in Manhattan, which is available 24/7.'

### Question 2

In [134]:
question_2 = """
Question: I live in Manhattan. Where can I drop off my scrapes after 10 pm?
Answer: 
"""

#### Not customized answer

In [135]:
answer2_not_customized = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_2,
    max_tokens=150
)
answer2_not_customized["choices"][0]["text"]

'Unfortunately, most public recycling centers and scrap yards in Manhattan are typically closed after 10 pm. You may need to wait until the next morning to drop off your scrapes at a designated recycling center or scrap yard. Alternatively, you can check with private scrap buyers or junk removal companies in the area to see if they offer after-hours drop-off options.'

#### Customized answer

In [136]:
answer2_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=create_prompt(question_2, df, 1800),
    max_tokens=150
)
answer2_customized["choices"][0]["text"]

" I don't know, as none of the given contexts specify drop-off locations that are open after 10 pm. All given locations have a 24/7 drop-off policy, but it is unclear if the bins are accessible after 10 pm. It is best to check the specific websites or contact the organization for more information on drop-off hours."