# Custom Chatbot Project

## Why Using Course Notes on Dieting is Appropriate for a Custom OpenAI Chatbot:

- **Relevance and Specificity**: 
    Course notes, by their nature, are structured and condensed summaries of comprehensive topics. Using course notes on dieting ensures the chatbot will be equipped with information that is both relevant and focused on key aspects of dieting.

- **Authority and Reliability**: 
    Assuming these course notes come from a recognized educational institution or a reputable course, they represent a trusted source of information. Chatbots thrive when provided with reliable and authoritative data sources.

- **Complexity and Depth**: 
    Dieting is not just about losing weight; it's about nutrition, understanding different body types, metabolic rates, the role of exercise, and more. Course notes are likely to provide a broad yet detailed overview, making them ideal for training a chatbot to answer diverse questions on the topic.

- **Structure and Organization**: 
    Course notes are typically well-organized, often broken down into sections, sub-sections, and bullet points. This organized structure can help the model generate clear and concise answers based on the way the information is presented.

- **Updated Information**: 
    Academic courses usually get updated frequently to include new research findings and methodologies. Using course notes ensures the chatbot is referencing recent and relevant information.

- **Consistency**: 
    Course notes maintain a consistent style and tone. This can help in ensuring the chatbot’s responses have a uniform tone and style, making interactions feel more coherent


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [4]:
import re
import requests
import openai
import tiktoken
import pandas as pd

openai.api_key = "YOUR API KEY"

### 0. Getting the data

In [5]:
with requests.get("https://raw.githubusercontent.com/yfe404/nutrition-course-notes/master/readme.org", stream=True) as response:
    response.raise_for_status()  # Raise an exception for HTTP errors
    
    with open("notes.org", 'wb') as fd:
        for chunk in response.iter_content(chunk_size=8192):
            fd.write(chunk)

with open("notes.org", "r") as fd:
    file_content = "\n".join(fd.readlines())

### 1. Preparing the data

In [6]:
# Regular expression pattern to capture titles and the chunks between them
pattern = r'\*\* (.*?)\n(.*?)(?=\*\* |\Z)'

# Find all matches
matches = re.findall(pattern, file_content, re.DOTALL)

# Create dataframe
df = pd.DataFrame(matches, columns=['title', 'text'])

df['text'] = df['title'] + '\n' + df['text']

# Drop the individual 'title' and 'text' columns
df = df[['text']]

### 2. Add embeddings

In [7]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [8]:
#  finding the pieces of data with the shortest cosine distance from the query. : 

from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [22]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [36]:
def ask_question(question: str, prompt_args: dict = {}, prompt_function=lambda x: x):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_function(question, **prompt_args),
        max_tokens=150
    )

    return response["choices"][0]["text"].strip()

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [37]:
question="What is more important for weight loss ?"
ask_question(question)

'Exercise and diet are equally important for weight loss. You need to create a calorie deficit by both eating fewer calories and burning more calories through physical activity in order to lose weight. Making lifestyle changes such as increasing your physical activity and eating healthy, balanced meals can help you achieve your weight loss goals.'

In [38]:
ask_question(question, {"df": df, "max_token_count": 150}, create_prompt)

'Calories in vs. calories out.'

### Question 2

In [39]:
question="Rank the most important factors in successful dieting"
ask_question(question)

'1. Exercise \n2. Healthy Eating Habits \n3. Discipline \n4. Support System \n5. Achievable Goals \n6. Tracking Progress \n7. Avoiding Temptations \n8. Positive Self-Talk'

In [40]:
ask_question(question, {"df": df, "max_token_count": 150}, create_prompt)

'1. Calorie Balance 2. Macronutrient Amounts 3. Nutrient Timing 4. Food Composition 5. Supplements'

### Interactive Mode

In [None]:
while True:
    question = input("Please type your question (or type 'exit' to quit): ")
    
    if question.lower() == 'exit':
        break
    
    # Here, you can process the question or do something with it.
    # I'll simply print it back for demonstration purposes.
    print(ask_question(question, {"df": df, "max_token_count": 150}, create_prompt))


Please type your question (or type 'exit' to quit):  When should I eat carbs, when shouldn't I? 


According to your workout schedule.


Please type your question (or type 'exit' to quit):  What can I eat before a workout?


It depends on what type of workout you will be doing and your personal dietary choices. It is generally recommended to eat a snack with complex carbohydrates and proteins about an hour or two before your workout for sustained energy.


Please type your question (or type 'exit' to quit):  What should I eat during the workout?


I don't know.


Please type your question (or type 'exit' to quit):  What should I eat after the workout?


I don't know.


Please type your question (or type 'exit' to quit):  What is vitamin A useful for? 


Vitamin A is useful for maintaining eyesight and preventing night blindness, improving the immune system, helping with fat storage and protecting against infections, growing new cells, lowering cholesterol and reducing the risk of heart disease.


Please type your question (or type 'exit' to quit):  How to know if I need more vitamin A ?


Consult a healthcare professional to know if you need more vitamin A.
