##### Vanessa Trujillo
###### Submission date 3/21/24
###### Udacity Generative AI Nanodegree project: Build Your Own Custom Chatbot

# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

**Explanation:**
Wikipedia stands out as a goldmine of information ripe for crafting your own custom chatbot. Boasting an extensive range of topics, from historical events to pop culture trivia, it offers an abundance of content to extract and utilize. Whether you're exploring the intricacies of scientific advancements or simply curious about the latest trends, Wikipedia provides a rich resource to explore. Leveraging this diverse wealth of knowledge, you can develop chatbots capable of delivering engaging and informative responses to an array of inquiries. It's akin to having a virtual encyclopedia at your disposal, ready to engage in lively conversations on a wide spectrum of subjects!

I made my custom chatbot dynamic to support nearly any topic you can directly look up on Wikipedia! :)

I have hardcoded my 2 questions about the **Moon**, ***please use this as your initial search.*** Although any topic may be searched.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:
# Dependecies, OpenAI dependencies included in requirements.txt 
import requests
import string
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path 
from openai import OpenAI
from typing import List, Union
from scipy.spatial.distance import cosine

In [None]:
# Define file paths
CSV_FILEPATH_WITH_EMBEDDINGS = 'details_with_embeddings.csv'

# OpenAI API Key
OPENAI_API_KEY = 'YOUR API KEY'
client = OpenAI(api_key=OPENAI_API_KEY)

# OpenAI Models
EMBEDDING_MODEL = 'text-embedding-3-small'
COMPLETION_MODEL = 'gpt-3.5-turbo'

In [None]:
# Wikipedia scraping function, Attributed to the Udacity course code & Wikipedia
def fetch_wikipedia_page(url: str) -> str:
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'}
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        return r.text
    else:
        print(r.status_code)
        return ""

def wikiscrape(url):
    url_open = requests.get(url)
    soup = BeautifulSoup(url_open.content, 'html.parser')
    details = soup('table', {'class': 'infobox'})
    for i in details:
        h = i.find_all('tr')
        for j in h:
            heading = j.find_all('th')
            details = j.find_all('td')
            if heading is not None and details is not None:
                for x, y in zip(heading, details):
                    print("{}  ::  {}".format(x.text, y.text))
                    print("~~~~~~~~~~~~~~~~~~~~~~")

    # Iterate over available <p> tags
    paragraphs = soup('p')
    for i in range(min(len(paragraphs), 99)):  # Limit to a maximum of 99 paragraphs
        print(paragraphs[i].text)

    # Save the Wikiscrape results as an HTML file
    with open("wikiscrape_results.html", "w", encoding="utf-8") as file:
        file.write(str(soup))

In [None]:
# Function to extract text data from HTML content, Attributed to the Udacity course code
def extract_text_data(html_content: str) -> pd.DataFrame:
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')

    data = []
    for paragraph in paragraphs:
        text_sample = paragraph.text.strip()
        if text_sample:
            data.append({"text": text_sample})

    # If there are less than 20 samples, duplicate them to reach 20 rows
    while len(data) < 20:
        data += data

    # Create a DataFrame with at least 20 rows
    df = pd.DataFrame(data[:20])

    return df

# Function to save HTML content to a file
def save_to_html(html_content: str, filepath: str):
    with open(filepath, "w", encoding="utf-8") as file:
        file.write(html_content)

In [None]:
# Prompt the user for search input
search_input = input("Search: ")
search_query = string.capwords(search_input)
search_words = search_query.split()
search_term = "_".join(search_words)

# Construct Wikipedia URL
url = "https://en.wikipedia.org/wiki/" + search_term

# Fetch page content
page_content = fetch_wikipedia_page(url)

# Save the Wikiscrape results as an HTML file
html_file_path = "wikiscrape_results.html"
save_to_html(page_content, html_file_path)

# Extract data from HTML
result_df = extract_text_data(page_content)

In [None]:
# Reviewing data collected
print(wikiscrape(url))

In [None]:
# Reset display options for pandas DataFrame
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

df = pd.DataFrame(result_df, columns=['text'])
df

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [None]:
# Function to get embeddings
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    response = client.embeddings.create(
            input=prompt if type(prompt) is list else [prompt],
            model=embedding_model
    )
    return [row.embedding for row in response.data]

# Function to create embeddings
def create_embeddings(df, embedding_model_name: str = EMBEDDING_MODEL, batch_size: int = 25) -> List[List[float]]:
    output = []
    for idx in range(0, len(df), BATCH_SIZE):
        batch = df.iloc[idx:idx+BATCH_SIZE].tolist()
        embeddings = get_embeddings(batch, embedding_model_name)
        output.extend(embeddings)

    return output

In [None]:
BATCH_SIZE = 50

# Load DataFrame with text data
df = pd.DataFrame(result_df, columns=['text'])

# Create embeddings for text data
df['embedding'] = create_embeddings(df['text'])

# Save DataFrame to CSV with embeddings
df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS, sep=',', index=False)

In [None]:
# Reset display options for pandas DataFrame
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [None]:
# Function to build basic prompt
def build_basic_prompt(question: str):
    return [
        {
            'role': 'user',
            'content': question
        }
    ]

# Function to build custom prompt
def build_custom_prompt(question: str, database_df):
    return [
        {
            'role': 'system',
            'content': """
            Context: 
                {}
            """.format('\n\n'.join(build_custom_context(question, database_df)))
        },
        {
            'role': 'user',
            'content': question
        }
    ]

In [None]:
# Function to build custom context
def build_custom_context(question: str, database_df: df, n: int = 5):
    question_embedding = get_embeddings(question, EMBEDDING_MODEL)[0]
    
    df = database_df.copy()
    df["distances"] = df['embedding'].apply(lambda embedding: cosine(embedding, question_embedding))

    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

# Function to handle question
def handle_question(prompt, model_name: str = COMPLETION_MODEL):
    response = client.chat.completions.create(
        model=model_name,
        messages=prompt,
        max_tokens=100
    )
    return response.choices[0].message.content

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [None]:
question = 'How old is the Moon?'
print('Basic Completion Model Answer:', handle_question(build_basic_prompt(question)))


In [None]:
print('Custom Query Answer:', handle_question(build_custom_prompt(question, df)))

### Question 2

In [None]:
question = 'What are the cycles of the Moon?'
print('Basic Completion Model Answer:', handle_question(build_basic_prompt(question)))

In [None]:
print('Custom Query Answer:', handle_question(build_custom_prompt(question, df)))

### Additional Questions, Dynamic Prompting 

In [None]:
# Loop for asking questions
while True:
    # Prompt the user for a question
    user_question = input("Ask a question: ")

    # Get context for the user's question
    context = build_custom_context(user_question, df)

    if not context:
        print("I'm sorry, I don't have enough information to answer that question.")
    else:
        # Call the function to handle the user's question
        print('Basic Completion Model Answer:', handle_question(build_basic_prompt(user_question)))
        print('Custom Query Answer:', handle_question(build_custom_prompt(user_question, df)))

    # Ask if the user wants to ask another question
    additional_question = input("Would you like to ask another question? (yes/no): ")
    if additional_question.lower() == 'no':
        print("Thank you for your inquiries!")
        break  # Break out of the loop only if the user says "no" to additional questions