# Custom Chatbot Project

### Explaination of Why selecting this data

Chat GPT took the world by storm to demonstrate how LLMs can do amazing work. Open AI (the company behind Chat GPT) has amazing history. Furthermore, as Chat GPT became more famous and adapted, Open AI started getting limelight for both good and bad reasons. So its natural for someone to follow what has happened at Open AI and what are they recently working into. 

However, the current LLM model will not provide us with this details. Hence, I am using the wiki which gets updated in real-time to updates about Open AI. Given that the wiki page will keep updated as more open AI news comes out, I wanted to use this page so that we can keep following Open AI along its journey


https://en.wikipedia.org/wiki/OpenAI

## Importing all the required libraries

In [1]:
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd
from dateutil.parser import parse
import tiktoken


openai.api_key = "YOUR API KEY"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [2]:
def fetch_page(url: str):
    headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    # Todo: fetch the page using the GET requests
    r = requests.get(url)
    if r.status_code == 200:
        return r.text
    else:
        return r.status_code

In [3]:
# Get the Wikipedia page for list of Sports events in 2024
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "OpenAI",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [4]:
text_data = response_dict['query']['pages'][0]['extract'].split("\n")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = text_data

# Clean up dataframe to remove empty lines and headings
df = df[(
    (df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))
)].reset_index(drop=True)

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

print('There are ' + str(len(df)) + ' rows in the dataset') 
print('')

df.head()

There are 146 rows in the dataset



Unnamed: 0,text
0,– OpenAI is an American artificial intelligen...
1,– The organization consists of the non-profit...
2,"– In November 2023, OpenAI's board removed Sa..."
3,"– In December 2015, OpenAI was founded by Sam..."
4,"– According to Wired, Brockman met with Yoshu..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Generating Embeddings

We'll use the `Embedding` tooling from OpenAI documentation here to create vectors representing each row of our custom dataset.

In order to avoid a RateLimitError we'll send our data in batches to the `Embedding.create` function.

In [5]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

df

Unnamed: 0,text,embeddings
0,– OpenAI is an American artificial intelligen...,"[-0.019855355843901634, -0.023019928485155106,..."
1,– The organization consists of the non-profit...,"[-0.006303194910287857, -0.03769782558083534, ..."
2,"– In November 2023, OpenAI's board removed Sa...","[-0.008517647162079811, -0.04329528659582138, ..."
3,"– In December 2015, OpenAI was founded by Sam...","[0.00842413678765297, -0.029996534809470177, -..."
4,"– According to Wired, Brockman met with Yoshu...","[-0.011349588632583618, -0.01666202023625374, ..."
...,...,...
141,– OpenAI on X,"[0.018376905471086502, -0.02012915536761284, -..."
142,– OpenAI on Instagram,"[-0.024862777441740036, -0.007472761906683445,..."
143,– OpenAI on YouTube,"[-0.016021838411688805, -0.023299461230635643,..."
144,"– ""What OpenAI Really Wants"" by Wired","[-0.00038224150193855166, -0.00625555776059627..."


### Create Function to find all the text closet to the question

In [6]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

### Create a Function for text prompt

In [7]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

### Create Function that answers questions

In [8]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [9]:
altman_departure_question = "Why was Altman removed as the CEO?"

altman_departure_prompt = """
Question: {}
Answer:
""".format(altman_departure_question)

### Model's answer without context

In [11]:
initial_altman_departure_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=altman_departure_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_altman_departure_answer)

There is no specific reason mentioned for Altman's removal as CEO. It is possible that it was due to his performance or a decision made by the company's board of directors. Other factors such as conflict of interest, financial troubles, or lack of leadership skills may have also played a role. Without more information, it is impossible to determine the exact reason for his removal.


### Model's answer with context from custom dataset

In [12]:
custom_altman_departure_answer = answer_question(altman_departure_question, df)
print(custom_altman_departure_answer)

The board of directors cited a lack of confidence in him.


### Question 2

In [13]:
apple_intelligence_question = "Which company did OpenAI partner at WWDC 2024 and for what reason?"

apple_intelligence_prompt = """
Question: {}
Answer:
""".format(apple_intelligence_question)

### Model's answer without context

In [14]:
initial_apple_intelligence_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=apple_intelligence_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_apple_intelligence_answer)

At WWDC 2024, OpenAI announced a partnership with Apple. The collaboration was focused on developing advanced artificial intelligence systems that could be integrated into Apple's products and services. This partnership aimed to enhance user experience and provide more personalized and intelligent features in Apple's devices and applications. The two companies also planned to work together on ethical and responsible AI practices, ensuring the technology is used for the benefit of society.


### Model's answer with context from custom dataset

In [15]:
custom_apple_intelligence_answer = answer_question(apple_intelligence_question, df)
print(custom_apple_intelligence_answer)

Apple Inc., to bring ChatGPT features to Apple Intelligence.


### Enhanced perfomance post RAG

As we can see from both questions, the turbo gpt 3.5 couldn't answer both questions correctly as it was not trained on the new specific avaialble data. For both questions, the model is hallucinating. 

1. For the first question about why Sam Altman was ousted?
    a. For the initial (untrained answer), while the model has no idea why Sam Altman was removed, it is trying to reason that it could be have been due to his performance. 
    b. However, once we have provided the context, the model was able to find the correct answer that he was removed as the board lacked confidence in him.
    
2. For the second question, who Open AI partnered with at WWDC, and for what reason?
    a. The model correctly predicted Apple.However,Apple is the biggest player for WWDC, so the model was able to correctly predict based on that probability. But it started hallucinating for reason why it partnered. The two companies never discuss about ethical use of AI and other reasons. 
    b. But, once provided the context was provided, it was able to corectly answer that the partnership was to bring ChatGPT to Apple intelligence.
    
So we can see clearly from above examples that the model performs appropriately after the context was provided with RAG.