Assignment 3

# Build Question and Answer ChatBot from PDF document 

* Arxiv AI papers 


## Task 1: Embedding on Arxiv AI papers

* Pre-processing Embedding was done in '2.3.1_DataPreparationEmbedding.ipynb'



#### Setup Azure OpenAI - "2023-09-15-preview", this version does not support ChatCompletion, but for Embedding it suports

In [1]:
import os
import openai
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv()

# openai.api_type = "azure"
# openai.api_base = "" # Api base is the 'Endpoint' which can be found in Azure Portal where Azure OpenAI is created. It looks like https://xxxxxx.openai.azure.com/
# openai.api_version = "2023-07-01-preview"
# openai.api_key = "" # Or os.getenv("OPENAI_API_KEY") using local .env file. For more details, please see https://github.com/theskumar/python-dotenv

openai.api_type = "azure"
# openai.api_version = "2023-07-01-preview"
openai.api_version = "2023-09-15-preview"
API_KEY = os.getenv("OPENAI_API_KEY","").strip()
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY
RESOURCE_ENDPOINT = os.getenv("OPENAI_API_ENDPOINT","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT
# Deployment for embedding
DEPLOYMENT_NAME_EMBEDDING = os.getenv('DEPLOYMENT_NAME_EMBEDDING')

## Task 2: Load all embeddings and get ready for cosine_similarity compare

#### Load all embeddings - we have already process all the embeddings and saved into .csv files

In [2]:
# Load small embeddings dataset
import os
import pandas as pd

# Specify the path to your folder containing .csv files
folder_path = './data_source/arxiv.org/AI/embedding_output'

# Initialize an empty list to store DataFrames
dfs = []

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    # Check if the file has a .csv extension
    if filename.endswith('.csv'):
        # Create the full file path
        file_path = os.path.join(folder_path, filename)
        
        # Read the CSV file into a DataFrame
        df_per_file = pd.read_csv(file_path, delimiter='\t')  # Adjust the delimiter if necessary
       
        # Append the DataFrame to the list
        dfs.append(df_per_file)

# Concatenate all DataFrames into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

# Print the concatenated DataFrame or perform further analysis
print(df.all())

filename        True
page_number     True
page_content    True
embedding       True
dtype: bool


#### Convert string representation of an array to NumPy arrays

In [3]:
import numpy as np

# convert string to array
df["embedding"] = df['embedding'].apply(eval).apply(np.array)  

In [45]:
df["embedding"]
# Printing all members in df["embedding"]
for embed in df["embedding"]:
    print(embed)

[-0.0121987  -0.0319206   0.01821196 ...  0.00610266  0.00498677
 -0.02164243]
[-0.00444533 -0.02816371  0.0076111  ...  0.00585418 -0.0036398
 -0.02948968]
[ 2.48839566e-03 -3.82125872e-05  1.97755042e-02 ... -9.17019881e-03
 -1.04341458e-03 -2.44363081e-02]
[ 0.00171012 -0.01008363  0.02309742 ...  0.00328956  0.00269286
 -0.03097999]
[-0.01112995 -0.03496641  0.01722827 ...  0.01701359 -0.00748705
 -0.04138005]
[-0.00427608 -0.01750331  0.02493823 ...  0.00921052 -0.00648062
 -0.03149532]
[-0.0059232  -0.01444069  0.0191263  ...  0.01421568  0.00407675
 -0.03285224]
[-0.01414606 -0.02144012  0.02106713 ... -0.0195199   0.00549472
 -0.03594536]
[ 0.00452031 -0.03532026  0.03015807 ... -0.00830705  0.01104437
 -0.03556479]
[-0.02255906 -0.02277108 -0.01040318 ... -0.02039645 -0.03918153
 -0.0073218 ]
[-0.02307236 -0.01539069 -0.01627913 ... -0.01960057 -0.02663983
 -0.00425431]
[-0.01230936 -0.03176699 -0.01748448 ... -0.0054244  -0.01869225
 -0.02078477]
[-0.00501806 -0.01657354 -0.0

## Task 3: Create functions

* Embedding for user question, 

* Similarity comparision between user input embedding and previous processed embeddings (Arxiv papers in AI) stored in DataFrame

#### Embedding user Question input text

In [5]:
# Embedding user input text so that we will use it later for similiarity search
def user_input_text_embedding(user_input):
    user_input_embedding = openai.Embedding().create(input=[user_input], deployment_id=DEPLOYMENT_NAME_EMBEDDING)
    return user_input_embedding['data'][0]['embedding']

# text = 'the quick brown fox jumped over the lazy dog'
# user_input_text_embedding(text)

#### Similarity search and find the DataFrame record with the highest similarity score

In [40]:
from openai.embeddings_utils import cosine_similarity
debug = False
def cosine_similarity_search(question_embedding, df):
    highest_score = 0
    embedding_record_index = 0
    # question_embedding = df['embedding'][0]
    for i in range(len(df)):
        # df['embedding'][i]
        if debug: print(i)
        score = cosine_similarity(question_embedding, df['embedding'][i])
        if (score > highest_score):
            if debug: print(f"highest_score: {highest_score}, score: {score}")
            highest_score = score
            embedding_record_index = i
    
    # Return page content at with the highest similarity
    # return df['page_content'][embedding_record_index]
    print(">>> internal msg: cosine_similarity_search(question_embedding, df): highest_score: ", highest_score)
    return embedding_record_index

# text = 'the quick brown fox jumped over the lazy dog'
# user_question_embedding = user_input_text_embedding(text)
# relevant_content = cosine_similarity_search(user_question_embedding, df)

## Task 4: Biuld a Q&A ChatBot based on the content stored in DataFrame

#### Set up Azure OpenAI

* Note: the reason to set this again is that the 'gpt-35-turbo-instruct-0914' does not support ChatCompletion, but Completion. So, had to use 'gpt-35-turbo' to support ChatCompletion.

In [7]:
#Note: The openai-python library support for Azure OpenAI is in preview.

import os
import openai
from dotenv import load_dotenv
# Set up Azure OpenAI
load_dotenv()

openai.api_type = "azure"
openai.api_version = "2023-07-01-preview"

CHAT_API_KEY = os.getenv("OPENAI_API_CHAT_KEY","").strip()
assert CHAT_API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = CHAT_API_KEY

CHAT_RESOURCE_ENDPOINT = os.getenv("OPENAI_API_CHAT_ENDPOINT","").strip()
assert CHAT_RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in CHAT_RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = CHAT_RESOURCE_ENDPOINT

# Deployment for Chat
# DEPLOYMENT_NAME_CHAT = os.getenv('DEPLOYMENT_NAME_CHAT')
DEPLOYMENT_NAME_CHAT = os.getenv('DEPLOYMENT_NAME_CHAT_16K')

#### Prompt options ...

* Build different prompts based on strategies, their tactics, and user own data (here use the result from cosine_similarity comparesion)

* There are plenty of prompt examples: https://platform.openai.com/docs/guides/prompt-engineering

For instance:

Tactic: Instruct the model to answer with citations from a reference text
If the input has been supplemented with relevant knowledge, it's straightforward to request that the model add citations to its answers by referencing passages from provided documents. Note that citations in the output can then be verified programmatically by string matching within the provided documents.

SYSTEM
You will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).
USER
"""<insert document here>"""

Question: <insert question here>


"What significant paradigm shifts have occurred in the history of artificial intelligence."

In [43]:
# Build different prompts based on strategies, their tactics, and user own data (here use the result from cosine_similarity comparesion
# TODO: more experiments are needed to 'fine-tune' the prompt to get better results ...

def prompt_engineering(tactic, relevant_content_index, user_question):
    system_content = ''
    user_content = ''
    relevant_content = df['page_content'][relevant_content_index]

    if (tactic == 1):
        system_content = \
        """You will be provided with a document delimited by triple quotes and a question. 
            Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. 
            If the document does not contain the information needed to answer this question then simply write: 'Insufficient information.' 
            If an answer to the question is provided, it must be annotated with a citation. Use the following format for citing relevant passages ({'citation': …})."""        
        user_content = f'"""{relevant_content}"""\n\nQuestion: {user_question}'
    elif (tactic == 2):
        system_content = "You are an AI assistant that helps people find information. "
        user_content = f"Using below context: \n{relevant_content}\nAnswer the following question: {user_question}"
    elif (tactic == 3):
        system_content = "You are an AI assistant that helps people find information. If the document does not contain the information needed to answer this question then simply write: 'Insufficient information.' "
        user_content = f"Using below context: \n{relevant_content}\nAnswer the following question: {user_question}"       
        
    message_text = [
        {
            "role": "system",
            "content": system_content        },
        {
            "role": "user",
            "content": user_content
        }
    ]
    print(">>> internal msg: prompt_engineering(tactic, relevant_content_index, user_question): message_text: ", message_text)
    return message_text


#### This the main loop for ChatBot conversation till user say 'quit'

In [44]:
import openai

# Chat history
message_history = [
    {
        "role": "system",
        "content": "You are an AI assistant that helps people find information."
    },
    {
        "role": "user",
        "content": "\"Use below context:...[your provided message context here]"
    }
]

# Function to handle bot responses
def chat_with_bot():
    while True:
        # 0. Get user input
        user_input = input("You: ")
        print("Me: ", user_input)
        if user_input == '' or user_input.lower() == 'quit':
            print("Exiting conversation.")
            break  # Exit the loop if user inputs 'quit'

        # Swtich to the right Azure OpenAI resource: gpt-35-turbo-instruct-0914, this version support Completion, Embedding, but not ChatCompletion
        openai.api_type = "azure"
        openai.api_version = "2023-09-15-preview"
        openai.api_key = API_KEY
        openai.api_base = RESOURCE_ENDPOINT
        # Deployment for embedding
        # DEPLOYMENT_NAME_EMBEDDING = os.getenv('DEPLOYMENT_NAME_EMBEDDING')
        
        # 1. Now we have user input, let's do embedding on it first
        user_question_embedding = user_input_text_embedding(user_input)
        
        # 2. Let's do similarity search first using our own data (Arxiv papers in AI) stored in DataFrame
        relevant_content_index = cosine_similarity_search(user_question_embedding, df)
        # print("relevant_content_index: ", relevant_content_index)

        # 3. Build up prompt ....

        # Build the prompt using f-string
        # prompt = f"Using below context: \n{df['page_content'][relevant_content_index]}\nAnswer the following question: {user_input}"
        # print("Prompt(internal):", prompt)
        # # Append user input to message history
        # message_history.append({"role": "user", "content": prompt})

        message_text = prompt_engineering(3, relevant_content_index, user_input)
        # Append user input to message history
        message_history.append(message_text[1])

        # Use chat history may help on the context of a conversation
        prompt_message =''
        use_chat_history = False
        if use_chat_history:
            prompt_message = message_history
        else:
            prompt_message = message_text

        # Swtich to the right Azure OpenAI resource: gpt-35-turbo, this version support ChatCompletion
        openai.api_type = "azure"
        openai.api_version = "2023-07-01-preview"
        openai.api_key = CHAT_API_KEY
        openai.api_base = CHAT_RESOURCE_ENDPOINT
        # Deployment for Chat
        # DEPLOYMENT_NAME_CHAT = os.getenv('DEPLOYMENT_NAME_CHAT')

        # 4. Create chat completion using OpenAI API
        completion = openai.ChatCompletion.create(
            engine=DEPLOYMENT_NAME_CHAT,
            # messages=message_history,
            messages=prompt_message,
            temperature=0.1,
            max_tokens=400,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            stop=None
        )

        # 5. Get bot response
        bot_response = completion['choices'][0]['message']['content']
        # print("Bot:", bot_response)
        # TODO: add where info coming from which doc and page number ...
        citation = f"citation: file name: {df['filename'][relevant_content_index]}, page number: {df['page_number'][relevant_content_index]}"
        bot_response_with_citation = f"{bot_response}\n{citation}"        
        print("Bot: ", bot_response_with_citation)

        # 6. Append bot response to message history
        message_history.append({"role": "assistant", "content": bot_response})
        # print("message_history:", message_history)

# Start the conversation
print("Start chatting with the bot ('quit' to exit):")
chat_with_bot()


Start chatting with the bot ('quit' to exit):
Me:  "Model-Based Minimum Bayes Risk Decoding"
>>> internal msg: cosine_similarity_search(question_embedding, df): highest_score:  0.9016991761938595
>>> internal msg: prompt_engineering(tactic, relevant_content_index, user_question): message_text:  [{'role': 'system', 'content': "You are an AI assistant that helps people find information. If the document does not contain the information needed to answer this question then simply write: 'Insufficient information.' "}, {'role': 'user', 'content': 'Using below context: \nModel-Based Minimum Bayes Risk Decoding Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe CyberAgent {jinnai_yu, morimura_tetsuro, honda_ukyo,kaito_ariu, abe_kenshi}@cyberagent.co.jp Abstract Minimum Bayes Risk (MBR) decoding has been shown to be a powerful alternative to beam search decoding in a variety of text gener- ation tasks. MBR decoding selects a hypothesis from a pool of hypotheses that has the least 