## Retrieval-Based Chatbot with Vector Database Integration

This program demonstrates how OpenAI's ChatGPT language model can be used to answer questions in specific domain areas, using a process called retrieval-based augmentation and supported by a vector database. 

With such an approach, companies can leverage ChatGPT's extraordinary natural language capabilities while limiting its answers to company-specific documents and information. The vector database enables the process by efficiently storing, managing and querying word vectors (or "embeddings") associated with a company's knowledge base.  These vectors play a critical role in large language models such as ChatGPT.  

The program includes integration with OpenAI and Pinecone, a cloud-based vector database provider, via their APIs.  

The program asks a user for a question in a prescribed domain area.  It then compares the user's query against pre-loaded domain content to identify and retrieve the most useful sections of content. The program answers the user question by leveraging ChatGPT's powerful general capabilities with the newly incorporated domain knowledge.  Such an approach might be used, 
for example, to provide an insurance company's agents with the ability to answer customer questions based on the company's policy materials.

For this example, I used the 2023 investment outlook summaries of leading Wall Street banks as the domain-specific content.   The summaries were drawn from the websites of 
Morgan Stanley ([here](https://www.morganstanley.com/ideas/global-investment-strategy-outlook-2023)), 
JPMorgan ([here](https://www.jpmorgan.com/insights/research/market-outlook)) and 
Goldman Sachs ([here](https://www.goldmansachs.com/insights/pages/gs-research/macro-outlook-2023-this-cycle-is-different/report.pdf)).  Questions such as "What is the outlook for inflation?" and "What will happen to the price of oil?" can be posed and answered via the chatbot. 

See the article ["Scaling Company Chatbots with Vector Databases"](retrieval_based_chatbot_article.md) for a more in-depth discussion of this program and vector databases. 


### Libraries and Imports

In [None]:
! pip install openai 
! pip install transformers 
! pip install gradio 
! pip install python-docx 
! pip install pandas 
! pip install pinecone-client 

In [None]:
import docx
import pandas as pd
import numpy as np
import openai
import gradio as gr
import pickle
import os
from transformers import GPT2TokenizerFast
import pinecone
import time

### Variables

In [None]:
openai.api_key = "YOUR_OPENAI_API_KEY"   # Your API keys go here 
pinecone_key= "YOUR_PINECONE_API_KEY"  
DOC_FILEPATH = 'Compilation_investment_outlook_2023.docx' # Path to document containing domain content; update as needed  
COMPLETIONS_MODEL = "text-davinci-003"  
DOC_EMBEDDINGS_MODEL = "text-embedding-ada-002"
QUERY_EMBEDDINGS_MODEL = "text-embedding-ada-002"
MAX_SECTION_LEN =1100  # The API limits total tokens -- for the prompt containing the question and domain-specific content and the answer -- to 2048 tokens, or about 1500 words.  
SEPARATOR = "\n* "  # A string called SEPARATOR is defined as the newline character followed by an asterisk and a space. This string will be used as a separator between different pieces of text.
TOKENIZER = GPT2TokenizerFast.from_pretrained("gpt2")
SEPARATOR_LEN = len(TOKENIZER.tokenize(SEPARATOR))
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}
EMBEDDING_DIMENSION = 1536  # The dimensionality of the document embeddings

### Functions: Initial Text Processing 

In [None]:
# FUNCTIONS: INITIAL TEXT PROCESSING 

def load_text(DOC_FILEPATH):
  """
  Loads a Microsoft Word document and returns a DataFrame containing the text of each paragraph in the document.

  Input:
    DOC_FILEPATH (str): the filepath to the Microsoft Word document.
    
  Returns:
    df (pandas.DataFrame): a DataFrame containing the 'content' column with the text of each paragraph in the document.
  """
  # Open the Word document
  doc = docx.Document(DOC_FILEPATH)

  # Create an empty pandas DataFrame
  df = pd.DataFrame()

  # Iterate through the paragraphs in the document and add each to the df
  for i, p in enumerate(doc.paragraphs):

      # Add the paragraph text [and index to the DataFrame]    
      df.loc[i, 'content'] = p.text
      # df.loc[i, 'paragraph_index'] = i

  # Delete empty paragraphs
  df['content'] = df['content'].replace('', np.nan)
  df = df.dropna(axis=0, subset=['content']).reset_index(drop=True)

  return df
    
def count_tokens(row):
    """count the number of tokens in a string"""
    return len(TOKENIZER.encode(row))

def truncate_text(df):
    """
    Truncates the text in the 'content' column of the input DataFrame if the number of tokens 
    in the text exceeds a specified maximum number. It will set the truncated text and the 
    number of tokens in the 'content' and 'tokens' columns, respectively.

    Input:
    df (pandas.DataFrame): a DataFrame containing the 'content' column

    Returns:
    df (pandas.DataFrame): the input DataFrame with modified 'content' and 'tokens' columns.

    """
    for i in range(len(df)):
        if df['tokens'][i] > 590:
            text = df['content'][i]
            tokens = TOKENIZER.encode(text)
            truncated_tokens = tokens[:590]
            truncated_text = TOKENIZER.decode(truncated_tokens)
            df.at[i, 'content'] = truncated_text
            df.at[i, 'tokens'] = len(truncated_tokens)
    return df



### Functions: Embeddings and Pinecone Integration

In [None]:
# FUNCTIONS: EMBEDDING/PINECONE VECTOR DB

def compute_doc_embeddings(df):
    """
    Generates embeddings for each row in a Pandas DataFrame using the OpenAI document embeddings model,
    and uploads the embeddings to Pinecone.

    Args:
        df (pandas.DataFrame): The DataFrame for which to generate embeddings.
    
    Returns:
        dict: A dictionary that maps the indices of the rows to their corresponding embedding vectors.
    """
    embeddings = {}
    for idx, r in df.iterrows():
        embedding = get_doc_embedding(r.content.replace("\n", " "))
        embeddings[str(idx)] = embedding
    # items = [(key, embeddings[key][:3]) for key in list(embeddings.keys())[:3]] # For Testing Purposes ........
    # for key, values in items:
    #     print(key, values)    
    upload_embeddings_to_pinecone(embeddings)
    return embeddings

def get_doc_embedding(text):
    """
    Generates an embedding for the given text using the OpenAI document embeddings model.

    Args:
        text (str): The text to generate an embedding for.
    
    Returns:
        numpy.ndarray: The embedding vector for the given text.
    """
    # print(text) # For debugging

    # Call the OpenAI API to generate the embedding
    result = openai.Embedding.create(
        model=DOC_EMBEDDINGS_MODEL,
        input=[text]
    )

    # Extract the embedding vector from the API response
    # print(result["data"][0]["embedding"][:3])  # For debugging

    return result["data"][0]["embedding"]


def upload_embeddings_to_pinecone(embeddings):
    """
    Uploads the provided embeddings to Pinecone.
    
    Args:
        embeddings (dict): A dictionary mapping document indices to their embeddings.
    """
    # Transform the dictionary to a list of tuples
    transformed_list = [(str(key), value) for key, value in embeddings.items()]
    pinecone_client.upsert(transformed_list)


def fetch_embeddings_from_pinecone(df, pinecone_client):
    """
    Fetches all embeddings from the specified Pinecone index.
    
    Args:
        pinecone_index (str): The name of the Pinecone index from which to fetch.
    
    Returns:
        dict: A dictionary mapping document indices to embeddings.
    """
    # Get all item ids in the index
    item_ids = [str(i) for i in df.index]
    # Fetch the vectors for all items
    document_embeddings = pinecone_client.fetch(ids=item_ids)
    
    return document_embeddings
    

### Functions: Question Answering

In [None]:
# FUNCTIONS: QUESTION ANSWERING 

def get_embedding(text, model): 
    """
    Generates an embedding for the given text using the specified OpenAI model.
    
    Args:
        text (str): The text for which to generate an embedding.
        model (str): The name of the OpenAI model to use for generating the embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
    result = openai.Embedding.create(
      model=model,
      input=[text]
    )
    return result["data"][0]["embedding"]


def get_query_embedding(text):
   """
    Generates an embedding for the given text using the OpenAI query embeddings model.
    
    Args:
        text (str): The text for which to generate an embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
   return get_embedding(text, QUERY_EMBEDDINGS_MODEL)


def answer_query_with_context(query, df, document_embeddings, show_prompt: bool = False):
    # print("STARTING ANSWER QUERY WITH CONTEXT..............................") # -- FOR TESTING PURPOSES
    prompt = construct_prompt(query, df)
    """
    Answer a query using relevant context from a DataFrame.
    
    Args:
        query (str): The query to answer.
        df (pandas.DataFrame): A DataFrame containing the document sections.
        document_embeddings (dict): A dictionary mapping document embeddings to their indices.
        show_prompt (bool, optional): If `True`, print the prompt before generating a response.
    
    Returns:
        str: The generated response to the query.
    """   
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
                )

    return response["choices"][0]["text"].strip(" \n")

def construct_prompt(question, df):
    """
    Construct a prompt for answering a question using the most relevant document sections.
    
    Args:
      question (str): The question to answer.
      # context_embeddings (dict): A dictionary mapping document embeddings to their indices.
      df (pandas.DataFrame): A DataFrame containing the document sections.
    
    Returns:
      str: The prompt, including the question and the relevant context.
    """
  
    # print("STARTING CONSTRUCT PROMPT..............................") # -- FOR TESTING PURPOSES
    # Get the query embedding from the OpenAI api
    xq = openai.Embedding.create(input=question, engine=QUERY_EMBEDDINGS_MODEL)['data'][0]['embedding']

    # Get the top n document sections related to the query from the pinecone database
    res = pinecone_client.query([xq], top_k=5, include_metadata=True)

    # Extract the section indexes for the top n sections
    most_relevant_document_sections = [int(match['id']) for match in res['matches']]

    # print(f"TESTING... Most relevant document sections: {most_relevant_document_sections}") # -- FOR TESTING PURPOSES

    ## LEAVE AS IS #######################################
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + SEPARATOR_LEN
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
  
    # print(f"Selected {len(chosen_sections)} document sections:") # -- FOR TESTING PURPOSES
    # print("\n".join(chosen_sections_indexes)) # -- FOR TESTING PURPOSES
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "Sorry, I don't know."\n\nContext:\n"""

    full_prompt = header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

    # print(full_prompt) # FOR TESTING PURPOSES

    return full_prompt


### Main Program

#### Initial Text Processing

In [None]:
# Load the text into dataframe 
df = load_text(DOC_FILEPATH)

# Count the tokens 
df = df.copy()    
df['tokens'] = df['content'].apply(count_tokens)

# Call the truncate_text function on the dataframe to fall within word/token limits 
df = df.copy()    
df = truncate_text(df)

# print(df.head(10))   # FOR TESTING PURPOSES 
# print(df['content'][3])   # FOR TESTING PURPOSES

#### Document Embeddings and Pinecone Integration

In [None]:
# Initialize Pinecone
pinecone.init(api_key=pinecone_key, environment="asia-southeast1-gcp-free")

# Create Pinecone index -- use only if first time setting up Pinecone index for a specific project  
# pinecone.create_index(name="docembeddings", dimension=1536, metric="cosine", shards=1)

# Connect to Pinecone service
pinecone_client = pinecone.Index(index_name="docembeddings")

# Use code below if calculating the embeddings in real time via OpenAI API...
# document_embeddings = compute_doc_embeddings(df) 

# OR ... use code below if importing previously-loaded embeddings from Pinecone
document_embeddings = fetch_embeddings_from_pinecone(df, pinecone_client)

# # FOR TESTING PURPOSES - PRINT FIRST FIVE VECTOR VALUES
# my_vectors = document_embeddings # For Testing Only .................
# test = [item['values'] for item in my_vectors['vectors'].values()] # For Testing Only .................
# print(f"TESTING...First 5 vector values are: {test[0][:5]}") # For Testing Only .................

#### Question Answering and Interface

In [None]:
use_interface = True # Change to False if you want to run the code without the Gradio interface, and instead see a single pre-supplied question 

if use_interface:
    demo = gr.Interface(
    fn=lambda query: answer_query_with_context(query, df, document_embeddings),
    inputs=gr.Textbox(lines=2,  label="Query", placeholder="Type Question Here..."),
    outputs=gr.Textbox(lines=2, label="Answer"),
    description="Example of a domain-specific chatbot, using ChatGPT with supplemental content added.<br>\
                  Here, the content relates to the investment outlook for 2023, according to Morgan Stanley, JPMorgan and Goldman Sachs.<br>\
                  Sample queries: What is Goldman's outlook for inflation? What about the bond market? What does JPMorgan think about 2023?<br>\
                  NOTE: High-level demo only. Supplemental content used here limited to about 30 paragraphs, due to limits on free-of-charge usage of ChatGPT.<br>\
                  ",
    title="Domain-Specific Chatbot",)
    demo.launch(debug=True)  # Set debug=True in launch() to see testing/debugging output
else:
    sample_question = 'What is the outlook for inflation?'
    print(answer_query_with_context(sample_question, df, document_embeddings)) 

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Keyboard interruption in main thread... closing server.
