<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_gen_ai_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.2

Description:
    This notebook contains a class-based implementation of a Retrieval Augmented Generation (RAG) engine
    designed to analyze bank quarterly earnings call transcripts (in PDF format) stored on Google Drive.
    The code performs the following tasks:

    1. Configures an LLM pipeline using a Flan-T5-based model for text summarization.
    2. Sets up sentence-transformer based embeddings for document vectorization.
    3. Loads and splits PDF documents from one or more specified directories.
    4. Chunks the documents and builds a vector index using Chroma, persisting the index to Google Drive.
    5. Optionally loads an existing persisted vector index to avoid re-indexing, via the 'rebuild_index' parameter.
    6. Retrieves context relevant to user queries from the vector index with token truncation to enforce input limits.
    7. Maintains conversation memory for interactive sessions.
    8. Supports both interactive and programmatic prompt-based querying.
    9. Includes a 'test_mode' option for quick testing with a single PDF.
===================================================
"""



In [2]:
!pip install langchain openai chromadb sentence-transformers pypdf datasets rouge-score  > /dev/null 2>&1
!pip install --upgrade langchain_community   > /dev/null 2>&1
!pip install -U langchain-huggingface  > /dev/null 2>&1


In [5]:
import os
import re
import pandas as pd
from datasets import Dataset
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.docstore.document import Document
from langchain.agents import Tool, initialize_agent, AgentType
from langchain_community.chat_models import ChatOpenAI
import warnings
from google.colab import drive

In [23]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')# Replace with your actual token

Mounted at /content/drive


# A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine

In [8]:
# Toggle flag: True for development (verbose output) and False for production (minimal output)
DEV_MODE = False

In [22]:
# Optionally, suppress warnings in production mode
if not DEV_MODE:
    warnings.filterwarnings("ignore")

# --- Configuration ---
CONFIG = {
    "pdf_folders": [
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan",
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs"
    ],
    "persist_directory": "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs",
    "llm_model_name": "gpt-3.5-turbo",  # Using GPT-3.5-turbo
    "embedding_model_name": "sentence-transformers/all-mpnet-base-v2",
    "max_length": 1024,
    "temperature": 0.1,
    "top_p": 0.8,
    "batch_size": 8,
    "chunk_size": 1000,
    "chunk_overlap": 100,
    "chunk_threshold": 1024,
    "memory_window_k": 10,
    "retriever_search_k": 5
}

# --- Master Agent Prompt (Refined) ---
MASTER_AGENT_PROMPT = (
    "You are a highly accurate and detail-oriented assistant specialized in analyzing bank earnings call transcripts.\n"
    "ONLY use the information from the retrieved transcript context. Your final answer must be presented as a bullet-point list.\n\n"
    "Follow EXACTLY this format:\n"
    "Thought: Briefly explain which transcript sections are relevant.\n"
    "Action: Use the appropriate tool (JP_Morgan_RAG or UBS_RAG) by writing e.g. JP_Morgan_RAG(\"<query>\").\n"
    "Observation: Summarize the retrieved context in a few words.\n"
    "Final Answer: Provide a concise bullet-point list with the key sentiments and takeaways.\n\n"
    "Now, answer the following question:\n"
    "{input}\n"
    "Begin!"
)

# --- RAGChatbot Class ---
class RAGChatbot:
    """
    A RAG chatbot that ingests PDF earnings call transcripts, builds a vector store,
    and uses a ConversationalRetrievalChain for Q&A.
    """
    def __init__(self, config, bank: str):
        self.config = config
        # For GPT-3.5-turbo via ChatOpenAI, the API key must be set as OPENAI_API_KEY.
        self.api_key = os.environ.get("OPENAI_API_KEY")
        self.pdf_folders = config["pdf_folders"]
        self.persist_directory = config["persist_directory"]
        self.max_length = config["max_length"]
        self.batch_size = config["batch_size"]
        self.chunk_size = config["chunk_size"]
        self.chunk_overlap = config["chunk_overlap"]
        self.chunk_threshold = config["chunk_threshold"]
        self.memory_window_k = config["memory_window_k"]
        self.retriever_search_k = config["retriever_search_k"]
        self.bank = bank

        self._setup_llm()
        self._setup_embeddings()
        # Create a separate tokenizer for splitting (using GPT-2 as a proxy)
        from transformers import AutoTokenizer
        self.split_tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self._load_documents()
        self._build_vector_store()
        self._build_summary_index()
        self._setup_retrieval_chain()

    def _setup_llm(self):
        # Initialize GPT-3.5-turbo via ChatOpenAI with top_p passed in model_kwargs.
        self.llm = ChatOpenAI(
            model_name=self.config["llm_model_name"],
            temperature=self.config["temperature"],
            max_tokens=self.max_length,
            openai_api_key=self.api_key,
            model_kwargs={"top_p": self.config["top_p"]}
        )

    def _setup_embeddings(self):
        emb_model = self.config["embedding_model_name"]
        self.embeddings = HuggingFaceEmbeddings(model_name=emb_model)

    def _load_documents(self):
        self.documents = []
        for folder in self.pdf_folders:
            bank = os.path.basename(folder).lower()
            files = [f for f in os.listdir(folder) if f.endswith(".pdf")]
            for file in files:
                path = os.path.join(folder, file)
                try:
                    loader = PyPDFLoader(path, extract_images=False)
                    docs = loader.load_and_split()
                    for doc in docs:
                        doc.metadata["bank"] = bank
                        doc.metadata["source_pdf"] = file
                        doc.metadata["year_quarter"] = "Unknown"  # Temporarily set to "Unknown"
                    self.documents.extend(docs)
                    if DEV_MODE:
                        print(f"Loaded: {file} from {folder} -> Unknown")
                except Exception as e:
                    if DEV_MODE:
                        print(f"Error loading {file}: {e}")

    def _chunk_document(self, doc: Document) -> list[Document]:
        tokens = self.split_tokenizer.encode(doc.page_content)
        if len(tokens) > self.chunk_threshold:
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
            chunks = splitter.split_documents([doc])
            return self._remove_duplicates(chunks)
        return [doc]

    @staticmethod
    def _remove_duplicates(chunks: list[Document]) -> list[Document]:
        seen = set()
        unique = []
        for chunk in chunks:
            text = chunk.page_content.strip()
            if text not in seen:
                seen.add(text)
                unique.append(chunk)
        return unique

    def _build_vector_store(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.raw_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings, persist_directory=self.persist_directory)
        if DEV_MODE:
            print(f"Built raw vector store with {len(all_chunks)} chunks.")

    def _build_summary_index(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.summary_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings,
            persist_directory=os.path.join(self.persist_directory, "summaries"))
        if DEV_MODE:
            print(f"Built summary vector index with {len(all_chunks)} chunks.")

    def _setup_retrieval_chain(self):
        memory = ConversationBufferWindowMemory(
            k=self.memory_window_k, memory_key="chat_history", return_messages=True)
        self.retrieval_chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.summary_db.as_retriever(
                search_kwargs={"k": self.retriever_search_k, "filter": {"bank": self.bank}})
            ,
            memory=memory,
            verbose=DEV_MODE
        )

    def answer_query(self, query: str) -> str:
        response = self.retrieval_chain({"question": query})
        return response.get("answer", "").strip()

# --- Create Multi-Agent Instances ---
jpm_folders = [folder for folder in CONFIG["pdf_folders"] if "jpmorgan" in folder.lower()]
ubs_folders = [folder for folder in CONFIG["pdf_folders"] if "ubs" in folder.lower()]

CONFIG_JPM = CONFIG.copy()
CONFIG_JPM["pdf_folders"] = jpm_folders

CONFIG_UBS = CONFIG.copy()
CONFIG_UBS["pdf_folders"] = ubs_folders

jpm_chatbot = RAGChatbot(CONFIG_JPM, bank="jpmorgan")
ubs_chatbot = RAGChatbot(CONFIG_UBS, bank="ubs")

def jpm_tool(query: str) -> str:
    return jpm_chatbot.answer_query(query)

def ubs_tool(query: str) -> str:
    return ubs_chatbot.answer_query(query)

jpm_tool_instance = Tool(
    name="JP_Morgan_RAG",
    func=jpm_tool,
    description="Answers questions about JP Morgan earnings call transcripts."
)

ubs_tool_instance = Tool(
    name="UBS_RAG",
    func=ubs_tool,
    description="Answers questions about UBS earnings call transcripts."
)

master_agent = initialize_agent(
    [jpm_tool_instance, ubs_tool_instance],
    jpm_chatbot.llm,
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=DEV_MODE,
    handle_parsing_errors=True,
    agent_kwargs={"prefix": MASTER_AGENT_PROMPT}
)

# --- Chatbot Loop (Master Agent) with CSV Logging ---
def run_master_agent():
    print("Master Agent Chatbot (type 'exit' to quit)")
    csv_file = "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs/gpt/master_agent_results.csv"
    if os.path.exists(csv_file):
        df = pd.read_csv(csv_file)
    else:
        if DEV_MODE:
            df = pd.DataFrame(columns=["Year/Quarter", "Question", "Master Answer", "Full Output"])
        else:
            df = pd.DataFrame(columns=["Year/Quarter", "Question", "Master Answer"])

    while True:
        user_q = input("You: ")
        if user_q.lower() == "exit":
            print("Exiting Master Agent Chatbot. Goodbye!")
            break

        # Check if the question has already been asked
        if user_q in df["Question"].values:
            if DEV_MODE:
                print("This question has already been asked. Skipping...")
            continue

        answer = master_agent.run(user_q)
        year_quarter = "Unknown"  # Year/Quarter extraction is disabled for now.
        if DEV_MODE:
            full_output = "Full chain output logged in console."
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": user_q, "Master Answer": answer, "Full Output": full_output}])
        else:
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": user_q, "Master Answer": answer}])
        df = pd.concat([df, new_row], ignore_index=True)
        df.to_csv(csv_file, index=False)
        # Always display the master answer on screen
        print(f"\nMaster Agent Answer:\n{answer}\n")
        print(f"Results saved to {csv_file}")


Master Agent Chatbot (type 'exit' to quit)
You: What are the key market and credit risk factors highlighted in the latest JP Morgan earnings call?

Master Agent Answer:
{
  "action": "Final Answer",
  "action_input": "The key market and credit risk factors highlighted in the latest JP Morgan earnings call are related to market competition, changes in market structure, the impact of new competitors like Jane Street, disruption to bank lending caused by private credit, the need for innovation and adaptation, and the implications of capital and liquidity regulations on the U.S. capital markets ecosystem."
}

Results saved to master_agent_results.csv
You: In your final answer, please include an overall sentiment (e.g., positive, negative, or neutral) based on the tone and language used in the transcripts

Master Agent Answer:
Please include the overall sentiment in the final response.

Results saved to master_agent_results.csv
You: Identify the overall sentiment (e.g., positive, negative, 

KeyboardInterrupt: Interrupted by user

In [None]:
if __name__ == "__main__":
    run_master_agent()