<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_gen_ai_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.2

Description:
    This notebook contains a class-based implementation of a Retrieval Augmented Generation (RAG) engine
    designed to analyze bank quarterly earnings call transcripts (in PDF format) stored on Google Drive.
    The code performs the following tasks:

    1. Configures an LLM pipeline using a Flan-T5-based model for text summarization.
    2. Sets up sentence-transformer based embeddings for document vectorization.
    3. Loads and splits PDF documents from one or more specified directories.
    4. Chunks the documents and builds a vector index using Chroma, persisting the index to Google Drive.
    5. Optionally loads an existing persisted vector index to avoid re-indexing, via the 'rebuild_index' parameter.
    6. Retrieves context relevant to user queries from the vector index with token truncation to enforce input limits.
    7. Maintains conversation memory for interactive sessions.
    8. Supports both interactive and programmatic prompt-based querying.
    9. Includes a 'test_mode' option for quick testing with a single PDF.
===================================================
"""



In [2]:
!pip install langchain openai chromadb sentence-transformers pypdf datasets rouge-score  > /dev/null 2>&1
!pip install --upgrade langchain_community   > /dev/null 2>&1
!pip install -U langchain-huggingface  > /dev/null 2>&1


In [5]:
import os
import re
from datasets import Dataset
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.docstore.document import Document
from langchain.agents import Tool, initialize_agent, AgentType
# Use the new ChatOpenAI wrapper instead of OpenAI
from langchain_community.chat_models import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from google.colab import drive

In [6]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN') # Replace with your actual token

Mounted at /content/drive


# A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine

In [8]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Replace with your actual token

In [16]:
import os
import re
import pandas as pd
from datasets import Dataset
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.docstore.document import Document
from langchain.agents import Tool, initialize_agent, AgentType
from langchain_community.chat_models import ChatOpenAI

# --- Helper Function to Extract Year/Quarter from Filename ---
def extract_year_quarter(filename: str) -> str:
    """
    Extracts a year/quarter from a filename.
    Matches either a numeric format (e.g. "2q23" or "3Q2024")
    or a spelled-out quarter format (e.g. "third-quarter-2024").
    Returns the extracted value in a standardized format, or "Unknown" if not found.
    """
    pattern = re.compile(
        r'(?i)'                             # case-insensitive
        r'([1-4]q\d{2,4})'                  # group(1): numeric format like "2q23"
        r'|'
        r'((first|second|third|fourth)[-_\s]?quarter[-_\s]?(\d{2,4}))'  # group(2): spelled-out format
    )
    match = pattern.search(filename)
    if not match:
        return "Unknown"

    # If numeric format matched:
    if match.group(1):
        return match.group(1).upper()  # e.g., "2Q23"
    else:
        quarter_text = match.group(3).lower()  # e.g., "third"
        year_text = match.group(4)             # e.g., "2024" or "23"
        mapping = {"first": "1", "second": "2", "third": "3", "fourth": "4"}
        quarter_num = mapping.get(quarter_text, "?")
        return f"{quarter_num}Q{year_text}"

# --- Configuration ---
CONFIG = {
    "pdf_folders": [
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan",
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs"
    ],
    "persist_directory": "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs",
    "llm_model_name": "gpt-3.5-turbo",  # Using GPT-3.5-turbo
    "embedding_model_name": "sentence-transformers/all-mpnet-base-v2",
    "max_length": 1024,
    "temperature": 0.1,
    "top_p": 0.8,
    "batch_size": 8,
    "chunk_size": 1000,
    "chunk_overlap": 100,
    "chunk_threshold": 1024,
    "memory_window_k": 10,
    "retriever_search_k": 5
}

# --- Master Agent Prompt (Refined) ---
MASTER_AGENT_PROMPT = (
    "You are a highly accurate and detail-oriented assistant specialized in analyzing bank earnings call transcripts.\n"
    "ONLY use the information from the retrieved transcript context. Your final answer must be presented as a bullet-point list.\n\n"
    "Follow EXACTLY this format:\n"
    "Thought: Briefly explain which transcript sections are relevant.\n"
    "Action: Use the appropriate tool (JP_Morgan_RAG or UBS_RAG) by writing e.g. JP_Morgan_RAG(\"<query>\").\n"
    "Observation: Summarize the retrieved context in a few words.\n"
    "Final Answer: Provide a concise bullet-point list with the key sentiments and takeaways.\n\n"
    "Now, answer the following question:\n"
    "{input}\n"
    "Begin!"
)

# --- RAGChatbot Class ---
class RAGChatbot:
    """
    A RAG chatbot that ingests PDF earnings call transcripts, builds a vector store,
    and uses a ConversationalRetrievalChain for Q&A.
    """
    def __init__(self, config, bank: str):
        self.config = config
        # For GPT-3.5-turbo via ChatOpenAI, the API key must be set as OPENAI_API_KEY.
        self.api_key = os.environ.get("OPENAI_API_KEY")
        self.pdf_folders = config["pdf_folders"]
        self.persist_directory = config["persist_directory"]
        self.max_length = config["max_length"]
        self.batch_size = config["batch_size"]
        self.chunk_size = config["chunk_size"]
        self.chunk_overlap = config["chunk_overlap"]
        self.chunk_threshold = config["chunk_threshold"]
        self.memory_window_k = config["memory_window_k"]
        self.retriever_search_k = config["retriever_search_k"]
        self.bank = bank

        self._setup_llm()
        self._setup_embeddings()
        # Create a separate tokenizer for splitting (using GPT-2 as a proxy)
        from transformers import AutoTokenizer
        self.split_tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self._load_documents()
        self._build_vector_store()
        self._build_summary_index()
        self._setup_retrieval_chain()

    def _setup_llm(self):
        # Initialize GPT-3.5-turbo via ChatOpenAI.
        self.llm = ChatOpenAI(
            model_name=self.config["llm_model_name"],
            temperature=self.config["temperature"],
            top_p=self.config["top_p"],
            max_tokens=self.max_length,
            openai_api_key=self.api_key
        )

    def _setup_embeddings(self):
        emb_model = self.config["embedding_model_name"]
        self.embeddings = HuggingFaceEmbeddings(model_name=emb_model)

    def _load_documents(self):
        self.documents = []
        for folder in self.pdf_folders:
            bank = os.path.basename(folder).lower()
            files = [f for f in os.listdir(folder) if f.endswith(".pdf")]
            for file in files:
                path = os.path.join(folder, file)
                try:
                    loader = PyPDFLoader(path, extract_images=False)
                    docs = loader.load_and_split()
                    # Extract year/quarter from the filename and add to metadata
                    yq = extract_year_quarter(file)
                    for doc in docs:
                        doc.metadata["bank"] = bank
                        doc.metadata["source_pdf"] = file
                        doc.metadata["year_quarter"] = yq
                    self.documents.extend(docs)
                    print(f"Loaded: {file} from {folder} -> {yq}")
                except Exception as e:
                    print(f"Error loading {file}: {e}")

    def _chunk_document(self, doc: Document) -> list[Document]:
        tokens = self.split_tokenizer.encode(doc.page_content)
        if len(tokens) > self.chunk_threshold:
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
            chunks = splitter.split_documents([doc])
            return self._remove_duplicates(chunks)
        return [doc]

    @staticmethod
    def _remove_duplicates(chunks: list[Document]) -> list[Document]:
        seen = set()
        unique = []
        for chunk in chunks:
            text = chunk.page_content.strip()
            if text not in seen:
                seen.add(text)
                unique.append(chunk)
        return unique

    def _build_vector_store(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.raw_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings, persist_directory=self.persist_directory)
        print(f"Built raw vector store with {len(all_chunks)} chunks.")

    def _build_summary_index(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.summary_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings,
            persist_directory=os.path.join(self.persist_directory, "summaries"))
        print(f"Built summary vector index with {len(all_chunks)} chunks.")

    def _setup_retrieval_chain(self):
        memory = ConversationBufferWindowMemory(
            k=self.memory_window_k, memory_key="chat_history", return_messages=True)
        self.retrieval_chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.summary_db.as_retriever(
                search_kwargs={"k": self.retriever_search_k, "filter": {"bank": self.bank}})
            ,
            memory=memory,
            verbose=True
        )

    def answer_query(self, query: str) -> str:
        response = self.retrieval_chain({"question": query})
        return response.get("answer", "").strip()


# --- Create Multi-Agent Instances ---
# Filter PDF folders for each bank.
jpm_folders = [folder for folder in CONFIG["pdf_folders"] if "jpmorgan" in folder.lower()]
ubs_folders = [folder for folder in CONFIG["pdf_folders"] if "ubs" in folder.lower()]

# Create separate configurations.
CONFIG_JPM = CONFIG.copy()
CONFIG_JPM["pdf_folders"] = jpm_folders

CONFIG_UBS = CONFIG.copy()
CONFIG_UBS["pdf_folders"] = ubs_folders

# Initialize separate RAGChatbot instances.
jpm_chatbot = RAGChatbot(CONFIG_JPM, bank="jpmorgan")
ubs_chatbot = RAGChatbot(CONFIG_UBS, bank="ubs")

# --- Define Tools for Each Agent ---
def jpm_tool(query: str) -> str:
    return jpm_chatbot.answer_query(query)

def ubs_tool(query: str) -> str:
    return ubs_chatbot.answer_query(query)

jpm_tool_instance = Tool(
    name="JP_Morgan_RAG",
    func=jpm_tool,
    description="Answers questions about JP Morgan earnings call transcripts."
)

ubs_tool_instance = Tool(
    name="UBS_RAG",
    func=ubs_tool,
    description="Answers questions about UBS earnings call transcripts."
)

# --- Master Agent Integration ---
master_agent = initialize_agent(
    [jpm_tool_instance, ubs_tool_instance],
    jpm_chatbot.llm,  # Using the same GPT-3.5-turbo LLM for all agents.
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True,
    agent_kwargs={"prefix": MASTER_AGENT_PROMPT}
)

# --- Chatbot Loop (Master Agent) with CSV Logging ---
def run_master_agent():
    print("Master Agent Chatbot (type 'exit' to quit)")
    csv_file = "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs/gpt/master_agent_results.csv"
    if os.path.exists(csv_file):
        df = pd.read_csv(csv_file)
    else:
        df = pd.DataFrame(columns=["Year/Quarter", "Question", "Master Answer"])

    while True:
        user_q = input("You: ")
        if user_q.lower() == "exit":
            print("Exiting Master Agent Chatbot. Goodbye!")
            break

        # Check if the question has already been asked
        if user_q in df["Question"].values:
            print("This question has already been asked. Skipping...")
            continue

        answer = master_agent.run(user_q)
        # Here we set Year/Quarter to "Unknown" if we can't extract it; you may update this logic.
        # Optionally, you could parse the relevant documents' metadata to determine the year/quarter.
        year_quarter = "Unknown"
        new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": user_q, "Master Answer": answer}])
        df = pd.concat([df, new_row], ignore_index=True)
        df.to_csv(csv_file, index=False)
        print(f"\nMaster Agent Answer:\n{answer}\n")
        print(f"Results saved to {csv_file}")

if __name__ == "__main__":
    run_master_agent()


                    top_p was transferred to model_kwargs.
                    Please confirm that top_p is what you intended.


Loaded: 1q23-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 1Q23
Loaded: 2q23-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 2Q23
Loaded: 4q24-earnings-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 4Q24
Loaded: jpm-1q24-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 1Q24
Loaded: jpm-2q24-earnings-call-transcript-final.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 2Q24
Loaded: jpm-3q23-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 3Q23
Loaded: jpm-4q23-earnings-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 4Q23
Loaded: jpmc-third-quarter-2024-earnings-conference-call-transcript.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan -> 3Q2024
Built raw vector store with 252 c

                    top_p was transferred to model_kwargs.
                    Please confirm that top_p is what you intended.


Built summary vector index with 252 chunks.
Loaded: 1q23-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 1Q23
Loaded: 1q24-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 1Q24
Loaded: 2q23-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 2Q23
Loaded: 2q24-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 2Q24
Loaded: 3q23-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 3Q23
Loaded: 3q24-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 3Q24
Loaded: 4q23-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 4Q23
Loaded: 4q24-earnings-call-remarks.pdf from /content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs -> 4Q24
Built raw vector store with 228 chunks.
Built summary vector index with 228 chunks.
Master A

KeyboardInterrupt: Interrupted by user

## For JP Morgan:
### Market & Credit Risk:
*  "What are the key market and credit risk factors highlighted in the latest JP Morgan earnings call?"

### Balance Sheet Exposures:
*  "How are JP Morgan's balance sheet exposures and risk-weighted assets affecting its risk profile?"

### Expense & Margin Pressures:
*  "What cost pressures or margin compressions were mentioned that could pose risks to JP Morgan's performance?"

## For UBS:
### Operating and Strategic Challenges:
*  "What are the primary challenges or risk factors affecting UBS's earnings call, particularly in relation to its operating profit and cost structure?"

### Revenue Mix & Expense Growth:
*  "How have changes in UBS's revenue mix and expense growth impacted its overall risk profile?"

### Risk Mitigation Measures:
*  "What steps is UBS taking to manage or mitigate its identified risks, especially those related to technology costs and recruitment expenses?"

In [None]:
if __name__ == "__main__":
    import pandas as pd
    run_master_agent()
