<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_gen_ai_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.2

Description:
    This module implements a Retrieval Augmented Generation (RAG) engine for analyzing bank
    quarterly earnings call transcripts (PDF format) stored on Google Drive. It leverages
    Langchain and Hugging Face Transformers for document loading, intelligent document chunking,
    embedding, and question answering.

    Key features:
    - Configures an LLM pipeline using GPT-3.5-turbo for interactive Q&A.
    - Utilizes sentence-transformer embeddings for semantic vectorization of documents.
    - Loads and processes PDF documents from specified Google Drive directories.
    - Intelligently chunks documents using RecursiveCharacterTextSplitter with an adjustable
      token threshold to avoid splitting when documents already fit within a larger context window.
    - Builds and persists two Chroma vector store indexes (raw and summary) in a dedicated folder
      ("gpt_chatbot"), which also stores the master CSV log file.
    - Supports reusing existing persisted indexes to avoid unnecessary re-indexing.
    - Implements context retrieval with token truncation to ensure LLM input limits are maintained.
    - Manages conversation history using ConversationBufferWindowMemory for interactive sessions.
    - Provides both batch and interactive modes for prompt-based querying and logging of Q&A sessions.
    - Integrates a ROUGE evaluation method to quantitatively measure summary quality.
===================================================
"""



In [2]:
!pip install -q langchain openai chromadb sentence-transformers pypdf datasets rouge-score  > /dev/null 2>&1
!pip install --upgrade langchain_community   > /dev/null 2>&1
!pip install -U langchain-huggingface  > /dev/null 2>&1
!pip install --upgrade openai > /dev/null 2>&1


In [3]:
import os
import re
import pandas as pd
from datasets import Dataset
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.docstore.document import Document
from langchain.agents import Tool, initialize_agent, AgentType
from langchain_community.chat_models import ChatOpenAI
import warnings
from google.colab import drive
from google.colab import userdata
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain_community.llms import HuggingFacePipeline
from rouge_score import rouge_scorer
import shutil
from transformers import AutoTokenizer
from langchain.chat_models import ChatOpenAI

In [4]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')# Replace with your actual token

Mounted at /content/drive


# A class-based implementation of an LLM Retrieval Augmented Generation (RAG) engine

In [5]:
# Suppress warnings in production mode if not in DEV_MODE
DEV_MODE = False
if not DEV_MODE:
    warnings.filterwarnings("ignore")

In [6]:
# --- Configuration ---
CONFIG = {
    "pdf_folders": [
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan",
        "/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs"
    ],
    "persist_directory": "/content/drive/MyDrive/BOE/bank_of_england/data/model_outputs",
    "llm_model_name": "gpt-3.5-turbo",  # Using GPT-3.5-turbo
    "embedding_model_name": "sentence-transformers/all-mpnet-base-v2",
    "max_length": 1024,
    "temperature": 0.1,
    "top_p": 0.8,
    "batch_size": 8,
    "chunk_size": 1000,
    "chunk_overlap": 100,
    "chunk_threshold": 3000,  # Increased to avoid splitting when not needed
    "memory_window_k": 10,
    "retriever_search_k": 5
}

# We'll use a dedicated folder "gpt_chatbot" under the persist directory for saving vector stores and CSV logs.
GPT_FOLDER = os.path.join(CONFIG["persist_directory"], "gpt_chatbot")
os.makedirs(GPT_FOLDER, exist_ok=True)

MASTER_AGENT_PROMPT = (
    "You are a highly accurate and detail-oriented assistant specialized in analyzing bank earnings call transcripts.\n"
    "ONLY use the information from the retrieved transcript context. Your final answer must be presented as a bullet-point list.\n\n"
    "Follow EXACTLY this format:\n"
    "Thought: Briefly explain which transcript sections are relevant.\n"
    "Action: Use the appropriate tool (JP_Morgan_RAG or UBS_RAG) by writing e.g. JP_Morgan_RAG(\"<query>\").\n"
    "Observation: Summarize the retrieved context in a few words.\n"
    "Final Answer: Provide a concise bullet-point list with the key sentiments and takeaways.\n\n"
    "Now, answer the following question:\n"
    "{input}\n"
    "Begin!"
)

# --- RAGChatbot Class ---
class RAGChatbot:
    """
    A RAG chatbot that ingests PDF earnings call transcripts, builds vector stores,
    and uses a ConversationalRetrievalChain for Q&A.
    """
    def __init__(self, config, bank: str):
        self.config = config
        self.api_key = os.environ.get("OPENAI_API_KEY")
        self.pdf_folders = config["pdf_folders"]
        self.persist_directory = config["persist_directory"]
        self.max_length = config["max_length"]
        self.batch_size = config["batch_size"]
        self.chunk_size = config["chunk_size"]
        self.chunk_overlap = config["chunk_overlap"]
        self.chunk_threshold = config["chunk_threshold"]
        self.memory_window_k = config["memory_window_k"]
        self.retriever_search_k = config["retriever_search_k"]
        self.bank = bank

        self._setup_llm()
        self._setup_embeddings()
        from transformers import AutoTokenizer
        self.split_tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self._load_documents()
        self._build_vector_store()
        self._build_summary_index()
        self._setup_retrieval_chain()

    def _setup_llm(self):
        self.llm = ChatOpenAI(
            model_name=self.config["llm_model_name"],
            temperature=self.config["temperature"],
            max_tokens=self.max_length,
            openai_api_key=self.api_key,
            model_kwargs={"top_p": self.config["top_p"]}
        )

    def _setup_embeddings(self):
        emb_model = self.config["embedding_model_name"]
        self.embeddings = HuggingFaceEmbeddings(model_name=emb_model)

    def _load_documents(self):
        self.documents = []
        for folder in self.pdf_folders:
            bank = os.path.basename(folder).lower()
            files = [f for f in os.listdir(folder) if f.endswith(".pdf")]
            for file in files:
                path = os.path.join(folder, file)
                try:
                    loader = PyPDFLoader(path, extract_images=False)
                    docs = loader.load_and_split()
                    for doc in docs:
                        doc.metadata["bank"] = bank
                        doc.metadata["source_pdf"] = file
                        doc.metadata["year_quarter"] = "Unknown"
                    self.documents.extend(docs)
                    if DEV_MODE:
                        print(f"Loaded: {file} from {folder}")
                except Exception as e:
                    if DEV_MODE:
                        print(f"Error loading {file}: {e}")

    def _chunk_document(self, doc: Document) -> list[Document]:
        tokens = self.split_tokenizer.encode(doc.page_content)
        if len(tokens) > self.chunk_threshold:
            splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
            chunks = splitter.split_documents([doc])
            return self._remove_duplicates(chunks)
        return [doc]

    @staticmethod
    def _remove_duplicates(chunks: list[Document]) -> list[Document]:
        seen = set()
        unique = []
        for chunk in chunks:
            text = chunk.page_content.strip()
            if text not in seen:
                seen.add(text)
                unique.append(chunk)
        return unique

    def _build_vector_store(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.raw_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings, persist_directory=GPT_FOLDER)
        if DEV_MODE:
            print(f"Built raw vector store with {len(all_chunks)} chunks.")

    def _build_summary_index(self):
        all_chunks = []
        for doc in self.documents:
            all_chunks.extend(self._chunk_document(doc))
        self.summary_db = Chroma.from_documents(
            all_chunks, embedding=self.embeddings,
            persist_directory=os.path.join(GPT_FOLDER, "summaries"))
        if DEV_MODE:
            print(f"Built summary vector index with {len(all_chunks)} chunks.")

    def _setup_retrieval_chain(self):
        memory = ConversationBufferWindowMemory(
            k=self.memory_window_k, memory_key="chat_history", return_messages=True)
        self.retrieval_chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.summary_db.as_retriever(
                search_kwargs={"k": self.retriever_search_k, "filter": {"bank": self.bank}}
            ),
            memory=memory,
            verbose=DEV_MODE
        )

    def answer_query(self, query: str) -> str:
        response = self.retrieval_chain({"question": query})
        return response.get("answer", "").strip()


# --- Multi-Agent Instances ---
jpm_folders = [folder for folder in CONFIG["pdf_folders"] if "jpmorgan" in folder.lower()]
ubs_folders = [folder for folder in CONFIG["pdf_folders"] if "ubs" in folder.lower()]

CONFIG_JPM = CONFIG.copy()
CONFIG_JPM["pdf_folders"] = jpm_folders

CONFIG_UBS = CONFIG.copy()
CONFIG_UBS["pdf_folders"] = ubs_folders

jpm_chatbot = RAGChatbot(CONFIG_JPM, bank="jpmorgan")
ubs_chatbot = RAGChatbot(CONFIG_UBS, bank="ubs")

def jpm_tool(query: str) -> str:
    return jpm_chatbot.answer_query(query)

def ubs_tool(query: str) -> str:
    return ubs_chatbot.answer_query(query)

jpm_tool_instance = Tool(
    name="JP_Morgan_RAG",
    func=jpm_tool,
    description="Answers questions about JP Morgan earnings call transcripts."
)

ubs_tool_instance = Tool(
    name="UBS_RAG",
    func=ubs_tool,
    description="Answers questions about UBS earnings call transcripts."
)

master_agent = initialize_agent(
    [jpm_tool_instance, ubs_tool_instance],
    jpm_chatbot.llm,
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=DEV_MODE,
    handle_parsing_errors=True,
    agent_kwargs={"prefix": MASTER_AGENT_PROMPT}
)


# --- ROUGE Evaluation Method ---
def calculate_rouge_scores(generated_summary: str, reference_summary: str) -> dict:
    """
    Calculate and print ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L) between a generated summary and a reference summary.
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference_summary, generated_summary)
    print("ROUGE Scores:")
    for key, score in scores.items():
        print(f"{key}: precision={score.precision:.2f}, recall={score.recall:.2f}, fmeasure={score.fmeasure:.2f}")
    return scores


# --- Batch Mode Chatbot Loop with CSV Logging ---
def run_master_agent_batch(questions: list[str]):
    print("Master Agent Chatbot Batch Mode (processing multiple questions)")
    csv_file = os.path.join(GPT_FOLDER, "master_agent_results.csv")
    if os.path.exists(csv_file):
        df = pd.read_csv(csv_file)
    else:
        cols = ["Year/Quarter", "Question", "Master Answer", "Full Output"] if DEV_MODE else ["Year/Quarter", "Question", "Master Answer"]
        df = pd.DataFrame(columns=cols)

    for question in questions:
        if question in df["Question"].values:
            if DEV_MODE:
                print(f"Skipping already processed question: {question}")
            continue

        answer = master_agent.run(question)
        year_quarter = "Unknown"
        if DEV_MODE:
            full_output = "Full chain output logged in console."
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": question, "Master Answer": answer, "Full Output": full_output}])
        else:
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": question, "Master Answer": answer}])

        df = pd.concat([df, new_row], ignore_index=True)
        df.to_csv(csv_file, index=False)
        print(f"\nQuestion: {question}")
        print(f"Master Agent Answer:\n{answer}\n")
        print(f"Results saved to {csv_file}\n{'-'*60}")


# --- Interactive Mode Chatbot Loop with Follow-up Option ---
def run_master_agent_interactive():
    """
    Runs an interactive chatbot loop where you can ask a question and then add follow-up questions
    based on the output of previous prompts.
    """
    print("Master Agent Chatbot Interactive Mode (type 'exit' to quit)")
    csv_file = os.path.join(GPT_FOLDER, "master_agent_results.csv")
    if os.path.exists(csv_file):
        df = pd.read_csv(csv_file)
    else:
        cols = ["Year/Quarter", "Question", "Master Answer", "Full Output"] if DEV_MODE else ["Year/Quarter", "Question", "Master Answer"]
        df = pd.DataFrame(columns=cols)

    while True:
        user_q = input("You: ")
        if user_q.lower() == "exit":
            print("Exiting Master Agent Chatbot. Goodbye!")
            break

        answer = master_agent.run(user_q)
        year_quarter = "Unknown"
        if DEV_MODE:
            full_output = "Full chain output logged in console."
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": user_q, "Master Answer": answer, "Full Output": full_output}])
        else:
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": user_q, "Master Answer": answer}])
        df = pd.concat([df, new_row], ignore_index=True)
        df.to_csv(csv_file, index=False)
        print(f"\nMaster Agent Answer:\n{answer}\n")

        follow = input("Would you like to ask a follow-up question? (yes/no): ")
        while follow.lower() == "yes":
            followup_q = input("Follow-up question: ")
            followup_answer = master_agent.run(followup_q)
            new_row = pd.DataFrame([{"Year/Quarter": year_quarter, "Question": followup_q, "Master Answer": followup_answer}])
            df = pd.concat([df, new_row], ignore_index=True)
            df.to_csv(csv_file, index=False)
            print(f"\nFollow-up Answer:\n{followup_answer}\n")
            follow = input("Would you like to ask another follow-up question? (yes/no): ")
# --- Example Usage of ROUGE Evaluation ---
if __name__ == "__main__":
    # Run batch mode processing
    questions = [
        "Summarise JPMorgan’s 3Q23 results including net income, EPS, revenue, ROTCE, First Republic impact, and Basel III strategy."
    ]
    run_master_agent_batch(questions)

    # For demonstration: evaluate ROUGE score for a generated summary vs. a reference.
    # Replace these with your actual generated and reference summaries.
    generated_summary = "• Net income: $13.2B • EPS: $4.33 • Revenue: $40.7B • ROTCE: 22% • First Republic: $2.2B revenue, $858M expenses, $1.1B net income • Basel III Endgame: ~30% RWA increase (~$500B), capital up ~25% • Strategic: conservative capital, moderate buybacks"
    reference_summary = (
        "• Net income: $13.2B\n"
        "• EPS: $4.33\n"
        "• Revenue: $40.7B\n"
        "• ROTCE: 22%\n"
        "• First Republic: revenue $2.2B, expenses $858M, net income $1.1B\n"
        "• Basel III Endgame: ~30% RWA increase (~$500B), capital up ~25%\n"
        "• Strategic: conservative capital, moderate buybacks"
    )
    calculate_rouge_scores(generated_summary, reference_summary)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Master Agent Chatbot Batch Mode (processing multiple questions)

Question: Summarise JPMorgan’s 3Q23 results including net income, EPS, revenue, ROTCE, First Republic impact, and Basel III strategy.
Master Agent Answer:
{
  "action": "JP_Morgan_RAG",
  "action_input": "Summarize JPMorgan’s 3Q23 results including net income, EPS, revenue, ROTCE, First Republic impact, and Basel III strategy."
}

Results saved to /content/drive/MyDrive/BOE/bank_of_england/data/model_outputs/gpt_chatbot/master_agent_results.csv
------------------------------------------------------------
ROUGE Scores:
rouge1: precision=1.00, recall=1.00, fmeasure=1.00
rouge2: precision=0.81, recall=0.81, fmeasure=0.81
rougeL: precision=0.89, recall=0.89, fmeasure=0.89


# Run the Chatbot interactively

---



In [7]:
# run_master_agent_interactive()