<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_RAG_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a Retrieval-Augmented Generation (RAG) system using JP Morgan
    earnings transcripts as the source data. It builds on our existing data engineering pipeline
    by reading raw PDF files stored in Google Drive, extracting text using LangChain’s PyPDFLoader,
    and indexing the content with CHROMA and Sentence Transformer embeddings. A text generation model
    (Flan-T5) is then used to answer queries based on the retrieved context, and the functionality
    is wrapped as a tool for a LangChain agent to handle more complex interactions.

===================================================
"""




# Step 1: Environment Setup & Library Imports

## Import required libraries

In [19]:
# Install necessary packages (uncomment if needed)
# !pip install -U langchain-community
# !pip install pypdf
# !pip install chromadb

import os
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain.agents import initialize_agent, Tool
from langchain.vectorstores import Chroma
from google.colab import drive
from google.colab import userdata

# Set your Hugging Face API token via Colab user secrets or replace as needed.
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF')  # <-- Ensure you have your token stored


# Step 2: Mount Google Drive and Identify Data

In [20]:
# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# Define the directory containing your PDFs
raw_dir = "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan"

# List all PDF files in the raw directory
pdf_files = [os.path.join(raw_dir, file) for file in os.listdir(raw_dir) if file.endswith(".pdf")]
print(f"Found {len(pdf_files)} PDF files in {raw_dir}")


Mounted at /content/drive
Found 8 PDF files in /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan


# Step 3: Load and Process PDF Documents

In [21]:
# Initialize a list to store document chunks
documents = []

# Loop over each PDF file, load content, and optionally skip header pages if detected.
for pdf_file in pdf_files:
    try:
        loader = PyPDFLoader(pdf_file)
        docs = loader.load()
        # Optionally remove header if it matches known content
        if docs and "JPMorgan Chase" in docs[0].page_content and "Earnings Call Transcript" in docs[0].page_content:
            docs = docs[1:]
        documents.extend(docs)
        print(f"Loaded {len(docs)} chunks from {os.path.basename(pdf_file)}")
    except Exception as e:
        print(f"Error processing {pdf_file}: {e}")


Loaded 18 chunks from 2q23-earnings-transcript.pdf
Loaded 18 chunks from jpm-3q23-earnings-call-transcript.pdf
Loaded 14 chunks from jpm-4q23-earnings-call-transcript.pdf
Loaded 17 chunks from 1q23-earnings-transcript.pdf
Loaded 18 chunks from jpm-1q24-earnings-call-transcript.pdf
Loaded 15 chunks from jpm-2q24-earnings-call-transcript-final.pdf
Loaded 16 chunks from 4q24-earnings-transcript.pdf
Loaded 19 chunks from jpmc-third-quarter-2024-earnings-conference-call-transcript.pdf


# Step 4: Initialize Embeddings & Build the Vector Store

In [22]:
# Initialize Sentence Transformer embeddings (using the all-MiniLM-L6-v2 model)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create a Chroma vector store from the processed documents
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="jpm_transcripts")
print("Chroma vector store successfully created from the JP Morgan transcripts.")


Chroma vector store successfully created from the JP Morgan transcripts.


# Step 5: Configure the Text Generation Model

In [23]:
# Load the text generation model (Flan-T5 Base) with GPU support
qa_model = pipeline("text2text-generation", model="google/flan-t5-base", device=0)


Device set to use cuda:0


# Define the RAG Function

In [35]:
def generate_answer(query: str) -> str:
    # Retrieve the top 10 similar document chunks to expand the context pool.
    retrieved_docs = vectorstore.similarity_search(query, k=10)

    # Filter out very short excerpts to ensure quality context.
    informative_docs = [doc for doc in retrieved_docs if len(doc.page_content.split()) > 20]

    # Optional: Remove duplicate or near-duplicate chunks to avoid redundancy.
    unique_docs = {doc.page_content: doc for doc in informative_docs}.values()

    # Assemble context with clear delimiters and optional metadata (e.g., source information).
    context = "\n---\n".join([
        f"{doc.metadata.get('source', 'Transcript')}: {doc.page_content}"
        for doc in unique_docs
    ])

    # Build a refined prompt with explicit instructions for diverse, comparative insights.
    prompt = (
        "Below are excerpts from various JP Morgan earnings call transcripts. "
        "Analyze these excerpts and provide a detailed summary of key insights regarding quarterly performance. "
        "Include specific performance metrics, trends, improvements or declines, and any challenges or notable observations mentioned. "
        "Compare and contrast perspectives from different transcripts to highlight emerging patterns and unique differences, "
        "ensuring that the summary is comprehensive and avoids redundancy.\n\n"
        "Excerpts:\n" + context + "\n\nSummary:"
    )

    # Invoke the text generation model with sampling enabled. Optionally, increase max_length for more detailed responses.
    result = qa_model(prompt, max_length=768, temperature=0.8, do_sample=True)
    return result[0]['generated_text']

# Test the RAG function with a sample query.
test_query = "What insights about quarterly performance are highlighted in the transcripts?"
print("RAG Answer:", generate_answer(test_query))


RAG Answer: JP Morgan earnings call transcripts.


# Step 7: Build a Custom LLM Wrapper

In [30]:
from langchain.llms.base import LLM

class CustomLLM(LLM):
    """A custom LLM wrapper for our Flan-T5 pipeline."""

    def _call(self, prompt: str, stop=None):
        result = qa_model(prompt, max_length=512, temperature=0.7)
        return result[0]['generated_text']

    @property
    def _identifying_params(self):
        return {"name": "CustomFlanT5"}

    def _llm_type(self) -> str:
        return "custom"

# Initialize our custom LLM
custom_llm = CustomLLM()


# Step 8: Initialize and Query the LangChain Agent

In [31]:
def rag_tool(query: str) -> str:
    return generate_answer(query)

tools = [
    Tool(
        name="JP Morgan RAG",
        func=rag_tool,
        description="Provides detailed insights using JP Morgan earnings transcripts."
    )
]

# Initialize the LangChain agent using our custom LLM and the defined tool
agent = initialize_agent(tools, custom_llm, agent="zero-shot-react-description", verbose=True)

# Query the agent with a sample question
agent_query = "Summarize the key performance trends mentioned in the JP Morgan earnings transcripts."
agent_answer = agent.run(agent_query)
print("Agent Answer:", agent_answer)




[1m> Entering new AgentExecutor chain...[0m




ValueError: An output parsing error occurred. In order to pass this error back to the agent and have it try again, pass `handle_parsing_errors=True` to the AgentExecutor. This is the error: Could not parse LLM output: `The following are key performance trends mentioned in the JP Morgan earnings transcripts:`
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE 