<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_RAG_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a Retrieval-Augmented Generation (RAG) system using JP Morgan
    earnings transcripts as the source data. It builds on our existing data engineering pipeline
    by reading raw PDF files stored in Google Drive, extracting text using LangChain’s PyPDFLoader,
    and indexing the content with CHROMA and Sentence Transformer embeddings. A text generation model
    (Flan-T5) is then used to answer queries based on the retrieved context, and the functionality
    is wrapped as a tool for a LangChain agent to handle more complex interactions.

===================================================
"""




## Set Up and Import Libraries

In [8]:
!pip install -U langchain-community
!pip install pypdf
!pip install chromadb



## Import required libraries

In [9]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain.agents import initialize_agent, Tool
from langchain.llms import HuggingFaceHub
from langchain.vectorstores import Chroma
from google.colab import drive
from langchain.agents import initialize_agent, Tool

## Mount Google Drive

In [10]:
drive.mount('/content/drive', force_remount=True)

# Set the raw directory (adjust if necessary)
raw_dir = "/content/drive/MyDrive/BOE/bank_of_england/data/raw/"

# List all PDF files in the raw directory
pdf_files = [os.path.join(raw_dir, file) for file in os.listdir(raw_dir) if file.endswith(".pdf")]
print(f"Found {len(pdf_files)} PDF files in {raw_dir}")


Mounted at /content/drive
Found 8 PDF files in /content/drive/MyDrive/BOE/bank_of_england/data/raw/


## Load and Process the PDFs

In [11]:
# Initialize an empty list for documents
documents = []

# Loop over each PDF file and load its content
for pdf_file in pdf_files:
    try:
        loader = PyPDFLoader(pdf_file)
        docs = loader.load()
        documents.extend(docs)
        print(f"Loaded {len(docs)} chunks from {os.path.basename(pdf_file)}")
    except Exception as e:
        print(f"Error processing {pdf_file}: {e}")

print(f"\nTotal document chunks loaded: {len(documents)}")


Loaded 16 chunks from 4q24-earnings-transcript.pdf
Loaded 17 chunks from 1q23-earnings-transcript.pdf
Loaded 18 chunks from 2q23-earnings-transcript.pdf
Loaded 18 chunks from jpm-1q24-earnings-call-transcript.pdf
Loaded 15 chunks from jpm-2q24-earnings-call-transcript-final.pdf
Loaded 18 chunks from jpm-3q23-earnings-call-transcript.pdf
Loaded 14 chunks from jpm-4q23-earnings-call-transcript.pdf
Loaded 19 chunks from jpmc-third-quarter-2024-earnings-conference-call-transcript.pdf

Total document chunks loaded: 135


## Initialize Sentence Transformer embeddings

In [12]:
# Initialize Sentence Transformer embeddings
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## Create a Chroma vector store from the loaded documents

In [13]:
# Create a Chroma vector store from the loaded documents.
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="jpm_transcripts")
print("Chroma vector store created from the JP Morgan transcripts.")

Chroma vector store created from the JP Morgan transcripts.


## Create a Chroma vector store from the loaded documents

In [14]:
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="jpm_transcripts")
print("Chroma vector store created from the JP Morgan transcripts.")


Chroma vector store created from the JP Morgan transcripts.


## Load the Flan-T5 model for text generation using Hugging Face's transformers pipeline

In [15]:
from transformers import pipeline
qa_model = pipeline("text2text-generation", model="google/flan-t5-base", device=0)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


## Define a Retrieval-Augmented Generation (RAG) function

In [25]:
def generate_answer(query):
    # Retrieve more document chunks for a broader context
    docs = vectorstore.similarity_search(query, k=6)
    # Filter out chunks that are very short (e.g., less than 20 words)
    informative_docs = [doc for doc in docs if len(doc.page_content.split()) > 20]
    context = " ".join([doc.page_content for doc in informative_docs])

    # Create a detailed prompt instructing the model to synthesize insights on performance
    prompt = (
        "Below are excerpts from JP Morgan earnings call transcripts. "
        "Please analyze these excerpts and provide a detailed summary of the key insights regarding quarterly performance. "
        "Include any performance metrics, trends, improvements or declines, and significant factors mentioned. "
        "If available, mention any specific challenges or notable observations about quarterly performance.\n\n"
        "Transcript Excerpts:\n"
        f"{context}\n\n"
        "Summary:"
    )

    # Optionally, increase max_length to allow a longer response
    result = qa_model(prompt, max_length=512, temperature=0.7)
    return result[0]['generated_text']

## Test the RAG function with an example query

In [26]:
# Test the updated RAG function with your query
test_query = "What insights about quarterly performance are highlighted in the transcripts?"
print("RAG Answer:", generate_answer(test_query))



RAG Answer: JP Morgan earnings call transcripts


## Wrap the RAG function as a tool for a LangChain agent

In [19]:
# IMPORTANT: Replace 'YOUR_HF_API_TOKEN' with your actual Hugging Face API token or set it as an environment variable.
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "YOUR_HF_API_TOKEN"  # <-- Replace with your token

from langchain_huggingface import HuggingFaceEndpoint
from langchain.agents import initialize_agent, Tool

# Initialize the LLM endpoint using the updated HuggingFaceEndpoint.
llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/google/flan-t5-base",
    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
    model_kwargs={"temperature": 0.7}
)

# Define a tool that wraps our RAG function.
def rag_tool(query: str) -> str:
    return generate_answer(query)

tools = [
    Tool(
        name="JP Morgan RAG",
        func=rag_tool,
        description="Answers questions using JP Morgan earnings transcripts as context."
    )
]

# Create the agent using the defined tool.
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# Use the agent to answer a sample query.
agent_query = "Summarize the key performance trends mentioned in the JP Morgan earnings transcripts."
agent_answer = agent.run(agent_query)
print("Agent Answer:", agent_answer)

ModuleNotFoundError: No module named 'langchain_huggingface'