<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_RAG_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a Retrieval-Augmented Generation (RAG) system using JP Morgan
    earnings transcripts as the source data. It builds on our existing data engineering pipeline
    by reading raw PDF files stored in Google Drive, extracting text using LangChain’s PyPDFLoader,
    and indexing the content with CHROMA and Sentence Transformer embeddings. A text generation model
    (Flan-T5) is then used to answer queries based on the retrieved context, and the functionality
    is wrapped as a tool for a LangChain agent to handle more complex interactions.

===================================================
"""




## Set Up and Import Libraries

In [2]:
# !pip install -U langchain-community
# !pip install pypdf
# !pip install chromadb

## Import required libraries

In [3]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain.agents import initialize_agent, Tool
from langchain.llms import HuggingFaceHub
from langchain.vectorstores import Chroma
from google.colab import drive
from langchain.agents import initialize_agent, Tool

## Mount Google Drive

In [4]:
drive.mount('/content/drive', force_remount=True)

# Set the raw directory (adjust if necessary)
raw_dir = "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan"

# List all PDF files in the raw directory
pdf_files = [os.path.join(raw_dir, file) for file in os.listdir(raw_dir) if file.endswith(".pdf")]
print(f"Found {len(pdf_files)} PDF files in {raw_dir}")


Mounted at /content/drive
Found 8 PDF files in /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan


## Load and Process the PDFs

In [5]:
# Initialize an empty list for documents
documents = []

# Loop over each PDF file and load its content, skipping the first chunk which is likely a header.
for pdf_file in pdf_files:
    try:
        loader = PyPDFLoader(pdf_file)
        docs = loader.load()
        # Option 1: Skip the first chunk if it is a header page
        if docs and "JPMorgan Chase" in docs[0].page_content and "Earnings Call Transcript" in docs[0].page_content:
            docs = docs[1:]
        documents.extend(docs)
        print(f"Loaded {len(docs)} chunks from {os.path.basename(pdf_file)} (header skipped if detected)")
    except Exception as e:
        print(f"Error processing {pdf_file}: {e}")




Loaded 18 chunks from 2q23-earnings-transcript.pdf (header skipped if detected)
Loaded 18 chunks from jpm-3q23-earnings-call-transcript.pdf (header skipped if detected)
Loaded 14 chunks from jpm-4q23-earnings-call-transcript.pdf (header skipped if detected)
Loaded 17 chunks from 1q23-earnings-transcript.pdf (header skipped if detected)
Loaded 18 chunks from jpm-1q24-earnings-call-transcript.pdf (header skipped if detected)
Loaded 15 chunks from jpm-2q24-earnings-call-transcript-final.pdf (header skipped if detected)
Loaded 16 chunks from 4q24-earnings-transcript.pdf (header skipped if detected)
Loaded 19 chunks from jpmc-third-quarter-2024-earnings-conference-call-transcript.pdf (header skipped if detected)


## Initialize Sentence Transformer embeddings

In [6]:
# Initialize Sentence Transformer embeddings
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Create a Chroma vector store from the loaded documents

In [7]:
# Create a Chroma vector store from the loaded documents.
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="jpm_transcripts")
print("Chroma vector store created from the JP Morgan transcripts.")

Chroma vector store created from the JP Morgan transcripts.


## Create a Chroma vector store from the loaded documents

In [8]:
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents, embeddings, collection_name="jpm_transcripts")
print("Chroma vector store created from the JP Morgan transcripts.")


Chroma vector store created from the JP Morgan transcripts.


## Load the Flan-T5 model for text generation using Hugging Face's transformers pipeline

In [9]:
from transformers import pipeline
qa_model = pipeline("text2text-generation", model="google/flan-t5-base", device=0)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


## Define a Retrieval-Augmented Generation (RAG) function

In [10]:
def generate_answer(query):
    # Retrieve more document chunks for a broader context
    docs = vectorstore.similarity_search(query, k=6)
    # Filter out chunks that are very short (e.g., less than 20 words)
    informative_docs = [doc for doc in docs if len(doc.page_content.split()) > 20]
    context = " ".join([doc.page_content for doc in informative_docs])

    # Create a detailed prompt instructing the model to synthesize insights on performance
    prompt = (
        "Below are excerpts from JP Morgan earnings call transcripts. "
        "Please analyze these excerpts and provide a detailed summary of the key insights regarding quarterly performance. "
        "Include any performance metrics, trends, improvements or declines, and significant factors mentioned. "
        "If available, mention any specific challenges or notable observations about quarterly performance.\n\n"
        "Transcript Excerpts:\n"
        f"{context}\n\n"
        "Summary:"
    )

    # Optionally, increase max_length to allow a longer response
    result = qa_model(prompt, max_length=512, temperature=0.7)
    return result[0]['generated_text']

## Test the RAG function with an example query

In [11]:
# Test the updated RAG function with your query
test_query = "What insights about quarterly performance are highlighted in the transcripts?"
print("RAG Answer:", generate_answer(test_query))



RAG Answer: JP Morgan earnings call transcripts.


## Wrap the RAG function as a tool for a LangChain agent

In [13]:
!pip install langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=2.6.0->langchain-huggingface)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=2.6.0->langchain-huggingface)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=2.6.0->langchain-huggingface)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=2.6.0->langchain-huggingface)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==1

In [14]:
from google.colab import userdata


In [17]:
# IMPORTANT: Replace 'YOUR_HF_API_TOKEN' with your actual Hugging Face API token or set it as an environment variable.
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF')  # <-- Replace with your token

from langchain_huggingface import HuggingFaceEndpoint
from langchain.agents import initialize_agent, Tool

# Initialize the LLM endpoint using the updated HuggingFaceEndpoint.
# Pass 'temperature' directly as a keyword argument, and specify the 'task'
llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/google/flan-t5-base",
    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
    temperature=0.7, # Moved temperature here
    task="text2text-generation"  # Add the task parameter
)

# Define a tool that wraps our RAG function.
def rag_tool(query: str) -> str:
    return generate_answer(query)

tools = [
    Tool(
        name="JP Morgan RAG",
        func=rag_tool,
        description="Answers questions using JP Morgan earnings transcripts as context."
    )
]

# Create the agent using the defined tool.
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# Use the agent to answer a sample query.
agent_query = "Summarize the key performance trends mentioned in the JP Morgan earnings transcripts."
agent_answer = agent.run(agent_query)
print("Agent Answer:", agent_answer)



[1m> Entering new AgentExecutor chain...[0m




ValueError: Task text2text-generation has no recommended model. Please specify a model explicitly. Visit https://huggingface.co/tasks for more info.

In [None]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 2.0

Description:
    This notebook implements a Retrieval-Augmented Generation (RAG) system using cleaned
    data from Bank of England projects. The data is stored in two CSV files – one containing the management
    discussion and one containing the questions and answers. The notebook loads these CSV files from Google Drive,
    converts each row into a Document object, and indexes them using a Chroma vector store with Sentence Transformer embeddings.
    A text generation model (Flan-T5) is then used to answer queries based on the retrieved context.
===================================================
"""

# Step 1: Mount Google Drive (if running in Google Colab)
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Step 2: Import necessary libraries
import pandas as pd
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from transformers import pipeline

# Step 3: Define paths to your cleaned CSV files
management_path = "/content/drive/MyDrive/BOE/bank_of_england/data/processed/management_discussion.csv"
qa_path = "/content/drive/MyDrive/BOE/bank_of_england/data/processed/qa_section.csv"

# Load the CSV files
management_df = pd.read_csv(management_path)
qa_df = pd.read_csv(qa_path)

print("Management Discussion CSV shape:", management_df.shape)
print("Q&A CSV shape:", qa_df.shape)

# Step 4: Create Document objects from each CSV row.
# Assumption: Each CSV has a column named "content" that holds the cleaned text.
docs = []

# Process management discussion data
for index, row in management_df.iterrows():
    content = row["content"]  # adjust this if your column name is different
    docs.append(Document(page_content=content, metadata={"source": "management_discussion"}))

# Process Q&A data
for index, row in qa_df.iterrows():
    content = row["content"]
    docs.append(Document(page_content=content, metadata={"source": "qa_section"}))

print("Total documents created:", len(docs))

# Step 5: Initialize Sentence Transformer embeddings and create a Chroma vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(docs, embeddings, collection_name="boe_cleaned_data")
print("Chroma vector store created from cleaned BOE data.")

# Step 6: Load the text generation model (Flan-T5) using Hugging Face's transformers pipeline.
# Note: Ensure that your runtime is set to GPU for faster inference.
qa_model = pipeline("text2text-generation", model="google/flan-t5-base", device=0)

# Step 7: Define the Retrieval-Augmented Generation (RAG) function
def generate_answer(query):
    # Retrieve the top 6 most similar documents from the vector store
    retrieved_docs = vectorstore.similarity_search(query, k=6)
    # Combine the content from the retrieved documents
    context = " ".join([doc.page_content for doc in retrieved_docs])

    # Construct a detailed prompt instructing the model to summarize key performance insights
    prompt = (
        "Below are excerpts from the Bank of England management discussion and Q&A sections. "
        "Please analyze these excerpts and provide a detailed summary of the key insights regarding quarterly performance. "
        "Include performance metrics, trends, improvements or declines, and any notable observations.\n\n"
        "Excerpts:\n"
        f"{context}\n\n"
        "Summary:"
    )

    result = qa_model(prompt, max_length=512, temperature=0.7)
    return result[0]['generated_text']

# Step 8: Test the RAG function with a sample query
test_query = "What insights about quarterly performance are highlighted in the processed data?"
print("RAG Answer:", generate_answer(test_query))
