<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/sk_RAG_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook implements a Retrieval-Augmented Generation (RAG) system using JP Morgan
    earnings transcripts as the source data. It builds on our existing data engineering pipeline
    by reading raw PDF files stored in Google Drive, extracting text using LangChain’s PyPDFLoader,
    and indexing the content with CHROMA and Sentence Transformer embeddings. A text generation model
    (Flan-T5) is then used to answer queries based on the retrieved context, and the functionality
    is wrapped as a tool for a LangChain agent to handle more complex interactions.

===================================================
"""




# Step 1: Environment Setup & Library Imports

## Import required libraries

In [2]:
# Install necessary packages (uncomment if needed)
!pip install PyPDF2 google-generativeai chromadb



In [3]:


import os
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain.agents import initialize_agent, Tool
from langchain.vectorstores import Chroma
from google.colab import drive
from google.colab import userdata
import google.generativeai as genai
import pandas as pd

# Set your Hugging Face API token via Colab user secrets or replace as needed.
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF')  # <-- Ensure you have your token stored
os.environ["GEMINI_API_TOKEN"] = userdata.get('GOOGLE_API_KEY')  # <-- Ensure you have your token stored



# Step 2: Mount Google Drive and Identify Data

In [4]:
# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# Define the directory containing your PDFs
raw_dir = "/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan"

# List all PDF files in the raw directory
pdf_files = [os.path.join(raw_dir, file) for file in os.listdir(raw_dir) if file.endswith(".pdf")]
print(f"Found {len(pdf_files)} PDF files in {raw_dir}")


Mounted at /content/drive
Found 8 PDF files in /content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan


In [5]:

# Install required libraries
!pip install pdfplumber transformers datasets faiss-cpu




In [None]:
/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/Archived/jpmorgan_management_discussion_preprocessed.csv"

In [41]:
import re
import ast
import pandas as pd
import torch
from transformers import (
    RagTokenizer, RagRetriever, RagTokenForGeneration,
    DPRContextEncoder, DPRContextEncoderTokenizerFast,
    pipeline, logging as hf_logging
)
from datasets import Dataset
import faiss
import numpy as np

# Suppress transformer warnings to reduce noise
hf_logging.set_verbosity_error()

#############################################
# 1. Load Preprocessed Data from CSV
#############################################

csv_path = "/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/Archived/jpmorgan_management_discussion_preprocessed.csv"
df = pd.read_csv(csv_path)

# Build passages from CSV rows.
# Prefer the 'cleaned_data' column; if not available, fallback to 'chunk_text'.
passages = []
for idx, row in df.iterrows():
    if "cleaned_data" in df.columns and pd.notna(row["cleaned_data"]):
        try:
            tokens = ast.literal_eval(row["cleaned_data"])
        except Exception:
            tokens = row["cleaned_data"].split()
    else:
        tokens = row["chunk_text"].split()
    passage_text = " ".join(tokens)
    passages.append({
        "title": f"{row['filename']}_{row['chunk_index']}",
        "text": passage_text,
        "financial_quarter": row["financial_quarter"],
        "call_date": row["call_date"]
    })

# Create a Hugging Face dataset from the passages.
passages_dataset = Dataset.from_list(passages)
passages_dataset.save_to_disk("my_passages_preprocessed")
print("Columns in dataset:", passages_dataset.column_names)

#############################################
# 2. Compute Embeddings for Each Passage
#############################################

context_tokenizer = DPRContextEncoderTokenizerFast.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

def embed_passage(example):
    inputs = context_tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    )
    with torch.no_grad():
        embedding = context_encoder(**inputs).pooler_output.squeeze(0).tolist()
    return {"embedding": embedding}

passages_dataset = passages_dataset.map(embed_passage, batched=False)
passages_dataset = passages_dataset.rename_column("embedding", "embeddings")
passages_dataset.save_to_disk("my_passages_preprocessed_updated")

embeddings = np.stack(passages_dataset["embeddings"])
dimension = embeddings.shape[1]

#############################################
# 3. Build and Save a FAISS Index from the Embeddings
#############################################

index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
faiss.write_index(index, "my_faiss_index")

#############################################
# 4. Initialize RAG with the Custom Retriever
#############################################

rag_tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-base")
rag_model = RagTokenForGeneration.from_pretrained("facebook/rag-token-base")

retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-base",
    index_name="custom",
    passages_path="my_passages_preprocessed_updated",
    index_path="my_faiss_index"
)

#############################################
# Helper: Extract a Numeric Value from Text
#############################################

def extract_numeric(text):
    text = text.strip()
    # Look for a dollar amount first (e.g., "$11 billion")
    dollar_vals = re.findall(r"\$([\d,]+(?:\.\d+)?)", text)
    if dollar_vals:
        return dollar_vals[0]
    # Then look for a percentage (e.g., "21%")
    perc_vals = re.findall(r"([\d,]+(?:\.\d+)?)\s*%", text)
    if perc_vals:
        return perc_vals[0]
    # Otherwise, find any numeric value
    plain_vals = re.findall(r"([\d,]+(?:\.\d+)?)", text)
    if plain_vals:
        return plain_vals[0]
    return text

#############################################
# 5. Extract Key Metrics using Refined RAG-based QA
#############################################

# Refined prompts that explicitly request a single number.
questions = {
    "Net Income": (
        "Extract only the net income value as a single number. "
        "For example, if the transcript states 'net income of $11 billion', return '11'."
    ),
    "EPS": (
        "Extract only the EPS value as a single number. "
        "For example, if the transcript states 'EPS of $4.81', return '4.81'."
    ),
    "Revenue": (
        "Extract only the revenue value as a single number. "
        "For example, if the transcript states 'revenue of $43.7 billion', return '43.7'."
    ),
    "ROTCE": (
        "Extract only the ROTCE value as a single number. "
        "For example, if the transcript states 'ROTCE of 21%', return '21'."
    )
}

def rag_qa(question, num_docs=5):
    input_dict = rag_tokenizer(question, return_tensors="pt")
    input_ids = input_dict["input_ids"]

    with torch.no_grad():
        question_embedding = rag_model.question_encoder(input_ids)[0].detach().cpu().numpy()

    retrieved_docs = retriever(
        input_ids.numpy(), question_embedding, return_tensors="pt", n_docs=num_docs
    )

    if "doc_scores" not in retrieved_docs or retrieved_docs["doc_scores"] is None:
        batch_size = input_ids.shape[0]
        retrieved_docs["doc_scores"] = torch.zeros(batch_size, num_docs)

    generated = rag_model.generate(
        input_ids=input_ids,
        context_input_ids=retrieved_docs["context_input_ids"],
        context_attention_mask=retrieved_docs["context_attention_mask"],
        doc_scores=retrieved_docs["doc_scores"]
    )
    answer = rag_tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
    return extract_numeric(answer)

print("Extracted Key Metrics using RAG:")
key_metrics = {}
for metric, question in questions.items():
    answer = rag_qa(question)
    key_metrics[metric] = answer if answer else "Not Found"
    print(f"{metric}: {key_metrics[metric]}")

#############################################
# 6. Overall Sentiment Analysis using a Production-Grade Pipeline (Batch Processed)
#############################################

sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")
label_mapping = {"LABEL_0": "NEGATIVE", "LABEL_1": "NEUTRAL", "LABEL_2": "POSITIVE"}
sentiment_scores = {"NEGATIVE": 0.0, "NEUTRAL": 0.0, "POSITIVE": 0.0}
count = 0

sentiment_results = sentiment_pipeline([row["text"] for row in passages])
for result in sentiment_results:
    mapped_label = label_mapping.get(result["label"], result["label"])
    sentiment_scores[mapped_label] += result["score"]
    count += 1

averages = {k: sentiment_scores[k] / count for k in sentiment_scores}
overall_sentiment = max(averages, key=averages.get)

print("\nSentiment Analysis Summary (Overall):")
print(f"Average NEGATIVE Score: {averages['NEGATIVE']:.2f}")
print(f"Average NEUTRAL Score: {averages['NEUTRAL']:.2f}")
print(f"Average POSITIVE Score: {averages['POSITIVE']:.2f}")
print(f"Overall Sentiment: {overall_sentiment}")

#############################################
# 7. Sentiment Analysis by Quarter and Year
#############################################

# Convert call_date to datetime and extract the year.
df["call_date"] = pd.to_datetime(df["call_date"])
df["year"] = df["call_date"].dt.year

# Create a full_text column using 'cleaned_data' if available; otherwise, use 'chunk_text'.
def join_tokens(data_str):
    try:
        tokens = ast.literal_eval(data_str)
    except Exception:
        tokens = data_str.split()
    return " ".join(tokens)

if "cleaned_data" in df.columns and df["cleaned_data"].notna().all():
    df["full_text"] = df["cleaned_data"].apply(join_tokens)
elif "chunk_text" in df.columns:
    df["full_text"] = df["chunk_text"]
else:
    raise ValueError("No suitable text data found in the CSV.")

# Group by financial_quarter and year.
grouped = df.groupby(["financial_quarter", "year"])
group_sentiments = {}

print("\nSentiment Analysis by Quarter and Year:")
for (quarter, year), group in grouped:
    aggregated_text = " ".join(group["full_text"].tolist())
    # Use truncation and a defined max_length to ensure input is within model limits.
    sentiment_result = sentiment_pipeline(aggregated_text, truncation=True, max_length=512)
    mapped_label = label_mapping.get(sentiment_result[0]["label"], sentiment_result[0]["label"])
    group_sentiments[(quarter, year)] = (mapped_label, sentiment_result[0]["score"])
    print(f"Quarter: {quarter}, Year: {year}, Sentiment: {mapped_label}, Score: {sentiment_result[0]['score']:.2f}")

Saving the dataset (0/1 shards):   0%|          | 0/239 [00:00<?, ? examples/s]

Columns in dataset: ['title', 'text', 'financial_quarter', 'call_date']


Map:   0%|          | 0/239 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/239 [00:00<?, ? examples/s]

Extracted Key Metrics using RAG:
Net Income: 1
EPS: 11
Revenue: 1
ROTCE: http://www.google.com/google.com/google.com/google.

Sentiment Analysis Summary (Overall):
Average NEGATIVE Score: 0.01
Average NEUTRAL Score: 0.50
Average POSITIVE Score: 0.21
Overall Sentiment: NEUTRAL

Sentiment Analysis by Quarter and Year:
Quarter: 1Q23, Year: 2023, Sentiment: NEUTRAL, Score: 0.80
Quarter: 1Q24, Year: 2024, Sentiment: NEUTRAL, Score: 0.66
Quarter: 2Q23, Year: 2023, Sentiment: NEUTRAL, Score: 0.66
Quarter: 2Q24, Year: 2024, Sentiment: NEUTRAL, Score: 0.73
Quarter: 3Q23, Year: 2023, Sentiment: NEUTRAL, Score: 0.75
Quarter: 3Q24, Year: 2024, Sentiment: NEUTRAL, Score: 0.79
Quarter: 4Q23, Year: 2024, Sentiment: NEUTRAL, Score: 0.82
Quarter: 4Q24, Year: 2025, Sentiment: NEUTRAL, Score: 0.75
