<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
"""
===================================================
Team Name: Quant Collective
Author/s: Sheldon Kemper, Rita, Kasia, Chiaki, Oscar, Arijit

LinkedIn Profiles:
    Sheldon: https://www.linkedin.com/in/sheldon-kemper
    Rita: [Insert Rita's LinkedIn URL]
    Kasia: [Insert Kasia's LinkedIn URL]
    Chiaki: [Insert Chiaki's LinkedIn URL]
    Oscar: [Insert Oscar's LinkedIn URL]
    Arijit: [Insert Arijit's LinkedIn URL]

Date: 2025-02-04
Version: 1.1

Description:
    This notebook serves as the central orchestrator of our end-to-end NLP pipeline,
    which has been developed to transform unstructured quarterly announcements and Q&A
    transcripts from Global Systemically Important Banks (G-SIBs) into actionable insights
    for the Bank of England. The processes integrated into this pipeline are:

    1. Data Collection & Preprocessing:
       - Ingestion of raw data from multiple sources (e.g., PDFs, HTML, transcripts)
       - Data cleaning, noise removal, and formatting standardization
       - Initial Exploratory Data Analysis (EDA) to understand data characteristics

    2. Methodology & Modelling:
       - Topic Modelling with BERTopic to extract latent themes and topics
       - Sentiment Analysis with FinBERT to gauge market sentiment within transcript segments
       - Summarisation Pipeline to generate concise summaries from lengthy texts

    3. Integration & Pipeline Development:
       - Sequential execution of the above processes to ensure a cohesive workflow
       - Handling of inter-process dependencies and data hand-offs
       - Iterative refinements based on challenges and model performance evaluations

    4. Results and Reporting:
       - Aggregation of model outputs, key findings, and visualisations
       - Generation of actionable insights and business recommendations for risk assessment

This collaborative effort demonstrates the combined expertise of Quant Collective in building robust,
scalable data engineering solutions tailored for complex financial datasets.

===================================================
"""



In [4]:
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/import/sk_import_PDF.ipynb
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/cleansed/sk_processed_ubs.ipynb
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/cleansed/sk_processed_jpmorgan.ipynb
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/cleansed/kk_eda.ipynb
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/modelling/kk_mvp_modelling.ipynb
!wget -q https://raw.githubusercontent.com/sheldonkemper/bank_of_england/refs/heads/main/notebooks/modelling/ob_flan_t5_sentiment_jpm.ipynb


In [5]:
import os
from google.colab import drive
# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)

# Assuming 'BOE' folder is in 'MyDrive' and already shared
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# Now you (and others with access) can work with files in this directory
# For example, you can list the contents:
print(os.listdir(BOE_path))

Mounted at /content/drive
['raw', 'jpm_presentation_final.csv', 'cleansed', 'model_outputs', 'preprocessed_data']


# 1. Data Collection & Preprocessing

## Ingestion of raw data from multiple sources

In [6]:
sk_bank_17 = "ubs"
%run sk_import_PDF.ipynb

Mounted at /content/drive
The value of my_variable is: ubs

Summary of Downloads:
('https://www.ubs.com/global/en/investor-relations/financial-information/quarterly-reporting/qr-shared/2023/2q23/_jcr_content/mainpar/toplevelgrid_copy_co/col1/linklistreimagined_c/link_2038370922_copy.1634234040.file/PS9jb250ZW50L2RhbS9hc3NldHMvY2MvaW52ZXN0b3ItcmVsYXRpb25zL3F1YXJ0ZXJsaWVzLzIwMjMvMnEyMy8ycTIzLWVhcm5pbmdzLWNhbGwtcmVtYXJrcy5wZGY=/2q23-earnings-call-remarks.pdf', 'exists', '/content/drive/MyDrive/BOE/bank_of_england/data/raw/ubs/2q23-earnings-call-remarks.pdf')
('https://www.ubs.com/global/en/investor-relations/financial-information/quarterly-reporting/qr-shared/2023/1q23/_jcr_content/mainpar/toplevelgrid_copy_co/col1/linklistreimagined_c/link_2038370922.1996821412.file/PS9jb250ZW50L2RhbS9hc3NldHMvY2MvaW52ZXN0b3ItcmVsYXRpb25zL3F1YXJ0ZXJsaWVzLzIwMjMvMXEyMy8xcTIzLWVhcm5pbmdzLWNhbGwtcmVtYXJrcy5wZGY=/1q23-earnings-call-remarks.pdf', 'exists', '/content/drive/MyDrive/BOE/bank_of_england/data/raw/

In [7]:
sk_bank_17 = "jpmorgan"
%run sk_import_PDF.ipynb

Mounted at /content/drive
The value of my_variable is: jpmorgan

Summary of Downloads:
('https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/quarterly-earnings/2023/3rd-quarter/jpm-3q23-earnings-call-transcript.pdf', 'exists', '/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan/jpm-3q23-earnings-call-transcript.pdf')
('https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/quarterly-earnings/2023/1st-quarter/1q23-earnings-transcript.pdf', 'exists', '/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan/1q23-earnings-transcript.pdf')
('https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relations/documents/quarterly-earnings/2023/2nd-quarter/2q23-earnings-transcript.pdf', 'exists', '/content/drive/MyDrive/BOE/bank_of_england/data/raw/jpmorgan/2q23-earnings-transcript.pdf')
('https://www.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/investor-relat

## Data cleaning, noise removal, and formatting standardization

## Initial Data cleaning

In [None]:
# JP MOrgan
%run sk_processed_jpmorgan.ipynb

In [8]:
# UBS
%run sk_processed_ubs.ipynb

Mounted at /content/drive
Processing file: 1q23-earnings-call-remarks.pdf
Processing file: 1q24-earnings-call-remarks.pdf
Processing file: 2q23-earnings-call-remarks.pdf
Processing file: 2q24-earnings-call-remarks.pdf
Processing file: 3q23-earnings-call-remarks.pdf
Processing file: 3q24-earnings-call-remarks.pdf
Processing file: 4q23-earnings-call-remarks.pdf
Processing file: 4q24-earnings-call-remarks.pdf
Management announcements saved to: /content/drive/MyDrive/BOE/bank_of_england/data/cleansed/ubs_management_discussion.csv
Q&A section saved to: /content/drive/MyDrive/BOE/bank_of_england/data/cleansed/ubs_qna_section.csv


## Preprocessing

## Initial Exploratory Data Analysis (EDA) to understand data characteristics

In [10]:
%run kk_eda.ipynb

Mounted at /content/drive
['JPMorgan_QNA_processed_data.csv', 'jpmorgan_qna_df_preprocessed_final.csv', 'jpmorgan_management_discussion.csv', 'jpmorgan_qna preprocessed.csv', 'archived', 'ubs_qa_df_preprocessed_ver2.csv', 'jpmorgan_qna preprocessed (1).gsheet', 'jpmorgan_qna preprocessed.gsheet', 'ubs_qa_df_preprocessed_ver2.gsheet']

📌 **File: jpmorgan_qna_df_preprocessed_final.csv**
['Index', 'Quarter-Year', 'Question', 'Question_cleaned', 'Asked By', 'Role of the person asked the question', 'Answer', 'Answer_cleaned', 'Answered By', 'Role of the person answered the question']

⚠️ Could not read jpmorgan_management_df_preprocessed_final.csv: [Errno 2] No such file or directory: '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_final.csv'


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_final.csv'

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_final.csv'

# 2. Methodology & Modelling

## Topic Modelling with BERTopic to extract latent themes and topics

In [11]:
%run kk_mvp_modelling.ipynb

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloa

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Mounted at /content/drive
['JPMorgan_QNA_processed_data.csv', 'jpmorgan_qna_df_preprocessed_final.csv', 'jpmorgan_management_discussion.csv', 'jpmorgan_qna preprocessed.csv', 'archived', 'ubs_qa_df_preprocessed_ver2.csv', 'jpmorgan_qna preprocessed (1).gsheet', 'jpmorgan_qna preprocessed.gsheet', 'ubs_qa_df_preprocessed_ver2.gsheet']


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_final.csv'

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_final.csv'

## Sentiment Analysis with FinBERT to gauge market sentiment within transcript segments

## Flan T5 Sentiment JPM


In [12]:
%run ob_flan_t5_sentiment_jpm.ipynb

[31mERROR: Could not find a version that satisfies the requirement datsets (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for datsets[0m[31m
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Mounted at /content/drive


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks/DS_CA/BOE/jpmorgan_qna_df_preprocessed_final.csv'

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks/DS_CA/BOE/jpmorgan_qna_df_preprocessed_final.csv'

## Summarisation Pipeline to generate concise summaries from lengthy texts

# 3. Integration & Pipeline Development

## Sequential execution of the above processes to ensure a cohesive workflow

## Handling of inter-process dependencies and data hand-offs

## Iterative refinements based on challenges and model performance evaluations

# 4. Results and Reporting

## Aggregation of model outputs, key findings, and visualisations

## Generation of actionable insights and business recommendations for risk assessment