<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/kk_mvp_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
"""
===================================================
Author: Kasia Kirby
Role: Reporting Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/kasia-kirby
Date: 2025-02-17
Version: 1.0

Description:
    This notebook builds an end to end NLP pipeline that runs topic modelling,
    sentiment analysis, and summarisation of JPM quarterly call trancripts from 2023-2024.

===================================================
"""



**Task**

Model the transcript data by analyst and by selected quarters using BERTopic, FinBert and LLM model (Flan-T5).

**Data**

- JMP bank, 2023-2024, Management & Q&A transcripts
- preprocessed with pdf parsing, regex-based segmentation, and structured extraction with GPT-4, and converted into a dataframe
- cleaned column added for questions, answers and management text

**Requirements**

1. finBERt returns sentiment accuracy
2. LLM model returns better sentiment accuracy
3. BERTopic performs accurate topic extraction
4. LLM is able to capture more topic context and provide summarisation of those
5. Analyst topics of interest and sentiment varies widely quarter by quarter
6. Topics and sentiment comparison between analysts.

**Approach**

Topic model -> Sentiment analysis -> Insight + Comparison

- First, topics are identified freely by the model on each transcript
- Second the model is trained with G-SIB assessment topics
- Third, sentiment analysis is run on the resulting four files (Q&A and management - with free topics, and trained topics)
- Fourth, all are compared against each other to find best insights

**Benchmark analysis**

- **comparison between model types and their results** (apply like-for-like rules when comparing the models results i.e. if we gather insights on a specific analyst in a specific quarter (Q42024) compare models results and ability to capture correct information using the same analyst and quarter across all models),
- **comparison between Q42024 analyst insights and other quarters insights** (apply full models flow across at least two different periods i.e. Q4 2024 and Q2 2024 and choose a specific analyst and check what major topics and sentiment they cover between the two periods; do they have similar sentiment over time? Do they focus on specific topic areas that are of interest in that specific quarter over the other quarter?),
- **comparison between two different analysts in the same quarter** (are analysts interested in different topics? What can we extract at analyst level with regards to sentiment, topics of interest and conversation summarisation?)



# 1. Import libraries and files

In [3]:
!pip install bertopic
!pip install pyLDAvis

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloadin

In [11]:
import os
import sys
from google.colab import drive

import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import re
from bertopic import BERTopic

import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [5]:
# Mount Google Drive to the root location
drive.mount('/content/drive', force_remount=True)
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data/preprocessed_data'
print(os.listdir(BOE_path))

Mounted at /content/drive
['ubs_management_discussion_preprocessed.csv', 'ubs_qna_df_preprocessed.csv', 'Archived', 'JP Mogran processed thru OpenAI', 'jpmorgan_qna_df_preprocessed_final.csv', 'jpmorgan_management_df_preprocessed_final.csv']


In [30]:
# Load data

qna_file = os.path.join(BOE_path, 'jpmorgan_qna_df_preprocessed_final.csv')
management_file = os.path.join(BOE_path, 'jpmorgan_management_df_preprocessed_final.csv')

df_qna = pd.read_csv(qna_file, encoding='utf-8')
df_mgmt = pd.read_csv(management_file, encoding='utf-8')

print("Q&A DataFrame:")
display(df_qna.head())

print("\nManagement Discussion DataFrame:")
display(df_mgmt.head())

print("Q&A DataFrame Overview:")
print(df_qna.info())

print("\nManagement Discussion DataFrame Overview:")
print(df_mgmt.head())

Q&A DataFrame:


Unnamed: 0,Index,Quarter-Year,Question,Question_cleaned,Asked By,Role of the person asked the question,Answer,Answer_cleaned,Answered By,Role of the person answered the question
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",['so jamie actually hoping get perspective see...,Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",['well think already kind complete answering q...,Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C..."
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",['hey thanks good morning hey jeremy wondering...,Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",['yeah sure let summarize drivers change outlo...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",['hi thanks jeremy wanted follow drivers nii r...,John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",['yeah john really good question weve obviousl...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
3,4,1Q23,My first question is you mentioned that your r...,['first question mentioned reserve build drive...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",['yeah so erika know take going go lot detail ...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,['hey good morning maybe little bit deposit th...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...",['yeah couple things there so first all know r...,"Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;..."



Management Discussion DataFrame:


Unnamed: 0,Index,Quarter-Year,Text,Text_cleaned
0,,4Q24,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...,['management discussion section operator : goo...
1,,3Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
2,,2Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
3,,1Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
4,,4Q23,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...


Q&A DataFrame Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 10 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Index                                     91 non-null     int64 
 1   Quarter-Year                              91 non-null     object
 2   Question                                  90 non-null     object
 3   Question_cleaned                          91 non-null     object
 4   Asked By                                  90 non-null     object
 5   Role of the person asked the question     90 non-null     object
 6   Answer                                    89 non-null     object
 7   Answer_cleaned                            91 non-null     object
 8   Answered By                               89 non-null     object
 9   Role of the person answered the question  89 non-null     object
dtypes: int64(1), object(9)
memor

**Q&A Data**
- Structured into **questions, answers, analysts, and executives**.

**Management Discussion Data**
- Only contains Quarter-Year and Text columns.
- Index column is entirely NaN and can be dropped.

# 2. Data preparation

Drop unnecessary columns and missing entries, standardise column names, format quarter properly, add Type column.

In [31]:
# Drop Unnecessary Columns
df_qna.drop(columns=["Index"], inplace=True, errors='ignore')
df_mgmt.drop(columns=["Index"], inplace=True, errors='ignore')

# Standardize Column Names
df_qna.rename(columns={
    "Quarter-Year": "Quarter",
    "Asked By": "Analyst",
    "Answer": "Response",
    "Answered By": "Executive",
    "Role of the person asked the question": "Analyst Role",
    "Role of the person answered the question": "Executive Role"
}, inplace=True)

df_mgmt.rename(columns={
    "Quarter-Year": "Quarter",
    "Text": "Transcript"
}, inplace=True)

# Drop Missing Q&A Entries (2 rows in the Q&A transcript)
df_qna.dropna(subset=["Question", "Response"], inplace=True)

# Format `Quarter` Properly
def format_quarter(quarter_str):
    match = re.search(r'(\d)Q(\d{2})', quarter_str)
    if match:
        return f"20{match.group(2)}-Q{match.group(1)}"
    return quarter_str

df_qna["Quarter"] = df_qna["Quarter"].astype(str).apply(format_quarter)
df_mgmt["Quarter"] = df_mgmt["Quarter"].astype(str).apply(format_quarter)

# Standardize Executive & Analyst Roles
role_mapping = {
    "Chief Executive Officer": "CEO",
    "Chairman & Chief Executive Officer": "CEO",
    "Chief Financial Officer": "CFO",
    "Chief Operating Officer": "COO",
    "President": "President",
    "Vice Chairman": "Vice Chairman",
    "Head of Investor Relations": "Head of IR",
    "Managing Director": "Managing Director",
    "Analyst, Wolfe Research LLC": "Analyst",
    "Analyst, Jefferies LLC": "Analyst",
    "Analyst, Autonomous Research": "Analyst",
    "Analyst, UBS Securities LLC": "Analyst",
    "Analyst, Seaport Global Securities LLC": "Analyst"
}

# Apply role mapping (handles cases where multiple roles are listed)
def standardize_role(role):
    if pd.isna(role):
        return None
    for key, value in role_mapping.items():
        if key.lower() in role.lower():
            return value
    return role

df_qna["Executive Role"] = df_qna["Executive Role"].apply(standardize_role)
df_qna["Analyst Role"] = df_qna["Analyst Role"].apply(standardize_role)

# Add `Type` Column
df_qna["Type"] = "Q&A"
df_mgmt["Type"] = "Management Discussion"

print("Q&A DataFrame:")
display(df_qna.head())

print("\nManagement Discussion DataFrame:")
display(df_mgmt.head())

Q&A DataFrame:


Unnamed: 0,Quarter,Question,Question_cleaned,Analyst,Analyst Role,Response,Answer_cleaned,Executive,Executive Role,Type
0,2023-Q1,"So, Jamie, I was actually hoping to get your p...",['so jamie actually hoping get perspective see...,Steven Chubak,Analyst,"Well, I think you were already kind of complet...",['well think already kind complete answering q...,Jamie Dimon,CEO,Q&A
1,2023-Q1,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",['hey thanks good morning hey jeremy wondering...,Ken Usdin,Analyst,"Yeah, sure. So let me just summarize the drive...",['yeah sure let summarize drivers change outlo...,Jeremy Barnum,CFO,Q&A
2,2023-Q1,"Hi, thanks. Jeremy, wanted to follow up again ...",['hi thanks jeremy wanted follow drivers nii r...,John McDonald,Analyst,"Yeah. John, it's a really good question, and w...",['yeah john really good question weve obviousl...,Jeremy Barnum,CFO,Q&A
3,2023-Q1,My first question is you mentioned that your r...,['first question mentioned reserve build drive...,Erika Najarian,Analyst,"Yeah. So, Erika, as you know, we take \n not g...",['yeah so erika know take going go lot detail ...,Jeremy Barnum,CFO,Q&A
4,2023-Q1,Hey. Good morning. Maybe just a little bit on ...,['hey good morning maybe little bit deposit th...,Jim Mitchell,Analyst,"Yeah. A couple things there. So, first of all,...",['yeah couple things there so first all know r...,"Jeremy Barnum, Jamie Dimon",CEO,Q&A



Management Discussion DataFrame:


Unnamed: 0,Quarter,Transcript,Text_cleaned,Type
0,2024-Q4,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...,['management discussion section operator : goo...,Management Discussion
1,2024-Q3,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
2,2024-Q2,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
3,2024-Q1,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
4,2023-Q4,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion


In [32]:
# Recheck for short, non-substantive responses as indicated by EDA (separate notebook)

# convert Answer_cleaned from string to a list of words
df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(lambda x: str(x).lower().split() if isinstance(x, str) else [])

# define a threshold for what is considered a "short" response
SHORT_RESPONSE_THRESHOLD = 5

# filter for responses that contain very few words
short_responses = df_qna[df_qna["Answer_cleaned"].apply(lambda x: isinstance(x, list) and len(x) <= SHORT_RESPONSE_THRESHOLD)]

print("Examples of Short Responses:")
print(short_responses[["Quarter", "Answer_cleaned"]].head())

print(f"\nTotal number of short responses: {len(short_responses)}")

Examples of Short Responses:
    Quarter                       Answer_cleaned
11  2023-Q1  [['excellent, folks, thank, much']]
25  2023-Q2               [['thank, you, guys']]
37  2023-Q3                    [['thank, much']]
48  2023-Q4   [['okay, thanks, much, everyone']]
79  2024-Q3      [['yeah, hear, you, hear, us']]

Total number of short responses: 5


In [33]:
# Remove short, non-informative responses

# flatten nested lists
def flatten_list(nested_list):
    if isinstance(nested_list, list) and len(nested_list) == 1 and isinstance(nested_list[0], list):
        return nested_list[0]
    return nested_list

df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(flatten_list)
df_qna_filtered = df_qna[df_qna["Answer_cleaned"].apply(lambda x: isinstance(x, list) and len(x) >= SHORT_RESPONSE_THRESHOLD)]

print(f"Removed {len(df_qna) - len(df_qna_filtered)} short non-informative responses.")
df_qna = df_qna_filtered

Removed 4 short non-informative responses.


Note: I will keep filler words, even if they might seem non-informative (e.g. "think," "little bit," and "obviously"), as they might be useful for sentiment analysis.

# 3. Identify themes with topic modeling (BERTopic)

- Unsupervised BERTopic (extracts natural topics)
- BERTopic trained with G-SIB topics (aligns with regulatory focus)

In line with EDA findings, we will model analyst questions, executive answers and the management discussion separately (to achieve distict themes, that are not inflated by management discussion positivity)

## 3a) LDA test (on uncleaned data)

Run LDA first to confirm topic modelling is feasible on this dataset.

In [65]:
# Run LDA on Q&A dataset

qna_texts = pd.concat([qna["Question"].dropna(), qna["Response"].dropna()])

# remove very short text entries
qna_texts = qna_texts[qna_texts.str.len() > 15].drop_duplicates()

# preprocess for LDA
qna_processed = qna_texts.apply(preprocess_string)
dictionary = corpora.Dictionary(qna_processed)
corpus = [dictionary.doc2bow(text) for text in qna_processed]

num_topics = 10
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10)

print("\nExtracted Topics from LDA:")
for idx, topic in lda_model.show_topics(num_topics=num_topics, formatted=True):
    print(f"Topic {idx}: {topic}")


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.




Extracted Topics from LDA:
Topic 0: 0.016*"think" + 0.014*"mean" + 0.014*"capit" + 0.010*"like" + 0.009*"product" + 0.009*"term" + 0.009*"demand" + 0.009*"price" + 0.009*"economi" + 0.008*"loan"
Topic 1: 0.025*"think" + 0.012*"right" + 0.011*"like" + 0.010*"question" + 0.009*"thing" + 0.009*"point" + 0.009*"deposit" + 0.009*"capit" + 0.009*"actual" + 0.008*"bit"
Topic 2: 0.015*"think" + 0.014*"capit" + 0.013*"rate" + 0.011*"sort" + 0.010*"differ" + 0.010*"chang" + 0.009*"yeah" + 0.009*"know" + 0.009*"lot" + 0.008*"obvious"
Topic 3: 0.022*"bit" + 0.021*"littl" + 0.018*"think" + 0.018*"rate" + 0.013*"expect" + 0.010*"market" + 0.010*"sort" + 0.010*"environ" + 0.009*"term" + 0.009*"yeah"
Topic 4: 0.013*"year" + 0.012*"question" + 0.011*"driver" + 0.010*"quarter" + 0.010*"invest" + 0.009*"bear" + 0.009*"nii" + 0.008*"busi" + 0.008*"sequenti" + 0.008*"good"
Topic 5: 0.025*"year" + 0.015*"look" + 0.013*"want" + 0.013*"question" + 0.012*"good" + 0.012*"credit" + 0.012*"think" + 0.011*"like" 

In [68]:
# Run LDA on Management dataset

management_texts = management["Transcript"].dropna()

# remove very short text entries
management_texts = management_texts[management_texts.str.len() > 15].drop_duplicates()

# preprocess for LDA
management_processed = management_texts.apply(preprocess_string)
dictionary_mgmt = corpora.Dictionary(management_processed)
corpus_mgmt = [dictionary_mgmt.doc2bow(text) for text in management_processed]

num_topics = 6
lda_model_mgmt = LdaModel(corpus=corpus_mgmt, id2word=dictionary_mgmt, num_topics=num_topics, passes=10)

print("\nExtracted Topics from LDA (Management Discussion):")
for idx, topic in lda_model_mgmt.show_topics(num_topics=num_topics, formatted=True):
    print(f"Topic {idx}: {topic}")


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.




Extracted Topics from LDA (Management Discussion):
Topic 0: 0.058*"year" + 0.037*"billion" + 0.025*"net" + 0.020*"revenu" + 0.020*"driven" + 0.019*"quarter" + 0.019*"higher" + 0.016*"market" + 0.015*"million" + 0.011*"deposit"
Topic 1: 0.002*"year" + 0.002*"billion" + 0.001*"net" + 0.001*"revenu" + 0.001*"million" + 0.001*"driven" + 0.001*"quarter" + 0.001*"higher" + 0.001*"expens" + 0.001*"page"
Topic 2: 0.003*"year" + 0.002*"billion" + 0.002*"net" + 0.002*"quarter" + 0.002*"market" + 0.002*"million" + 0.002*"higher" + 0.001*"revenu" + 0.001*"driven" + 0.001*"deposit"
Topic 3: 0.004*"year" + 0.002*"billion" + 0.002*"quarter" + 0.002*"driven" + 0.002*"net" + 0.002*"revenu" + 0.002*"higher" + 0.002*"million" + 0.001*"expens" + 0.001*"market"
Topic 4: 0.058*"year" + 0.032*"billion" + 0.023*"quarter" + 0.020*"net" + 0.017*"driven" + 0.017*"revenu" + 0.015*"higher" + 0.014*"million" + 0.014*"market" + 0.011*"deposit"
Topic 5: 0.004*"year" + 0.004*"billion" + 0.003*"net" + 0.003*"quarter" 

In [67]:
# Visualise LDA topics

print("\nGenerating LDA visualization for Q&A...")
lda_vis_qna = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(lda_vis_qna, "LDA_QnA_Visualization.html")

print("\nGenerating LDA visualization for Management Discussion...")
lda_vis_mgmt = gensimvis.prepare(lda_model_mgmt, corpus_mgmt, dictionary_mgmt)
pyLDAvis.save_html(lda_vis_mgmt, "LDA_Management_Visualization.html")

print("\nDisplaying LDA Visualization for Q&A:")
display(HTML("LDA_QnA_Visualization.html"))

print("\nDisplaying LDA Visualization for Management Discussion:")
display(HTML("LDA_Management_Visualization.html"))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.




Generating LDA visualization for Q&A...

Generating LDA visualization for Management Discussion...

Displaying LDA Visualization for Q&A:



Displaying LDA Visualization for Management Discussion:


# 4. Measure tone in analyst discussions with sentiment analysis (finBERT, Flan-T5)

- Apply FinBERT to extract sentiment scores.
- Apply Flan-T5 (or another LLM) to re-evaluate sentiment and capture better sentiment accuracy.

# 5. Summarisation & Context Extraction

- Use Flan-T5 to summarize extracted topics.
- Compare free topics vs G-SIB-trained topics.

# 6. Comparative Analysis

Compare analyst topics & sentiment per quarter

Compare different analysts in the same quarter

Compare different model outputs for accuracy

1. Model Comparisons -
Compare BERTopic, FinBERT, and LLM results for accuracy.
Ensure like-for-like comparison across different model outputs.
2. Quarter-on-Quarter Analyst Comparison -
Select at least two quarters (e.g., Q4 2024 vs. Q2 2024).
Compare topics and sentiment shifts for a chosen analyst.
3. Analyst Comparisons (Same Quarter) -
Identify topic differences between analysts within the same quarter.
Check sentiment variation across analysts.1.