<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/Preprocessing/notebooks/processed/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create function to preprocess as follows:


  1, Lowercasing letters

  2, Removing stop words

  3, Stemming words

  4, Tokenizing




# **Library**

In [111]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize
from collections import Counter
import regex as re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# **Function**

preprocessor function : if you add data name and column name, the values in the column are preprocessed.

In [112]:
#create function to preprocess data
def preprocessor (data, col):
  #Lower the lettercase
  data[col] = data[col].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col] = data[col].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col] = data[col].apply(word_tokenize)

  #Remove numbers
  data[col] = data[col].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col] = data[col].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col] = data[col].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col] = data[col].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])
  return

remove_freq_words function: This function removes top X percent of frequent words from the column.

In [113]:
def remove_freq_words(data, col, percent):
  #Flatten the list of tokens
  all_tokens = [token for sublist in data[col] for token in sublist]

#calculate word frequencies
  word_freq = Counter(all_tokens)

#Convert it to DF
  word_freq_df = pd.DataFrame(word_freq.items(), columns = ["word","freq"])

#Identify the to 5% most frequent words
  top_5_percent = word_freq_df.nlargest(int(len(word_freq_df)*percent), "freq")["word"]

  filtered_data = []
  for sentence in data[col]:
    filtered_sentence = [word for word in sentence if word not in top_5_percent.values]
    filtered_data.append(" ".join(filtered_sentence))

  print(filtered_data)

In [114]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
%ls
%cd bank_of_england/
%ls
%cd data
%ls
%cd processed/
%ls

#Defining qa_data
qa_data = pd.read_csv("qa_section.csv")


Cloning into 'bank_of_england'...
remote: Enumerating objects: 408, done.[K
remote: Counting objects: 100% (176/176), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 408 (delta 90), reused 54 (delta 34), pack-reused 232 (from 1)[K
Receiving objects: 100% (408/408), 3.38 MiB | 16.42 MiB/s, done.
Resolving deltas: 100% (171/171), done.
[0m[01;34mbank_of_england[0m/  management_discussion.csv  qa_section.csv
/content/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england
[0m[01;34mdata[0m/  [01;34mdocum

In [115]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","So, you'll stay around maybe for a few more ye...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC",All right. Thank you.,4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,Thank you. Our next question comes from Jim Mi...,4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [116]:
#Checking the type of data
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 739 entries, 0 to 738
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   speaker            739 non-null    object
 1   marker             637 non-null    object
 2   job_title          636 non-null    object
 3   utterance          738 non-null    object
 4   filename           739 non-null    object
 5   financial_quarter  739 non-null    object
 6   call_date          739 non-null    object
dtypes: object(7)
memory usage: 40.5+ KB


In [117]:
preprocessor(qa_data, "utterance")

In [118]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","[yeah, think, conventional, wisdom, pretending...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[stay, around, maybe, years, base, case, right...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[right, thank, you]",4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,"[thank, you, next, question, comes, jim, mitch...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","[hey, good, morning, maybe, regulation, new, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [119]:
# Remove Operator
qa_data = qa_data[qa_data["speaker"] != "Operator"]

#Filtering only on Answers
#qa_data = qa_data[qa_data["marker"]=="A"]

In [120]:
"""
#Flatten the list of tokens
all_tokens = [token for sublist in qa_data["utterance"] for token in sublist]

#calculate word frequencies
word_freq = Counter(all_tokens)

#Convert it to DF
word_freq_df = pd.DataFrame(word_freq.items(), columns = ["word","freq"])

#Identify the to 5% most frequent words
top_5_percent = word_freq_df.nlargest(int(len(word_freq_df)*0.05), "freq")["word"]

filtered_data = []
for sentence in qa_data["utterance"]:
  filtered_sentence = [word for word in sentence if word not in top_5_percent.values]
  filtered_data.append(" ".join(filtered_sentence))

print(filtered_data)
"""

'\n#Flatten the list of tokens\nall_tokens = [token for sublist in qa_data["utterance"] for token in sublist]\n\n#calculate word frequencies \nword_freq = Counter(all_tokens)\n\n#Convert it to DF\nword_freq_df = pd.DataFrame(word_freq.items(), columns = ["word","freq"])\n\n#Identify the to 5% most frequent words\ntop_5_percent = word_freq_df.nlargest(int(len(word_freq_df)*0.05), "freq")["word"]\n\nfiltered_data = []\nfor sentence in qa_data["utterance"]:\n  filtered_sentence = [word for word in sentence if word not in top_5_percent.values]\n  filtered_data.append(" ".join(filtered_sentence))\n\nprint(filtered_data)\n'

In [121]:
remove_freq_words(qa_data, "utterance", 0.05)

['conventional wisdom pretending add conventional wisdom other tapering complete and therefore sometime middle seems consensus step h data flow funds models type peers behaving evolution expectations economywide cetera impact systemwide consistent story telling background plus minus happens policy stabilizing growing half', 'stay base case', '', 'regulation administration soontobe head regulation about again areas regulatory structure impactful areas requirements down story requirements simply stop', 'jim deep rabbit holes speculating parts framework evolve productive attempt backing read quotes consistent long coherent rational holisticallyassessed regulatory framework allows job supporting reflexively antibank default every', 'everything liquidity uses data obvious goal safe sound system recognizing play critical role supporting hope aspects supervisory framework bureaucratic adversarial substantive management focus matter most goes down stays flat complicated iii endgame gsib factor

In [106]:
!pip install tensorflow
!pip install numpy
!pip install bertopic


import tensorflow as tf
import numpy as np
import random
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import tensorflow as tf
from umap import UMAP



In [107]:
# Define a function to reset the session.
def reset_session():
    tf.keras.backend.clear_session()
    np.random.seed(42)
    random.seed(42)
    tf.random.set_seed(42)
#reset_session()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
###vectorizer_model = CountVectorizer(max_df=0.8, min_df=2)
umap_model = UMAP(n_neighbors=20, min_dist=0.1)

model2 = BERTopic(embedding_model=embedding_model, umap_model=umap_model, verbose=True)
model2.fit(filtered_data)
topic, probabilities = model2.transform(filtered_data)

2025-02-08 08:32:09,075 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/20 [00:00<?, ?it/s]

2025-02-08 08:32:39,344 - BERTopic - Embedding - Completed ✓
2025-02-08 08:32:39,353 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-02-08 08:32:41,513 - BERTopic - Dimensionality - Completed ✓
2025-02-08 08:32:41,515 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-02-08 08:32:41,542 - BERTopic - Cluster - Completed ✓
2025-02-08 08:32:41,548 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-02-08 08:32:41,582 - BERTopic - Representation - Completed ✓


Batches:   0%|          | 0/20 [00:00<?, ?it/s]

2025-02-08 08:33:02,701 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-02-08 08:33:02,707 - BERTopic - Dimensionality - Completed ✓
2025-02-08 08:33:02,709 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-02-08 08:33:02,732 - BERTopic - Cluster - Completed ✓


In [108]:
model2.get_topic_freq().head(10)

Unnamed: 0,Topic,Count
0,0,421
1,1,112
4,2,47
3,3,29
6,4,13
2,5,11
5,-1,7


In [109]:
model2.get_topic(0)

[('function', 0.009271714048439546),
 ('republic', 0.009271714048439546),
 ('case', 0.009033923530997684),
 ('reprice', 0.009033923530997684),
 ('pricing', 0.009033923530997684),
 ('investor', 0.009033923530997684),
 ('else', 0.009033923530997684),
 ('looking', 0.009033923530997684),
 ('impact', 0.008793668524299044),
 ('happens', 0.008793668524299044)]

In [110]:
import plotly.io as pio

fig = model2.visualize_barchart(top_n_topics=10, n_words=5)
fig.update_layout(
    autosize=False,
    width=1000,
    height=800,
    margin=dict(l=50, r=50, t=100, b=50),
    font=dict(size=12),
    title=dict(
        text="Top 10 Topics and Their Key Words",
        font=dict(size=16),
        x=0.5,
        y=0.98,
        xanchor="center",
        yanchor="top"
    )
)

fig.show()