<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/Preprocessing/notebooks/processed/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create function to preprocess as follows:


  1, Lowercasing letters

  2, Removing stop words

  3, Stemming words

  4, Tokenizing




# **Library**

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize
from collections import Counter
import regex as re


# **Function**

preprocessor function : if you add data name and column name, the values in the column are preprocessed.

In [25]:
"""
#create function to preprocess data
def preprocessor (data, col):
  #Lower the lettercase
  data[col] = data[col].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col] = data[col].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col] = data[col].apply(word_tokenize)

  #Remove numbers
  data[col] = data[col].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col] = data[col].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col] = data[col].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col] = data[col].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])
  return
  """

remove_freq_words function: This function removes top X percent of frequent words from the column.

In [26]:
"""
def remove_freq_words(data, col, percent):
  #Flatten the list of tokens
  all_tokens = [token for sublist in data[col] for token in sublist]

#calculate word frequencies
  word_freq = Counter(all_tokens)

#Convert it to DF
  word_freq_df = pd.DataFrame(word_freq.items(), columns = ["word","freq"])

#Identify the to 5% most frequent words
  top_5_percent = word_freq_df.nlargest(int(len(word_freq_df)*percent), "freq")["word"]

  filtered_data = []
  for sentence in data[col]:
    filtered_sentence = [word for word in sentence if word not in top_5_percent.values]
    filtered_data.append(" ".join(filtered_sentence))

  print(filtered_data)
  """

**Adding Data**

In [38]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
!git switch Preprocessing
%cd bank_of_england/data/processed
%ls

#Defining qa_data
qa_data = pd.read_csv("qa_section.csv")


Cloning into 'bank_of_england'...
remote: Enumerating objects: 425, done.[K
remote: Counting objects:   0% (1/193)[Kremote: Counting objects:   1% (2/193)[Kremote: Counting objects:   2% (4/193)[Kremote: Counting objects:   3% (6/193)[Kremote: Counting objects:   4% (8/193)[Kremote: Counting objects:   5% (10/193)[Kremote: Counting objects:   6% (12/193)[Kremote: Counting objects:   7% (14/193)[Kremote: Counting objects:   8% (16/193)[Kremote: Counting objects:   9% (18/193)[Kremote: Counting objects:  10% (20/193)[Kremote: Counting objects:  11% (22/193)[Kremote: Counting objects:  12% (24/193)[Kremote: Counting objects:  13% (26/193)[Kremote: Counting objects:  14% (28/193)[Kremote: Counting objects:  15% (29/193)[Kremote: Counting objects:  16% (31/193)[Kremote: Counting objects:  17% (33/193)[Kremote: Counting objects:  18% (35/193)[Kremote: Counting objects:  19% (37/193)[Kremote: Counting objects:  20% (39/193)[Kremote: Counting objects:

In [40]:
%cd ..
%cd ..
%cd notebooks
%ls

/bank_of_england/notebooks/bank_of_england/notebooks/bank_of_england/notebooks/bank_of_england
/bank_of_england/notebooks/bank_of_england/notebooks/bank_of_england/notebooks
[Errno 2] No such file or directory: 'notebooks'
/bank_of_england/notebooks/bank_of_england/notebooks/bank_of_england/notebooks
[0m[01;34mbank_of_england[0m/  [01;34mimport[0m/  [01;34mlibrary[0m/  [01;34mmodelling[0m/  [01;34mprocessed[0m/


In [41]:
from library import quant as qu

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [29]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","So, you'll stay around maybe for a few more ye...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC",All right. Thank you.,4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,Thank you. Our next question comes from Jim Mi...,4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [30]:
#Checking the type of data
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 739 entries, 0 to 738
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   speaker            739 non-null    object
 1   marker             637 non-null    object
 2   job_title          636 non-null    object
 3   utterance          738 non-null    object
 4   filename           739 non-null    object
 5   financial_quarter  739 non-null    object
 6   call_date          739 non-null    object
dtypes: object(7)
memory usage: 40.5+ KB


In [42]:
qu.preprocessor(qa_data, "utterance")

In [32]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","[yeah, think, conventional, wisdom, pretending...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[stay, around, maybe, years, base, case, right...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[right, thank, you]",4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,"[thank, you, next, question, comes, jim, mitch...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","[hey, good, morning, maybe, regulation, new, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [33]:
# Remove Operator
qa_data = qa_data[qa_data["speaker"] != "Operator"]

#Filtering only on Answers
#qa_data = qa_data[qa_data["marker"]=="A"]

In [46]:
#filtered_data = qu.remove_freq_words(qa_data, "utterance", 0.05)

In [36]:
!pip install tensorflow
!pip install numpy
!pip install bertopic


import tensorflow as tf
import numpy as np
import random
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import tensorflow as tf
from umap import UMAP



In [47]:
# Define a function to reset the session.
def reset_session():
    tf.keras.backend.clear_session()
    np.random.seed(42)
    random.seed(42)
    tf.random.set_seed(42)
#reset_session()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
###vectorizer_model = CountVectorizer(max_df=0.8, min_df=2)
umap_model = UMAP(n_neighbors=20, min_dist=0.1)

model2 = BERTopic(embedding_model=embedding_model, umap_model=umap_model, verbose=True)
model2.fit(filtered_data)
topic, probabilities = model2.transform(filtered_data)

2025-02-08 22:16:41,657 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/24 [00:00<?, ?it/s]

2025-02-08 22:16:55,058 - BERTopic - Embedding - Completed ✓
2025-02-08 22:16:55,062 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-02-08 22:17:12,615 - BERTopic - Dimensionality - Completed ✓
2025-02-08 22:17:12,617 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-02-08 22:17:12,653 - BERTopic - Cluster - Completed ✓
2025-02-08 22:17:12,663 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-02-08 22:17:12,715 - BERTopic - Representation - Completed ✓


Batches:   0%|          | 0/24 [00:00<?, ?it/s]

2025-02-08 22:17:25,885 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2025-02-08 22:17:25,892 - BERTopic - Dimensionality - Completed ✓
2025-02-08 22:17:25,894 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2025-02-08 22:17:25,922 - BERTopic - Cluster - Completed ✓


In [48]:
model2.get_topic_freq().head(10)

Unnamed: 0,Topic,Count
0,0,424
1,1,113
2,2,82
4,3,44
5,4,21
7,5,18
3,6,14
6,-1,12
8,7,11


In [49]:
model2.get_topic(0)

[('increase', 0.009021775852286288),
 ('function', 0.008801750563518052),
 ('second', 0.008801750563518052),
 ('returns', 0.008801750563518052),
 ('republic', 0.008801750563518052),
 ('stuff', 0.008801750563518052),
 ('looking', 0.008579362789563552),
 ('investor', 0.008579362789563552),
 ('reprice', 0.008579362789563552),
 ('case', 0.008579362789563552)]

In [50]:
import plotly.io as pio

fig = model2.visualize_barchart(top_n_topics=10, n_words=5)
fig.update_layout(
    autosize=False,
    width=1000,
    height=800,
    margin=dict(l=50, r=50, t=100, b=50),
    font=dict(size=12),
    title=dict(
        text="Top 10 Topics and Their Key Words",
        font=dict(size=16),
        x=0.5,
        y=0.98,
        xanchor="center",
        yanchor="top"
    )
)

fig.show()