<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/Preprocessing/notebooks/processed/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create function to preprocess as follows:


  1, Lowercasing letters

  2, Removing stop words

  3, Stemming words

  4, Tokenizing




Importing library

In [70]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize
from collections import Counter
import regex as re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Creating Function

In [63]:
#create function to preprocess data
def preprocessor (data, col):
  #Lower the lettercase
  data[col] = data[col].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col] = data[col].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col] = data[col].apply(word_tokenize)

  #Remove numbers
  data[col] = data[col].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col] = data[col].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col] = data[col].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col] = data[col].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])
  return

In [64]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
%ls
%cd bank_of_england/
%ls
%cd data
%ls
%cd processed/
%ls

#Defining qa_data
qa_data = pd.read_csv("qa_section.csv")


Cloning into 'bank_of_england'...
remote: Enumerating objects: 403, done.[K
remote: Counting objects: 100% (171/171), done.[K
remote: Compressing objects: 100% (131/131), done.[K
remote: Total 403 (delta 88), reused 54 (delta 34), pack-reused 232 (from 1)[K
Receiving objects: 100% (403/403), 3.34 MiB | 27.15 MiB/s, done.
Resolving deltas: 100% (169/169), done.
[0m[01;34mbank_of_england[0m/  management_discussion.csv  qa_section.csv
/content/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england/data/processed/bank_of_england
[0m[01;34mdata[0m/  [01;34mdocuments[0m/  [01;34mnotebooks[0m/  README.md  [01;34msrc[0m/
/content/bank_of_england/data/processed/bank_of_england/data/

In [34]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","So, you'll stay around maybe for a few more ye...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC",All right. Thank you.,4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,Thank you. Our next question comes from Jim Mi...,4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [15]:
#Checking the type of data
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 739 entries, 0 to 738
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   speaker            739 non-null    object
 1   marker             637 non-null    object
 2   job_title          636 non-null    object
 3   utterance          738 non-null    object
 4   filename           739 non-null    object
 5   financial_quarter  739 non-null    object
 6   call_date          739 non-null    object
dtypes: object(7)
memory usage: 40.5+ KB


In [65]:
preprocessor(qa_data, "utterance")

In [66]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","[yeah, think, conventional, wisdom, pretending...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[stay, around, maybe, years, base, case, right...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","[right, thank, you]",4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,Operator,,,"[thank, you, next, question, comes, jim, mitch...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","[hey, good, morning, maybe, regulation, new, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [72]:
#Flatten the list of tokens
all_tokens = [token for sublist in qa_data["utterance"] for token in sublist]

#calculate word frequencies
word_freq = Counter(all_tokens)

#Convert it to DF
word_freq_df = pd.DataFrame(word_freq.items(), columns = ["word","freq"])

#Identify the to 5% most frequent words
top_5_percent = word_freq_df.nlargest(int(len(word_freq_df)*0.05), "freq")["word"]

filtered_data = []
for sentence in qa_data["utterance"]:
  filtered_sentence = [word for word in sentence if word not in top_5_percent.values]
  filtered_data.append(" ".join(filtered_sentence))

print(filtered_data)


['conventional wisdom pretending add conventional wisdom other tapering complete and therefore sometime middle seems consensus step h data flow funds models type stuff peers behaving evolution expectations economywide cetera impact systemwide consistent story telling background plus minus happens policy stabilizing growing second half', 'stay base case', '', 'jim mitchell seaport global securities', 'regulation administration soontobe head regulation about again areas regulatory structure impactful areas requirements down story requirements simply stop', 'jim deep rabbit holes speculating parts framework evolve productive attempt backing second read quotes consistent long coherent rational holisticallyassessed regulatory framework allows job supporting reflexively antibank default every', 'everything liquidity uses data obvious goal safe sound system recognizing play critical role supporting hope aspects supervisory framework bureaucratic adversarial substantive management focus matter