<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [57]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [58]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [148]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


## **Data**

In [60]:
#drive.mount('/content/drive')

In [61]:
#!ls"/content/bank_of_england/data/preprocessed_data/Archived/jpmorgan_qa_section_preprocessed.csv"

JP Morgan QA section

In [108]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
!git switch Preprocessing
%cd bank_of_england/data/preprocessed_data/archived
%ls

Cloning into 'bank_of_england'...
remote: Enumerating objects: 1126, done.[K
remote: Counting objects: 100% (280/280), done.[K
remote: Compressing objects: 100% (215/215), done.[K
remote: Total 1126 (delta 177), reused 85 (delta 65), pack-reused 846 (from 2)[K
Receiving objects: 100% (1126/1126), 10.17 MiB | 21.69 MiB/s, done.
Resolving deltas: 100% (545/545), done.
fatal: invalid reference: Preprocessing
/content/bank_of_england/data/cleansed/bank_of_england/data/cleansed/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived
jpmorgan_preprocessed_transcript.csv  santander_management_discussion_preprocessed.csv
jpmorgan_qa_section_preprocessed.csv  unfiltered_preprocessed_JP_qa_sec.csv
preprocessed_santander.csv


In [179]:
#Defining qa_data
qa_data = pd.read_csv("jpmorgan_qa_section_preprocessed.csv")
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,tokenised_data,cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['yeah', 'think', 'conventional', 'wisdom', 'p...",['yeah think conventional wisdom qt im pretend...
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","So, you'll stay around maybe for a few more ye...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['stay', 'around', 'maybe', 'years', 'base', '...",['so stay around maybe years base case right n...
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC",All right. Thank you.,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['right', 'thank', 'you']",['right thank you']
3,Operator,,,Thank you. Our next question comes from Jim Mi...,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['thank', 'you', 'next', 'question', 'comes', ...",['thank you next question comes jim mitchell s...
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'good', 'morning', 'maybe', 'regulatio...",['hey good morning maybe regulation new admini...


In [180]:
#preprocessing data
preprocessor(qa_data, "utterance", "question_tokenised_data", "question_cleaned_data")
#preprocessor(qa_data,"answer","answer_tokenised_data","answer_cleaned_data")

In [181]:
#remove operater
qa_data = qa_data.loc[qa_data["speaker"]!="Operator"]

#remove less than 20 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))
qa_data = qa_data.loc[qa_data["count"]>20]
qa_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))


Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,tokenised_data,cleaned_data,question_tokenised_data,question_cleaned_data,count
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['yeah', 'think', 'conventional', 'wisdom', 'p...",['yeah think conventional wisdom qt im pretend...,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,80
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'good', 'morning', 'maybe', 'regulatio...",['hey good morning maybe regulation new admini...,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,39
5,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'jim', 'mean', 'obviously', 'something...",['hey jim mean obviously something were thinki...,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,60
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['everything', 'capital', 'liquidity', 'uses',...",['everything capital liquidity uses data balan...,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,74
7,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['add', 'that', 'great', 'jeremy', 'gave', 'al...",['add thats great jeremy gave all let add thre...,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,80


In [156]:
print(len(qa_data["question_tokenised_data"][2]))

3


In [182]:
qa_data.drop(columns=["count","tokenised_data","cleaned_data"], inplace=True)
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
5,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...
7,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...


In [196]:
qa_data.reset_index(drop=True, inplace=True)
qa_data.drop(columns=["index"])

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...
...,...,...,...,...,...,...,...,...,...
348,Jeremy Barnum,A,"Chief Financial Officer, JPMorgan Chase & Co.","Sure. So, Betsy, your question is very good. A...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"[sure, betsy, question, good, would, say, look...",[sure so betsy question good would say look ev...
349,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorgan C...","Yeah. So I'll just add, so next quarter we all...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"[yeah, ll, add, next, quarter, kind, know, alr...",[yeah ill add next quarter kind know already t...
350,Betsy L. Graseck,Q,"Analyst, Morgan Stanley & Co. LLC","Yeah. Got it. Okay, that's super helpful to un...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"[yeah, got, okay, that, super, helpful, unders...",[yeah got it okay thats super helpful understa...
351,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorgan C...",Yeah. No. We're not on pause now. We're doing ...,1q23-earnings-transcript.pdf,1Q23,2023-04-14,"[yeah, re, pause, now, re, little, bit, now, o...",[yeah no were pause now were little bit now ob...


In [197]:
qa_data.head()

Unnamed: 0,index,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data
0,0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...
1,4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
2,5,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...
3,6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...
4,7,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...


JP morgan management discussion

In [None]:
%ls

In [None]:
#defining santader dataframe
jpmorgan_body_df=pd.read_csv("jpmorgan_management_discussion.csv")
jpmorgan_body_df.head()

In [None]:
#preprocess data
preprocessor(jpmorgan_body_df, "chunk_text", "tokenized_data","cleaned_data")

In [None]:
jpmorgan_body_df.head()

UBS qna section

In [None]:
%ls

In [None]:
#define ubs q&a data
ubs_qna_df=pd.read_csv("ubs_qna_section.csv")

In [None]:
#preprocessing ubs Q&A data
preprocessor(ubs_qna_df, "utterance", "tokenized_data","cleaned_data")

In [None]:
ubs_qna_df.head()

UBS management discussion

In [None]:
%ls

In [None]:
#defining ubs management discussion
ubs_manag_df=pd.read_csv("ubs_management_discussion.csv")
ubs_manag_df.head()

In [None]:
#preprocessing ubs management discussion
preprocessor(ubs_manag_df,"utterance", "tokenized_data","cleaned_data")
ubs_manag_df.head()

# **Export the output as a csv file**

JP morgan QA section

In [189]:
#export preprocessed data
preprocessed_qa_csv_path1 = "/content/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver4.csv"
qa_data.to_csv(preprocessed_qa_csv_path1, index=False)

JP morgan management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path2 = "/content/sample_data/jpmorgan_management_df_preprocessed.csv"
jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)

UBS QA section

In [None]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/sample_data/ubs_qa_df_preprocessed.csv"
ubs_qna_df.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)