<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-27
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [None]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


## **Data**

In [None]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


JP Morgan QA section

In [None]:
#Defining qa_data
qa_data = pd.read_csv("/content/drive/MyDrive/jpmorgan_qna_df_preprocessed_ver7 .csv")
qa_data.head()

Unnamed: 0,filename,financial_quarter,call_date,speaker,marker,question_num,job_title,metadata,answer_cleaned,analyst,analyst_title,metadata_question,question_cleaned
0,1q23-earnings-transcript.pdf,1Q23,2023-04-14,Steven Chubak,Q,[],"Analyst, Wolfe Research LLC","So, Jamie, I was actually hoping to get your p...",['so jamie actually hoping get perspective see...,[],[],[],['so jamie actually hoping get perspective see...
1,1q23-earnings-transcript.pdf,1Q23,2023-04-14,Jamie Dimon,A,['0'],"Chairman & Chief Executive Officer, JPMorgan C...","Well, I think you were already kind of complet...",['well think already kind complete answering q...,"['Steven Chubak, Analyst, Wolfe Research LLC']","['Analyst, Wolfe Research LLC']","[""So, Jamie, I was actually hoping to get your...",['well think already kind complete answering q...
2,1q23-earnings-transcript.pdf,1Q23,2023-04-14,Jamie Dimon,A,['0'],"Chairman & Chief Executive Officer, JPMorgan C...","Well, we've told you that we're kind of pencil...",['well weve told were kind penciling $12 billi...,"['Steven Chubak, Analyst, Wolfe Research LLC']","['Analyst, Wolfe Research LLC']","[""So, Jamie, I was actually hoping to get your...",['well weve told were kind penciling $12 billi...
3,1q23-earnings-transcript.pdf,1Q23,2023-04-14,Ken Usdin,Q,[],"Analyst, Jefferies LLC","Hey, thanks. Good morning. Hey, Jeremy, I was ...",['hey thanks good morning hey jeremy wondering...,[],[],[],['hey thanks good morning hey jeremy wondering...
4,1q23-earnings-transcript.pdf,1Q23,2023-04-14,Jeremy Barnum,A,['1'],"Chief Financial Officer, JPMorgan Chase & Co.","Yeah, sure. So let me just summarize the drive...",['yeah sure let summarize drivers change outlo...,"['Ken Usdin, Analyst, Jefferies LLC']","['Analyst, Jefferies LLC']","['Hey, thanks. Good morning. Hey, Jeremy, I wa...",['yeah sure let summarize drivers change outlo...


In [None]:
#preprocessing data
preprocessor(qa_data, "utterance", "question_tokenised_data", "question_cleaned")
preprocessor(qa_data,"utterance","answer_tokenised_data","answer_cleaned")

#remove less than 20 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))
qa_data = qa_data.loc[qa_data["count"]>20]

#reset index
qa_data.reset_index(drop=True, inplace=True)

#reorganise column
qa_data=qa_data[["filename","Quarter","Question","Question_cleaned","Analyst","Analyst Role","Response","Response_cleaned","Executive","Executive Role Type"]]

In [None]:
# Standardize Roles
for i in range(len(qa_data)):
  if isinstance(qa_data.loc[i, "Executive Role Type"], str):
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chairman & Chief Executive Officer, JPMorgan Chase","CEO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Executive Officer, JPMorgan Chase","CEO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Vice Chairman, JPMorgan Chase","Vice President", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Managing Director$","Managing Director", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Head of Investor Relations, JPMorgan Chase","Head of IR", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Financial Officer, JPMorgan Chase","CFO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Operating Officer, JPMorgan Chase","COO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Financial Officer, JPMorganChase","CFO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chairman & Chief Executive Officer, JPMorganChase","CEO", qa_data.loc[i, "Executive Role Type"])


In [None]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,tokenised_data,cleaned_data,question_tokenised_data,question_cleaned,answer_tokenised_data,answer_cleaned,count,question_number_inline
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorgan Chase & Co.","Yeah. A couple things there. So, first of all,...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"['yeah', 'couple', 'things', 'there', 'first',...",['yeah couple things there so first all know r...,"[yeah, couple, thing, there, first, all, know,...",[yeah couple things there so first all know ri...,"[yeah, couple, thing, there, first, all, know,...",[yeah couple things there so first all know ri...,55,
1,Jeremy Barnum,A,"Chief Financial Officer, JPMorgan Chase & Co.","Yeah. And we always say, right, we underwrite ...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"['yeah', 'always', 'say', 'right', 'underwrite...",['yeah always say right underwrite cycle think...,"[yeah, always, say, right, underwrite, cycle, ...",[yeah always say right underwrite cycle think ...,"[yeah, always, say, right, underwrite, cycle, ...",[yeah always say right underwrite cycle think ...,44,
2,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","No, fair – all fair points. And maybe just a f...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"['fair', 'fair', 'points', 'maybe', 'followup'...",['no fair – fair points maybe follow-up johns ...,"[fair, fair, point, maybe, followup, john, que...",[no fair – fair points maybe follow-up johns q...,"[fair, fair, point, maybe, followup, john, que...",[no fair – fair points maybe follow-up johns q...,31,0.0
3,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorgan C...","If I add, I would say, categorically, there's ...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"['add', 'would', 'say', 'categorically', 'ther...",['add would say categorically theres pricing p...,"[add, would, say, categorically, there, pricin...",[add would say categorically theres pricing po...,"[add, would, say, categorically, there, pricin...",[add would say categorically theres pricing po...,47,
4,Jeremy Barnum,A,"Chief Financial Officer, JPMorgan Chase & Co.","Yeah. So a few things on there, Gerard. So we ...",1q23-earnings-transcript.pdf,1Q23,2023-04-14,"['yeah', 'things', 'there', 'gerard', 'previou...",['yeah things there gerard previously said tar...,"[yeah, thing, there, gerard, previously, said,...",[yeah things there gerard previously said targ...,"[yeah, thing, there, gerard, previously, said,...",[yeah things there gerard previously said targ...,106,


In [None]:
#Check if there is nill
print(f'Check if there is nil values on DF: {qa_data.isnull().sum()}')

JP morgan management discussion

In [None]:
#defining jp morgan managment discussion dataframe
jpmorgan_body_df=pd.read_csv("jpmorgan_management_discussion.csv")
jpmorgan_body_df.head()

In [None]:
#Cleaning transcript
preprocessor(jpmorgan_body_df, "chunk_text", "tokenized_data","cleaned_data")

In [None]:
jpmorgan_body_df.head()

# **Export the output as a csv file**

JP morgan QA section

In [None]:
#export preprocessed data
preprocessed_qa_csv_path1 = "/content/drive/MyDrive/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver7.csv"
qa_data.to_csv("jp_morgan.csv", index=False)

JP morgan management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path2 = "/content/sample_data/jpmorgan_management_df_preprocessed.csv"
jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)