<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_jpmorgan_gp4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [100]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [101]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [102]:
#create function to preprocess data
def preprocessor (data, col, col1, col2):
  #Copy column
  data[col2]=data[col]
  data[col1]=data[col]

  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


find_row: This function searches upwards from the given current_row_num in the DataFrame df to find the first row where the value in column "M" is "A". It returns the index of that row. If no such row is found, it returns 0

In [103]:
def find_row (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "A":
      break
    else:
      i-=1
  return i

find_row_empty: This function searches upwards from the given current_row_num to find the first row where col1 has the value "A" and col2 is not an empty list. It returns the index of that row.

## **Data**

In [104]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


JP Morgan QA section

In [105]:
#Defining qa_data
qa_data = pd.read_excel("/content/drive/MyDrive/bank_of_england/data/preprocessed_data/JPMorgan_QNA_processed_data.xlsx")
qa_data.head()

Unnamed: 0,Index,Quarter-Year,Asked By,Role of the person Asked the question,Question,Answered By,Role of the person answered the question,Answer
0,1,4Q24,John McDonald,"Analyst, Truist Securities, Inc.","Hi. Good morning. Jeremy, I wanted to ask abou...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Yeah. Good question, John, and welcome back, b..."
1,2,4Q24,Mike Mayo,"Analyst, Wells Fargo Securities LLC","Hi. Simple and then more difficult, I guess. J...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorganChase",I do love what I do. And answering the second ...
2,3,4Q24,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'..."
3,4,4Q24,Erika Najarian,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m..."
4,5,4Q24,Erika,Unknown,"Does that conclude your question, Erika?",Jeremy Barnum,"Chief Financial Officer, JPMorganChase",Very good. We can go to the next question. Tha...


In [106]:
#preprocessing data
preprocessor(qa_data, "Question", "question_tokenised_data", "question_cleaned")
preprocessor(qa_data,"Answer","answer_tokenised_data","answer_cleaned")

In [107]:
#remove less than 5 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))
qa_data = qa_data.loc[qa_data["count"]>10]

In [108]:
#reset index
qa_data.reset_index(drop=True, inplace=True)
qa_data.head()

Unnamed: 0,Index,Quarter-Year,Asked By,Role of the person Asked the question,Question,Answered By,Role of the person answered the question,Answer,question_cleaned,question_tokenised_data,answer_cleaned,answer_tokenised_data,count
0,1,4Q24,John McDonald,"Analyst, Truist Securities, Inc.","Hi. Good morning. Jeremy, I wanted to ask abou...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Yeah. Good question, John, and welcome back, b...",[hi good morning jeremy wanted ask capital kno...,"[good, morning, jeremy, wanted, ask, capital, ...",[yeah good question john welcome back way so y...,"[yeah, good, question, john, welcome, back, wa...",76
1,2,4Q24,Mike Mayo,"Analyst, Wells Fargo Securities LLC","Hi. Simple and then more difficult, I guess. J...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorganChase",I do love what I do. And answering the second ...,[hi simple difficult guess jamie whos successo...,"[simple, difficult, guess, jamie, who, success...",[love do answering second question first look ...,"[love, answering, second, question, first, loo...",91
2,3,4Q24,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",[hey good morning maybe regulation new adminis...,"[hey, good, morning, maybe, regulation, new, a...",[hey jim mean obviously something thinking lot...,"[hey, jim, mean, obviously, something, thinkin...",66
3,4,4Q24,Erika Najarian,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",[yes hi good morning wanted follow questions c...,"[yes, good, morning, wanted, follow, questions...",[right erika okay tempting many rabbit holes d...,"[right, erika, okay, tempting, many, rabbit, h...",141
4,6,4Q24,Matt O'Connor,"Analyst, Deutsche Bank Securities, Inc.",Good morning. It seems like you guys have back...,Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Yeah, it's a good question, Matt. I guess mayb...",[good morning seems like guys backed view mate...,"[good, morning, seems, like, guys, backed, vie...",[yeah good question matt guess maybe let try f...,"[yeah, good, question, matt, guess, maybe, let...",78


In [109]:
#Remove tokenised columns
qa_data = qa_data.drop(columns=["question_tokenised_data", "answer_tokenised_data", "count"])


In [110]:
#rename column names
qa_data.rename(columns={"Quarter-Year":"Quarter", "Asked By":"Analyst", "Role of the person Asked the question":"Analyst Role","Answer":"Response", "Answered By": "Executive", "Role of the person answered the question":"Executive Role Type" }, inplace=True)


In [111]:
qa_data.head()

Unnamed: 0,Index,Quarter,Analyst,Analyst Role,Question,Executive,Executive Role Type,Response,question_cleaned,answer_cleaned
0,1,4Q24,John McDonald,"Analyst, Truist Securities, Inc.","Hi. Good morning. Jeremy, I wanted to ask abou...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Yeah. Good question, John, and welcome back, b...",[hi good morning jeremy wanted ask capital kno...,[yeah good question john welcome back way so y...
1,2,4Q24,Mike Mayo,"Analyst, Wells Fargo Securities LLC","Hi. Simple and then more difficult, I guess. J...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorganChase",I do love what I do. And answering the second ...,[hi simple difficult guess jamie whos successo...,[love do answering second question first look ...
2,3,4Q24,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",[hey good morning maybe regulation new adminis...,[hey jim mean obviously something thinking lot...
3,4,4Q24,Erika Najarian,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",[yes hi good morning wanted follow questions c...,[right erika okay tempting many rabbit holes d...
4,6,4Q24,Matt O'Connor,"Analyst, Deutsche Bank Securities, Inc.",Good morning. It seems like you guys have back...,Jeremy Barnum,"Chief Financial Officer, JPMorganChase","Yeah, it's a good question, Matt. I guess mayb...",[good morning seems like guys backed view mate...,[yeah good question matt guess maybe let try f...


In [112]:
# Standardize Roles
for i in range(len(qa_data)):
  if isinstance(qa_data.loc[i, "Executive Role Type"], str):
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chairman & Chief Executive Officer, JPMorgan Chase & Co.","CEO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Executive Officer, JPMorgan Chase & Co.","CEO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Vice Chairman, JPMorgan Chase & Co.","Vice President", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Managing Director$","Managing Director", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Head of Investor Relations, JPMorgan Chase & Co.","Head of IR", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Financial Officer, JPMorgan Chase & Co.","CFO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Operating Officer, JPMorgan Chase & Co.","COO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chief Financial Officer, JPMorganChase","CFO", qa_data.loc[i, "Executive Role Type"])
    qa_data.loc[i, "Executive Role Type"] = re.sub(r"Chairman & Chief Executive Officer, JPMorganChase","CEO", qa_data.loc[i, "Executive Role Type"])


In [113]:
qa_data.head()

Unnamed: 0,Index,Quarter,Analyst,Analyst Role,Question,Executive,Executive Role Type,Response,question_cleaned,answer_cleaned
0,1,4Q24,John McDonald,"Analyst, Truist Securities, Inc.","Hi. Good morning. Jeremy, I wanted to ask abou...",Jeremy Barnum,CFO,"Yeah. Good question, John, and welcome back, b...",[hi good morning jeremy wanted ask capital kno...,[yeah good question john welcome back way so y...
1,2,4Q24,Mike Mayo,"Analyst, Wells Fargo Securities LLC","Hi. Simple and then more difficult, I guess. J...",Jamie Dimon,CEO,I do love what I do. And answering the second ...,[hi simple difficult guess jamie whos successo...,[love do answering second question first look ...
2,3,4Q24,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",Jeremy Barnum,CFO,"Hey, Jim. I mean, it's obviously something we'...",[hey good morning maybe regulation new adminis...,[hey jim mean obviously something thinking lot...
3,4,4Q24,Erika Najarian,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",Jeremy Barnum,CFO,"Right, Erika. Okay. You are tempting me with m...",[yes hi good morning wanted follow questions c...,[right erika okay tempting many rabbit holes d...
4,6,4Q24,Matt O'Connor,"Analyst, Deutsche Bank Securities, Inc.",Good morning. It seems like you guys have back...,Jeremy Barnum,CFO,"Yeah, it's a good question, Matt. I guess mayb...",[good morning seems like guys backed view mate...,[yeah good question matt guess maybe let try f...


In [114]:
#split column Analyst Roles into Analyst and organisation
qa_data["Organisation"] = qa_data["Analyst Role"].apply(lambda x: x.split(",")[1])
qa_data["Analyst Role"] = qa_data["Analyst Role"].apply(lambda x: x.split(",")[0])

In [115]:
#Relocate organisation column
col=qa_data.pop("Organisation")
qa_data.insert(5, col.name, col)

In [116]:
qa_data.isnull().sum()

Unnamed: 0,0
Index,0
Quarter,0
Analyst,0
Analyst Role,0
Question,0
Organisation,0
Executive,0
Executive Role Type,0
Response,11
question_cleaned,0


JP Morgan Management Discussion

In [None]:
#defining santader dataframe
jpmorgan_body_df=pd.read_excel("/content/drive/MyDrive/bank_of_england/data/preprocessed_data/JP Mogran processed thru OpenAI/JPMorgan_Management_Discussion_processed_data.xlsx")
jpmorgan_body_df.head()

Unnamed: 0,Index,Quarter-Year,Text
0,,4Q24,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...
1,,3Q24,MANAGEMENT DISCUSSION SECTION \n \n...
2,,2Q24,MANAGEMENT DISCUSSION SECTION \n \n...
3,,1Q24,MANAGEMENT DISCUSSION SECTION \n \n...
4,,4Q23,MANAGEMENT DISCUSSION SECTION \n \n...


In [None]:
#Remove index column as there is no value
jpmorgan_body_df.drop(columns=["Index"], inplace=True)
jpmorgan_body_df.head()

Unnamed: 0,Quarter-Year,Text
0,4Q24,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...
1,3Q24,MANAGEMENT DISCUSSION SECTION \n \n...
2,2Q24,MANAGEMENT DISCUSSION SECTION \n \n...
3,1Q24,MANAGEMENT DISCUSSION SECTION \n \n...
4,4Q23,MANAGEMENT DISCUSSION SECTION \n \n...


In [None]:
#cleaning text
preprocessor(jpmorgan_body_df, "Text", "text_tokenised_data", "text_cleaned")

In [None]:
jpmorgan_body_df.head()


Unnamed: 0,Quarter-Year,Text,text_tokenised_data,text_cleaned
0,4Q24,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...,"[management, discussion, section, operator, go...",[management discussion section operator : good...
1,3Q24,MANAGEMENT DISCUSSION SECTION \n \n...,"[management, discussion, section, operator, go...",[management discussion section operator : good...
2,2Q24,MANAGEMENT DISCUSSION SECTION \n \n...,"[management, discussion, section, operator, go...",[management discussion section operator : good...
3,1Q24,MANAGEMENT DISCUSSION SECTION \n \n...,"[management, discussion, section, operator, go...",[management discussion section operator : good...
4,4Q23,MANAGEMENT DISCUSSION SECTION \n \n...,"[management, discussion, section, operator, go...",[management discussion section operator : good...


In [None]:
jpmorgan_body_df.isnull().sum()

Unnamed: 0,0
Quarter-Year,0
Text,0
text_tokenised_data,0
text_cleaned,0


# **Export the output as a csv file**

JP morgan QA section

In [118]:
##export preprocessed data
#preprocessed_qa_csv_path1 = "/content/drive/MyDrive/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver9.csv"
#qa_data.to_csv(preprocessed_qa_csv_path1, index=False)
qa_data.to_csv("preprocessed_jp_qna.csv")

JP morgan management discussion

In [None]:
#export preprocessed data
#preprocessed_qa_csv_path2 = "/content/drive/MyDrive/bank_of_england/data/preprocessed_data/jpmorgan_management_df_preprocessed_ver8.csv"
#jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)