<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_jpmorgan_gp4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [53]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [54]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


find_row: This function searches upwards from the given current_row_num in the DataFrame df to find the first row where the value in column "M" is "A". It returns the index of that row. If no such row is found, it returns 0

In [55]:
def find_row (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "A":
      break
    else:
      i-=1
  return i

find_row_empty: This function searches upwards from the given current_row_num to find the first row where col1 has the value "A" and col2 is not an empty list. It returns the index of that row.

In [56]:
def find_row_empty (df, col1, col2, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col1][i] == "A" and df[col2][i] != []:
      break
    else:
      i-=1
  return i

In [57]:
def create_ques_num_column (data, new_col,marker_col):
  #Create question number column
  data[new_col]=None
  #set global var to count question number
  num = 0
  #if Q was found, num adds 1 otherwise none
  for i in data.index:
    if data.loc[i,marker_col]=="Q":
      data.at[i,new_col]=num
      num +=1
    else:
      continue

In [58]:
# Function to extract names
def extract_name(full_string):
    return full_string.split(',')[0]

In [59]:
#check if there is "A" before "Q" from the current location
def find_last_a (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "Q":
      j = i-1
      while j > 0:
        if df[col][j]=="A":
          pass
        else:
          break
        j-=1
      break
    else:
      i-=1
  return i

## **Data**

In [60]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


JP Morgan QA section

In [61]:
#Defining qa_data
qa_data = pd.read_excel("/content/drive/MyDrive/bank_of_england/data/preprocessed_data/JP Mogran processed thru OpenAI/JPMorgan_QNA_Consildated_processed_data.xlsx")
qa_data.head()

Unnamed: 0,Index,Quarter-Year,Question,Asked By,Role of the person asked the question,Answer,Answered By,Role of the person answered the question
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C..."
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
3,4,1Q23,My first question is you mentioned that your r...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...","Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;..."


In [62]:
#preprocessing data
preprocessor(qa_data, "Question", "question_tokenised_data", "question_cleaned")
preprocessor(qa_data,"Answer","answer_tokenised_data","answer_cleaned")

In [63]:
#remove operater
#qa_data = qa_data.loc[qa_data["speaker"]!="Operator"]

#remove less than 20 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))


In [64]:
#qa_data.to_csv("qa_data.csv")

In [65]:
qa_data = qa_data.loc[qa_data["count"]>5]
qa_data.head()

Unnamed: 0,Index,Quarter-Year,Question,Asked By,Role of the person asked the question,Answer,Answered By,Role of the person answered the question,question_tokenised_data,question_cleaned,answer_tokenised_data,answer_cleaned,count
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C...","[jamie, actually, hoping, get, perspective, se...",[so jamie actually hoping get perspective see ...,"[well, think, already, kind, complete, answeri...",[well think already kind complete answering qu...,54
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[hey, thanks, good, morning, hey, jeremy, wond...",[hey thanks good morning hey jeremy wondering ...,"[yeah, sure, let, summarize, driver, change, o...",[yeah sure let summarize drivers change outloo...,35
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[thanks, jeremy, wanted, follow, driver, nii, ...",[hi thanks jeremy wanted follow drivers nii re...,"[yeah, john, really, good, question, obviously...",[yeah john really good question obviously thou...,45
3,4,1Q23,My first question is you mentioned that your r...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[first, question, mentioned, reserve, build, d...",[first question mentioned reserve build driven...,"[yeah, erika, know, take, going, lot, detail, ...",[yeah so erika know take going go lot detail t...,17
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...","Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;...","[hey, good, morning, maybe, little, bit, depos...",[hey good morning maybe little bit deposit tho...,"[yeah, couple, thing, there, first, all, know,...",[yeah couple things there so first all know ri...,31


In [66]:
#reset index
qa_data.reset_index(drop=True, inplace=True)
qa_data.head()

Unnamed: 0,Index,Quarter-Year,Question,Asked By,Role of the person asked the question,Answer,Answered By,Role of the person answered the question,question_tokenised_data,question_cleaned,answer_tokenised_data,answer_cleaned,count
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C...","[jamie, actually, hoping, get, perspective, se...",[so jamie actually hoping get perspective see ...,"[well, think, already, kind, complete, answeri...",[well think already kind complete answering qu...,54
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[hey, thanks, good, morning, hey, jeremy, wond...",[hey thanks good morning hey jeremy wondering ...,"[yeah, sure, let, summarize, driver, change, o...",[yeah sure let summarize drivers change outloo...,35
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[thanks, jeremy, wanted, follow, driver, nii, ...",[hi thanks jeremy wanted follow drivers nii re...,"[yeah, john, really, good, question, obviously...",[yeah john really good question obviously thou...,45
3,4,1Q23,My first question is you mentioned that your r...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.","[first, question, mentioned, reserve, build, d...",[first question mentioned reserve build driven...,"[yeah, erika, know, take, going, lot, detail, ...",[yeah so erika know take going go lot detail t...,17
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...","Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;...","[hey, good, morning, maybe, little, bit, depos...",[hey good morning maybe little bit deposit tho...,"[yeah, couple, thing, there, first, all, know,...",[yeah couple things there so first all know ri...,31


In [67]:
qa_data = qa_data.drop(columns=["question_tokenised_data", "answer_tokenised_data", "count"])


In [68]:
qa_data.rename(columns={"Quarter-Year":"Quarter", "Asked By":"Analyst", "Role of the person asked the question":"Analyst Role","Answer":"Response", "Answered By": "Executive", "Role of the person answered the question":"Executive Role Type" }, inplace=True)


In [69]:
qa_data.isnull().sum()

Unnamed: 0,0
Index,0
Quarter,0
Question,0
Analyst,0
Analyst Role,0
Response,0
Executive,0
Executive Role Type,0
question_cleaned,0
answer_cleaned,0


In [74]:
qa_data.head(25)

Unnamed: 0,Index,Quarter,Question,Analyst,Analyst Role,Response,Executive,Executive Role Type,question_cleaned,answer_cleaned
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C...",[so jamie actually hoping get perspective see ...,[well think already kind complete answering qu...
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[hey thanks good morning hey jeremy wondering ...,[yeah sure let summarize drivers change outloo...
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[hi thanks jeremy wanted follow drivers nii re...,[yeah john really good question obviously thou...
3,4,1Q23,My first question is you mentioned that your r...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[first question mentioned reserve build driven...,[yeah so erika know take going go lot detail t...
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...","Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;...",[hey good morning maybe little bit deposit tho...,[yeah couple things there so first all know ri...
5,6,1Q23,"In your comments about your CET1 ratio, obviou...",Gerard Cassidy,"Analyst, RBC Capital Markets LLC","Yeah. So a few things on there, Gerard. So we ...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[comments cet1 ratio obviously came strong 138...,[yeah things there gerard previously said targ...
6,7,1Q23,How much of a issue is oversupply in the marke...,Ebrahim H. Poonawala,"Analyst, Bank of America Merrill Lynch","Yeah, so Ebrahim let me sort of respond narrow...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[much issue oversupply market think next years...,[yeah ebrahim let sort respond narrowly connec...
7,8,1Q23,"Hey Jeremy, you mentioned a degree of reinterm...",Mike Mayo,"Analyst, Wells Fargo Securities LLC","Yeah, Mike. So I think, yeah, you're referring...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[hey jeremy mentioned degree reintermediation ...,[yeah mike think yeah referring comments made ...
8,9,1Q23,I do want to unpack the question here on the p...,Betsy L. Graseck,"Analyst, Morgan Stanley & Co. LLC","Sure. So, Betsy, your question is very good. A...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[want unpack question possibility higher longe...,[sure so betsy question good would say look ev...
9,10,1Q23,Have you seen big changes in how corporate tre...,Glenn Schorr,"Analyst, Evercore ISI","Yeah, Glenn, in short, we really haven't seen ...",Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co.",[seen big changes corporate treasurers cfos ad...,[yeah glenn short really seen big changes spea...


In [76]:
print(qa_data["Question"][18])

Wanted to ask on the deposit outlook â€“ just with signs that recent liquidity drawdown has come predominantly out of RRP versus industry deposits, just wanted to get your thoughts on what expectations you have for deposit growth in the second half, both for you and even the broader industry, especially as Treasury issuance really begins to ramp in earnest.


In [72]:
qa_data.isnull().sum()

Unnamed: 0,0
Index,0
Quarter,0
Question,0
Analyst,0
Analyst Role,0
Response,0
Executive,0
Executive Role Type,0
question_cleaned,0
answer_cleaned,0


In [None]:
#defining santader dataframe
jpmorgan_body_df=pd.read_csv("jpmorgan_management_discussion.csv")
jpmorgan_body_df.head()

In [73]:
jpmorgan_body_df.head(25)

NameError: name 'jpmorgan_body_df' is not defined

# **Export the output as a csv file**

JP morgan QA section

In [71]:
#export preprocessed data
preprocessed_qa_csv_path1 = "/content/drive/MyDrive/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver8.csv"
qa_data.to_csv(preprocessed_qa_csv_path1, index=False)

JP morgan management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path2 = "/content/sample_data/jpmorgan_management_df_preprocessed.csv"
jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)

UBS QA section

In [None]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/sample_data/ubs_qa_df_preprocessed.csv"
ubs_qna_df.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)