<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [185]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [186]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [187]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


find_row: This function searches upwards from the given current_row_num in the DataFrame df to find the first row where the value in column "M" is "A". It returns the index of that row. If no such row is found, it returns 0

In [188]:
def find_row (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "A":
      break
    else:
      i-=1
  return i

In [189]:
def find_row_empty (df, col1, col2, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col1][i] == "A" and df[col2][i] != []:
      break
    else:
      i-=1
  return i

## **Data**

In [190]:
#drive.mount('/content/drive')

In [191]:
#!ls"/content/bank_of_england/data/preprocessed_data/Archived/jpmorgan_qa_section_preprocessed.csv"

JP Morgan QA section

In [192]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
!git switch Preprocessing
%cd bank_of_england/data/preprocessed_data/archived
%ls

Cloning into 'bank_of_england'...
remote: Enumerating objects: 1162, done.[K
remote: Counting objects: 100% (316/316), done.[K
remote: Compressing objects: 100% (249/249), done.[K
remote: Total 1162 (delta 201), reused 86 (delta 67), pack-reused 846 (from 2)[K
Receiving objects: 100% (1162/1162), 10.19 MiB | 11.99 MiB/s, done.
Resolving deltas: 100% (569/569), done.
fatal: invalid reference: Preprocessing
/content/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived/bank_of_england/data/preprocessed_data/archived
jpmorgan_preprocessed_transcript.csv  santander_management_discussion_preprocessed.csv
jpmorgan_qa_section_preprocessed.csv  unfiltered_preprocess

In [193]:
#Defining qa_data
qa_data = pd.read_csv("jpmorgan_qa_section_preprocessed.csv")
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,tokenised_data,cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['yeah', 'think', 'conventional', 'wisdom', 'p...",['yeah think conventional wisdom qt im pretend...
1,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC","So, you'll stay around maybe for a few more ye...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['stay', 'around', 'maybe', 'years', 'base', '...",['so stay around maybe years base case right n...
2,Mike Mayo,Q,"Analyst, Wells Fargo Securities LLC",All right. Thank you.,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['right', 'thank', 'you']",['right thank you']
3,Operator,,,Thank you. Our next question comes from Jim Mi...,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['thank', 'you', 'next', 'question', 'comes', ...",['thank you next question comes jim mitchell s...
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'good', 'morning', 'maybe', 'regulatio...",['hey good morning maybe regulation new admini...


In [194]:
#preprocessing data
preprocessor(qa_data, "utterance", "question_tokenised_data", "question_cleaned_data")
#preprocessor(qa_data,"answer","answer_tokenised_data","answer_cleaned_data")

In [195]:
#remove operater
qa_data = qa_data.loc[qa_data["speaker"]!="Operator"]

#remove less than 20 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))
qa_data = qa_data.loc[qa_data["count"]>20]
qa_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))


Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,tokenised_data,cleaned_data,question_tokenised_data,question_cleaned_data,count
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['yeah', 'think', 'conventional', 'wisdom', 'p...",['yeah think conventional wisdom qt im pretend...,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,80
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'good', 'morning', 'maybe', 'regulatio...",['hey good morning maybe regulation new admini...,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,39
5,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['hey', 'jim', 'mean', 'obviously', 'something...",['hey jim mean obviously something were thinki...,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,60
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['everything', 'capital', 'liquidity', 'uses',...",['everything capital liquidity uses data balan...,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,74
7,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"['add', 'that', 'great', 'jeremy', 'gave', 'al...",['add thats great jeremy gave all let add thre...,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,80


In [196]:
qa_data.drop(columns=["count","tokenised_data","cleaned_data"], inplace=True)
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...
4,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
5,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...
7,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...


In [197]:
qa_data.reset_index(drop=True, inplace=True)
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...


In [198]:
#Add question number
qa_data["question_number_inline"]=None
num = 0
for i in qa_data.index:
  if qa_data.loc[i,"marker"]=="Q":
    qa_data.at[i,"question_number_inline"]=num
    num +=1
  else:
    continue

In [199]:
qa_data.head(15)

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data,question_number_inline
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,0.0
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,
5,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Yeah. No, that makes sense. And maybe just as ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, make, sense, maybe, followup, loan, gro...",[yeah no makes sense maybe follow-up loan grow...,1.0
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah, it's a good question. And I think given ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, good, question, think, given, significa...",[yeah good question think given significant im...,
7,Erika Najarian,Q,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yes, good, morning, wanted, follow, question,...",[yes hi good morning wanted follow questions c...,2.0
8,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[right, erika, okay, tempting, many, rabbit, h...",[right erika okay tempting many rabbit holes d...,
9,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Oh, yeah. This is an unfortunate thing for any...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, unfortunate, thing, big, company, like,...",[oh yeah unfortunate thing big company like pe...,


In [200]:
qa_data["AS"]=None
qa_data["title"]=None
qa_data["question"]=None
qa_data["question_num"]=None

for i in qa_data.index:
  if qa_data.loc[i,"marker"]=="Q":
    qa_data.at[i,"AS"]="x"
    qa_data.at[i,"title"]="x"
    qa_data.at[i,"question"]="x"
    qa_data.at[i,"question_num"]="x"
  else:
    last_a = find_row(qa_data,"marker", i)+ 1
    name_list=[]
    title_list=[]
    question_list=[]
    quest_num_list=[]
    for j in range(last_a, i):
      name_list.append(qa_data["speaker"][j])
      title_list.append(qa_data["job_title"][j])
      question_list.append(qa_data["utterance"][j])
      quest_num_list.append(str(qa_data["question_number_inline"][j]))
    print(quest_num_list)


    qa_data.at[i,"AS"]=name_list
    qa_data.at[i,"title"]=title_list
    qa_data.at[i,"question"]=question_list
    qa_data.at[i,"question_num"]=quest_num_list

[]
['0']
[]
[]
['1']
['2']
[]
['3']
['4']
[]
['5']
['6', '7']
['8']
['9']
[]
['10']
['11']
['12', '13']
['14']
[]
['15', '16']
[]
['17']
['18']
[]
['19']
['20']
[]
[]
[]
[]
[]
[]
['21']
[]
[]
['22']
[]
['23']
[]
['24', '25']
['26']
['27']
[]
[]
['28', '29']
[]
[]
[]
['30', '31']
[]
['32']
[]
[]
['33']
['34']
['35', '36', '37']
[]
[]
[]
['38']
['39', '40']
['41']
[]
['42']
[]
['43']
[]
['44']
['45']
['46']
['47']
['48']
[]
['49']
['50']
[]
['51']
['52', '53', '54']
[]
[]
['55']
['56']
[]
[]
['57']
['58']
['59']
[]
[]
['None', '60', '61', '62']
['63']
['64']
['65']
['66']
['66', 'None']
[]
[]
['67']
[]
[]
['68']
['69']
[]
['70']
['71']
['72']
['73', '74']
['75']
['76', '77']
[]
[]
['78']
['79', '80']
[]
[]
['81']
[]
['82', '83']
[]
['84']
['85', '86']
[]
['87']
[]
['88']
[]
['89']
[]
['90']
['91']
['92']
[]
[]
['93']
[]
[]
['94']
[]
[]
['95']
['96']
['97']
['98', '99']
[]
['100']
['101', '102', '103']
['104']
['105']
[]
['106']
[]
['107']
['108']
[]
['109']
['110']
['111']
[]
['112']
['1

In [201]:
qa_data.head(50)

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data,question_number_inline,AS,title,question,question_num
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,,[],[],[],[]
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,0.0,x,x,x,x
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","[Hey. Good morning. Maybe just on regulation, ...",[0]
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,,[],[],[],[]
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,,[],[],[],[]
5,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Yeah. No, that makes sense. And maybe just as ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, make, sense, maybe, followup, loan, gro...",[yeah no makes sense maybe follow-up loan grow...,1.0,x,x,x,x
6,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah, it's a good question. And I think given ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, good, question, think, given, significa...",[yeah good question think given significant im...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","[Yeah. No, that makes sense. And maybe just as...",[1]
7,Erika Najarian,Q,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yes, good, morning, wanted, follow, question,...",[yes hi good morning wanted follow questions c...,2.0,x,x,x,x
8,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[right, erika, okay, tempting, many, rabbit, h...",[right erika okay tempting many rabbit holes d...,,[Erika Najarian],"[Analyst, UBS Securities LLC]","[Yes. Hi, good morning. Wanted to follow up on...",[2]
9,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Oh, yeah. This is an unfortunate thing for any...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, unfortunate, thing, big, company, like,...",[oh yeah unfortunate thing big company like pe...,,[],[],[],[]


In [202]:
for i in range(len(qa_data)):
  if i ==0 and qa_data["marker"][i]=="A":
    qa_data.at[i,"AS"] ="x"
    qa_data.at[i,"title"] ="x"
    qa_data.at[i,"question"] ="x"
    qa_data.at[i,"question_num"]="x"
  elif qa_data["marker"][i]=="A"and qa_data["AS"][i]==[]:
    a = find_row_empty(qa_data,"marker","AS",i)
    qa_data.at[i,"AS"] = qa_data.loc[a,"AS"]
    qa_data.at[i,"title"] = qa_data.loc[a,"title"]
    qa_data.at[i,"question"] = qa_data.loc[a,"question"]
    qa_data.at[i,"question_num"] =qa_data.loc[a,"question_num"]
  else:
    continue


In [203]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data,question_number_inline,AS,title,question,question_num
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,,x,x,x,x
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,0.0,x,x,x,x
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","[Hey. Good morning. Maybe just on regulation, ...",[0]
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","[Hey. Good morning. Maybe just on regulation, ...",[0]
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","[Hey. Good morning. Maybe just on regulation, ...",[0]


In [204]:
qa_data["question"] = qa_data["question"].apply(lambda x: " ".join(x))

In [205]:
preprocessor(qa_data,"question","ques_tokenised_data","ques_cleaned_data")

In [206]:
qa_data.head()

Unnamed: 0,speaker,marker,job_title,utterance,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data,question_number_inline,AS,title,question,question_num,ques_tokenised_data,ques_cleaned_data
0,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, think, conventional, wisdom, pretending...",[yeah think conventional wisdom qt im pretendi...,,x,x,x,x,[],[x]
1,Jim Mitchell,Q,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...,0.0,x,x,x,x,[],[x]
2,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[hey, jim, mean, obviously, something, re, thi...",[hey jim mean obviously something were thinkin...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[0],"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
3,Jeremy Barnum,A,"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[0],"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...
4,Jamie Dimon,A,"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[0],"[hey, good, morning, maybe, regulation, new, a...",[hey good morning maybe regulation new adminis...


In [207]:
"""
filename
financial_quarter
call_date
speaker
marker
question number (please add to each marker the question number (so we keep track of the order) from the question that each marker refers to
job_title
utterance can you please rename this as "metatext"	--->answer
question_cleaned_data	 can you please rename this as "metatext_cleaned"---->answer_cleaned
AS	can you please name this as "analyst"
title and you please name this as "analyst_title"
question	can you please name this as "metatext_question"
ques_cleaned_data  can you please name this as "metatext_quest_cleaned"
"""
#run preprocessed on answer column
preprocessor(qa_data,"utterance","answer_tokenised_data","answer_cleaned_data")

#drop tokenised column
qa_data.drop(columns=["answer_tokenised_data","question_tokenised_data","question_cleaned_data"], inplace=True)

#rename column
qa_data.rename(columns={"utterance":"metadata_answer","answer_cleaned_data":"answer_cleaned", "AS":"analyst","title":"analyst_title","question":"metadata_question","ques_cleaned_data":"question_cleaned"},inplace=True)

#Reorder df
columns_in_order=["filename","financial_quarter","call_date","speaker","marker","question_number","job_title","metadata_answer", "answer_cleaned","analyst","analyst_title","metadata_question","question_cleaned"]


In [208]:
qa_data=qa_data[["filename","financial_quarter","call_date","speaker","marker","question_num","job_title","metadata_answer", "answer_cleaned","analyst","analyst_title","metadata_question","question_cleaned"]]

In [209]:
qa_data.head(20)

Unnamed: 0,filename,financial_quarter,call_date,speaker,marker,question_num,job_title,metadata_answer,answer_cleaned,analyst,analyst_title,metadata_question,question_cleaned
0,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jeremy Barnum,A,x,"Chief Financial Officer, JPMorganChase","Yeah. I think the conventional wisdom on QT, a...",[yeah think conventional wisdom qt im pretendi...,x,x,x,[x]
1,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jim Mitchell,Q,x,"Analyst, Seaport Global Securities LLC","Hey. Good morning. Maybe just on regulation, w...",[hey good morning maybe regulation new adminis...,x,x,x,[x]
2,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jeremy Barnum,A,[0],"Chief Financial Officer, JPMorganChase","Hey, Jim. I mean, it's obviously something we'...",[hey jim mean obviously something were thinkin...,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[hey good morning maybe regulation new adminis...
3,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jeremy Barnum,A,[0],"Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",[everything capital liquidity uses data balanc...,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[hey good morning maybe regulation new adminis...
4,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jamie Dimon,A,[0],"Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",[add thats great jeremy gave all let add three...,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Hey. Good morning. Maybe just on regulation, w...",[hey good morning maybe regulation new adminis...
5,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jim Mitchell,Q,x,"Analyst, Seaport Global Securities LLC","Yeah. No, that makes sense. And maybe just as ...",[yeah no makes sense maybe follow-up loan grow...,x,x,x,[x]
6,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jeremy Barnum,A,[1],"Chief Financial Officer, JPMorganChase","Yeah, it's a good question. And I think given ...",[yeah good question think given significant im...,[Jim Mitchell],"[Analyst, Seaport Global Securities LLC]","Yeah. No, that makes sense. And maybe just as ...",[yeah no makes sense maybe follow-up loan grow...
7,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Erika Najarian,Q,x,"Analyst, UBS Securities LLC","Yes. Hi, good morning. Wanted to follow up on ...",[yes hi good morning wanted follow questions c...,x,x,x,[x]
8,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jeremy Barnum,A,[2],"Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",[right erika okay tempting many rabbit holes d...,[Erika Najarian],"[Analyst, UBS Securities LLC]","Yes. Hi, good morning. Wanted to follow up on ...",[yes hi good morning wanted follow questions c...
9,4q24-earnings-transcript.pdf,4Q24,2025-01-15,Jamie Dimon,A,[2],"Chairman & Chief Executive Officer, JPMorganChase","Oh, yeah. This is an unfortunate thing for any...",[oh yeah unfortunate thing big company like pe...,[Erika Najarian],"[Analyst, UBS Securities LLC]","Yes. Hi, good morning. Wanted to follow up on ...",[yes hi good morning wanted follow questions c...


JP morgan management discussion

In [None]:
%ls

In [None]:
#defining santader dataframe
jpmorgan_body_df=pd.read_csv("jpmorgan_management_discussion.csv")
jpmorgan_body_df.head()

In [None]:
#preprocess data
preprocessor(jpmorgan_body_df, "chunk_text", "tokenized_data","cleaned_data")

In [None]:
jpmorgan_body_df.head()

UBS qna section

In [None]:
%ls

In [None]:
#define ubs q&a data
ubs_qna_df=pd.read_csv("ubs_qna_section.csv")

In [None]:
#preprocessing ubs Q&A data
preprocessor(ubs_qna_df, "utterance", "tokenized_data","cleaned_data")

In [None]:
ubs_qna_df.head()

UBS management discussion

In [None]:
%ls

In [None]:
#defining ubs management discussion
ubs_manag_df=pd.read_csv("ubs_management_discussion.csv")
ubs_manag_df.head()

In [None]:
#preprocessing ubs management discussion
preprocessor(ubs_manag_df,"utterance", "tokenized_data","cleaned_data")
ubs_manag_df.head()

# **Export the output as a csv file**

JP morgan QA section

In [210]:
#export preprocessed data
preprocessed_qa_csv_path1 = "/content/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver6.csv"
qa_data.to_csv(preprocessed_qa_csv_path1, index=False)

JP morgan management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path2 = "/content/sample_data/jpmorgan_management_df_preprocessed.csv"
jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)

UBS QA section

In [None]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/sample_data/ubs_qa_df_preprocessed.csv"
ubs_qna_df.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)