<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_ubs_original.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1125]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [1126]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [1127]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


find_row: This function searches upwards from the given current_row_num in the DataFrame df to find the first row where the value in column "M" is "A". It returns the index of that row. If no such row is found, it returns 0

In [1128]:
def find_row (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "A":
      break
    else:
      i-=1
  return i

find_row_empty: This function searches upwards from the given current_row_num to find the first row where col1 has the value "A" and col2 is not an empty list. It returns the index of that row.

In [1129]:
def find_row_empty (df, col1, col2, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col1][i] == "A" and df[col2][i] != []:
      break
    else:
      i-=1
  return i

In [1130]:
def create_ques_num_column (data, new_col,marker_col):
  #Create question number column
  data[new_col]=None
  #set global var to count question number
  num = 0
  #if Q was found, num adds 1 otherwise none
  for i in data.index:
    if data.loc[i,marker_col] is not np.NaN:
      data.at[i,new_col]=num
      num +=1
    else:
      continue

In [1131]:
# Function to extract names
def extract_name(full_string):
    return full_string.split(',')[0]

In [1132]:
#check if there is "A" before "Q" from the current location
def find_last_a (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "Q":
      j = i-1
      while j > 0:
        if df[col][j]=="A":
          pass
        else:
          break
        j-=1
      break
    else:
      i-=1
  return i

## **Data**

In [1133]:
#drive.mount('/content/drive')

In [1134]:
#!ls"/content/bank_of_england/data/preprocessed_data/Archived/jpmorgan_qa_section_preprocessed.csv"

In [1135]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


JP Morgan QA section

In [1136]:
#Defining qa_data
qa_data = pd.read_csv("/content/drive/MyDrive/bank_of_england/data/cleansed/ubs_qna_section.csv")
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf


In [1137]:
pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+, [A-Z][A-Za-z]+'
qa_data["dummy"]=None
for i in range(len(qa_data)):
  matches = re.findall(pattern, str(qa_data['utterance'][i]))
  if matches:
    qa_data.at[i, 'dummy'] = matches
  else:
    continue


In [1138]:
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"


In [1139]:
additional_row = len(qa_data) - qa_data["dummy"].isnull().sum()
print(additional_row)
total = additional_row + len(qa_data)
print(total)

56
365


In [1140]:
for  i in qa_data.index:
  matches = re.findall(pattern, str(qa_data['utterance'][i]))
  #print(matches[0])
  if matches and i==0:
    qa_data.at[i, 'speaker'] = matches
    #print(matches)
  elif matches:
    new_index=i+0.5
    parts1 = [part.strip() for part in qa_data['utterance'][i].split(matches[0])]
    qa_data.at[i, 'utterance'] = parts1[0]
    qa_data.loc[new_index] = {"speaker":matches,"job_title":matches,"utterance":parts1[1], "call_date":qa_data["call_date"][i], "financial_quarter":qa_data["financial_quarter"][i],"source_file":qa_data["source_file"][i], "dummy":None}
    #print(parts1[0])
    print(parts1[1])

  else:
    continue

Yeah. Thanks. Just two questions. The first one is related to the book value per share guidance that you gave of the 74% increase on day one. Just wondering if that still holds, first of all, at this point, or has that changed post your further analysis. And in that context, should we assume that the purchase accounting, restructuring, asset marks, etcetera, should be captured by the reserves that is being built off what we estimate 21 billion. The second question is in relation to more top down. How we should think about the impact on clients at this point and also retention of clients, especially in wealth management at this point already. And how we should think about retention of clients post the event, the clients that have left. How should we think about your position after that?
Sorry if I may. The 74% also includes - but does clearly not include additional charges to the P&L and the rest comes to the P&L, we should assume.
Ok.
Yeah. Thank you very much for taking my question. T

In [1141]:
qa_data=qa_data.sort_index().reset_index(drop=True)
qa_data.head(10)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,"[Chis Hallam, Goldman]",,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,"[Kian Abouhossein, JPMorgan]","[Kian Abouhossein, JPMorgan]",Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
5,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,"[Kian Abouhossein, JPMorgan]","[Kian Abouhossein, JPMorgan]",Sorry if I may. The 74% also includes - but do...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
8,"[Kian Abouhossein, JPMorgan]","[Kian Abouhossein, JPMorgan]",Ok.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [1142]:
for i in range(len(qa_data)):
  if isinstance(qa_data['speaker'][i], list):
    parts = [part.strip() for part in qa_data['speaker'][i][0].split(',')]
    qa_data.at[i, 'speaker'] = parts[0]
    qa_data.at[i, 'job_title'] = parts[1]

  else:
    continue


In [1143]:
qa_data.head(60)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
5,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,Kian Abouhossein,JPMorgan,Sorry if I may. The 74% also includes - but do...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
8,Kian Abouhossein,JPMorgan,Ok.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [1144]:
#Manual adjustments!!
text = qa_data["speaker"][55] + qa_data["job_title"][55]+qa_data["utterance"][55]
qa_data.at[54,"utterance"]=text
qa_data=qa_data.drop(index=55)
qa_data.reset_index(drop=True, inplace=True)
qa_data=qa_data.drop(columns="dummy")

In [1145]:
qa_data.head(60)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
2,Chris Hallam,Goldman Sachs,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
4,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
5,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
6,Kian Abouhossein,JPMorgan,Sorry if I may. The 74% also includes - but do...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
7,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
8,Kian Abouhossein,JPMorgan,Ok.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
9,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf


In [1146]:
qa_data.to_csv("/content/ubs_qna_section.csv", index=False)

In [1064]:
"""
#ascend df by financial_quarter
quarter_order = ["1Q23", "2Q23","3Q23","4Q23","1Q24","2Q24","3Q24","4Q24"]
qa_data["financial_quarter"] = pd.Categorical(qa_data["financial_quarter"], categories=quarter_order, ordered=True)
qa_data = qa_data.sort_values("financial_quarter", kind="mergesort")
"""

'\n#ascend df by financial_quarter\nquarter_order = ["1Q23", "2Q23","3Q23","4Q23","1Q24","2Q24","3Q24","4Q24"]\nqa_data["financial_quarter"] = pd.Categorical(qa_data["financial_quarter"], categories=quarter_order, ordered=True)\nqa_data = qa_data.sort_values("financial_quarter", kind="mergesort")\n'

In [1065]:
#preprocessing data
#preprocessor(qa_data, "utterance", "question_tokenised_data", "question_cleaned")
#preprocessor(qa_data,"utterance","answer_tokenised_data","answer_cleaned")

In [1066]:
#remove operater
#qa_data = qa_data.loc[qa_data["speaker"]!="Operator"]
"""
#remove less than 20 words
qa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))
qa_data = qa_data.loc[qa_data["count"]>20]
qa_data.head()
"""

'\n#remove less than 20 words\nqa_data["count"] = qa_data["question_tokenised_data"].apply(lambda x: len(x))\nqa_data = qa_data.loc[qa_data["count"]>20]\nqa_data.head()\n'

In [1068]:
#reset index
qa_data.reset_index(drop=True, inplace=True)
qa_data.head(20)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
5,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,Kian Abouhossein,JPMorgan,Sorry if I may. The 74% also includes - but do...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
8,Kian Abouhossein,JPMorgan,Ok.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [1069]:
qa_data.to_csv("/content/ubs_qa_df_preprocessed.csv", index=False)

In [1024]:
#Create question number column
create_ques_num_column(qa_data,"question_number_inline","job_title")

In [1025]:
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,question_tokenised_data,question_cleaned,answer_tokenised_data,answer_cleaned,question_number_inline
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]","[chi, hallam, goldman, sachs, yes, good, morni...",[chis hallam goldman sachs yes good morning ev...,"[chi, hallam, goldman, sachs, yes, good, morni...",[chis hallam goldman sachs yes good morning ev...,0.0
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"[okay, thank, you, capital, requirement, know,...",[okay thank you capital requirements know situ...,"[okay, thank, you, capital, requirement, know,...",[okay thank you capital requirements know situ...,
2,Chris Hallam,Goldman Sachs,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]","[clear, thanks]",[clear thanks],"[clear, thanks]",[clear thanks],1.0
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"[yeah, thanks, two, question, first, one, rela...",[yeah thanks two questions first one related b...,"[yeah, thanks, two, question, first, one, rela...",[yeah thanks two questions first one related b...,2.0
4,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"[sarah, take, first, question, take, second]",[so sarah take first question take second],"[sarah, take, first, question, take, second]",[so sarah take first question take second],


In [1026]:
#adding new columns
qa_data["analyst"]=None
qa_data["analyst_title"]=None
qa_data["metadata_question"]=None
qa_data["question_num"]=None

#

for i in qa_data.index:
    name_list=[]
    title_list=[]
    question_list=[]
    quest_num_list=[]
    #If the value in the "job_title" column at row i is "Q", then set the values of "analyst", "analyst_title", "metadata_question", and "question_num" columns at row i to "x".
    if qa_data.loc[i,"job_title"]is not np.NaN:
      qa_data.at[i,"analyst"]="x"
      qa_data.at[i,"analyst_title"]="x"
      qa_data.at[i,"metadata_question"]="x"
      qa_data.at[i,"question_num"]="x"
    elif find_last_a(qa_data,"job_title",i) >=0 and qa_data.loc[i-1,"job_title"] is not np.NaN:
      name_list.append(qa_data["speaker"][i-1])
      title_list.append(qa_data["job_title"][i-1])
      question_list.append(qa_data["utterance"][i-1])
      quest_num_list.append(str(qa_data["question_number_inline"][i-1]))
    elif find_last_a(qa_data,"job_title",i) >=0 and qa_data.loc[i-1,"job_title"] is np.NaN:
      qa_data.at[i,"analyst"] = qa_data.at[i-1,"analyst"]
      qa_data.at[i, "analyst_title"] = qa_data.at[i-1,"analyst_title"]
      qa_data.at[i,"metadata_question"]=qa_data.at[i-1, "metadata_question"]
      qa_data.at[i,"question_num"] = qa_data.at[i-1, "question_num"]

    #Initialize lists and populate them with "speaker", "job_title", "utterance", and "question_number_inline" values from rows between last_a and i in qa_data.
    else:
      last_a = find_row(qa_data,"job_title", i)+ 1
      for j in range(last_a, i):
        name_list.append(qa_data["speaker"][j])
        title_list.append(qa_data["job_title"][j])
        question_list.append(qa_data["utterance"][j])
        quest_num_list.append(str(qa_data["question_number_inline"][j]))
 # Assign the lists name_list, title_list, question_list, and quest_num_list to the "analyst", "analyst_title", "metadata_question", and "question_num" columns at row i in qa_data, respectively.
    qa_data.at[i,"analyst"]=name_list
    qa_data.at[i,"analyst_title"]=title_list
    qa_data.at[i,"metadata_question"]=question_list
    qa_data.at[i,"question_num"]=quest_num_list

In [1036]:
qa_data.head()

Unnamed: 0,source_file,financial_quarter,call_date,Exective,question_num,metadata_question,question_cleaned,metadata_answer,answer_cleaned,analyst,Bank
0,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Chis Hallam,x,x,[chis hallam goldman sachs yes good morning ev...,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",[chis hallam goldman sachs yes good morning ev...,Chis Hallam,x
1,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sergio P. Ermotti,[0],"[Chis Hallam, Goldman Sachs Yes. Good morning,...",[okay thank you capital requirements know situ...,"Okay. Thank you. On capital requirements, you ...",[okay thank you capital requirements know situ...,[Chis Hallam],[Goldman]
2,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Chris Hallam,[],[],[clear thanks],Very clear. Thanks.,[clear thanks],[],[]
3,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Kian Abouhossein,[],[],[yeah thanks two questions first one related b...,Yeah. Thanks. Just two questions. The first on...,[yeah thanks two questions first one related b...,[],[]
4,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sergio P. Ermotti,[2],[Yeah. Thanks. Just two questions. The first o...,[so sarah take first question take second],"So, Sarah, take the first question. I'll take ...",[so sarah take first question take second],[Kian Abouhossein],[JPMorgan]


In [1028]:
for i in range(len(qa_data)):
  #if the first row is answer, then return all analyst columns as x, if the row is answer and there is not analyst, then popurate questioner analyst name and other data, otherwise pass the function
  if i ==0:
    qa_data.at[i,"analyst"] =qa_data.loc[i,"speaker"]
    qa_data.at[i,"analyst_title"] ="x"
    qa_data.at[i,"metadata_question"] ="x"
    qa_data.at[i,"question_num"]="x"
  elif qa_data["job_title"][i] is np.NaN and qa_data["analyst"][i]==[]:
    a = find_row_empty(qa_data,"job_title","analyst",i)
    qa_data.at[i,"analyst"] = qa_data.loc[a,"analyst"]
    qa_data.at[i,"analyst_title"] = qa_data.loc[a,"analyst_title"]
    qa_data.at[i,"metadata_question"] = qa_data.loc[a,"metadata_question"]
    qa_data.at[i,"question_num"] =qa_data.loc[a,"question_num"]
  else:
    continue


In [1029]:
#rename column
qa_data.rename(columns={"utterance":"metadata_answer","speaker":"Exective", "analyst_title":"Bank"},inplace=True)
qa_data.drop(columns=["job_title"],inplace=True)

In [1030]:
#reorganise column
qa_data=qa_data[["source_file","financial_quarter","call_date","Exective","question_num","metadata_question","question_cleaned","metadata_answer", "answer_cleaned","analyst","Bank"]]

In [1037]:
print(qa_data["metadata_question"][2])

[]


In [1042]:
qa_data = qa_data[(qa_data["metadata_question"]!="x") | (qa_data["metadata_answer"]==[])]
#

ValueError: ('Lengths must match to compare', (364,), (0,))

In [1043]:
qa_data.head(20)

Unnamed: 0,source_file,financial_quarter,call_date,Exective,question_num,metadata_question,question_cleaned,metadata_answer,answer_cleaned,analyst,Bank
0,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Chis Hallam,x,x,[chis hallam goldman sachs yes good morning ev...,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",[chis hallam goldman sachs yes good morning ev...,Chis Hallam,x
1,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sergio P. Ermotti,[0],"[Chis Hallam, Goldman Sachs Yes. Good morning,...",[okay thank you capital requirements know situ...,"Okay. Thank you. On capital requirements, you ...",[okay thank you capital requirements know situ...,[Chis Hallam],[Goldman]
2,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Chris Hallam,[],[],[clear thanks],Very clear. Thanks.,[clear thanks],[],[]
3,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Kian Abouhossein,[],[],[yeah thanks two questions first one related b...,Yeah. Thanks. Just two questions. The first on...,[yeah thanks two questions first one related b...,[],[]
4,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sergio P. Ermotti,[2],[Yeah. Thanks. Just two questions. The first o...,[so sarah take first question take second],"So, Sarah, take the first question. I'll take ...",[so sarah take first question take second],[Kian Abouhossein],[JPMorgan]
5,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sarah Youngwood,x,x,[so give 74% focused intentionally viewed econ...,"So, when we give you the 74%, we focused inten...",[so give 74% focused intentionally viewed econ...,Chis Hallam,x
6,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Kian Abouhossein,[],[],[sorry may 74% also includes - clearly include...,Sorry if I may. The 74% also includes - but do...,[sorry may 74% also includes - clearly include...,[],[]
7,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sarah Youngwood,[3],[Sorry if I may. The 74% also includes - but d...,[thats right initial ppa comes cet1 tangible b...,That's right. The initial PPA comes into the C...,[thats right initial ppa comes cet1 tangible b...,[Kian Abouhossein],[JPMorgan]
8,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Kian Abouhossein,[],[],[ok],Ok.,[ok],[],[]
9,1q23-earnings-call-remarks.pdf,1Q23,25 April 2023,Sergio P. Ermotti,[4],[Ok.],[client retention – maybe let reiterate look i...,"So on client retention, I – maybe let me reite...",[client retention – maybe let reiterate look i...,[Kian Abouhossein],[JPMorgan]


# **Export the output as a csv file**

UBS QA section

In [816]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/ubs_qa_df_preprocessed.csv"
qa_data.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)