<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_ubs_pairing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [408]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [409]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [410]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


find_row: This function searches upwards from the given current_row_num in the DataFrame df to find the first row where the value in column "M" is "A". It returns the index of that row. If no such row is found, it returns 0

In [411]:
def find_row (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] == "A":
      break
    else:
      i-=1
  return i

find_row_empty: This function searches upwards from the given current_row_num to find the first row where col1 has the value "A" and col2 is not an empty list. It returns the index of that row.

In [412]:
def find_row_empty (df, col1, col2, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col1][i] == "A" and df[col2][i] != []:
      break
    else:
      i-=1
  return i

In [413]:
def create_ques_num_column (data, new_col,marker_col):
  #Create question number column
  data[new_col]=None
  #set global var to count question number
  num = 0
  #if Q was found, num adds 1 otherwise none
  for i in data.index:
    if data.loc[i,marker_col] is not np.NaN:
      data.at[i,new_col]=num
      num +=1
    else:
      continue

In [414]:
# Function to extract names
def extract_name(full_string):
    return full_string.split(',')[0]

In [415]:
#check if there is "A" before "Q" from the current location
def find_last_a (df, col, current_row_num):
  #list_name=[]
  i = current_row_num-1
  while i > 0:
    if df[col][i] =="UBS":
      j = i-1
      while j > 0:
        if df[col][j]!="UBS":
          pass
        else:
          break
        j-=1
      break
    else:
      i-=1
  return i

## **Data**

In [416]:
#drive.mount('/content/drive')

In [417]:
#!ls"/content/bank_of_england/data/preprocessed_data/Archived/jpmorgan_qa_section_preprocessed.csv"

In [418]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


JP Morgan QA section

In [419]:
#Defining qa_data
qa_data = pd.read_csv("/content/ubs_data.csv")
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,ex_dummy,category
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"['Chis Hallam, Goldman']",,Goldman
1,Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,,Goldman
2,Chris Hallam,Goldman,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"['Kian Abouhossein, JPMorgan']",,Goldman
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,,JPMorgan
4,Sergio P. Ermotti,UBS,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,,JPMorgan


In [420]:
for i in range(len(qa_data)-1):
  qa_data.at[i,"ex_dummy"]=qa_data.at[i+1,"financial_quarter"]

In [421]:
qa_data[10:60]

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,ex_dummy,category
10,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America
11,Sergio P. Ermotti,UBS,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America
12,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America
13,Flora Bocahut,Jefferies,Yes. Good morning. The first question I wanted...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies
14,Sarah Youngwood,UBS,"So, on the first question in terms of the tren...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies
15,Sergio P. Ermotti,UBS,"Yeah. On the client side. I mean, if you look ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies
16,Flora Bocahut,Jefferies,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies
17,Andrew Coombs,Citi,"Good morning. Two questions. Firstly, just on ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi
18,Sarah Youngwood,UBS,"So, on the first quarter or the first question...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi
19,Andrew Coombs,Citi,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi


In [423]:
qa_data["question"] = None
qa_data["answer"] = None
qa_data["analyst"] = None
qa_data["exective"] = None
current_bank=qa_data["category"][0]
analyst=[]
exective=[]
question=""
answer=""
for i in range(len(qa_data)):
  if qa_data["ex_dummy"][i]!=qa_data["financial_quarter"][i] and qa_data["job_title"][i]=="UBS":
      exective.append(str(qa_data["speaker"][i]))
      answer += str(qa_data["utterance"][i])
      qa_data.at[i,"question"]=question
      qa_data.at[i,"answer"]=answer
      qa_data.at[i,"analyst"]=analyst
      qa_data.at[i,"exective"]=exective
      exective=[]
      analyst=[]
      answer = ""
      question = ""
      qa_data.at[i,"category"]=current_bank
      if i<len(qa_data)-1:
        current_bank = qa_data["job_title"][i+1]
      print(qa_data["question"][i])
      print(qa_data["answer"][i])
      print(qa_data["analyst"][i])
      print(qa_data["exective"][i])
  else:
    if qa_data["job_title"][i]!="UBS" and current_bank==qa_data["category"][i]:
      question += str(qa_data["utterance"][i])
      analyst.append(str(qa_data["speaker"][i]))
    elif qa_data["job_title"][i]=="UBS" and current_bank==qa_data["category"][i]:
      answer += str(qa_data["utterance"][i])
      exective.append(str(qa_data["speaker"][i]))
    elif qa_data["job_title"][i]!="UBS" and current_bank!=qa_data["category"][i]:
      current_bank=qa_data["job_title"][i]
      analyst.append(str(qa_data["speaker"][i]))
      qa_data.at[i-1,"question"]=question
      qa_data.at[i-1,"answer"]=answer
      qa_data.at[i-1, "analyst"]=analyst
      qa_data.at[i-1,"exective"]=exective
      question = str(qa_data["utterance"][i])
      answer = ""
      analyst = []
      exective = []
    else:
      continue


Hi. Good morning. And welcome back, Sergio, from me as well. So, my first question is on NII. Appreciate the more integrated disclosure on the beta and the migration. I was wondering if you could talk more specifically about how that plays out on a geographical basis. Is that the greater pressure on the beta migration mainly from the U.S. side or are we seeing any changes in Europe and Asia as well? And maybe you can give some color on how your NII should pan out based on forward curves into 2024. And then my next question is on the IB. Appreciate you've yet to outline the perimeter of what mean – of what goes into non-core. But at the same time, you've been very specific about your IB DNA for the past few years now. And so, there's a question mark over the lev fin business which CS lumps with IBD. I guess the question is how do you see that business within your IB structure?
Okay. Thank you, Andrew. I'll take the second question. I think that, yes, the DNA won't change what we do is p

In [424]:
qa_data[10:60]

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,ex_dummy,category,question,answer,analyst,exective
10,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America,,,,
11,Sergio P. Ermotti,UBS,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America,,,,
12,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...","Thank you, Ryan. It is good to be back to inte...","[Alastair Ryan, Flora Bocahut]",[Sergio P. Ermotti]
13,Flora Bocahut,Jefferies,Yes. Good morning. The first question I wanted...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies,,,,
14,Sarah Youngwood,UBS,"So, on the first question in terms of the tren...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies,,,,
15,Sergio P. Ermotti,UBS,"Yeah. On the client side. I mean, if you look ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies,,,,
16,Flora Bocahut,Jefferies,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Jefferies,Yes. Good morning. The first question I wanted...,"So, on the first question in terms of the tren...","[Flora Bocahut, Andrew Coombs]","[Sarah Youngwood, Sergio P. Ermotti]"
17,Andrew Coombs,Citi,"Good morning. Two questions. Firstly, just on ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi,,,,
18,Sarah Youngwood,UBS,"So, on the first quarter or the first question...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi,,,,
19,Andrew Coombs,Citi,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Citi,"Good morning. Two questions. Firstly, just on ...","So, on the first quarter or the first question...","[Andrew Coombs, Adam Terelak]",[Sarah Youngwood]


In [425]:
#Adding Analyst name and Exective name
for i in range(len(qa_data)):
  if isinstance(qa_data["exective"][i], list):
    qa_data["exective"][i]=list(set(qa_data["exective"][i]))
  else:
    continue


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  qa_data["exective"][i]=list(set(qa_data["exective"][i]))
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the orig

In [426]:
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,ex_dummy,category,question,answer,analyst,exective
0,Chis Hallam,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"['Chis Hallam, Goldman']",1Q23,Goldman,,,,
1,Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,Goldman,,,,
2,Chris Hallam,Goldman,Very clear. Thanks.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"['Kian Abouhossein, JPMorgan']",1Q23,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...","Okay. Thank you. On capital requirements, you ...","[Chis Hallam, Chris Hallam, Kian Abouhossein]",[Sergio P. Ermotti]
3,Kian Abouhossein,JPMorgan,Yeah. Thanks. Just two questions. The first on...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,JPMorgan,,,,
4,Sergio P. Ermotti,UBS,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,1Q23,JPMorgan,,,,


In [428]:
qa_data[130:150]

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy,ex_dummy,category,question,answer,analyst,exective
130,Todd Tuckner,UBS,Yeah. That was our understanding from Credit S...,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Mediobanca,,,,
131,Adam Terelak,Mediobanca,Okay. Thank you.,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Mediobanca,Morning. Thank you for the questions. I want t...,"Thanks, Adam. On the second one, would just sa...","[Adam Terelak, Adam Terelak, Adam Terelak, And...","[Todd Tuckner, Serio P. Ermotti]"
132,Andrew Coombs,Citigroup,Good morning it’s Andrew Coombs from Citi and ...,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Citigroup,,,,
133,Todd Tuckner,UBS,"So, Andrew, in terms of – I’ll take the second...",31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Citigroup,,,,
134,Andrew Coombs,UBS,Thank you.,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Citigroup,Good morning it’s Andrew Coombs from Citi and ...,"So, Andrew, in terms of – I’ll take the second...",[Vishal Shah],"[Andrew Coombs, Todd Tuckner]"
135,Vishal Shah,Morgan Stanley,Hi. Thank you so much for your questions. My f...,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Morgan Stanley,,,,
136,Todd Tuckner,UBS,"Hey, Vishal. I mean I think on this on that se...",31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Morgan Stanley,,,,
137,Vishal Shah,Morgan Stanley,Okay. Thank you so much.,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q23,Morgan Stanley,,,,
138,Sergio P. Ermotti,UBS,Okay. The last answer and questions and I'm su...,31 August 2023,2Q23,2q23-earnings-call-remarks.pdf,,2Q24,Morgan Stanley,Hi. Thank you so much for your questions. My f...,"Hey, Vishal. I mean I think on this on that se...",[Vishal Shah],"[Todd Tuckner, Sergio P. Ermotti]"
139,Giulia Miotto,Morgan Stanley,Hi. Good morning. Thank you for taking my ques...,14 August 2024,2Q24,2q24-earnings-call-remarks.pdf,,2Q24,Morgan Stanley,,,,


In [429]:
qa_data= qa_data.dropna(subset=["question"])

In [430]:
qa_data.reset_index(drop=True, inplace=True)

In [431]:
qa_data=qa_data.drop(columns="ex_dummy")

In [432]:
qa_data=qa_data.drop(columns="dummy")

In [433]:
qa_data=qa_data.drop(columns="job_title")

In [434]:
qa_data=qa_data.drop(columns="utterance")

In [435]:
qa_data=qa_data.drop(columns="speaker")

In [436]:
qa_data.head(50)

Unnamed: 0,call_date,financial_quarter,source_file,category,question,answer,analyst,exective
0,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...","Okay. Thank you. On capital requirements, you ...","[Chis Hallam, Chris Hallam, Kian Abouhossein]",[Sergio P. Ermotti]
1,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,JPMorgan,Yeah. Thanks. Just two questions. The first on...,"So, Sarah, take the first question. I'll take ...","[Kian Abouhossein, Kian Abouhossein, Alastair ...","[Sergio P. Ermotti, Sarah Youngwood]"
2,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...","Thank you, Ryan. It is good to be back to inte...","[Alastair Ryan, Flora Bocahut]",[Sergio P. Ermotti]
3,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Jefferies,Yes. Good morning. The first question I wanted...,"So, on the first question in terms of the tren...","[Flora Bocahut, Andrew Coombs]","[Sergio P. Ermotti, Sarah Youngwood]"
4,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Citi,"Good morning. Two questions. Firstly, just on ...","So, on the first quarter or the first question...","[Andrew Coombs, Adam Terelak]",[Sarah Youngwood]
5,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Mediobanca,Morning. I've got two. One is a bit of a follo...,So on the LCR and more generally the funding p...,"[Adam Terelak, Adam Terelak, Jeremy Sigee]",[Sarah Youngwood]
6,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Exane,Morning. Thank you and welcome back to Sergio ...,"Thank you, Jeremy. Look, you know, the base pl...","[Jeremy Sigee, Anke Reingen]",[Sergio P. Ermotti]
7,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,RBC,Yeah. Thank you very much for taking my questi...,"So on the treasury share, what happened is we ...","[Anke Reingen, RBC, Amit Goel, Barclays]",[Sarah Youngwood]
8,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Barclays,"Hi. Thank you and welcome back, Sergio, too fr...","Sure. So, in terms of the cost savings and EPS...","[Amit Goel, Amit Goel, Tom Hallet]","[Sergio Ermotti, Sarah Youngwood]"
9,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,KBW,"Good morning, everyone. So, just a couple for ...","Thank you, Tom. Now, look, of course, the reve...","[Tom Hallet, Nicholas Payen]",[Sergio P. Ermotti]


In [437]:
#preprocessing data
preprocessor(qa_data, "answer", "question_tokenised_data", "question_cleaned")
preprocessor(qa_data,"question","answer_tokenised_data","answer_cleaned")

In [438]:
qa_data.head()

Unnamed: 0,call_date,financial_quarter,source_file,category,question,answer,analyst,exective,question_tokenised_data,question_cleaned,answer_tokenised_data,answer_cleaned
0,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ...","Okay. Thank you. On capital requirements, you ...","[Chis Hallam, Chris Hallam, Kian Abouhossein]",[Sergio P. Ermotti],"[okay, thank, you, capital, requirement, know,...",[okay thank you capital requirements know situ...,"[chi, hallam, goldman, sachs, yes, good, morni...",[chis hallam goldman sachs yes good morning ev...
1,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,JPMorgan,Yeah. Thanks. Just two questions. The first on...,"So, Sarah, take the first question. I'll take ...","[Kian Abouhossein, Kian Abouhossein, Alastair ...","[Sergio P. Ermotti, Sarah Youngwood]","[sarah, take, first, question, take, secondso,...",[so sarah take first question take secondso gi...,"[yeah, thanks, two, question, first, one, rela...",[yeah thanks two questions first one related b...
2,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...","Thank you, Ryan. It is good to be back to inte...","[Alastair Ryan, Flora Bocahut]",[Sergio P. Ermotti],"[thank, you, ryan, good, back, interact, well,...",[thank you ryan good back interact well look y...,"[yeah, thank, you, good, morning, welcome, bac...",[yeah thank you good morning welcome back serg...
3,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Jefferies,Yes. Good morning. The first question I wanted...,"So, on the first question in terms of the tren...","[Flora Bocahut, Andrew Coombs]","[Sergio P. Ermotti, Sarah Youngwood]","[first, question, term, trend, april, really, ...",[so first question terms trends april really s...,"[yes, good, morning, first, question, wanted, ...",[yes good morning first question wanted ask re...
4,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Citi,"Good morning. Two questions. Firstly, just on ...","So, on the first quarter or the first question...","[Andrew Coombs, Adam Terelak]",[Sarah Youngwood],"[first, quarter, first, question, seen, signif...",[so first quarter first question seen signific...,"[good, morning, two, question, firstly, slide,...",[good morning two questions firstly slide 10 r...


In [439]:
#rename column
qa_data.rename(columns={"financial_quarter":"Quarter","question":"Question", "category":"Analyst_Bank", "answer": "Response", "exective":"Executive", "question_cleaned": "Question_cleaned", "answer_cleaned": "Response_cleaned", "source_file":"filename"},inplace=True)

In [440]:
#reorganise column
qa_data=qa_data[["filename","Quarter","Question","Question_cleaned","Analyst_Bank","Response","Response_cleaned", "Executive"]]

In [441]:
qa_data.head()

Unnamed: 0,filename,Quarter,Question,Question_cleaned,Analyst_Bank,Response,Response_cleaned,Executive
0,1q23-earnings-call-remarks.pdf,1Q23,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",[okay thank you capital requirements know situ...,Goldman,"Okay. Thank you. On capital requirements, you ...",[chis hallam goldman sachs yes good morning ev...,[Sergio P. Ermotti]
1,1q23-earnings-call-remarks.pdf,1Q23,Yeah. Thanks. Just two questions. The first on...,[so sarah take first question take secondso gi...,JPMorgan,"So, Sarah, take the first question. I'll take ...",[yeah thanks two questions first one related b...,"[Sergio P. Ermotti, Sarah Youngwood]"
2,1q23-earnings-call-remarks.pdf,1Q23,"Yeah. Thank you. Good morning. Welcome back, S...",[thank you ryan good back interact well look y...,Bank of America,"Thank you, Ryan. It is good to be back to inte...",[yeah thank you good morning welcome back serg...,[Sergio P. Ermotti]
3,1q23-earnings-call-remarks.pdf,1Q23,Yes. Good morning. The first question I wanted...,[so first question terms trends april really s...,Jefferies,"So, on the first question in terms of the tren...",[yes good morning first question wanted ask re...,"[Sergio P. Ermotti, Sarah Youngwood]"
4,1q23-earnings-call-remarks.pdf,1Q23,"Good morning. Two questions. Firstly, just on ...",[so first quarter first question seen signific...,Citi,"So, on the first quarter or the first question...",[good morning two questions firstly slide 10 r...,[Sarah Youngwood]


# **Export the output as a csv file**

UBS QA section

In [442]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/ubs_qa_df_preprocessed.csv"
qa_data.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [443]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)

NameError: name 'ubs_manag_df' is not defined