<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_ubs_original.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [145]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-26
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Segmenting text by bank name and analyst name
    - Pairing question and answer by GPT 4
    - Cleaning Texts
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [146]:
!pip install openai==0.28



In [147]:
!pip install python-dotenv



In [148]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from google.colab import drive
import openai
import json
from dotenv import load_dotenv
import os
from google.colab import userdata
import time

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [149]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]


  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(nltk.word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])



  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


In [150]:
# Function to extract names
def extract_name(full_string):
    return full_string.split(',')[0]

## **Pairing Question and Answer**

In [151]:
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [152]:
#Defining qa_data
qa_data = pd.read_csv("/content/drive/MyDrive/bank_of_england/data/processed/ubs_qna_section.csv")
qa_data.head()

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf


In [153]:
"""
Defining lists of Exective names from UBS
"""
exective_name = ["Sergio P. Ermotti","Sarah Youngwood	"]
exect_name = r'\b(Sergio P\. Ermotti|Sarah Youngwood)\b'


In [154]:
"""
This code searches for Analyst / Exective name and their bank name in the 'utterance' column of a DataFrame and stores any matches in a new 'dummy' column. If no matches are found, the 'dummy' column remains None for that row.

"""
pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+, [A-Z][A-Za-z]+'
qa_data["dummy"]=None
for i in range(len(qa_data)):
  matches = re.findall(pattern, str(qa_data['utterance'][i]))
  if matches:
    qa_data.at[i, 'dummy'] = matches
  else:
    continue


In [155]:
qa_data.head(50)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
5,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
8,Sergio P. Ermotti,,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [156]:
"""
This code calculates the number of rows with non-null values in the 'dummy' column and the total number of rows in the DataFrame.
"""
additional_row = len(qa_data) - qa_data["dummy"].isnull().sum()
print("The number of new rows: ", additional_row)
total = additional_row + len(qa_data)
print("Total number of rows: ", total)

The number of new rows:  56
Total number of rows:  365


In [157]:
"""
This code searches for Analyst/Exective name and their bank name in the 'utterance' column of a DataFrame, updates the 'speaker' column for the first row with matches, and splits the 'utterance' and inserts a new row for subsequent matches. If no matches are found, the loop continues to the next row.
"""
# I have to loop the parts 1[0] so that it catches the second bank as well.


for  i in qa_data.index:
  matches = re.findall(pattern, str(qa_data['utterance'][i]))
  matches = re.findall(exect_name, str(qa_data['utterance'][i]))
  if matches and i==0:
    qa_data.at[i, 'speaker'] = matches
  elif matches:
    new_index=i+0.5
    parts1 = [part.strip() for part in qa_data['utterance'][i].split(matches[0])]
    qa_data.at[i, 'utterance'] = parts1[0]
  elif matches:
    new_index=i+0.5
    parts1 = [part.strip() for part in qa_data['utterance'][i].split(matches[0])]
    qa_data.at[i, 'utterance'] = parts1[0]
    qa_data.loc[new_index] = {"speaker":matches,"job_title":matches,"utterance":parts1[1], "call_date":qa_data["call_date"][i], "financial_quarter":qa_data["financial_quarter"][i],"source_file":qa_data["source_file"][i], "dummy":None}
  else:
    continue

In [158]:
#Reset index due to new rows
qa_data=qa_data.sort_index().reset_index(drop=True)
qa_data.head(50)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
5,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
8,Sergio P. Ermotti,,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [159]:
"""
This code checks if the 'speaker' column contains a list, splits the first element of the list at the comma, and updates the 'speaker' and 'job_title' columns accordingly. If the 'speaker' is not a list, the loop continues to the next row.
"""

for i in range(len(qa_data)):
  if isinstance(qa_data['speaker'][i], list):
    parts = [part.strip() for part in qa_data['speaker'][i][0].split(',')]
    qa_data.at[i, 'speaker'] = parts[0]
    qa_data.at[i, 'job_title'] = parts[1]

  else:
    continue


In [160]:
qa_data.head(60)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,dummy
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Chis Hallam, Goldman]"
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
5,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,"[Kian Abouhossein, JPMorgan]"
6,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
8,Sergio P. Ermotti,,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,


In [165]:
"""
#Manual adjustments!!
"""
text = str(qa_data["speaker"][55]) + str(qa_data["job_title"][55])+str(qa_data["utterance"][55])
qa_data.at[54,"utterance"]=text
qa_data=qa_data.drop(index=55)
qa_data.reset_index(drop=True, inplace=True)
qa_data=qa_data.drop(columns="dummy")


In [166]:
qa_data.head(50)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file
0,Unknown,,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
1,Sergio P. Ermotti,,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
3,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
4,Sarah Youngwood,,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
5,Sarah Youngwood,,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
6,Sergio P. Ermotti,,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
8,Sergio P. Ermotti,,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf


In [167]:
qa_data["category"]=None
current_bank=qa_data["job_title"][0]
for i in range(len(qa_data)):
  if current_bank == qa_data["job_title"][i]:
    qa_data.at[i,"category"]=current_bank
  elif current_bank!=qa_data["job_title"][i] and qa_data["job_title"][i] is np.NaN:
    qa_data.at[i,"category"]=current_bank
  else:
    current_bank=qa_data["job_title"][i]
    qa_data.at[i,"category"]=current_bank


In [168]:
#Manual adjust
qa_data.at[2,"category"]="Goldman"

In [169]:
#add ubs in nan in job title
for i in range(len(qa_data)):
  if qa_data["job_title"][i] is np.NaN:
    qa_data.at[i,"job_title"]="UBS"

In [170]:
qa_data.head(50)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,category
0,Unknown,UBS,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
1,Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman
3,Sergio P. Ermotti,UBS,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs
4,Sarah Youngwood,UBS,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs
5,Sarah Youngwood,UBS,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs
6,Sergio P. Ermotti,UBS,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America
8,Sergio P. Ermotti,UBS,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America


In [171]:
qa_data["Text"]=None
text=""
for i in range(len(qa_data)):
  if i==0:
    text += qa_data["utterance"][i]
  elif qa_data["category"][i] != qa_data["category"][i-1]:
    qa_data.at[i-1,"Text"]=text
    text=""
    text += str(qa_data["speaker"][i]) + ", "
    text += str(qa_data["job_title"][i]) +" "
    text += str(qa_data["utterance"][i])
  else:
    text += str(qa_data["speaker"][i]) + ", "
    text += str(qa_data["job_title"][i]) + " "
    text += str(qa_data["utterance"][i])

In [172]:
qa_data.head(50)

Unnamed: 0,speaker,job_title,utterance,call_date,financial_quarter,source_file,category,Text
0,Unknown,UBS,"Chis Hallam, Goldman Sachs Yes. Good morning, ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Chis Hallam, Goldman Sachs Yes. Good morning, ..."
1,Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Sergio P. Ermotti, UBS Okay. Thank you. On cap..."
2,Chris Hallam,Goldman Sachs,"Very clear. Thanks. Kian Abouhossein, JPMorgan...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chris Hallam, Goldman Sachs Very clear. Thanks..."
3,Sergio P. Ermotti,UBS,"So, Sarah, take the first question. I'll take ...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,
4,Sarah Youngwood,UBS,"So, when we give you the 74%, we focused inten...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,
5,Sarah Youngwood,UBS,That's right. The initial PPA comes into the C...,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,
6,Sergio P. Ermotti,UBS,"So on client retention, I – maybe let me reite...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,"Sergio P. Ermotti, UBS So, Sarah, take the fir..."
7,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,
8,Sergio P. Ermotti,UBS,"Thank you, Ryan. It is good to be back to inte...",25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,
9,Alastair Ryan,Bank of America,Thank you.,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Alastair Ryan, Bank of America Yeah. Thank you..."


In [173]:
qa_data.drop(columns=["job_title","utterance"],inplace=True)
filtered_data = qa_data.dropna(subset="Text")

In [174]:
filtered_data.head()

Unnamed: 0,speaker,call_date,financial_quarter,source_file,category,Text
0,Unknown,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Chis Hallam, Goldman Sachs Yes. Good morning, ..."
1,Sergio P. Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Sergio P. Ermotti, UBS Okay. Thank you. On cap..."
2,Chris Hallam,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chris Hallam, Goldman Sachs Very clear. Thanks..."
6,Sergio P. Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,"Sergio P. Ermotti, UBS So, Sarah, take the fir..."
9,Alastair Ryan,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Alastair Ryan, Bank of America Yeah. Thank you..."


In [175]:
filtered_data.reset_index(drop=True, inplace=True)

In [176]:
filtered_data.head(20)

Unnamed: 0,speaker,call_date,financial_quarter,source_file,category,Text
0,Unknown,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Chis Hallam, Goldman Sachs Yes. Good morning, ..."
1,Sergio P. Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,,"Sergio P. Ermotti, UBS Okay. Thank you. On cap..."
2,Chris Hallam,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chris Hallam, Goldman Sachs Very clear. Thanks..."
3,Sergio P. Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman Sachs,"Sergio P. Ermotti, UBS So, Sarah, take the fir..."
4,Alastair Ryan,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Alastair Ryan, Bank of America Yeah. Thank you..."
5,Flora Bocahut,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Jefferies,"Flora Bocahut, Jefferies Yes. Good morning. Th..."
6,Andrew Coombs,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Citi,"Andrew Coombs, Citi Good morning. Two question..."
7,Adam Terelak,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Mediobanca,"Adam Terelak, Mediobanca Morning. I've got two..."
8,Sergio Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Exane,"Jeremy Sigee, Exane Morning. Thank you and wel..."
9,Tom Hallet,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,KBW,"Tom Hallet, KBW Good morning, everyone. So, ju..."


In [119]:
filtered_data.to_csv("/content/ubs_qna_sectionver1.csv", index=False)

In [55]:
openai.api_key = userdata.get('Openai_key')

In [114]:
def extract_info(text):
    """
    This function sends a prompt to the GPT-4 Turbo model asking it to extract
    specific fields from the provided text. The model is expected to return a JSON
    with the following keys:
    - Name of the first person
    - Role of the first person
    - All text that the first person said
    - Name of the second person
    - Role of the second person
    - All text that the second person said
    """
    prompt = f"""
    The text is conversation between two people. Please Extract the following information from the text below:


    - Name of the first person
    - All text that the first person said
    - Name of the second person
    - Bank name
    - All text that the second person said


    The output should have all text both the persons said in the text.

    Provide the response in JSON format with keys exactly as:
    "Name of the first person", "Name of bank", "All text that the first person said", "Name of the second person", "Role of the second person", "All text that the second person said".

    Text: {text}
    """
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts structured information from text."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}, # Set output to JSON format
            max_tokens=4000,  # Adjust tokens based on your text size
            temperature=0  # Keep it deterministic
        )
        content = response['choices'][0]['message']['content']
        # Attempt to parse the JSON response
        result = json.loads(content)
    except Exception as e:
        print(f"Error processing text: {e}")
        # Return a dictionary with None values in case of error
        result = {
            "Name of the first person": None,
            "All text that the first person said": None,
            "Name of the second person": None,
            "Role of the second person": None,
            "All text that the second person said": None
        }
    return result

In [116]:
# List to store processed results
processed_results = []

# Loop through each row in result_df
for idx, row in filtered_data.iterrows():
    text = row['Text']
    info = extract_info(text)
    processed_results.append(info)
    # Optional: sleep to respect rate limits (adjust the delay as needed)
    time.sleep(1)

# Convert the list of dictionaries to a DataFrame
processed_df = pd.DataFrame(processed_results)

# Display the processed DataFrame
processed_df.head(20)

Error processing text: Incorrect API key provided: sk-proj-********************************************************************************************************************************************************fd8A. You can find your API key at https://platform.openai.com/account/api-keys.
Error processing text: Incorrect API key provided: sk-proj-********************************************************************************************************************************************************fd8A. You can find your API key at https://platform.openai.com/account/api-keys.
Error processing text: Incorrect API key provided: sk-proj-********************************************************************************************************************************************************fd8A. You can find your API key at https://platform.openai.com/account/api-keys.
Error processing text: Incorrect API key provided: sk-proj-*****************************************************************

Unnamed: 0,Name of the first person,All text that the first person said,Name of the second person,Name of bank,All text that the second person said,Role of the second person
0,Chris Hallam,"Yes. Good morning, everybody. Firstly, on the ...",Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ...",
1,Kian Abouhossein,Yeah. Thanks. Just two questions. The first on...,Sergio P. Ermotti,UBS,"So, Sarah, take the first question. I'll take ...",
2,Alastair Ryan,"Yeah. Thank you. Good morning. Welcome back, S...",Sergio P. Ermotti,Bank of America,"Thank you, Ryan. It is good to be back to inte...",UBS
3,Flora Bocahut,Yes. Good morning. The first question I wanted...,Sarah Youngwood,UBS,"So, on the first question in terms of the tren...",
4,Andrew Coombs,"Good morning. Two questions. Firstly, just on ...",Sarah Youngwood,Citi,"So, on the first quarter or the first question...",UBS
5,Adam Terelak,Morning. I've got two. One is a bit of a follo...,Sarah Youngwood,UBS,So on the LCR and more generally the funding p...,UBS
6,Jeremy Sigee,Morning. Thank you and welcome back to Sergio ...,Sergio P. Ermotti,UBS,"Thank you, Jeremy. Look, you know, the base pl...",
7,Anke Reingen,Yeah. Thank you very much for taking my questi...,Sarah Youngwood,UBS,"So on the treasury share, what happened is we ...",
8,Amit Goel,"Okay. Thanks. And just on the cost savings, is...",Benjamin Goy,Barclays,"Yes. Hi. Good morning. Two questions, please. ...",Deutsche Bank
9,Tom Hallet,"Good morning, everyone. So, just a couple for ...",Sergio P. Ermotti,Credit Suisse,"Thank you, Tom. Now, look, of course, the reve...",UBS


# **Export the output as a csv file**

UBS QA section

In [117]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/ubs_qa_df_preprocessed.csv"
processed_df.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)