<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_nlp_pipeline_v1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Libraries**

In [140]:
!pip install PyPDF2



In [141]:
!pip install openai==0.28



In [142]:
!pip install python-dotenv



In [143]:
import re
import PyPDF2
import openai
import json
import pandas as pd
import time
import os
from dotenv import load_dotenv
from google.colab import userdata
from google.colab import drive

In [144]:
load_dotenv()
openai_api_key = os.getenv("Openai_key")

# **1. Data extraction**

In [145]:
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [146]:
def extract_sections_from_pdf(pdf_path):
    """
    Reads a PDF file and extracts text from all pages.
    Then segregates the text into two sections based on markers:
    - 'Management Discussion' section (md_section)
    - 'Question and Answer' section (QNA_section)

    Parameters:
        pdf_path (str): The file path to the PDF.

    Returns:
        md_section (str): Extracted text for the Management Discussion section.
        QNA_section (str): Extracted text for the Question and Answer section.
    """
    # Open the PDF file in binary mode
    with open(pdf_path, 'rb') as pdf_file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        num_pages = len(pdf_reader.pages)

        # Extract text from all pages
        full_text = ""
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            full_text += page_text + "\n"

    # Optionally, clean up extra whitespace/newlines
    full_text = re.sub(r'\n+', '\n', full_text)
    print(full_text)

    # Convert text to lowercase for marker searching (you can retain original text for extraction)
    text_lower = full_text.lower()

    # Define markers for splitting sections
    # Adjust these markers if your PDF uses slightly different headings
    #md_marker = "First quarter 2024 results"
    md2_marker = "First quarter 2024 results"
    qna_marker = "Analyst Q&A (CEO and CFO)"

    # Find the starting index of each section in the text
    md_start = full_text.find(md2_marker)
    qna_start = full_text.find(qna_marker)

    if md_start == -1:
        raise ValueError("First quarter 2024 results marker not found in the PDF.")
    if qna_start == -1:
        raise ValueError("Analyst Q&A (CEO and CFO) marker not found in the PDF.")

    # Extract the sections based on the identified markers
    # We assume that the Management Discussion section comes first
    md_section = full_text[md_start:qna_start].strip()
    QNA_section = full_text[qna_start:].strip()

    return md_section, QNA_section




In [147]:
if __name__ == "__main__":
    # Specify the path to your PDF file
    pdf_path = "/content/drive/MyDrive/bank_of_england/data/raw/ubs/1q24-earnings-call-remarks.pdf"

    # Extract the sections
    md_section, QNA_section= extract_sections_from_pdf(pdf_path)

    # Optionally, convert the extracted text into datasets (e.g., as a list of lines)
    md_dataset = md_section.split("\n")
    qna_dataset = QNA_section.split("\n")

 
Page 1 of 11 
First quarter 2024 results   
7 May 2024 
 
Speeches  by Sergio P. Ermotti , Group  Chief  Executive  Officer , and Todd Tuckner , 
Group  Chief  Financial  Officer  
 
Including analyst Q&A session  
 
Transcript.  
Numbers for slides refer  to the first quarter 2024 results presentation . Materials and a webcast 
replay are available at www.ubs.com/investors   
 
 
Sergio P. Ermotti  
 
Slide 3 – Key messages  
Thank you, Sarah and good morning, everyone.  
A little over a year ago, we were asked to play a critical role in stabilizing the Swiss and global financial systems 
through the acquisition of Credit Suisse and we are delivering on our commitments.  
This quarter marks the return to reported net profits and capital accretion – a testament to the strength of our 
client franchises and significant progress on our integration plans.  
Reported net profit was 1.8 billion, with underlying PBT of 2.6 billion and an underlying return on CET1 capital 
of 9.6%.  
Our co

In [148]:
    # Display a sample from each dataset
    print("=== Management Discussion Section Sample ===")
    for line in md_dataset[:5]:
        print(line)

=== Management Discussion Section Sample ===
First quarter 2024 results   
7 May 2024 
 
Speeches  by Sergio P. Ermotti , Group  Chief  Executive  Officer , and Todd Tuckner , 
Group  Chief  Financial  Officer  


In [149]:
    print("\n=== Question and Answer Section Sample ===")
    for line in qna_dataset[:5]:
        print(line)


=== Question and Answer Section Sample ===
Analyst Q&A (CEO and CFO)  
 
Alastair Ryan, Bank of America  
Yeah. Thank you. Good morning. A billion dollar beat in the quarter. I never did quite get the hang of 
forecasting lark. Just on that then -- so non -core, very strong performance and appreciate the updated runoff 


In [150]:
# Convert md_dataset to DataFrame before saving
md_df = pd.DataFrame(md_dataset, columns=['Text'])
md_df.to_csv("MD_1Q23.csv", index=False)

In [151]:
qna_dataset

['Analyst Q&A (CEO and CFO)  ',
 ' ',
 'Alastair Ryan, Bank of America  ',
 'Yeah. Thank you. Good morning. A billion dollar beat in the quarter. I never did quite get the hang of ',
 'forecasting lark. Just on that then -- so non -core, very strong performance and appreciate the updated runoff ',
 "profile you give us on slide 6. Is there any reason that you're just reverting to natural runoff or can we expect ",
 "continued sales if markets stay favorable bec ause clearly there's quite a meaningful driver of the very ",
 'favorable capital ratio and the interactions of all of those.  ',
 'And then secondly, the project to improve the revenue to risk -weighted assets in wealth management, are ',
 "presumably, you wouldn't represent the Q1 performances kind of the payoff of that project is? It's too early ",
 "but just what's the profile of that project? How long is that repricing sitting on the net new asset generation ",
 'and has it started ? Thank you.  ',
 'Sergio P. Ermotti  ',
 

In [152]:
openai.api_key = userdata.get('Openai_key')

In [153]:
def convert_qna_dataset_to_text(qna_dataset):
    """
    Converts a list of Q&A pair strings into a single text string.

    Parameters:
        qna_dataset (list): List of strings, where each string represents a Q&A pair.

    Returns:
        str: A single string containing all Q&A pairs separated by newlines.
    """
    # Join the list elements using a newline separator
    text = "\n".join(qna_dataset)
    return text

# Example usage:
if __name__ == "__main__":
    # Convert the qna_dataset to a single text string
    qna_text = convert_qna_dataset_to_text(qna_dataset)
    print("Converted Q&A Text:")
    print(qna_text)

Converted Q&A Text:
Analyst Q&A (CEO and CFO)  
 
Alastair Ryan, Bank of America  
Yeah. Thank you. Good morning. A billion dollar beat in the quarter. I never did quite get the hang of 
forecasting lark. Just on that then -- so non -core, very strong performance and appreciate the updated runoff 
profile you give us on slide 6. Is there any reason that you're just reverting to natural runoff or can we expect 
continued sales if markets stay favorable bec ause clearly there's quite a meaningful driver of the very 
favorable capital ratio and the interactions of all of those.  
And then secondly, the project to improve the revenue to risk -weighted assets in wealth management, are 
presumably, you wouldn't represent the Q1 performances kind of the payoff of that project is? It's too early 
but just what's the profile of that project? How long is that repricing sitting on the net new asset generation 
and has it started ? Thank you.  
Sergio P. Ermotti  
Alastair, before I pass to Todd, 

In [154]:
sib_list = [
    "Agricultural Bank of China",
    "Bank of America",
    "Bank of China",
    "Bank of New York Mellon",
    "Barclays",
    "BNP Paribas",
    "China Construction Bank",
    "Citi",
    "Credit Suisse",
    "Deutsche Bank",
    "Goldman Sachs",
    "Groupe BPCE",
    "Group Crédit Agricole",
    "HSBC",
    "Industrial and Commercial Bank of China",
    "ING Bank",
    "JPMorgan",
    "Mitsubishi UFJ FG",
    "Mizuho FG",
    "Morgan Stanley",
    "Royal Bank of Canada",
    "Santander",
    "Société Générale",
    "Standard Chartered",
    "State Street",
    "Sumitomo Mitsui FG",
    "Toronto Dominion",
    "UBS",
    "Unicredit Group",
    "Wells Fargo"
]

In [155]:
def extract_Page_segments(qna_dataset):
    """
    Processes the qna_dataset to extract text between consecutive occurrences of the word Page.

    Steps:
      1. Check if the word Page exists in the text.
      2. Find all occurrences of Page.
      3. Extract text between each consecutive pair of Page occurrences.
      4. Create a DataFrame with two columns:
         - 'Question_Number': a count starting at 1.
         - 'Text': the text between consecutive occurrences.

    Parameters:
        qna_dataset (str): The input text containing multiple occurrences of Page.

    Returns:
        pd.DataFrame: A DataFrame with the extracted segments.
    """
    # Step 1: Check if 'Page' exists in the dataset.
    if "Page" not in qna_dataset:
        print("The word 'Page' is not found in the dataset.")
        return pd.DataFrame(columns=["Question_Number", "Text"])

    # Step 2: Find all occurrences of 'Page'
    matches = list(re.finditer(r"Page", qna_dataset))

    # Check if there are at least two occurrences to form a segment.
    if len(matches) < 2:
        print("Not enough occurrences of 'Page' to extract segments.")
        return pd.DataFrame(columns=["Question_Number", "Text"])

    segments = []
    # Step 3: Extract text between consecutive occurrences.
    for i in range(len(matches) - 1):
        # Get the end index of the current occurrence and start index of the next occurrence.
        start = matches[i].end()
        end = matches[i+1].start()
        segment_text = qna_dataset[start:end].strip()
        segments.append(segment_text)

    # Step 4: Create the DataFrame.
    df = pd.DataFrame({
        "Question_Number": list(range(1, len(segments) + 1)),
        "Text": segments
    })
    return df

In [156]:
if __name__ == "__main__":
    result_df = extract_Page_segments(qna_text)
    print(result_df)

   Question_Number                                               Text
0                1  14 of 24 \n And, and so, we would expect that ...
1                2  15 of 24 \n the return on CET1, but I would st...
2                3  16 of 24 \n Sergio P. Ermotti  \nThank you. Ye...
3                4  17 of 24 \n And of course, we also hope that t...
4                5  18 of 24 \n And then my second question is sor...
5                6  19 of 24 \n And as I highlighted, that's not a...
6                7  20 of 24 \n And then on the net new assets, th...
7                8  21 of 24 \n Benjamin Goy , Deutsche Bank  \nHi...
8                9  22 of 24 \n Todd Tuckner  \nHey Piers. In term...
9               10  23 of 24 \n that would be corrected to be upsi...


In [157]:
result_df

Unnamed: 0,Question_Number,Text
0,1,"14 of 24 \n And, and so, we would expect that ..."
1,2,"15 of 24 \n the return on CET1, but I would st..."
2,3,16 of 24 \n Sergio P. Ermotti \nThank you. Ye...
3,4,"17 of 24 \n And of course, we also hope that t..."
4,5,18 of 24 \n And then my second question is sor...
5,6,"19 of 24 \n And as I highlighted, that's not a..."
6,7,"20 of 24 \n And then on the net new assets, th..."
7,8,"21 of 24 \n Benjamin Goy , Deutsche Bank \nHi..."
8,9,22 of 24 \n Todd Tuckner \nHey Piers. In term...
9,10,23 of 24 \n that would be corrected to be upsi...


In [158]:
result_df.to_csv("QNA_output.csv", index=False)

In [189]:
def extract_bank(text):
    """
    This function sends a prompt to the GPT-4 Turbo model asking it to extract
    specific fields from the provided text. The model is expected to return a JSON
    with the following keys:
    - Name of the first person
    - Role of the first person
    - All text that the first person said
    - Name of the second person
    - Role of the second person
    - All text that the second person said
    """
    prompt = f"""
    The text is conversation between two people. Please Extract the following information from the text below:

    - lists of bank name except "UBS"


    The output should have all text both the persons said in the text.

    Provide the response in JSON format with keys exactly as:
    "Name of bank"
    Text: {text}
    """
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts structured information from text."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}, # Set output to JSON format
            max_tokens=4000,  # Adjust tokens based on your text size
            temperature=0  # Keep it deterministic
        )
        content = response['choices'][0]['message']['content']
        # Attempt to parse the JSON response
        result = json.loads(content)
    except Exception as e:
        print(f"Error processing text: {e}")
        # Return a dictionary with None values in case of error
        result = {
            "Name of bank": None
        }
    return result

In [131]:
def extract_info(text):
    """
    This function sends a prompt to the GPT-4 Turbo model asking it to extract
    specific fields from the provided text. The model is expected to return a JSON
    with the following keys:
    - Name of the first person
    - Role of the first person
    - All text that the first person said
    - Name of the second person
    - Role of the second person
    - All text that the second person said
    """
    prompt = f"""
    The text is conversation between two people. Please Extract the following information from the text below:

    - Name of the first person
    - Role of the first person
    - All text that the first person said
    - Name of the second person
    - Role of the second person
    - All text that the second person said


    The output should have all text both the persons said in the text.

    Provide the response in JSON format with keys exactly as:
    "Name of the first person", "Role of the first person", "All text that the first person said", "Name of the second person", "Role of the second person", "All text that the second person said".

    Text: {text}
    """
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts structured information from text."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}, # Set output to JSON format
            max_tokens=4000,  # Adjust tokens based on your text size
            temperature=0  # Keep it deterministic
        )
        content = response['choices'][0]['message']['content']
        # Attempt to parse the JSON response
        result = json.loads(content)
    except Exception as e:
        print(f"Error processing text: {e}")
        # Return a dictionary with None values in case of error
        result = {
            "Name of the first person": None,
            "Role of the first person": None,
            "All text that the first person said": None,
            "Name of the second person": None,
            "Role of the second person": None,
            "All text that the second person said": None
        }
    return result

In [137]:
# List to store processed results
processed_results = []

# Loop through each row in result_df
for idx, row in result_df.iterrows():
    text = row['Text']
    info = extract_info(text)
    processed_results.append(info)
    # Optional: sleep to respect rate limits (adjust the delay as needed)
    time.sleep(1)

# Convert the list of dictionaries to a DataFrame
processed_df = pd.DataFrame(processed_results)

# Display the processed DataFrame
processed_df.head()

Unnamed: 0,Name of the first person,Role of the first person,All text that the first person said,Name of the second person,Role of the second person,All text that the second person said
0,Alastair Ryan,Bank of America,"Okay. Thank you . And Sergio, thank you.",Sergio P. Ermotti,,Sure. Pleasure.
1,Sergio P. Ermotti,Speaker,"Yes, Chris, first of all, of course, you know,...",Chris Hallam,Goldman Sachs,Right. Thank you.
2,Sergio P. Ermotti,Speaker,"Thank you. Yeah, very good question. Yeah, I t...",Giulia Aurora Miotto,Morgan Stanley,Yeah. Hi. Good morning. So two questions from ...
3,Giulia Miotto,Morgan Stanley,"June 2025 I meant, sorry.\nThanks.",Sergio P. Ermotti,,"That one is, I don't know about June 2025. I t..."
4,Sergio P. Ermotti,Speaker,"Well, let me take the first question is very ...",Andrew Coombs,Citigroup,"Good morning. Two questions please, bas ic fol..."


In [161]:
# List to store processed results
processed_results = []

# Loop through each row in result_df
for idx, row in result_df.iterrows():
    text = row['Text']
    info = extract_bank(text)
    processed_results.append(info)
    # Optional: sleep to respect rate limits (adjust the delay as needed)
    time.sleep(1)

# Convert the list of dictionaries to a DataFrame
processed_df = pd.DataFrame(processed_results)

# Display the processed DataFrame
processed_df.head()

Unnamed: 0,Name of bank
0,"[Bank of America, Goldman Sachs]"
1,"[Goldman Sachs, JP Morgan, UBS]"
2,"[UBS, Credit Suisse, JP Morgan, Morgan Stanley]"
3,"[Credit Suisse, Morgan Stanley, UBS, BNP Parib..."
4,"[BNP Paribas Exane, Citigroup]"


In [162]:
lists_of_banks = processed_df["Name of bank"].tolist()

In [167]:
flat_list_bank = [item for sublist in lists_of_banks for item in sublist]
removed_duplicate_bank = list(set(flat_list_bank))
removed_single_letter_bank = [item for item in removed_duplicate_bank if len(item) > 1]
print(removed_single_letter_bank)

['KBW', 'JP Morgan', 'BNP Paribas Exane', 'Deutsche Bank', 'HSBC', 'RBC', 'Bank of America', 'Credit Suisse', 'UBS', 'Goldman Sachs', 'Morgan Stanley', 'Citigroup']


In [187]:
def extract_bank_segments(qna_dataset):
    """
    Processes the qna_dataset to extract text between consecutive occurrences of the word Page.

    Steps:
      1. Check if the word Page exists in the text.
      2. Find all occurrences of Page.
      3. Extract text between each consecutive pair of Page occurrences.
      4. Create a DataFrame with two columns:
         - 'Question_Number': a count starting at 1.
         - 'Text': the text between consecutive occurrences.

    Parameters:
        qna_dataset (str): The input text containing multiple occurrences of Page.

    Returns:
        pd.DataFrame: A DataFrame with the extracted segments.
    """

    # Step 1: Check if 'Page' exists in the dataset.
    for bank in removed_single_letter_bank:
      print(bank)
      if bank not in qna_dataset:
          print("The word 'Bank' is not found in the dataset.")
          return pd.DataFrame(columns=["Question_Number", "Text"])

    # Step 2: Find all occurrences of 'Bank'
    matches = list(qna_dataset.find(bank))
    print(len(matches))

    # Check if there are at least two occurrences to form a segment.
    if len(matches) < 2:
        print("Not enough occurrences of 'Bank' to extract segments.")
        return pd.DataFrame(columns=["Question_Number", "Text"])

    segments = []
    # Step 3: Extract text between consecutive occurrences.
    for i in range(len(matches) - 1):
        # Get the end index of the current occurrence and start index of the next occurrence.
        start = matches[i].end()
        end = matches[i+1].start()
        segment_text = qna_dataset[start:end].strip()
        segments.append(segment_text)

    # Step 4: Create the DataFrame.
    df = pd.DataFrame({
        "Question_Number": list(range(1, len(segments) + 1)),
        "Text": segments
    })
    return df
    if len(matches) < 2:
        print("Not enough occurrences of 'Bank' to extract segments.")
        return pd.DataFrame(columns=["Question_Number", "Text"])

    segments = []
    # Step 3: Extract text between consecutive occurrences.
    for i in range(len(matches) - 1):
        # Get the end index of the current occurrence and start index of the next occurrence.
        start = matches[i].end()
        end = matches[i+1].start()
        segment_text = qna_dataset[start:end].strip()
        segments.append(segment_text)

    # Step 4: Create the DataFrame.
    df = pd.DataFrame({
        "Question_Number": list(range(1, len(segments) + 1)),
        "Text": segments
    })
    return df
    if len(matches) < 2:
        print("Not enough occurrences of 'Bank' to extract segments.")
        return pd.DataFrame(columns=["Question_Number", "Text"])

    segments = []
    # Step 3: Extract text between consecutive occurrences.
    for i in range(len(matches) - 1):
        # Get the end index of the current occurrence and start index of the next occurrence.
        start = matches[i].end()
        end = matches[i+1].start()
        segment_text = qna_dataset[start:end].strip()
        segments.append(segment_text)

    # Step 4: Create the DataFrame.
    df = pd.DataFrame({
        "Question_Number": list(range(1, len(segments) + 1)),
        "Text": segments
    })
    return df

In [174]:
qna_text

"Analyst Q&A (CEO and CFO)  \n \nAlastair Ryan, Bank of America  \nYeah. Thank you. Good morning. A billion dollar beat in the quarter. I never did quite get the hang of \nforecasting lark. Just on that then -- so non -core, very strong performance and appreciate the updated runoff \nprofile you give us on slide 6. Is there any reason that you're just reverting to natural runoff or can we expect \ncontinued sales if markets stay favorable bec ause clearly there's quite a meaningful driver of the very \nfavorable capital ratio and the interactions of all of those.  \nAnd then secondly, the project to improve the revenue to risk -weighted assets in wealth management, are \npresumably, you wouldn't represent the Q1 performances kind of the payoff of that project is? It's too early \nbut just what's the profile of that project? How long is that repricing sitting on the net new asset generation \nand has it started ? Thank you.  \nSergio P. Ermotti  \nAlastair, before I pass to Todd, I want

In [188]:
extract_bank_segments(qna_text)

KBW
JP Morgan
BNP Paribas Exane
Deutsche Bank
HSBC
RBC
Bank of America
Credit Suisse
UBS
Goldman Sachs
The word 'Bank' is not found in the dataset.


Unnamed: 0,Question_Number,Text


In [171]:
processed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name of bank  10 non-null     object
dtypes: object(1)
memory usage: 212.0+ bytes


In [None]:
processed_df.to_csv("QNA_4Q24.csv", index=False)