<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing_ubs_gpt4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Libraries**

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m122.9/232.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [2]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.61.1
    Uninstalling openai-1.61.1:
      Successfully uninstalled openai-1.61.1
Successfully installed openai-0.28.0


In [3]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [4]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [5]:
import re
import PyPDF2
import openai
import json
import pandas as pd
import time
import os
from dotenv import load_dotenv
from google.colab import userdata
from google.colab import drive

In [6]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process




In [7]:
load_dotenv()
openai_api_key = os.getenv("Openai_key")

# **1. Data extraction**

In [8]:
drive.mount("/content/drive")

Mounted at /content/drive


In [18]:
qa_data=pd.read_csv("/content/ubs_qna_sec.csv")

In [19]:
qa_data.head()

Unnamed: 0,speaker,call_date,financial_quarter,source_file,category,Text
0,Chris Hallam,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Goldman,"Chis Hallam, Goldman Sachs Yes. Good morning, ..."
1,Sergio P. Ermotti,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,JPMorgan,"Kian Abouhossein, JPMorgan Yeah. Thanks. Just ..."
2,Alastair Ryan,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Bank of America,"Alastair Ryan, Bank of America Yeah. Thank you..."
3,Flora Bocahut,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Jefferies,"Flora Bocahut, Jefferies Yes. Good morning. Th..."
4,Andrew Coombs,25 April 2023,1Q23,1q23-earnings-call-remarks.pdf,Citi,"Andrew Coombs, Citi Good morning. Two question..."


In [13]:
openai.api_key = userdata.get('Openai_key')

In [21]:
def extract_info(text):
    """
    This function sends a prompt to the GPT-4 Turbo model asking it to extract
    specific fields from the provided text. The model is expected to return a JSON
    with the following keys:
    - Name of the first person
    - Role of the first person
    - All text that the first person said
    - Name of the second person
    - Role of the second person
    - All text that the second person said
    """
    prompt = f"""
    The text is conversation between two people. Please Extract the following information from the text below:


    - Name of the first person
    - All text that the first person said
    - Name of the second person
    - Bank name
    - All text that the second person said


    The output should have all text both the persons said in the text.

    Provide the response in JSON format with keys exactly as:
    "Name of the first person", "Name of bank", "All text that the first person said", "Name of the second person", "Role of the second person", "All text that the second person said".

    Text: {text}
    """
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts structured information from text."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"}, # Set output to JSON format
            max_tokens=4000,  # Adjust tokens based on your text size
            temperature=0  # Keep it deterministic
        )
        content = response['choices'][0]['message']['content']
        # Attempt to parse the JSON response
        result = json.loads(content)
    except Exception as e:
        print(f"Error processing text: {e}")
        # Return a dictionary with None values in case of error
        result = {
            "Name of the first person": None,
            "All text that the first person said": None,
            "Name of the second person": None,
            "Role of the second person": None,
            "All text that the second person said": None
        }
    return result

In [22]:
# List to store processed results
processed_results = []

# Loop through each row in result_df
for idx, row in qa_data.iterrows():
    text = row['Text']
    info = extract_info(text)
    processed_results.append(info)
    # Optional: sleep to respect rate limits (adjust the delay as needed)
    time.sleep(1)

# Convert the list of dictionaries to a DataFrame
processed_df = pd.DataFrame(processed_results)

# Display the processed DataFrame
processed_df.head(20)

Unnamed: 0,Name of the first person,Name of bank,All text that the first person said,Name of the second person,Role of the second person,All text that the second person said
0,Chris Hallam,Goldman Sachs,"Yes. Good morning, everybody. Firstly, on the ...",Sergio P. Ermotti,UBS,"Okay. Thank you. On capital requirements, you ..."
1,Kian Abouhossein,UBS,Yeah. Thanks. Just two questions. The first on...,Sergio P. Ermotti,,"So, Sarah, take the first question. I'll take ..."
2,Alastair Ryan,Bank of America,"Yeah. Thank you. Good morning. Welcome back, S...",Sergio P. Ermotti,UBS,"Thank you, Ryan. It is good to be back to inte..."
3,Flora Bocahut,UBS,Yes. Good morning. The first question I wanted...,Sarah Youngwood,,"So, on the first question in terms of the tren..."
4,Andrew Coombs,Citi,"Good morning. Two questions. Firstly, just on ...",Sarah Youngwood,UBS,"So, on the first quarter or the first question..."
5,Adam Terelak,Mediobanca,Morning. I've got two. One is a bit of a follo...,Sarah Youngwood,UBS,So on the LCR and more generally the funding p...
6,Jeremy Sigee,UBS,Morning. Thank you and welcome back to Sergio ...,Sergio P. Ermotti,,"Thank you, Jeremy. Look, you know, the base pl..."
7,Anke Reingen,UBS,Yeah. Thank you very much for taking my questi...,Sarah Youngwood,,"So on the treasury share, what happened is we ..."
8,Amit Goel,UBS,"Okay. Thanks. And just on the cost savings, is...",Sarah Youngwood,,"So, the 8 billion-dollars-plus was done based ..."
9,Tom Hallet,Credit Suisse,"Good morning, everyone. So, just a couple for ...",Sergio P. Ermotti,UBS,"Thank you, Tom. Now, look, of course, the reve..."


In [23]:
processed_df.to_csv("ubs_qna_ver1.csv", index=False)