<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/cleansed/sk_processed_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook contains data engineering functions to process earnings call transcripts (in PDF format)
    from JPMorgan and similar sources stored on Google Drive. The code performs the following tasks:

    1. Reads raw PDF files from a specified directory using pdfplumber.
    2. Cleans and preprocesses the transcript text via regular expressions.
    3. Extracts metadata such as the financial quarter and call date from the transcript.
    4. Splits the transcript into two sections:
         - Management Discussion: Contains the prepared remarks and discussion from management.
         - Q&A Section: Contains the questions and answers from the call.
    5. Parses the Q&A section:
         - Extracts individual Q&A entries (each row in qa_section_v2.csv contains a single speaker’s entry,
           along with optional markers, job titles, and utterances).
         - Pairs question entries with their corresponding answer entries to form coherent Q&A pairs
           (stored in paired_qa_section_v2.csv).
    6. Aggregates the management discussion and Q&A data (both individual entries and paired Q&A) into Pandas DataFrames.
    7. Formats and sorts the DataFrames (e.g., converts call dates to datetime objects).
    8. Saves the processed results as CSV files to a specified directory on Google Drive.

===================================================
"""




Modules

In [20]:
!pip install pdfplumber  # Install pdfplumber library



In [21]:
# Step 1: Import required libraries
import pdfplumber
import re
import os
from google.colab import drive
import pandas as pd

In [27]:
# Mount Google Drive to the root location with force_remount
drive.mount('/content/drive', force_remount=True)

# Assuming 'BOE' folder is in 'MyDrive' and already shared
BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# Now you (and others with access) can work with files in this directory
# For example, you can list the contents:
print(os.listdir(BOE_path))

Mounted at /content/drive
['raw', 'preprocessed_data', 'cleansed']


In [None]:
import os
import re
import pdfplumber
import pandas as pd

# -------------------------------
# 1. Define the path to your raw folder on Google Drive
# -------------------------------
raw_dir = "/content/drive/My Drive/BOE/bank_of_england/data/raw/jpmorgan"

# -------------------------------
# 2. Define speaker corrections for known names
# -------------------------------
SPEAKER_CORRECTIONS = {
    "Mike Mayo": "Mike Mayo, Analyst, Wells Fargo Securities LLC",
    "Jim Mitchell": "Jim Mitchell, Analyst, Seaport Global Securities LLC",
    "Operator": "Operator",
    "Jeremy Barnum": "Jeremy Barnum, Chief Financial Officer, JPMorgan Chase",
    "Jamie Dimon": "Jamie Dimon, Chairman & Chief Executive Officer, JPMorgan Chase",
    "Erika Najarian": "Erika Najarian, Analyst, UBS Securities LLC",
    "John McDonald": "John McDonald, Analyst, Autonomous Research",
    "Ken Usdin": "Ken Usdin, Analyst, Jefferies LLC",
    "Gerard Cassidy": "Gerard Cassidy, Analyst, RBC Capital Markets LLC",
    "Steven Chubak": "Steven Chubak, Analyst, Wolfe Research LLC",
    "Matt O’Connor": "Matt O’Connor, Analyst, Deutsche Bank Securities, Inc.",
    "Betsy L. Graseck": "Betsy L. Graseck, Analyst, Morgan Stanley & Co. LLC",
    "Saul Martinez": "Saul Martinez, Analyst, HSBC Securities (USA), Inc.",
    "Ebrahim H. Poonawala": "Ebrahim H. Poonawala, Analyst, Bank of America Merrill Lynch",
}

# -------------------------------
# 3. Define helper functions for processing
# -------------------------------

def clean_transcript(text):
    """Cleans the raw transcript text."""
    text = re.sub(r'\n\s*\.{10,}\s*\n', '\n', text)
    text = re.sub(r'\n\d+\n', '\n', text)
    text = re.sub(r'On page \d+', '', text)
    text = re.sub(r'Starting on page \d+', '', text)
    text = re.sub(r'\.\s*,', '.', text)
    text = text.replace('%. ,', '%.')
    text = re.sub(r'\s+\n', '\n', text)
    text = re.sub(r'\n+', '\n', text).strip()
    if "Disclaimer" in text:
        text = text.split("Disclaimer")[0].strip()
    return text

def extract_metadata(text):
    """Extracts the financial quarter and call date from the transcript text."""
    quarter_match = re.search(r'(\dQ\s*\d{2})', text)
    financial_quarter = quarter_match.group(1).replace(" ", "") if quarter_match else None
    date_match = re.search(r'([A-Za-z]+\s+\d{1,2},\s+\d{4})', text)
    call_date = date_match.group(1) if date_match else None
    return financial_quarter, call_date

def split_sections(transcript):
    """
    Splits the transcript into Management Discussion and Q&A sections.
    Assumes that the Q&A section is introduced by a marker like "QUESTION AND ANSWER" (case-insensitive).
    Returns:
        tuple: (management_discussion, qa_section)
    """
    qa_marker = re.search(r'(?i)(QUESTION\s+AND\s+ANSWER)', transcript)
    if qa_marker:
        management_discussion = transcript[:qa_marker.start()].strip()
        qa_section = transcript[qa_marker.start():].strip()
    else:
        management_discussion = transcript
        qa_section = ""
    return management_discussion, qa_section

def parse_qa_section(qa_text, job_role_word_threshold=10):
    """
    Parses the Q&A section of the transcript into a list of dictionaries.
    Each dictionary contains 'speaker', 'marker', 'job_title', and 'utterance'.
    It now checks for multiple formats:
      - "Name Marker" (e.g., "John McDonald Q")
      - "Name, Title" (e.g., "Betsy L. Graseck, Analyst, Morgan Stanley & Co. LLC")
      - "Speaker: Utterance" (e.g., "Operator: Thank you...")
    For lines appended after a speaker header, if the first line is short (fewer than job_role_word_threshold words)
    and contains a comma, it is assumed to be the job role.
    """
    entries = []
    current_entry = None
    lines = qa_text.split('\n')
    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Try matching the "Name Marker" format (e.g., "John McDonald Q")
        m1 = re.match(r'^(?P<speaker>.+?)\s+(?P<marker>[QA])$', line)
        if m1:
            if current_entry is not None:
                entries.append(current_entry)
            current_entry = {
                'speaker': m1.group('speaker').strip(),
                'marker': m1.group('marker'),
                'job_title': "",
                'utterance': ""
            }
            continue

        # Try matching the "Name, Title" format (e.g., "Betsy L. Graseck, Analyst, Morgan Stanley & Co. LLC")
        m3 = re.match(r'^(?P<speaker>.+?)\s*,\s*(?P<title>.+?)$', line)
        if m3:
            if current_entry is not None:
                entries.append(current_entry)
            current_entry = {
                'speaker': m3.group('speaker').strip(),
                'marker': None,
                'job_title': m3.group('title').strip(),
                'utterance': ""
            }
            continue

        # Next, try matching the "Speaker: Utterance" format (e.g., "Operator: Thank you...")
        m2 = re.match(r'^(?P<speaker>[^:]+):\s*(?P<utterance>.+)$', line)
        if m2:
            if current_entry is not None:
                entries.append(current_entry)
            current_entry = {
                'speaker': m2.group('speaker').strip(),
                'marker': None,
                'job_title': "",
                'utterance': m2.group('utterance').strip()
            }
            continue

        # Otherwise, this line is a continuation of the current entry.
        if current_entry is not None:
            # If current entry has no job_title and no utterance yet, and the line is short with a comma, treat it as job title.
            if not current_entry['job_title'] and not current_entry['utterance']:
                words = line.split()
                if len(words) < job_role_word_threshold and ',' in line:
                    current_entry['job_title'] = line.strip()
                    continue
            # Append the line to the current entry's utterance.
            if current_entry['utterance']:
                current_entry['utterance'] += " " + line
            else:
                current_entry['utterance'] = line
        else:
            current_entry = {'speaker': 'Unknown', 'marker': None, 'job_title': "", 'utterance': line}
    if current_entry is not None:
        entries.append(current_entry)

    # ---- Post-process to merge consecutive entries from the same speaker and marker ----
    merged_entries = []
    prev = None
    for entry in entries:
        # Correct the speaker name if needed
        corrected_speaker = SPEAKER_CORRECTIONS.get(entry['speaker'], entry['speaker'])
        entry['speaker'] = corrected_speaker

        if prev and prev['speaker'] == entry['speaker'] and prev.get('marker') == entry.get('marker'):
            # Merge utterances
            prev['utterance'] += " " + entry['utterance']
            # If previous job_title is empty and entry has one, copy it
            if not prev['job_title'] and entry['job_title']:
                prev['job_title'] = entry['job_title']
        else:
            if prev:
                merged_entries.append(prev)
            prev = entry
    if prev:
        merged_entries.append(prev)
    return merged_entries

def pair_qa_entries(entries):
    """
    Attempts to pair question (marker 'Q') and answer (marker 'A') entries.
    Returns a list of dictionaries with keys:
      'question_speaker', 'question_job_title', 'question',
      'answer_speaker', 'answer_job_title', 'answer'

    This heuristic assumes that Q and A entries alternate.
    """
    paired = []
    i = 0
    while i < len(entries):
        if entries[i].get('marker', '') == 'Q':
            question = entries[i]
            # Look ahead for the next 'A'
            j = i + 1
            answer = None
            while j < len(entries):
                if entries[j].get('marker', '') == 'A':
                    answer = entries[j]
                    break
                j += 1
            paired.append({
                'question_speaker': question.get('speaker'),
                'question_job_title': question.get('job_title'),
                'question': question.get('utterance'),
                'answer_speaker': answer.get('speaker') if answer else "",
                'answer_job_title': answer.get('job_title') if answer else "",
                'answer': answer.get('utterance') if answer else ""
            })
            i = j + 1 if answer else i + 1
        else:
            # If no marker Q, add standalone entry.
            paired.append({
                'question_speaker': entries[i].get('speaker'),
                'question_job_title': entries[i].get('job_title'),
                'question': entries[i].get('utterance'),
                'answer_speaker': "",
                'answer_job_title': "",
                'answer': ""
            })
            i += 1
    return paired

def parse_management_discussion_section(md_text, speaker_corrections):
    """
    Parses the Management Discussion section into segments.
    Each segment is a dictionary with 'speaker' and 'utterance' fields.
    It detects lines that start with a known speaker name (using speaker_corrections)
    and treats those lines as markers for a new speaker's segment.
    """
    segments = []
    current_speaker = None
    current_utterance = []
    lines = md_text.split('\n')
    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Check if the line starts with any known speaker name.
        found = None
        for short_name, full_name in speaker_corrections.items():
            # Allow the speaker's name to be followed by punctuation and extra text.
            pattern = r'^' + re.escape(short_name) + r'([\s,:-].*)?$'
            if re.match(pattern, line, re.IGNORECASE):
                found = full_name
                break

        if found:
            # Save the previous segment if it exists.
            if current_speaker is not None or current_utterance:
                segments.append({
                    'speaker': current_speaker,
                    'utterance': ' '.join(current_utterance).strip()
                })
            current_speaker = found
            # If there's a colon and text after the speaker name, start the utterance with it.
            if ':' in line:
                parts = line.split(':', 1)
                current_utterance = [parts[1].strip()]
            else:
                current_utterance = []
        else:
            current_utterance.append(line)
    if current_speaker is not None or current_utterance:
        segments.append({
            'speaker': current_speaker,
            'utterance': ' '.join(current_utterance).strip()
        })
    return segments

# -------------------------------
# 4. Process each PDF in the raw folder and aggregate results
# -------------------------------
all_qa_entries = []      # List to store parsed Q&A entries for all transcripts
all_md_segments = []     # List to store management discussion segments (with speaker) for all transcripts

for filename in os.listdir(raw_dir):
    if filename.lower().endswith(".pdf"):
        file_path = os.path.join(raw_dir, filename)
        print("Processing file:", file_path)

        # Extract text from PDF
        transcript_text = ""
        with pdfplumber.open(file_path) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    transcript_text += page_text + "\n"
        print("Extracted text preview for", filename, ":", transcript_text[:1000])

        # Clean the transcript text
        transcript_clean = clean_transcript(transcript_text)
        print("Cleaned text preview for", filename, ":", transcript_clean[:1000])

        # Extract metadata
        financial_quarter, call_date = extract_metadata(transcript_clean)
        print("Extracted Financial Quarter:", financial_quarter)
        print("Extracted Call Date:", call_date)

        # Split transcript into sections
        management_discussion, qa_section = split_sections(transcript_clean)
        # Remove the Q&A header if present
        qa_section = re.sub(r'(?i)^QUESTION\s+AND\s+ANSWER\s+SECTION\s*', '', qa_section, count=1).strip()

        # Parse the management discussion into segments with speaker names
        md_segments = parse_management_discussion_section(management_discussion, SPEAKER_CORRECTIONS)
        for seg in md_segments:
            seg['filename'] = filename
            seg['financial_quarter'] = financial_quarter
            seg['call_date'] = call_date
            all_md_segments.append(seg)

        # Parse the Q&A section and add metadata to each entry
        qa_entries = parse_qa_section(qa_section)
        for entry in qa_entries:
            entry['filename'] = filename
            entry['financial_quarter'] = financial_quarter
            entry['call_date'] = call_date
        all_qa_entries.extend(qa_entries)

        print("Processed file:", filename)

# Optionally, pair Q&A entries into Q&A pairs
qa_pairs = pair_qa_entries(all_qa_entries)

# Convert aggregated lists to DataFrames
df_qa_all = pd.DataFrame(all_qa_entries)
df_md_segments = pd.DataFrame(all_md_segments)  # New DataFrame for management discussion segments
df_qa_pairs = pd.DataFrame(qa_pairs)

# -------------------------------
# 5. Format 'call_date' as datetime and sort descending by call_date
# -------------------------------
for df in [df_qa_all, df_md_segments, df_qa_pairs]:
    if 'call_date' not in df.columns:
        df['call_date'] = pd.NaT
    else:
        df['call_date'] = pd.to_datetime(df['call_date'], format='%B %d, %Y', errors='coerce')
    df.sort_values(by='call_date', ascending=False, inplace=True)

print("\nCombined Parsed Q&A Section Preview (Sorted):")
print(df_qa_all.head(10))
print("\nCombined Paired Q&A Preview (Sorted):")
print(df_qa_pairs.head(10))
print("\nCombined Management Discussion Segments Preview (Sorted):")
print(df_md_segments.head())

# -------------------------------
# 6. Save the DataFrames as CSV Files
# -------------------------------
# qa_csv_path = "/content/drive/My Drive/BOE/bank_of_england/data/cleansed/qa_section_v2.csv"  # Useful for debugging
md_csv_path = "/content/drive/My Drive/BOE/bank_of_england/data/cleansed/jpmorgan_management_discussion.csv"
qa_pairs_csv_path = "/content/drive/My Drive/BOE/bank_of_england/data/cleansed/jpmorgan_qa_section.csv"

# df_qa_all.to_csv(qa_csv_path, index=False)
# print("\nQ&A DataFrame saved to:", qa_csv_path)

df_md_segments.to_csv(md_csv_path, index=False)
print("Management Discussion Segments DataFrame saved to:", md_csv_path)

df_qa_pairs.to_csv(qa_pairs_csv_path, index=False)
print("Paired Q&A DataFrame saved to:", qa_pairs_csv_path)


Processing file: /content/drive/My Drive/BOE/bank_of_england/data/raw/jpmorgan/2q23-earnings-transcript.pdf
Extracted text preview for 2q23-earnings-transcript.pdf : 2Q23 FINANCIAL RESULTS
EARNINGS CALL TRANSCRIPT
July 14, 2023
MANAGEMENT DISCUSSION SECTION
......................................................................................................................................................................................................................................................
Operator: Good morning, ladies and gentlemen. Welcome to JPMorgan Chase's Second Quarter 2023 Earnings Call. This call is being
recorded. Your line will be muted for the duration of the call. We will now go to the live presentation. Please stand by.
At this time, I would like to turn the call over to JPMorgan Chase's Chairman and CEO, Jamie Dimon and Chief Financial Officer, Jeremy
Barnum. Mr. Barnum, please go ahead.
.........................................................................