<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/cleansed/sk_processed_jpmorgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
"""
===================================================
Author: Sheldon Kemper
Role: Data Engineering Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/sheldon-kemper
Date: 2025-02-04
Version: 1.1

Description:
    This notebook contains data engineering functions to process earnings call transcripts (in PDF format)
    from JPMorgan and similar sources stored on Google Drive. The code performs the following tasks:

    1. Reads raw PDF files from a specified directory using pdfplumber.
    2. Cleans and preprocesses the transcript text via regular expressions.
    3. Extracts metadata such as the financial quarter and call date from the transcript.
    4. Splits the transcript into two sections:
         - Management Discussion: Contains the prepared remarks and discussion from management.
         - Q&A Section: Contains the questions and answers from the call.
    5. Parses the Q&A section:
         - Extracts individual Q&A entries (each row in qa_section_v2.csv contains a single speaker’s entry,
           along with optional markers, job titles, and utterances).
         - Pairs question entries with their corresponding answer entries to form coherent Q&A pairs
           (stored in paired_qa_section_v2.csv).
    6. Aggregates the management discussion and Q&A data (both individual entries and paired Q&A) into Pandas DataFrames.
    7. Formats and sorts the DataFrames (e.g., converts call dates to datetime objects).
    8. Saves the processed results as CSV files to a specified directory on Google Drive.

===================================================
"""




Modules

In [11]:
!pip install pymupdf  # Install the PyMuPDF library, which includes the 'fitz' module



In [12]:
# Step 1: Import required libraries
import pymupdf
import re
import os
from google.colab import drive
import pandas as pd

In [None]:
# # Mount Google Drive to the root location with force_remount
# drive.mount('/content/drive', force_remount=True)

# # Assuming 'BOE' folder is in 'MyDrive' and already shared
# BOE_path = '/content/drive/MyDrive/BOE/bank_of_england/data'

# # Now you (and others with access) can work with files in this directory
# # For example, you can list the contents:
# print(os.listdir(BOE_path))

Mounted at /content/drive
['raw', 'preprocessed_data', 'cleansed']


In [25]:
import os
import re
import fitz  # PyMuPDF
import pandas as pd

# -------------------------------
# 1. Define the path to your raw folder on Google Drive
# -------------------------------
raw_dir = "/content/drive/My Drive/BOE/bank_of_england/data/raw/jpmorgan"

# -------------------------------
# 2. Define speaker corrections for known names
# -------------------------------
SPEAKER_CORRECTIONS = {
    "Operator": "Operator",  # We'll filter these out later.
    "Mike Mayo": "Mike Mayo, Analyst, Wells Fargo Securities LLC",
    "Jim Mitchell": "Jim Mitchell, Analyst, Seaport Global Securities LLC",
    "Jeremy Barnum": "Jeremy Barnum, Chief Financial Officer, JPMorgan Chase",
    "Jamie Dimon": "Jamie Dimon, Chairman & Chief Executive Officer, JPMorgan Chase",
    "Erika Najarian": "Erika Najarian, Analyst, UBS Securities LLC",
    "John McDonald": "John McDonald, Analyst, Autonomous Research",
    "Ken Usdin": "Ken Usdin, Analyst, Jefferies LLC",
    "Gerard Cassidy": "Gerard Cassidy, Analyst, RBC Capital Markets LLC",
    "Steven Chubak": "Steven Chubak, Analyst, Wolfe Research LLC",
    "Matt O’Connor": "Matt O’Connor, Analyst, Deutsche Bank Securities, Inc.",
    "Betsy L. Graseck": "Betsy L. Graseck, Analyst, Morgan Stanley & Co. LLC",
    "Saul Martinez": "Saul Martinez, Analyst, HSBC Securities (USA), Inc.",
    "Ebrahim H. Poonawala": "Ebrahim H. Poonawala, Analyst, Bank of America Merrill Lynch",
}

# -------------------------------
# 3. Define helper functions for processing
# -------------------------------

def extract_text_from_pdf(file_path):
    """Extracts text from a PDF file using PyMuPDF."""
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text() + "\n"
    return text

def clean_transcript(text):
    """Cleans the raw transcript text."""
    # Normalize quotes.
    text = text.replace("’", "'").replace("“", '"').replace("”", '"')
    text = re.sub(r'\n\s*\.{10,}\s*\n', '\n', text)
    text = re.sub(r'\n\d+\n', '\n', text)
    text = re.sub(r'On page \d+', '', text)
    text = re.sub(r'Starting on page \d+', '', text)
    text = re.sub(r'\.\s*,', '.', text)
    text = text.replace('%. ,', '%.')
    text = re.sub(r'\s+\n', '\n', text)
    text = re.sub(r'\n+', '\n', text).strip()
    if "Disclaimer" in text:
        text = text.split("Disclaimer")[0].strip()
    return text

def extract_metadata(text):
    """Extracts the financial quarter and call date from the transcript text."""
    quarter_match = re.search(r'(\dQ\s*\d{2})', text)
    financial_quarter = quarter_match.group(1).replace(" ", "") if quarter_match else None
    date_match = re.search(r'([A-Za-z]+\s+\d{1,2},\s+\d{4})', text)
    call_date = date_match.group(1) if date_match else None
    return financial_quarter, call_date

def split_sections(transcript):
    """
    Splits the transcript into Management Discussion and Q&A sections.
    Assumes that the Q&A section is introduced by a marker like "QUESTION AND ANSWER" (case-insensitive).
    Returns a tuple: (management_discussion, qa_section)
    """
    qa_marker = re.search(r'(?i)(QUESTION\s+AND\s+ANSWER)', transcript)
    if qa_marker:
        management_discussion = transcript[:qa_marker.start()].strip()
        qa_section = transcript[qa_marker.start():].strip()
    else:
        management_discussion = transcript
        qa_section = ""
    return management_discussion, qa_section

def parse_management_discussion_section(md_text, speaker_corrections):
    """
    Parses the Management Discussion section into segments.
    Each segment is a dictionary with keys 'speaker' and 'utterance'.
    It detects lines that begin with one of the known speaker names and starts a new segment.
    """
    segments = []
    current_speaker = None
    current_utterance = []
    lines = md_text.split('\n')
    for line in lines:
        line = line.strip()
        if not line:
            continue
        found = None
        for short_name, full_name in speaker_corrections.items():
            pattern = r'^' + re.escape(short_name) + r'([\s,:-].*)?$'
            if re.match(pattern, line, re.IGNORECASE):
                found = full_name
                break
        if found:
            if current_speaker or current_utterance:
                segments.append({
                    'speaker': current_speaker,
                    'utterance': ' '.join(current_utterance).strip()
                })
            current_speaker = found
            if ':' in line:
                parts = line.split(':', 1)
                current_utterance = [parts[1].strip()]
            else:
                current_utterance = []
        else:
            current_utterance.append(line)
    if current_speaker or current_utterance:
        segments.append({
            'speaker': current_speaker,
            'utterance': ' '.join(current_utterance).strip()
        })
    return segments

def parse_qa_section(qa_text):
    """
    Parses the Q&A section from the transcript.
    Expects a layout where each speaker block has:
      - Line 1: Speaker name (exact match in SPEAKER_CORRECTIONS)
      - Line 2: Job title (free text)
      - Line 3: Marker ('Q' or 'A')
      - Line 4+: Utterance (until the next speaker is detected)
    Returns a list of dictionaries with keys: 'speaker', 'job_title', 'marker', 'utterance'
    """
    lines = qa_text.split('\n')
    entries = []
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        if line in SPEAKER_CORRECTIONS:
            speaker = SPEAKER_CORRECTIONS[line]
            job_title = lines[i+1].strip() if i+1 < len(lines) else ""
            marker = lines[i+2].strip() if i+2 < len(lines) else ""
            utterance_lines = []
            i += 3
            while i < len(lines) and (lines[i].strip() not in SPEAKER_CORRECTIONS):
                utterance_lines.append(lines[i].strip())
                i += 1
            utterance = " ".join(utterance_lines)
            entries.append({
                'speaker': speaker,
                'job_title': job_title,
                'marker': marker,
                'utterance': utterance
            })
        else:
            i += 1
    return entries

def add_metadata_to_entries(entries, filename, financial_quarter, call_date):
    """Adds filename, financial_quarter, and call_date metadata to each entry."""
    for entry in entries:
        entry['filename'] = filename
        entry['financial_quarter'] = financial_quarter
        entry['call_date'] = call_date
    return entries

def pair_qa_entries(entries):
    """
    Pairs Q&A entries assuming they alternate (Q then A).
    Returns a list of dictionaries with keys:
      'question_speaker', 'question_job_title', 'question',
      'answer_speaker', 'answer_job_title', 'answer',
      'filename', 'financial_quarter', and 'call_date'
    """
    paired = []
    i = 0
    while i < len(entries):
        if entries[i].get('marker', '') == 'Q':
            question = entries[i]
            j = i + 1
            answer = None
            while j < len(entries):
                if entries[j].get('marker', '') == 'A' or (not entries[j].get('marker') and j == i+1):
                    answer = entries[j]
                    break
                j += 1
            paired_entry = {
                'question_speaker': question.get('speaker', ''),
                'question_job_title': question.get('job_title', ''),
                'question': question.get('utterance', ''),
                'answer_speaker': answer.get('speaker', '') if answer else "",
                'answer_job_title': answer.get('job_title', '') if answer else "",
                'answer': answer.get('utterance', '') if answer else ""
            }
            for key in ['filename', 'financial_quarter', 'call_date']:
                paired_entry[key] = question.get(key, "")
            paired.append(paired_entry)
            i = j + 1 if answer else i + 1
        else:
            paired_entry = {
                'question_speaker': entries[i].get('speaker', ''),
                'question_job_title': entries[i].get('job_title', ''),
                'question': entries[i].get('utterance', ''),
                'answer_speaker': "",
                'answer_job_title': "",
                'answer': ""
            }
            for key in ['filename', 'financial_quarter', 'call_date']:
                paired_entry[key] = entries[i].get(key, "")
            paired.append(paired_entry)
            i += 1
    return paired

def filter_operator_qa_pairs(df):
    """
    Removes rows from the paired Q&A DataFrame where the text "Operator:" (case-insensitive)
    appears in either the 'question' or 'answer' field.
    """
    if 'question' not in df.columns or 'answer' not in df.columns:
        return df
    mask = ~(df['question'].str.contains(r'(?i)Operator:', na=False) |
             df['answer'].str.contains(r'(?i)Operator:', na=False))
    return df[mask]

# -------------------------------
# 4. Process each PDF in the raw folder and aggregate results
# -------------------------------
all_qa_entries = []      # To store parsed Q&A entries from all transcripts.
all_md_segments = []     # To store management discussion segments from all transcripts.

for filename in os.listdir(raw_dir):
    if filename.lower().endswith(".pdf"):
        file_path = os.path.join(raw_dir, filename)
        print("Processing file:", file_path)

        # Extract text using PyMuPDF.
        transcript_text = extract_text_from_pdf(file_path)
        print("Extracted text preview for", filename, ":", transcript_text[:1000])

        # Clean transcript.
        transcript_clean = clean_transcript(transcript_text)
        print("Cleaned text preview for", filename, ":", transcript_clean[:1000])

        # Extract metadata.
        financial_quarter, call_date = extract_metadata(transcript_clean)
        print("Extracted Financial Quarter:", financial_quarter)
        print("Extracted Call Date:", call_date)

        # Split transcript into sections.
        management_discussion, qa_section = split_sections(transcript_clean)
        qa_section = re.sub(r'(?i)^QUESTION\s+AND\s+ANSWER\s+SECTION\s*', '', qa_section, count=1).strip()

        # Parse Management Discussion.
        md_segments = parse_management_discussion_section(management_discussion, SPEAKER_CORRECTIONS)
        # Filter out segments with missing speaker or where speaker is "Operator"
        md_segments = [seg for seg in md_segments if seg.get('speaker') and seg.get('speaker').strip().lower() != "operator"]
        md_segments = add_metadata_to_entries(md_segments, filename, financial_quarter, call_date)
        all_md_segments.extend(md_segments)

        # Parse Q&A section.
        qa_entries = parse_qa_section(qa_section)
        # Filter out Q&A entries with missing speaker or where speaker is "Operator"
        qa_entries = [entry for entry in qa_entries if entry.get('speaker') and entry.get('speaker').strip().lower() != "operator"]
        qa_entries = add_metadata_to_entries(qa_entries, filename, financial_quarter, call_date)
        all_qa_entries.extend(qa_entries)
        print("Processed file:", filename)

# Pair Q&A entries.
qa_pairs = pair_qa_entries(all_qa_entries)

# --- New Step: Remove Q&A pairs with an empty question field ---
qa_pairs = [pair for pair in qa_pairs if str(pair.get('question', '')).strip() != ""]

# --- New Step: Filter out Q&A pairs where "Operator:" appears in question or answer ---
df_qa_pairs = pd.DataFrame(qa_pairs)
df_qa_pairs = filter_operator_qa_pairs(df_qa_pairs)

# --- New Step: Remove Q&A pairs where the answer_speaker is null or empty ---
df_qa_pairs = df_qa_pairs[df_qa_pairs['answer_speaker'].astype(str).str.strip() != ""]

# Convert aggregated lists to DataFrames.
df_qa_all = pd.DataFrame(all_qa_entries)
df_md_segments = pd.DataFrame(all_md_segments)

# -------------------------------
# 5. Format 'call_date' as datetime and sort descending by call_date
# -------------------------------
for df in [df_qa_all, df_md_segments, df_qa_pairs]:
    if 'call_date' not in df.columns:
        df['call_date'] = pd.NaT
    else:
        df['call_date'] = pd.to_datetime(df['call_date'], format='%B %d, %Y', errors='coerce')
    df.sort_values(by='call_date', ascending=False, inplace=True)

print("\nCombined Parsed Q&A Section Preview (Sorted):")
print(df_qa_all.head(10))
print("\nCombined Paired Q&A Preview (Sorted):")
print(df_qa_pairs.head(10))
print("\nCombined Management Discussion Segments Preview (Sorted):")
print(df_md_segments.head())

# -------------------------------
# 6. Save the DataFrames as CSV Files
# -------------------------------
md_csv_path = "/content/drive/My Drive/BOE/bank_of_england/data/cleansed/jpmorgan_management_discussion.csv"
qa_pairs_csv_path = "/content/drive/My Drive/BOE/bank_of_england/data/cleansed/jpmorgan_qa_section.csv"

df_md_segments.to_csv(md_csv_path, index=False)
print("Management Discussion Segments DataFrame saved to:", md_csv_path)
df_qa_pairs.to_csv(qa_pairs_csv_path, index=False)
print("Paired Q&A DataFrame saved to:", qa_pairs_csv_path)

# -------------------------------
# 7. Data Validation
# -------------------------------

# 7.1. Check that all PDF files have been processed.
pdf_files = [f for f in os.listdir(raw_dir) if f.lower().endswith(".pdf")]
processed_files = set()
if 'filename' in df_qa_all.columns:
    processed_files = processed_files.union(set(df_qa_all['filename'].unique()))
if 'filename' in df_md_segments.columns:
    processed_files = processed_files.union(set(df_md_segments['filename'].unique()))
print(f"\nTotal PDF files in directory: {len(pdf_files)}")
print(f"Unique filenames in parsed data: {processed_files}")
missing_files = set(pdf_files) - processed_files
if missing_files:
    print("Warning: The following PDF files were not processed:", missing_files)
else:
    print("All PDF files have been processed.")

# 7.2. Check for missing metadata in key columns.
for name, df in [("Q&A entries", df_qa_all), ("Management Discussion segments", df_md_segments), ("Paired Q&A", df_qa_pairs)]:
    missing_filename = df['filename'].isnull().sum() if 'filename' in df.columns else 0
    missing_call_date = df['call_date'].isnull().sum() if 'call_date' in df.columns else 0
    missing_quarter = df['financial_quarter'].isnull().sum() if 'financial_quarter' in df.columns else 0
    print(f"\nFor {name}:")
    print(f"  Missing filename: {missing_filename}")
    print(f"  Missing call_date: {missing_call_date}")
    print(f"  Missing financial_quarter: {missing_quarter}")

# 7.3. Verify that no management discussion segments have "Operator" as the speaker.
if (df_md_segments['speaker'].str.lower() == "operator").any():
    print("\nError: 'Operator' entries found in management discussion segments!")
else:
    print("\nNo 'Operator' entries found in management discussion segments.")

# 7.4. Display summary statistics.
print("\nTotal Q&A entries:", len(df_qa_all))
print("Total management discussion segments:", len(df_md_segments))
print("Total paired Q&A entries:", len(df_qa_pairs))

# 7.5. Display sample data for manual review.
print("\nSample Paired Q&A entries:")
print(df_qa_pairs.head(5))


Processing file: /content/drive/My Drive/BOE/bank_of_england/data/raw/jpmorgan/2q23-earnings-transcript.pdf
Extracted text preview for 2q23-earnings-transcript.pdf :  
2Q23 FINANCIAL RESULTS  
EARNINGS CALL TRANSCRIPT 
July 14, 2023 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
2 
 
MANAGEMENT DISCUSSION SECTION 
 ......................................................................................................................................................................................................................................................  
 
Operator: Good morning, ladies and gentlemen. Welcome to JPMorgan Chase's Second Quarter 2023 Earnings Call. This call is being 
recorded. Your line will be muted for the duration of the call. We will now go to the live presentation. Please stand by. 
 
At this time, I would like to turn the call over to JPMorgan Chase's Chairman and CEO, Jamie Dimon and Chief Financial Officer, Jeremy 
Barnum. Mr. Barnum, please go ahead. 
 