# Preprocessing Earnings Call Transcripts

Before I can score sentiment on these transcripts, I need to strip the noise that would muddy the signal. My training in close reading is actually useful here — earnings calls are carefully crafted texts where word choice is deliberate, but they're wrapped in layers of formulaic boilerplate (operator greetings, participant lists, Motley Fool footers) that carry no sentiment at all. I also want to segment each transcript into prepared remarks and Q&A, since the tone often shifts between the two — the scripted section is tightly controlled by IR teams, while the Q&A is more spontaneous.

In [1]:
import sys
sys.path.insert(0, "..")

from pathlib import Path
import pandas as pd
from src.preprocessing import (
    remove_boilerplate,
    segment_transcript,
    tokenise_and_clean,
    process_transcript,
)

RAW_DIR = Path("../data/raw/transcripts")
PROCESSED_DIR = Path("../data/processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

transcript_files = sorted(RAW_DIR.glob("*.txt"))
print(f"{len(transcript_files)} transcripts found")

56 transcripts found


## Looking at a raw transcript

I want to see the actual text before writing any cleaning rules. The start and end of the file are where most of the boilerplate lives, so I'll print those sections of an Apple earnings call.

In [2]:
sample_path = RAW_DIR / "AAPL_2022_Q1.txt"
raw_text = sample_path.read_text(encoding="utf-8")

print(f"Total length: {len(raw_text):,} characters, {len(raw_text.split()):,} words")
print("\n" + "=" * 60)
print("FIRST 500 CHARACTERS:")
print("=" * 60)
print(raw_text[:500])
print("\n" + "=" * 60)
print("LAST 400 CHARACTERS:")
print("=" * 60)
print(raw_text[-400:])

Total length: 49,027 characters, 8,470 words

FIRST 500 CHARACTERS:
Prepared Remarks:
Operator
Good day, and welcome to the Apple Q1 FY 2022 earnings conference call. Today's call is being recorded. At this time, for opening remarks and introductions, I would like to turn the call over to Tejas Gala, director of investor relations and corporate finance. Please go ahead. 
Tejas Gala -- Director, Investor Relations, and Corporate Finance
Thank you. Good afternoon and thank you for joining us. Speaking first today is Apple's CEO, Tim Cook; and he'll be followed by 

LAST 400 CHARACTERS:
 Jefferies -- Analyst
Shannon Cross -- Cross Research -- Analyst
Katy Huberty -- Morgan Stanley -- Analyst
Amit Daryanani -- Evercore ISI -- Analyst
David Vogt -- UBS -- Analyst
Samik Chatterjee -- J.P. Morgan -- Analyst
Chris Caso -- Raymond James -- Analyst
Ben Bollin -- Cleveland Research Company -- Analyst
Harsh Kumar -- Piper Sandler -- Analyst
More AAPL analysis
All earnings call transcripts


## What I'm removing

Having worked in financial services, I know these calls follow a rigid format: an operator opens with a scripted greeting, IR hands off to the CEO and CFO for prepared remarks, then the operator opens the line for analyst questions. The substantive content — the part that carries sentiment — starts with the first named speaker and ends with the last real Q&A exchange. Everything else is noise: the `Prepared Remarks:` header, operator greetings, `[Operator instructions]` tags, call duration, participant lists, and Motley Fool's own footer. I strip all of that while keeping the safe harbour disclaimers in place — they're embedded in the prepared remarks and too interleaved with real content to extract cleanly.

In [3]:
cleaned_text = remove_boilerplate(raw_text)

raw_words = len(raw_text.split())
cleaned_words = len(cleaned_text.split())
removed = raw_words - cleaned_words

print(f"Raw:     {raw_words:,} words")
print(f"Cleaned: {cleaned_words:,} words")
print(f"Removed: {removed:,} words ({100 * removed / raw_words:.1f}%)")
print("\n" + "=" * 60)
print("CLEANED — FIRST 300 CHARACTERS:")
print("=" * 60)
print(cleaned_text[:300])
print("\n" + "=" * 60)
print("CLEANED — LAST 300 CHARACTERS:")
print("=" * 60)
print(cleaned_text[-300:])

Raw:     8,470 words
Cleaned: 8,312 words
Removed: 158 words (1.9%)

CLEANED — FIRST 300 CHARACTERS:
Tejas Gala -- Director, Investor Relations, and Corporate Finance
Thank you. Good afternoon and thank you for joining us. Speaking first today is Apple's CEO, Tim Cook; and he'll be followed by CFO, Luca Maestri. After that, we'll open the call to questions from analysts.
Please note that some of th

CLEANED — LAST 300 CHARACTERS:
mation code 3599903.
These replays will be available by approximately 5:00 p.m. Pacific Time today. Members of the press with additional questions can contact Josh Rosenstock at 408-862-1142. Financial analysts can contact me with additional questions at 669-227-2402.
Thank you again for joining us.


In [4]:
segments = segment_transcript(cleaned_text)

for section, content in segments.items():
    word_count = len(content.split())
    print(f"{section}: {word_count:,} words")

prepared_remarks: 3,175 words
qa: 5,134 words


## Running the full pipeline

Now I'll process all 56 transcripts through the pipeline. Each one gets boilerplate stripped, segmented into prepared remarks and Q&A, and metadata extracted from the filename.

In [5]:
results = []
for fpath in transcript_files:
    result = process_transcript(fpath)
    results.append(result)

print(f"Processed {len(results)} transcripts")

Processed 56 transcripts


In [6]:
summary_df = pd.DataFrame([
    {
        "ticker": r["ticker"],
        "quarter": f"{r['year']}-{r['quarter']}",
        "raw_words": r["raw_word_count"],
        "cleaned_words": r["cleaned_word_count"],
        "pct_removed": round(100 * (1 - r["cleaned_word_count"] / r["raw_word_count"]), 1),
        "section_type": r["section_type"],
    }
    for r in results
]).sort_values(["ticker", "quarter"]).reset_index(drop=True)

pd.set_option("display.max_rows", 60)
summary_df

Unnamed: 0,ticker,quarter,raw_words,cleaned_words,pct_removed,section_type
0,AAPL,2021-Q1,9197,9040,1.7,prepared_remarks+qa
1,AAPL,2021-Q2,8892,8730,1.8,prepared_remarks+qa
2,AAPL,2021-Q3,9007,8846,1.8,prepared_remarks+qa
3,AAPL,2021-Q4,9279,9118,1.7,prepared_remarks+qa
4,AAPL,2022-Q1,8470,8312,1.9,prepared_remarks+qa
5,AAPL,2022-Q2,8617,8457,1.9,prepared_remarks+qa
6,AAPL,2022-Q3,8238,8078,1.9,prepared_remarks+qa
7,AAPL,2022-Q4,8303,8157,1.8,prepared_remarks+qa
8,AAPL,2023-Q1,8424,8266,1.9,prepared_remarks+qa
9,AMZN,2021-Q1,5623,5490,2.4,prepared_remarks+qa


## Saving processed transcripts

I'll save the cleaned text to `data/processed/` using the same naming convention. These files are what the sentiment analysis notebook will read — boilerplate-free and ready to score.

In [7]:
for r in results:
    out_path = PROCESSED_DIR / f"{r['ticker']}_{r['year']}_{r['quarter']}.txt"
    out_path.write_text(r["cleaned_text"], encoding="utf-8")

saved_files = sorted(PROCESSED_DIR.glob("*.txt"))
print(f"Saved {len(saved_files)} processed transcripts to {PROCESSED_DIR}")

Saved 56 processed transcripts to ../data/processed


## Summary statistics

In [8]:
segmented = summary_df[summary_df["section_type"] == "prepared_remarks+qa"].shape[0]
full_only = summary_df[summary_df["section_type"] == "full"].shape[0]

print(f"Total transcripts: {len(summary_df)}")
print(f"Successfully segmented (prepared_remarks + qa): {segmented}")
print(f"Single section (no Q&A marker found): {full_only}")
print(f"\nBoilerplate removed: {summary_df['pct_removed'].mean():.1f}% on average")
print(f"\nWord counts by company (cleaned):")
print(summary_df.groupby("ticker")["cleaned_words"].agg(["mean", "median", "min", "max"]).round(0).astype(int).to_string())

Total transcripts: 56
Successfully segmented (prepared_remarks + qa): 56
Single section (no Q&A marker found): 0

Boilerplate removed: 1.8% on average

Word counts by company (cleaned):
         mean  median    min    max
ticker                             
AAPL     8556    8457   8078   9118
AMZN     6125    5922   5027   7386
BAC     12992   13270  11428  14549
GOOGL    7858    8001   5593   9028
JNJ     10077    9329   8777  12143
JPM     11073   10705   8143  16137
MSFT     8874    9028   7413   9444


## Observations

All 56 transcripts processed cleanly and every one had a recognisable Q&A marker, so the segmentation worked across the board. The boilerplate removal is modest (roughly 1-2% of words) because these are long documents and we're only stripping a few lines of operator text and metadata — the safe harbour disclaimers stay in since they're woven into the prepared remarks. The Q&A section is consistently longer than prepared remarks for most companies, which makes sense — the scripted portion is tight and rehearsed, while analyst questions and management responses run longer. JPM stands out with noticeably longer transcripts overall, driven by the breadth of their business lines generating more analyst questions.