# Multi-Ticker Earnings Dataset ‚Äî Construction Notebook

This notebook programmatically builds a **multi-year, multi-ticker dataset** covering S&P 500 companies from 2004 to 2025.
It consolidates historical market data, engineered technical indicators, and earnings-surprise information into a single feature table suitable for supervised learning.

**Workflow overview:**

1. Fetch the S&P 500 ticker universe from Wikipedia.
2. Download historical OHLCV data from Yahoo Finance.
3. Collect and merge earnings events (with Surprise %).
4. Engineer predictive features such as RSI, ATR, momentum, and MA ratios.
5. Generate forward-looking labels for model training.
6. Export the finalized dataset as `multi_ticker_earnings_dataset.csv`.

---

## Imports and Config

In [2]:
import yfinance as yf
import pandas as pd
import numpy as np
from ta.momentum import RSIIndicator
from ta.volatility import AverageTrueRange
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup

sns.set(style="whitegrid")

# ‚îÄ‚îÄ‚îÄ CONFIG ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
start_date = "2004-01-01"
end_date   = "2025-12-31"
horizon    = 1    # days ahead for the target


## 1Ô∏è‚É£ Fetching S&P 500 Ticker List

We begin by scraping the current S&P 500 constituents directly from Wikipedia using `requests` and `BeautifulSoup`.
This avoids stale or incomplete local copies and ensures reproducibility of the ticker universe.

**Notes:**

* Headers are spoofed to bypass 403 HTTP restrictions.
* Symbols are normalized (e.g. BRK.B ‚Üí BRK-B).
* The function returns a list of ~500 tickers for downstream looping.

---

In [4]:
def get_sp500_tickers():
    """
    Scrape the current list of S&P 500 tickers from Wikipedia.
    Uses requests + BeautifulSoup to avoid HTTP 403 errors.
    """
    url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

    # Add browser headers to bypass bot protection
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0 Safari/537.36"
        )
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()   # will show any real connection error

    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, "lxml")
    table = soup.find("table", {"id": "constituents"})

    if table is None:
        raise ValueError("Could not find the constituents table on the Wikipedia page.")

    # Extract the table using pandas
    df = pd.read_html(str(table))[0]

    tickers = [t.replace(".", "-") for t in df["Symbol"].astype(str).tolist()]
    return tickers

# Usage
sp500_tickers = get_sp500_tickers()
print(f"‚úÖ Fetched {len(sp500_tickers)} tickers, e.g.: {sp500_tickers[:10]}")


‚úÖ Fetched 503 tickers, e.g.: ['MMM', 'AOS', 'ABT', 'ABBV', 'ACN', 'ADBE', 'AMD', 'AES', 'AFL', 'A']


  df = pd.read_html(str(table))[0]


## 2Ô∏è‚É£ Initialize Ticker Universe

We call `get_sp500_tickers()` to populate the `tickers` list.
This provides the master universe for subsequent price and earnings retrieval.

---

In [5]:
tickers = get_sp500_tickers()

  df = pd.read_html(str(table))[0]


## 3Ô∏è‚É£ Download Historical Price Data

We use `yfinance.download()` to pull OHLCV data for all S&P 500 tickers spanning **2004 ‚Äì 2025**.
To avoid throttling, tickers are split into batches of 50 and fetched concurrently.

**Post-processing:**

* Concatenate batches horizontally into a single DataFrame.
* Drop holiday rows where all tickers are NaN.

This step produces a multi-indexed DataFrame of prices ready for feature engineering.

---

In [6]:
# ‚îÄ‚îÄ Step 1: Fetch price data for the entire S&P 500 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print("Fetching S&P 500 ticker list‚Ä¶")
tickers = get_sp500_tickers()
print(f"Got {len(tickers)} tickers. Date range: {start_date} ‚Üí {end_date}")

batch_size = 50
chunks = [tickers[i:i + batch_size] for i in range(0, len(tickers), batch_size)]

frames = []
for chunk in tqdm(chunks, desc="üì• Downloading Price Data (batched)", ncols=100):
    try:
        df_chunk = yf.download(
            chunk,
            start=start_date,
            end=end_date,
            group_by="ticker",
            auto_adjust=False,
            threads=True
        )
        frames.append(df_chunk)
    except Exception as e:
        print(f"‚ö†Ô∏è Batch failed ({chunk[0]} - {chunk[-1]}): {e}")
        continue

raw = pd.concat(frames, axis=1)
raw.dropna(how="all", inplace=True)

print("‚úÖ Price download complete. Raw shape:", raw.shape)
display(raw.head())


Fetching S&P 500 ticker list‚Ä¶
Got 503 tickers. Date range: 2004-01-01 ‚Üí 2025-12-31


  df = pd.read_html(str(table))[0]


üì• Downloading price data (batched):   0%|                                   | 0/11 [00:00<?, ?it/s]

[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  50 of 50 completed
[*********************100%***********************]  3 of 3 completed


‚úÖ Price download complete. Raw shape: (5503, 3018)


Ticker,ABT,ABT,ABT,ABT,ABT,ABT,MO,MO,MO,MO,...,ZBRA,ZBRA,ZBRA,ZBRA,ZTS,ZTS,ZTS,ZTS,ZTS,ZTS
Price,Open,High,Low,Close,Adj Close,Volume,Open,High,Low,Close,...,Low,Close,Adj Close,Volume,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2004-01-02,20.918781,21.165676,20.802067,20.986116,11.766494,6894172,54.669998,55.0,54.529999,54.650002,...,43.406666,43.586666,43.586666,339900,,,,,,
2004-01-05,21.14323,21.210567,20.739222,20.986116,11.766494,14395828,54.57,54.619999,53.59,54.240002,...,43.393333,43.666668,43.666668,640950,,,,,,
2004-01-06,20.851446,20.959183,20.721266,20.83349,11.680915,7790584,54.240002,54.299999,53.599998,53.830002,...,43.366669,43.766666,43.766666,311700,,,,,,
2004-01-07,20.851446,21.053452,20.716776,21.053452,11.804256,7774322,53.759998,53.759998,52.509998,53.119999,...,43.5,44.5,44.5,495150,,,,,,
2004-01-08,20.42499,20.447435,20.24992,20.433968,11.456921,17053876,53.029999,53.279999,52.599998,53.099998,...,44.133331,44.599998,44.599998,290850,,,,,,


## 4Ô∏è‚É£ Collect Earnings Event Dates

Each ticker‚Äôs earnings calendar is queried using `yf.Ticker(t).earnings_dates`.
We aggregate all available earnings dates within the specified range and normalize timezones.

**Error handling & validation:**

* Tickers with no data are tracked in `bad_tickers`.
* Valid events are merged into a master `earnings_dates` table.

The output contains timestamped earnings announcements per ticker ‚Äî the foundation for label creation.

---

In [7]:
bad_tickers = []
events = []

print(f"Collecting earnings dates for {len(tickers)} tickers...")

for t in tqdm(tickers, desc="üìä Fetching earnings calendars", ncols=100):
    try:
        ticker_obj = yf.Ticker(t)
        ed = ticker_obj.earnings_dates

        if ed is None or ed.empty:
            bad_tickers.append(t)
            continue

        ed = ed.reset_index()
        ed.columns = ["Date", "Estimate", "Reported", "Surprise_%"]
        ed["Date"] = ed["Date"].dt.tz_localize(None).dt.normalize()
        ed["Ticker"] = t
        events.append(ed[["Date", "Ticker"]])

    except Exception as e:
        bad_tickers.append(t)
        continue

if events:
    earnings_dates = (
        pd.concat(events, ignore_index=True)
        .query("@start_date <= Date <= @end_date")
        .sort_values(["Date", "Ticker"])
        .reset_index(drop=True)
    )
else:
    earnings_dates = pd.DataFrame(columns=["Date", "Ticker"])

print(f"‚úÖ Total valid earnings events: {len(earnings_dates)}")
print(f"üö´ Skipped {len(bad_tickers)} tickers (no data)")

Collecting earnings dates for 503 tickers...


üìä Fetching earnings calendars:   0%|                                       | 0/503 [00:00<?, ?it/s]

  df['Earnings Date'] = pd.to_datetime(df['Event Start Date'])
FOX: $FOX: possibly delisted; no earnings dates found
NWS: $NWS: possibly delisted; no earnings dates found
PSKY: $PSKY: possibly delisted; no earnings dates found
Q: $Q: possibly delisted; no earnings dates found
SOLS: $SOLS: possibly delisted; no earnings dates found


‚úÖ Total valid earnings events: 5893
üö´ Skipped 5 tickers (no data)


## 5Ô∏è‚É£ Enrich Earnings Events with Surprise %

We rebuild the `earnings_dates` DataFrame to explicitly include **Earnings Surprise %**, which quantifies how reported earnings differed from analyst expectations.

**Key notes:**

* Only ‚Äúgood‚Äù tickers from the previous step are processed.
* Missing Surprise % values are handled gracefully.
* Output is sorted chronologically and indexed by `Date` and `Ticker`.

This forms a clean event dataset alignable with price data for feature merging.

---

In [13]:
# Step 2 ‚Äî Build unified earnings_dates with Surprise_% included
events = []

# Filter tickers to exclude those with no data (collected earlier)
good_tickers = [t for t in tickers if t not in bad_tickers]
print(f"Processing {len(good_tickers)} valid tickers (skipping {len(bad_tickers)} bad ones)")

for t in good_tickers:
    try:
        ed = yf.Ticker(t).earnings_dates
        if ed is None or ed.empty:
            print(f"‚ö†Ô∏è  No earnings data for {t} ‚Äî skipping")
            bad_tickers.append(t)
            continue

        ed = ed.reset_index()
        ed.columns = ["Date", "Earnings_Estimate", "Reported_Earnings", "Surprise_%"]
        ed["Date"] = ed["Date"].dt.tz_localize(None).dt.normalize()
        ed["Ticker"] = t
        events.append(ed[["Date", "Ticker", "Surprise_%"]])

    except Exception as e:
        print(f"‚ùå {t}: {e}")
        bad_tickers.append(t)
        continue

earnings_dates = (
    pd.concat(events, ignore_index=True)
      .query("@start_date <= Date <= @end_date")
      .sort_values(["Date", "Ticker"])
      .reset_index(drop=True)
)

print(f"‚úÖ Total earnings events (with Surprise_%): {len(earnings_dates)}")
display(earnings_dates.head())

Processing 498 valid tickers (skipping 5 bad ones)


  df['Earnings Date'] = pd.to_datetime(df['Event Start Date'])


‚úÖ Total earnings events (with Surprise_%): 5893


Unnamed: 0,Date,Ticker,Surprise_%
0,2020-08-05,FISV,-0.09
1,2020-10-27,FISV,3.62
2,2021-02-09,FISV,0.76
3,2021-04-27,FISV,3.71
4,2021-07-27,FISV,7.04


## 6Ô∏è‚É£ Feature Engineering

For each ticker:

1. Extract and clean the OHLCV data.
2. Compute technical indicators:
   ‚ÄÉ- **Return**, **Volatility**, **RSI**, **Moving Averages (5 & 10 days)**, **MA Ratio**
   ‚ÄÉ- **Volume metrics** (20-day average and spike ratio)
   ‚ÄÉ- **Momentum (3-day change)** and **ATR14** for true range volatility
   ‚ÄÉ- **Temporal features** (weekday, month)
3. Merge earnings **Surprise %** to capture event context and set non-event days to zero.

After rolling-window cleanup, all features are concatenated across tickers into `features_df`.

**Outcome:**
A comprehensive multi-ticker DataFrame spanning price, momentum, volatility, volume, and event-driven signals.

---

In [14]:
feature_list = []
skipped = []

good_tickers = [t for t in tickers if t not in bad_tickers]
print(f"Engineering features for {len(good_tickers)} tickers...")

for t in tqdm(good_tickers, desc="‚öôÔ∏è Feature engineering", ncols=100):
    try:
        # ‚îÄ‚îÄ Data Validation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        if t not in raw.columns.get_level_values(0):
            skipped.append((t, "missing_in_raw"))
            continue

        df_t = raw[t].copy()

        # Flatten MultiIndex if present
        if isinstance(df_t.columns, pd.MultiIndex):
            df_t.columns = df_t.columns.get_level_values(0)

        # Clean numeric columns
        df_t.drop(columns=[c for c in ["Price"] if c in df_t], inplace=True, errors="ignore")
        df_t["Close"]  = pd.to_numeric(df_t["Close"],  errors="coerce")
        df_t["Volume"] = pd.to_numeric(df_t["Volume"], errors="coerce")
        df_t.dropna(subset=["Close", "Volume"], inplace=True)
        if df_t.empty:
            skipped.append((t, "empty_after_clean"))
            continue

        # ‚îÄ‚îÄ Core Baseline Features ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        df_t["Return"]      = df_t["Close"].pct_change()
        df_t["Volatility"]  = df_t["Return"].rolling(5).std()
        df_t["RSI"]         = RSIIndicator(close=df_t["Close"], window=14).rsi()
        df_t["MA5"]         = df_t["Close"].rolling(5).mean()
        df_t["MA10"]        = df_t["Close"].rolling(10).mean()
        df_t["MA_ratio"]    = df_t["MA5"] / df_t["MA10"] - 1
        df_t["Volume_Avg20"]= df_t["Volume"].rolling(20).mean()
        df_t["Volume_Spike"]= df_t["Volume"] / df_t["Volume_Avg20"] - 1
        df_t["Momentum3"]   = df_t["Close"].pct_change(3)
        atr = AverageTrueRange(high=df_t["High"], low=df_t["Low"], close=df_t["Close"], window=14)
        df_t["ATR14"]       = atr.average_true_range()
        df_t["DayOfWeek"]   = df_t.index.dayofweek
        df_t["Month"]       = df_t.index.month

        # ‚îÄ‚îÄ Merge Earnings Surprise % ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        ed_t = earnings_dates.loc[earnings_dates["Ticker"] == t, ["Date", "Surprise_%"]].set_index("Date")
        df_t = df_t.join(ed_t, how="left")
        df_t["Surprise_%"] = df_t["Surprise_%"].fillna(0)

        # ‚îÄ‚îÄ Additional Derived Features ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

        ## üîπ Trend Persistence & Regime
        df_t["CumulativeReturn5"]  = df_t["Return"].rolling(5).sum()
        df_t["CumulativeReturn20"] = df_t["Return"].rolling(20).sum()
        df_t["MA_CrossGap"]        = (df_t["MA5"] - df_t["MA10"]) / df_t["Close"]
        df_t["PositiveDays_5"]     = df_t["Return"].rolling(5).apply(lambda x: (x > 0).mean(), raw=True)

        ## üîπ Volatility Asymmetry & Structure
        df_t["VolatilityChange"]   = df_t["Volatility"].pct_change()
        df_t["Skew5"]              = df_t["Return"].rolling(5).skew()
        df_t["HighVol"]            = (df_t["Volatility"] > df_t["Volatility"].rolling(60).mean()).astype(int)
        df_t["VolatilityJump"]     = (
            df_t["Volatility"].rolling(3).mean().shift(-1) / df_t["Volatility"].rolling(3).mean() - 1
        )

        ## üîπ Volume‚ÄìPrice Interactions
        df_t["VolPriceCorr20"] = df_t["Return"].rolling(20).corr(df_t["Volume"])
        df_t["VolumeDivergence"] = df_t["Volume_Spike"] * np.sign(df_t["Return"])

        ## üîπ Event-Driven Dynamics
        df_t["SurpriseDecay5"] = df_t["Surprise_%"].rolling(5, min_periods=1).mean()
        df_t["EventShock"] = (df_t["Surprise_%"].abs() > df_t["Surprise_%"].abs().median() * 3).astype(int)

        # Compute Days Since Last Earnings
        df_t["DaysSinceEarnings"] = np.nan
        earnings_idx = ed_t.index
        if not earnings_idx.empty:
            last_event_date = None
            days_since = []
            for current_date in df_t.index:
                valid_past = earnings_idx[earnings_idx <= current_date]
                if not valid_past.empty:
                    last_event_date = valid_past[-1]
                days_since.append((current_date - last_event_date).days if last_event_date else np.nan)
            df_t["DaysSinceEarnings"] = days_since
        df_t["DaysSinceEarnings"].fillna(df_t["DaysSinceEarnings"].max() or 0, inplace=True)

        ## üîπ Composite Ratios
        df_t["MomentumVolRatio"] = df_t["Momentum3"] / (df_t["Volatility"] + 1e-9)
        df_t["EarningsImpactMag"] = df_t["Surprise_%"].abs() * df_t["Volatility"]

        # ‚îÄ‚îÄ Cleanup & Filtering ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        df_t.dropna(subset=[
            "Return","Volatility","RSI","MA5","MA10","MA_ratio",
            "Volume_Avg20","Volume_Spike","Momentum3","ATR14"
        ], inplace=True)

        if df_t.empty:
            skipped.append((t, "empty_after_features"))
            continue

        # ‚îÄ‚îÄ Final Column Selection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        keep = [
            "Close","Volume","Return","Volatility","RSI",
            "MA5","MA10","MA_ratio","Volume_Avg20","Volume_Spike",
            "Momentum3","ATR14","DayOfWeek","Month","Surprise_%",
            "CumulativeReturn5","CumulativeReturn20","MA_CrossGap","PositiveDays_5",
            "VolatilityChange","Skew5","HighVol","VolatilityJump",
            "VolPriceCorr20","VolumeDivergence",
            "SurpriseDecay5","EventShock","DaysSinceEarnings",
            "MomentumVolRatio","EarningsImpactMag"
        ]

        feats = df_t[keep].copy()
        feats["Ticker"] = t
        feature_list.append(feats)

    except Exception as e:
        skipped.append((t, f"error:{str(e)[:80]}"))
        continue

# ‚îÄ‚îÄ Combine Results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if feature_list:
    features_df = pd.concat(feature_list)
    features_df.index.name = "Date"
    features_df = features_df.sort_index()
    print(f"‚úÖ Feature engineering complete: {len(feature_list)} tickers succeeded.")
    print(f"üìä Combined dataset shape: {features_df.shape}")
    display(features_df.head())
else:
    print("‚ùå No ticker produced valid feature data. Check 'skipped' list below.")
    features_df = pd.DataFrame()

if skipped:
    print(f"\n‚ö†Ô∏è Skipped {len(skipped)} tickers (sample):")
    print(skipped[:10])


Engineering features for 498 tickers...


‚öôÔ∏è Feature engineering:   0%|                                               | 0/498 [00:00<?, ?it/s]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_t["DaysSinceEarnings"].fillna(df_t["DaysSinceEarnings"].max() or 0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_t["DaysSinceEarnings"].fillna(df_t["DaysSinceEarnings"].max() or 0, inplace=True)
The behavior will change in pandas 3.0. This inplace method wi

‚úÖ Feature engineering complete: 498 tickers succeeded.
üìä Combined dataset shape: (2480382, 31)


Unnamed: 0_level_0,Close,Volume,Return,Volatility,RSI,MA5,MA10,MA_ratio,Volume_Avg20,Volume_Spike,...,HighVol,VolatilityJump,VolPriceCorr20,VolumeDivergence,SurpriseDecay5,EventShock,DaysSinceEarnings,MomentumVolRatio,EarningsImpactMag,Ticker
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-01-30,66.128761,4054081.0,-0.005407,0.011509,33.131979,67.421405,68.134616,-0.010468,4487015.3,-0.096486,...,0,-0.104918,,0.096486,0.0,0,205.0,-3.185526,0.0,MMM
2004-01-30,6.9625,2629600.0,0.005415,0.022642,55.191102,6.889,6.98275,-0.013426,4145740.0,-0.36571,...,0,0.254908,,-0.36571,0.0,0,175.0,0.692594,0.0,ROST
2004-01-30,20.485001,2386600.0,-0.022196,0.029585,60.164472,20.423,20.344,0.003883,1408260.0,0.694715,...,0,0.215869,,-0.694715,0.0,0,196.0,0.240958,0.0,EL
2004-01-30,6.641667,327600.0,-0.001753,0.008481,58.705176,6.610333,6.568667,0.006343,618720.0,-0.47052,...,0,-0.210176,,0.47052,0.0,0,196.0,0.774351,0.0,CHD
2004-01-30,20.6625,2914400.0,0.000363,0.007324,46.023753,20.649,20.83425,-0.008892,2749340.0,0.060036,...,0,-0.024483,,0.060036,0.0,0,211.0,-0.575775,0.0,PGR


## 7Ô∏è‚É£ Target Label Generation

We define a labeling function that assigns a binary target ( `Target = 1` if the price rises within the next *horizon* days, else 0 ).

**Mechanism:**

* Shift the closing price forward by the chosen horizon ( e.g. 1‚Äì3 days ).
* Compare future returns at each earnings date to classify directional movement.

This creates a balanced event-based label dataset linking post-earnings performance to the feature window.

---

In [15]:
# Step 4: Label creation (target variable) for multi‚Äêticker MultiIndex

def create_labels(event_dates, price_df, horizon=3):
    """
    event_dates: DataFrame with ['Date','Ticker'] columns of pd.Timestamps
    price_df:    DataFrame with a MultiIndex (Date, Ticker) and at least a 'Close' column
    horizon:     how many trading days ahead to look
    """
    labels = []
    # 1) pre‚Äêshift the Close series within each ticker
    future_close = price_df['Close'].groupby(level='Ticker').shift(-horizon)
    
    for _, ev in event_dates.iterrows():
        dt, tkr = ev['Date'], ev['Ticker']
        key = (dt, tkr)
        # 2) skip if that (Date, Ticker) combo isn't in your features
        if key not in price_df.index:
            continue
        
        past = price_df.at[key, 'Close']
        fut  = future_close.at[key]
        # 3) skip if we ran off the end
        if pd.isna(fut):
            continue
        
        ret = (fut - past) / past
        labels.append({
          'Date':   dt,
          'Ticker': tkr,
          'Target': int(ret > 0)
        })
    
    return pd.DataFrame(labels)


# ‚Äî how to call it ‚Äî
# make sure features_df is a MultiIndexed DF: index names must be ['Date','Ticker']
features_df = features_df.reset_index().set_index(['Date','Ticker'])

labels_df = create_labels(earnings_dates, features_df, horizon=horizon)
print(f"Labeled {len(labels_df)} events:")
display(labels_df)

Labeled 5876 events:


Unnamed: 0,Date,Ticker,Target
0,2020-08-05,FISV,1
1,2020-10-27,FISV,0
2,2021-02-09,FISV,0
3,2021-04-27,FISV,1
4,2021-07-27,FISV,0
...,...,...,...
5871,2025-06-25,GIS,0
5872,2025-06-25,MU,0
5873,2025-06-25,PAYX,1
5874,2025-06-26,MKC,0


## 8Ô∏è‚É£ Assemble Final Dataset and Export

Finally, we join `features_df` and `labels_df` on their shared MultiIndex (`Date`, `Ticker`).
Only entries with valid targets are retained for supervised learning.

**Output:**

* A fully aligned feature-label table ready for model training.
* Saved as `multi_ticker_earnings_dataset.csv`.

---

In [16]:
# 1) Ensure the feature and label DataFrames share the same MultiIndex
#    (Date,Ticker) before joining:

# features_df should already be indexed by (Date,Ticker)
# if not, do it explicitly:
features_df = features_df.reset_index().set_index(['Date','Ticker'])

# labels_df just needs to have the same index
labels_df = labels_df.set_index(['Date','Ticker'])

# 2) Join on that MultiIndex, pulling in only the 'Target' column from labels_df
final_df = features_df.join(
    labels_df[['Target']],
    how='inner'
).reset_index()

# 3) Inspect & save
print("Final dataset shape:", final_df.shape)
display(final_df.head())

final_df.to_csv("multi_ticker_earnings_dataset.csv", index=False)
print("‚úÖ Saved to multi_ticker_earnings_dataset.csv")

Final dataset shape: (5876, 33)


Unnamed: 0,Date,Ticker,Close,Volume,Return,Volatility,RSI,MA5,MA10,MA_ratio,...,HighVol,VolatilityJump,VolPriceCorr20,VolumeDivergence,SurpriseDecay5,EventShock,DaysSinceEarnings,MomentumVolRatio,EarningsImpactMag,Target
0,2020-08-05,FISV,98.0,5283200.0,-0.016558,0.008797,43.912461,99.686,100.367001,-0.006785,...,0,0.403601,0.249003,-0.301167,-0.018,1,0.0,-2.039029,0.000792,1
1,2020-10-27,FISV,96.610001,6889300.0,-0.017991,0.017073,38.957956,99.281999,100.041999,-0.007597,...,0,0.107696,-0.423413,-0.79027,0.724,1,0.0,-2.479234,0.061804,0
2,2021-02-09,FISV,113.449997,3774700.0,0.000441,0.008138,59.238398,112.301999,108.918999,0.03106,...,0,0.087538,-0.31973,-0.283251,0.152,1,0.0,1.324784,0.006185,0
3,2021-04-27,FISV,121.660004,9036900.0,-0.038641,0.019613,44.317565,124.736,124.783999,-0.000385,...,1,0.489307,-0.664602,-1.334438,0.742,1,0.0,-1.215045,0.072764,1
4,2021-07-27,FISV,114.68,8285000.0,0.029906,0.013433,63.271989,111.582001,110.515,0.009655,...,1,-0.190961,0.208774,1.158586,1.408,1,0.0,3.131975,0.094568,0


‚úÖ Saved to multi_ticker_earnings_dataset.csv


## ‚úÖ Summary

This dataset-creation pipeline automates the integration of market and fundamental signals for hundreds of equities over two decades.

**Key benefits:**

* Unified multi-ticker structure for cross-sectional analysis.
* Rich feature space combining momentum, risk, and surprise factors.
* Forward-looking labels supporting classification or forecasting tasks.

The resulting dataset serves as a robust foundation for volatility modeling, event prediction, and machine-learning experiments in financial time series research.
