# Event Detection, Triple-Barrier Labeling, and Meta-Labeling (based on López de Prado, AFML)

This notebook implements the full event-based labeling pipeline introduced by Marcos López de Prado in *Advances in Financial Machine Learning* (2018), Chapters 2–4, applied cross-sectionally to the universe of U.S. equities from the CRSP (Center for Research in Security Prices) database. The objective of this workflow is to generate financially meaningful supervised-learning labels for return prediction, while avoiding the pitfalls of time-bar sampling, look-ahead bias, and label noise. Following AFML, the notebook proceeds through all major components of the triple-barrier methodology:

1. **Event Detection (CUSUM Filter, Ch. 2)**
   First, we identify candidate event start times by applying a symmetric CUSUM filter with a dynamic, volatility-scaled threshold. Unlike fixed-interval sampling, CUSUM focuses attention on *informational events*—periods where price changes significantly enough to justify model re-evaluation. This produces a set of timestamps representing plausible trading opportunities free from microstructural noise.

2. **Volatility Estimation (EWMA of Daily Returns, Ch. 3)**
   For each event, the notebook computes an EWMA-based volatility estimate using López de Prado’s recommended two-day lagged log-return formulation. This volatility-based measure serves as the “base width” for setting adaptive barrier distances in the triple-barrier method.

3. **Triple-Barrier Labeling (Ch. 3)**
   Once events and volatility targets are defined, the notebook constructs three types of barriers:

   * **Profit-taking barrier**,
   * **Stop-loss barrier**, and
   * **Vertical time barrier** (maximum holding period).
   Using the methodology in AFML, the model checks which barrier is first breached after an event start. This yields an unbiased, path-dependent labeling system that encodes not only the direction of returns but also the uncertainty and duration of each event. Unlike fixed-horizon labels, triple-barrier labels naturally incorporate information about realized price paths and event-specific volatility.

4. **Primary Label Construction (Side Prediction)**
   Primary Label Construction (Directional Prediction)
   Based on the price difference between the event’s start and the first barrier hit, the notebook assigns the primary label, called the side label in AFML. This label indicates the realized direction of price movement and is typically expressed as:
   +1 for positive outcomes,
   –1 for negative outcomes, and 0 for small or economically insignificant returns, determined by the minimum-return threshold. This is set to 0.02.
   These labels form the training targets for a directional prediction model. By incorporating volatility-adjusted barrier distances, path-dependent evaluation, and event-specific holding periods, the resulting dataset provides cleaner, less noisy supervised-learning targets compared to standard return-based labeling

5. **Note on Meta-Labeling (Applied in a Later Stage)**
Although this notebook focuses exclusively on constructing the **primary side labels** using the triple-barrier method, the resulting labels provide the foundation for **meta-labeling**, as described in López de Prado’s *Advances in Financial Machine Learning* (Chapter 4). Meta-labeling is applied *after* training the primary directional model. In that later step, the model produces a predicted side (+1 or –1), and a meta-label is assigned based on whether this predicted direction matches the realized primary label **and** the event’s realized return exceeds a specified performance threshold. Thus, the meta-label evaluates the *quality* of the primary model’s signal rather than the direction of the price itself. While meta-labeling is not implemented in this notebook, the event structure and primary labels produced here are designed to directly support that later enhancement.

1. **Event Enrichment (Trading Days, Daily Return, etc.)**
   The notebook also computes event features such as the number of trading days between entry and exit and the event’s average daily return. These features reinforce the economic interpretability of labels and support the meta-model in distinguishing high-quality signals from noisy or ambiguous ones.

This notebook implements the AFML event-based labeling process up to the construction of the primary side labels. The detailed implementation of the Labeling Methods used in this Project are defined in the file **Labeling.py**. It offers a principled way to convert raw price series into structured, unbiased, and finance-aware labels suitable for directional machine learning models


In [1]:
import numpy as np
import pandas as pd
from typing import  List
import pickle

## Data Preprocessing

In [2]:
with open('./Data/feature_scores.pkl', 'rb') as file:
    features = pickle.load(file)

In [3]:
features = features.reset_index()

In [4]:
data_final = features.copy()

In [5]:
data_final

Unnamed: 0,Date,PERMNO,ma_5,ema_5,slope_ma_5,ma_10,ema_10,slope_ma_10,ma_20,ema_20,...,vol,adv,dvol,vol_z,turnover_proxy,vwap_proxy,pv_corr,pv_divergence,label,DlyClose
0,2013-04-01,10026,1.171313,1.161774,0.541080,1.141680,1.141524,1.749915,1.092093,1.109808,...,-0.449389,-0.453514,0.669874,0.502964,0.290945,1.171328,-0.817272,0.817272,-0.022231,75.12
1,2013-04-01,10145,1.149054,1.142536,-0.859211,1.146074,1.138618,0.566247,1.117035,1.122651,...,1.285469,0.981474,0.136404,0.078073,-0.002715,1.134350,-0.045139,0.045139,-0.010494,74.33
2,2013-04-01,10158,-0.801484,-0.802927,0.206227,-0.806610,-0.801662,-0.207468,-0.795113,-0.794578,...,-0.411934,-0.404023,-0.343869,-0.318661,-0.525320,-0.802024,-2.006186,2.006186,-0.019499,7.18
3,2013-04-01,10207,-0.814302,-0.813424,-0.077890,-0.813643,-0.812927,-0.040022,-0.811231,-0.811746,...,-0.457803,-0.453907,1.305792,-0.616476,-0.580649,-0.813009,-1.320348,1.320348,-0.020349,6.88
4,2013-04-01,10252,-0.077493,-0.081945,-0.362007,-0.078820,-0.081338,-0.028474,-0.085185,-0.082850,...,-0.437524,-0.445881,0.222168,0.348070,0.388319,-0.088090,0.144052,-0.144052,-0.020997,31.91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2655554,2023-12-21,93380,-0.368517,-0.368925,0.673617,-0.373438,-0.370163,-0.090457,-0.369716,-0.366960,...,-0.116259,-0.236797,1.154491,0.825230,0.758403,-0.366640,-0.337685,0.337685,0.049730,35.19
2655555,2023-12-21,93415,-0.333871,-0.332995,-0.029628,-0.332091,-0.330406,-0.549918,-0.321932,-0.325615,...,-0.485686,-0.518971,-1.034074,-0.880380,-0.998471,-0.333002,1.030022,-1.030022,0.064242,37.67
2655556,2023-12-21,93419,-0.603996,-0.605305,-0.068915,-0.609252,-0.609151,-0.320187,-0.618376,-0.615913,...,0.526555,0.665130,-0.541223,-0.493217,-0.274511,-0.603613,0.499878,-0.499878,0.041387,8.94
2655557,2023-12-21,93427,1.105182,1.094618,1.702947,1.054806,1.067444,3.346722,1.010483,1.043644,...,-0.403387,-0.430691,-0.773993,-0.262785,-0.202414,1.110732,1.394271,-1.394271,-0.005694,191.42


## Event based Side Labeling using Triple Barrier Labeling (Long/short prediction)

In [6]:
from Labeling import *

In [7]:
permno_list = data_final["PERMNO"].unique()

In [10]:
def apply_triple_barrier_labeling(
        data: pd.DataFrame,
        filter_threshold = 1.5,
        return_min = 0.02):

    data = data.sort_values('Date').copy()
    data = data.set_index("Date")

    prices = data["DlyClose"]

    volatility = daily_volatility_with_log_returns(prices, 30)

    #detect events based on individual observation threshold (filter_threshold * volatility) to detect meaningful market movements returns date time index of events
    molecules = cusum_filter_events_dynamic_threshold(np.log(prices), filter_threshold * volatility)

    #For each identified event returns the End time of the maximum holding period of 20 days. 
    vertical_barriers = vertical_barrier(prices, molecules, 20)

    #builds the events dataframe containig start date, end time, return of the label.
    #End time is defined by triple barrier method, by which barrier is hit first 
    triple_barrier_events , _ = meta_events(prices, vertical_barriers.index, [1, 1], volatility, 0, 1, vertical_barriers)

    side = triple_barrier_labeling(triple_barrier_events, prices, return_min=return_min, three_class= True)

    return pd.merge(data, side, left_index=True, right_index=True)

In [11]:
df_events = (
        data_final.groupby('PERMNO', group_keys=False)
          .apply(
              apply_triple_barrier_labeling,
          )
    )

  .apply(


In [12]:
df_events

Unnamed: 0,PERMNO,ma_5,ema_5,slope_ma_5,ma_10,ema_10,slope_ma_10,ma_20,ema_20,slope_ma_20,...,pv_corr,pv_divergence,label,DlyClose,timestamp,End Time,Return,trade_days,Return_dly,Side
2013-04-05,10026,1.151538,1.150492,-1.564549,1.160615,1.146049,0.384526,1.116281,1.123467,1.734052,...,0.603224,-0.603224,0.029078,73.94,2013-04-05,2013-04-11,0.018799,5,0.003760,0.0
2013-04-10,10026,1.128310,1.128761,-0.233729,1.148390,1.133827,-0.757602,1.127092,1.122325,1.362931,...,0.457884,-0.457884,-0.009893,74.80,2013-04-10,2013-04-12,0.017246,3,0.005749,0.0
2013-04-15,10026,1.141294,1.151321,1.066644,1.143037,1.145207,-0.029673,1.141710,1.131694,1.077090,...,0.539036,-0.539036,0.014664,74.33,2013-04-15,2013-04-16,0.026773,2,0.013386,1.0
2013-04-16,10026,1.158194,1.166039,2.238585,1.145787,1.154400,0.390880,1.147882,1.137863,1.558358,...,0.176387,-0.176387,-0.005045,76.32,2013-04-16,2013-04-17,-0.029612,2,-0.014806,-1.0
2013-04-17,10026,1.164892,1.164092,0.197541,1.146782,1.155495,0.002511,1.151120,1.140057,0.658019,...,0.292920,-0.292920,0.018363,74.06,2013-04-17,2013-04-19,0.018904,3,0.006301,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-06,93429,1.098561,1.107769,1.194385,1.109865,1.103695,1.630473,1.088649,1.085479,1.606221,...,1.899549,-1.899549,0.025276,174.08,2023-11-06,2023-11-09,0.013040,4,0.003260,0.0
2023-11-09,93429,1.138164,1.143867,3.970029,1.116274,1.127564,1.474208,1.105721,1.102399,2.554780,...,0.515106,-0.515106,0.008166,176.35,2023-11-09,2023-11-13,0.012078,3,0.004026,0.0
2023-11-13,93429,1.153401,1.155605,1.385615,1.128295,1.140842,1.597454,1.114112,1.114957,2.110616,...,0.382692,-0.382692,-0.009413,178.48,2023-11-13,2023-11-28,0.013503,11,0.001228,0.0
2023-11-22,93429,1.113766,1.116453,1.050906,1.129846,1.120802,0.594064,1.127156,1.116272,0.842195,...,0.212575,-0.212575,0.011829,180.06,2023-11-22,2023-11-30,0.011829,6,0.001972,0.0


In [13]:
df_events["Side"].value_counts()

Side
 0.0    242774
 1.0    226943
-1.0    218507
Name: count, dtype: int64

In [46]:
#drop events where Label at End of Time Horizon is exactly the same as at Event start Time.
#df_events = df_events[df_events["Side"] != 0]

In [14]:
df_events = df_events.reset_index(names="Date")

In [15]:
with open('./Data/feature_scores_events.pkl', 'wb') as file:
    pickle.dump(df_events, file)