[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)
[![Open In Kaggle](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-kaggle.svg)](https://www.kaggle.com/code/crunchdao/structural-break-baseline)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [None]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break Cb7h3ABFcv8KpU5wFV65q7dM

crunch-cli, version 8.0.0
you appear to have never submitted code before
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
                                
---
Success! Your environment has been correctly setup.
Next recommended actions:
1. Load the Crunch Toolings: `crunch = crunch.load_notebook()`
2. Execute the cells with your code
3. Run a test: `crunch.test()`
4. Download and submit your code to t

# Your model

## Setup

In [None]:
!pip install -q pandas numpy scikit-learn scipy catboost lightgbm xgboost statsmodels joblib tqdm pyarrow

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m99.2/99.2 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import typing
import warnings

# Import your dependencies
import joblib
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import sklearn.metrics

from scipy.fft import fft
from scipy.signal import find_peaks
from scipy.stats import skew, kurtosis, ks_2samp, wasserstein_distance, entropy, linregress
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import VotingClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from statsmodels.tsa.stattools import acf

# Speed and warning configs
warnings.filterwarnings("ignore")

In [None]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 8.0.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [None]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


### Understanding `X_train`

The training data is structured as a pandas DataFrame with a MultiIndex:

**Index Levels:**
- `id`: Identifies the unique time series
- `time`: (arbitrary) The time step within each time series, which is regularly sampled

**Columns:**
- `value`: The values of the time series at each given time step
- `period`: whether you are in the first part of the time series (`0`), before the presumed break point, or in the second part (`1`), after the break point

In [None]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


### Understanding `y_train`

This is a simple `pandas.Series` that tells if a time series id has a structural break, or not, from the presumed break point on.

**Index:**
- `id`: the ID of the time series

**Value:**
- `structural_breakpoint`: Boolean indicating whether a structural break occurred (`True`) or not (`False`)

In [None]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### Understanding `X_test`

The test data is provided as a **`list` of `pandas.DataFrame`s** with the same format as [`X_train`](#understanding-X_test).

It is structured as a list to encourage processing records one by one, which will be mandatory in the `infer()` function.

In [None]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [None]:
X_test[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0
10001,...,...,...
10001,2774,-0.013937,1
10001,2775,-0.015649,1
10001,2776,-0.009744,1
10001,2777,0.025375,1


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

# Feature Engineering Function

In [None]:
def create_features(data: pd.DataFrame) -> pd.DataFrame:
    """
    Generates a comprehensive set of features from the raw time series data.
    """
    # Define helper functions for feature extraction
    def summary_stats(group, prefix):
        s = group['value']
        return pd.Series({
            f"{prefix}_mean": s.mean(), f"{prefix}_median": s.median(), f"{prefix}_max": s.max(),
            f"{prefix}_min": s.min(), f"{prefix}_std": s.std(), f"{prefix}_var": s.var(),
            f"{prefix}_skew": s.skew(), f"{prefix}_kurt": s.kurt(),
            f"{prefix}_q25": s.quantile(0.25), f"{prefix}_q75": s.quantile(0.75),
            f"{prefix}_iqr": s.quantile(0.75) - s.quantile(0.25),
            f"{prefix}_rms": np.sqrt(np.mean(s**2)), f"{prefix}_ptp": np.ptp(s)
        })

    def dist_features(group):
        before = group[group['period']==0]['value']
        after = group[group['period']==1]['value']
        eps = 1e-5
        if len(before) > 1 and len(after) > 1:
            p_hist, _ = np.histogram(before, bins=20, density=True)
            q_hist, _ = np.histogram(after, bins=20, density=True)
            return pd.Series({
                "ks_stat": ks_2samp(before, after).statistic,
                "wasserstein": wasserstein_distance(before, after),
                "kl_div": entropy(p_hist + eps, q_hist + eps)
            })
        return pd.Series({"ks_stat": np.nan, "wasserstein": np.nan, "kl_div": np.nan})

    def autocorr_features(x, prefix, lags=[1, 5, 10]):
        if len(x) <= max(lags): return pd.Series({f"{prefix}_acf_lag_{lag}": np.nan for lag in lags})
        acf_vals = acf(x, nlags=max(lags), fft=True)
        return pd.Series({f"{prefix}_acf_lag_{lag}": acf_vals[lag] for lag in lags})

    def slope_feat(x):
        return linregress(np.arange(len(x)), x).slope if len(x) > 1 else np.nan

    def fft_features(x):
        if len(x) < 2: return pd.Series({"fft_power": np.nan, "fft_dominant_freq": np.nan})
        fft_vals = np.abs(fft(x.values))[:len(x)//2]
        return pd.Series({
            "fft_power": np.sum(fft_vals**2),
            "fft_dominant_freq": np.argmax(fft_vals) if len(fft_vals) > 0 else 0
        })

    def shape_features(x):
        if len(x) < 2: return pd.Series({"mean_abs_change": np.nan, "num_peaks": np.nan, "binned_entropy": np.nan})
        return pd.Series({
            "mean_abs_change": np.mean(np.abs(np.diff(x.values))),
            "num_peaks": len(find_peaks(x.values)[0]),
            "binned_entropy": entropy(np.histogram(x, bins=10)[0])
        })

    # --- Feature Generation ---
    # 1. Basic Summary Stats
    stats_before = data[data['period']==0].groupby("id").apply(lambda g: summary_stats(g, "before"))
    stats_after  = data[data['period']==1].groupby("id").apply(lambda g: summary_stats(g, "after"))
    features = pd.concat([stats_before, stats_after], axis=1)

    # 2. Delta & Ratio Stats
    eps = 1e-5
    for stat in ["mean","median","max","min","std","var","skew","kurt","iqr","rms","ptp"]:
        features[f"delta_{stat}"] = features[f"after_{stat}"] - features[f"before_{stat}"]
        features[f"ratio_{stat}"] = features[f"after_{stat}"] / (features[f"before_{stat}"] + eps)

    # 3. Distribution Comparison Features
    dist_feats = data.groupby("id").apply(dist_features)
    features = features.join(dist_feats)

    # 4. Autocorrelation Features
    acf_before = data[data['period']==0].groupby('id')['value'].apply(lambda s: autocorr_features(s, "before")).unstack()
    acf_after = data[data['period']==1].groupby('id')['value'].apply(lambda s: autocorr_features(s, "after")).unstack()
    features = features.join(acf_before).join(acf_after)

    # 5. Slope Features
    slope_before = data[data['period']==0].groupby("id")['value'].apply(slope_feat)
    slope_after = data[data['period']==1].groupby("id")['value'].apply(slope_feat)
    features['delta_slope'] = slope_after - slope_before

    # print(features.head(2))
    # 6. FFT Features
    fft_b = data[data['period']==0].groupby('id')['value'].apply(fft_features).add_prefix("before_").unstack().reset_index(drop=True)
    fft_a = data[data['period']==1].groupby('id')['value'].apply(fft_features).add_prefix("after_").unstack().reset_index(drop=True)
    fft_b.index.name='id'
    fft_a.index.name='id'
    features = features.join(fft_b).join(fft_a)
    features["delta_fft_power"] = features["after_fft_power"] - features["before_fft_power"]
    features["delta_dominant_freq"] = features["after_fft_dominant_freq"] - features["before_fft_dominant_freq"]

    # 7. Shape & Complexity Features
    shape_b = data[data['period']==0].groupby('id')['value'].apply(shape_features).add_prefix("before_").unstack().reset_index(drop=True)
    shape_a = data[data['period']==1].groupby('id')['value'].apply(shape_features).add_prefix("after_").unstack().reset_index(drop=True)
    shape_b.index.name='id'
    shape_a.index.name='id'
    features = features.join(shape_b).join(shape_a)
    features["delta_mean_abs_change"] = features["after_mean_abs_change"] - features["before_mean_abs_change"]
    features["delta_num_peaks"] = features["after_num_peaks"] - features["before_num_peaks"]
    features["delta_binned_entropy"] = features["after_binned_entropy"] - features["before_binned_entropy"]

    # --- Final Cleanup ---
    features.replace([np.inf, -np.inf], np.nan, inplace=True)
    return features

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [2]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    """
    Trains the model pipeline and saves it to a file.
    """
    print("üîπ Starting model training...")

    # 1. Generate features from the training data
    print("   - Generating features...")
    features = create_features(X_train)

    # Align target variable with the generated features index
    y_train_aligned = y_train.loc[features.index]

    lgbm_best_params=lgbm_params = {
        'n_estimators': 3631,
        'learning_rate': 0.006825835793978852,
        'num_leaves': 133,
        'max_depth': 10,
        'min_child_samples': 48,
        'feature_fraction': 0.5341049053295885,
        'bagging_fraction': 0.9843374214084879,
        'bagging_freq': 1,
        'lambda_l1': 0.9230400919559729,
        'lambda_l2': 1.5880313004821216,
        # 'objective': 'binary',
        # 'metric': 'auc',
        # 'random_state': 42,
        # 'verbose': -1
    }

    xgb_best_params = {
        'n_estimators': 1323,
        'learning_rate': 0.004254091716023784,
        'max_depth': 11,
        'subsample': 0.8031404416363784,
        'colsample_bytree': 0.5218353889087578,
        'gamma': 0.00014732440323895037,
        'reg_lambda': 5.5547604991334626e-05,  # instead of "lambda"
        'reg_alpha': 0.7350577325691311,       # instead of "alpha"
        # 'objective': 'binary:logistic',
        # 'eval_metric': 'auc',
        # 'tree_method': 'hist',
        # 'device': 'cuda',
        # 'random_state': 42,
        # 'verbosity': 0
    }

    catboost_best_params = {
        'iterations': 1500,                        # similar to n_estimators
        'learning_rate': 0.005,                    # in between your tuned LGBM/XGB values
        'depth': 10,                               # maps to max_depth
        'l2_leaf_reg': 1.0,                        # like reg_lambda / lambda_l2
        'subsample': 0.8,                          # like bagging_fraction / subsample
        'colsample_bylevel': 0.5,                  # like colsample_bytree
        'random_strength': 0.8,                    # like gamma (regularization strength)
        'border_count': 128,                       # number of splits per feature (usually 128/254)
        # 'loss_function': 'Logloss',                # binary classification
        # 'eval_metric': 'AUC',
        # 'task_type': 'GPU',                        # use GPU
        # 'random_seed': 42,
        # 'verbose': 200
    }
    
    
    # 2. Define models
    print("   - Defining models...")
    xgb = XGBClassifier(**xgb_best_params,random_state=42, eval_metric="logloss", n_jobs=-1)
    lgbm = LGBMClassifier(**lgbm_best_params,random_state=42, verbose=-1, n_jobs=-1)
    cat = CatBoostClassifier(**catboost_best_params,random_seed=42, verbose=0, thread_count=-1)

    # 3. Create the full pipeline with scaling and voting classifier
    model = Pipeline([
        ('scaler', MinMaxScaler()),
        ('vote', VotingClassifier(
            estimators=[('xgb', xgb), ('lgbm', lgbm), ('cat', cat)],
            voting='soft',
            n_jobs=-1
        ))
    ])

    # 4. Fit the pipeline
    print("   - Fitting the model pipeline...")
    model.fit(features, y_train_aligned)

    # 5. Save the trained pipeline
    model_path = os.path.join(model_directory_path, 'model.joblib')
    joblib.dump(model, model_path)
    print(f"‚úÖ Model saved to {model_path}")

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [None]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    """
    Loads the model and generates predictions for the test set.
    """
    # 1. Load the trained model pipeline from the specified directory
    model_path = os.path.join(model_directory_path, 'model.joblib')
    try:
        model = joblib.load(model_path)
    except FileNotFoundError:
        print(f"Error: Model file not found at {model_path}. Please train the model first.")
        return

    # 2. Signal that the model is ready to receive data
    yield

    # 3. Process each test dataset instance as it arrives
    print("üîπ Starting inference...")
    for dataset in X_test:
        # Generate the same set of features as used in training
        features = create_features(dataset)

        # Predict the probability of a structural breakpoint (class 1)
        # The model expects a 2D array, and `features` is a 1-row DataFrame
        prediction_proba = model.predict_proba(features)[0, 1]

        # Yield the prediction for the current dataset
        yield prediction_proba

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

08:07:08 
08:07:08 started
08:07:08 running local test
08:07:08 internet access isn't restricted, no check will be done
08:07:08 
08:07:09 starting unstructured loop...
08:07:09 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match
üîπ Starting model training...
   - Generating features...
   - Defining models...
   - Fitting the model pipeline..

08:09:42 executing - command=infer


‚úÖ Model saved to resources/model.joblib
üîπ Starting inference...


08:09:47 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
08:09:47 executing - command=infer


üîπ Starting inference...


08:09:49 determinism check: passed
08:09:49 save prediction - path=data/prediction.parquet
08:09:49 ended
08:09:49 duration - time=00:02:41
08:09:49 memory - before="2.01 GB" after="1.61 GB" consumed="-399720448 bytes"


## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,0.107940
10002,0.127877
10003,0.307071
10004,0.118086
10005,0.215866
...,...
10097,0.254205
10098,0.160773
10099,0.291095
10100,0.279144


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.7009389671361502)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)