[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/baseline/baseline.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# ADIA Lab Structural Break Challenge

## Challenge Overview

Welcome to the ADIA Lab Structural Break Challenge! In this challenge, you will analyze univariate time series data to determine whether a structural break has occurred at a specified boundary point.

### What is a Structural Break?

A structural break occurs when the process governing the data generation changes at a certain point in time. These changes can be subtle or dramatic, and detecting them accurately is crucial across various domains such as climatology, industrial monitoring, finance, and healthcare.

![Structural Break Example](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/competitions/structural-break/quickstarters/baseline/images/example.png)

### Your Task

For each time series in the test set, you need to predict a score between `0` and `1`:
- Values closer to `0` indicate no structural break at the specified boundary point;
- Values closer to `1` indicate a structural break did occur.

### Evaluation Metric

The evaluation metric is [ROC AUC (Area Under the Receiver Operating Characteristic Curve)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), which measures the performance of detection algorithms regardless of their specific calibration.

- ROC AUC around `0.5`: No better than random chance;
- ROC AUC approaching `1.0`: Perfect detection.

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [1]:
%pip install crunch-cli --upgrade --quiet --progress-bar off
!crunch setup-notebook structural-break JmkqCcYe1p9Bi5N91CaA9RMV

crunch-cli, version 6.7.0
you appear to have never submitted code before
data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
                                
---
Success! Your environment has been correctly setup.
Next recommended actions:
1. Load the Crunch Toolings: `crunch = crunch.load_notebook()`
2. Execute the cells with your code
3. Run a test: `crunch.test()`
4. Download and submit your code to the p

# Your model

## Setup

In [23]:
import pandas as pd
import numpy as np

import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import time
from scipy.stats import randint, uniform
import os
import typing

# Import your dependencies
import joblib
import pandas as pd
import scipy
import sklearn.metrics
import pickle

In [3]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 6.7.0
available ram: 12.67 gb
available cpu: 2 core
----


## Understanding the Data

The dataset consists of univariate time series, each containing ~2,000-5,000 values with a designated boundary point. For each time series, you need to determine whether a structural break occurred at this boundary point.

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [4]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3.eu-west-1.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


## Strategy Implementation

There are multiple approaches you can take to detect structural breaks:

1. **Statistical Tests**: Compare distributions before and after the boundary point;
2. **Feature Engineering**: Extract features from both segments for comparison;
3. **Time Series Modeling**: Detect deviations from expected patterns;
4. **Machine Learning**: Train models to recognize break patterns from labeled examples.

The baseline implementation below uses a simple statistical approach: a t-test to compare the distributions before and after the boundary point.

### The `train()` Function

In this function, you build and train your model for making inferences on the test data. Your model must be stored in the `model_directory_path`.

The baseline implementation below doesn't require a pre-trained model, as it uses a statistical test that will be computed at inference time.

In [19]:
def predict_single_dataset(model, X_test_sample):
    """
    Make prediction for a single dataset
    X_test_sample should be a DataFrame with one id's data
    """
    # Extract features for this single series
    features = extract_features_for_single_series(X_test_sample)

    if features is None:
        return None, "No structural break detected in this series"

    # Convert to DataFrame for prediction
    features_df = pd.DataFrame([features])

    # Make prediction
    prediction = model.predict(features_df)[0]
    probability = model.predict_proba(features_df)[0]

    return prediction, probability, features

In [20]:
def extract_features_for_single_series(df):
    """
    Extract features for a single time series (one id)
    df should be a DataFrame with 'time', 'value', 'period' columns
    """
    # Find the boundary (where period == 1)
    boundary_idx = df[df['period'] == 1].index.min()

    if pd.isna(boundary_idx):
        # No structural break found, return None or handle as special case
        return None

    # Split the series into before and after the boundary
    before = df.loc[:boundary_idx-1, 'value']
    after = df.loc[boundary_idx:, 'value']

    # Calculate rolling features
    rolling_mean_before = before.rolling(window=3, min_periods=1).mean().mean()
    rolling_std_after = after.rolling(window=3, min_periods=1).std().mean()

    # Extract features (same as in training)
    features = {
        'boundary_loc': boundary_idx / len(df),
        'mean_before': before.mean(),
        'mean_after': after.mean(),
        'std_before': before.std(),
        'std_after': after.std(),
        'mean_change': after.mean() - before.mean(),
        'std_change': after.std() - before.std(),
        'rolling_mean_before': rolling_mean_before,
        'rolling_std_after': rolling_std_after,
    }

    return features

In [22]:
def save_model(model, filename='lightgbm_structural_break_model.pkl'):
    """Save the trained model"""
    with open(filename, 'wb') as f:
        pickle.dump(model, f)
    print(f"Model saved to {filename}")

In [25]:
#eigenes Model
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series
):
    features = []

    for id_, group in X_train.groupby(level=0):
        df = group.reset_index(level=0, drop=True)
        boundary_idx = df[df['period'] == 1].index.min()
        if pd.isna(boundary_idx):
            continue  # or handle as special case

        before = df.loc[:boundary_idx-1, 'value']
        after = df.loc[boundary_idx:, 'value']

        # Example rolling features
        rolling_mean_before = before.rolling(window=3, min_periods=1).mean().mean()
        rolling_std_after = after.rolling(window=3, min_periods=1).std().mean()

        feat = {
            'id': id_,
            'boundary_loc': boundary_idx / len(df),
            'mean_before': before.mean(),
            'mean_after': after.mean(),
            'std_before': before.std(),
            'std_after': after.std(),
            'mean_change': after.mean() - before.mean(),
            'std_change': after.std() - before.std(),
            'rolling_mean_before': rolling_mean_before,
            'rolling_std_after': rolling_std_after,
            # Add more features as above
        }
        features.append(feat)

    features_df = pd.DataFrame(features).set_index('id')
    X = features_df
    y = y_train.loc[X.index, 'structural_breakpoint'].values.ravel()

    # Define the parameter grid
    param_dist = {
        'num_leaves': randint(10, 100),
        'learning_rate': uniform(0.005, 0.2),
        'n_estimators': randint(50, 300),
        'feature_fraction': uniform(0.7, 0.3),
        'bagging_fraction': uniform(0.7, 0.3),
        'min_child_samples': randint(5, 50),
        'reg_alpha': uniform(0, 1),
        'reg_lambda': uniform(0, 1),
        'seed': [123]
    }

    # Create the classifier
    lgbm = lgb.LGBMClassifier(objective='binary', verbose=-1)

    # Set up RandomizedSearchCV
    random_search = RandomizedSearchCV(
        lgbm,
        param_distributions=param_dist,
        n_iter=100,  # Try 100 different combinations
        scoring='roc_auc',
        cv=5,        # 5-fold cross-validation
        verbose=2,
        n_jobs=-1,
        random_state=42
    )

    # Fit the randomized search
    random_search.fit(X, y)

    print("Best parameters found:", random_search.best_params_)
    print("Best ROC AUC:", random_search.best_score_)
    # Save the best model
    save_model(random_search.best_estimator_)

In [26]:
model = train(X_train, y_train)

IndexingError: Too many indexers

### The `infer()` Function

In the inference function, your trained model (if any) is loaded and used to make predictions on test data.

**Important workflow:**
1. Load your model;
2. Use the `yield` statement to signal readiness to the runner;
3. Process each dataset one by one within the for loop;
4. For each dataset, use `yield prediction` to return your prediction.

**Note:** The datasets can only be iterated once!

In [15]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    yield  # Mark as ready

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        # Baseline approach: Compute t-test between values before and after boundary point
        # The negative p-value is used as our score - smaller p-values (larger negative numbers)
        # indicate more evidence against the null hypothesis that distributions are the same,
        # suggesting a structural break
        def t_test(u: pd.DataFrame):
            return -scipy.stats.ttest_ind(
                u["value"][u["period"] == 0],  # Values before boundary point
                u["value"][u["period"] == 1],  # Values after boundary point
            ).pvalue

        prediction = t_test(dataset)
        yield prediction  # Send the prediction for the current dataset

        # Note: This baseline approach uses a t-test to compare the distributions
        # before and after the boundary point. A smaller p-value (larger negative number)
        # suggests stronger evidence that the distributions are different,
        # indicating a potential structural break.

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [16]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

FileNotFoundError: [Errno 2] No such file or directory: 'data/prediction.parquet'

### Local scoring

You can call the function that the system uses to estimate your score locally.

In [12]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"]

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

NameError: name 'sklearn' is not defined

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)