[![Open In Colab](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/badge/open-in-colab.svg)](https://colab.research.google.com/github/crunchdao/quickstarters/blob/master/competitions/structural-break/quickstarters/random-submission/random-submission.ipynb)

![Banner](https://raw.githubusercontent.com/crunchdao/quickstarters/refs/heads/master/competitions/structural-break/assets/banner.webp)

# Setup

The first steps to get started are:
1. Get the setup command
2. Execute it in the cell below

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Reveal token](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/reveal-token.gif)

In [2]:
# Install the Crunch CLI
%pip install --upgrade crunch-cli

# Setup your local environment
!crunch setup --notebook structural-break hello --token U7jQcdgS3U3PuGpaB3DcHfvA

crunch-cli, version 7.3.0
main.py: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25817/main.py (1676 bytes)
notebook.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25817/notebook.ipynb (79774 bytes)
requirements.txt: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/25817/requirements.original.txt (154 bytes)
data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_test.reduced.parquet: download from https:crunchdao--c

# Your model

## Setup

In [3]:
import os
import random
import typing

# Import your dependencies
import joblib
import pandas as pd
import sklearn.metrics

In [4]:
import crunch

# Load the Crunch Toolings
crunch = crunch.load_notebook()

loaded inline runner with module: <module '__main__'>

cli version: 7.3.0
available ram: 12.67 gb
available cpu: 2 core
----


## Data

The data was downloaded when you setup your local environment and is now available in the `data/` directory.

In [5]:
# Load the data simply
X_train, y_train, X_test = crunch.load_data()

data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


### `X_train`

Index:
- `id`: the ID of the dataset
- `time`: arbitrary amount of time sampled regularely

Columns:
- `value`: the timeseries data
- `period`: if you are in an **initial segment** (0) or an **extension segment** (1)

In [6]:
X_train

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,-0.005564,0
0,1,0.003705,0
0,2,0.013164,0
0,3,0.007151,0
0,4,-0.009979,0
...,...,...,...
10000,2134,0.001137,1
10000,2135,0.003526,1
10000,2136,0.000687,1
10000,2137,0.001640,1


### `y_train`

This is a simple `pandas.Series` that tells if a dataset id has a structural breakpoint or not.

Index:
- `id`: the ID of the dataset

Value:
- `structural_breakpoint`: the value you need to predict

In [7]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


### `X_test`

This is a **`list` of `pandas.DataFrame`** that have the same format as [`X_train`](#X_train).

It is provided as a list to make sure you are encouraged to read the records **one by one**, __as this will be mandatory in the [`infer()`](#infer) function__.

In [8]:
print("Number of datasets:", len(X_test))

Number of datasets: 101


In [9]:
X_test[77].period.value_counts()

Unnamed: 0_level_0,count
period,Unnamed: 1_level_1
0,1629
1,709


In [10]:
X_test[0].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,value,period
id,time,Unnamed: 2_level_1,Unnamed: 3_level_1
10001,0,0.010753,0
10001,1,-0.031915,0
10001,2,-0.010989,0
10001,3,-0.011111,0
10001,4,0.011236,0


In [11]:
X_train.groupby('period')['value'].agg(['mean', 'std', 'min', 'max'])

Unnamed: 0_level_0,mean,std,min,max
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.000571,0.108955,-2.284821,412.5
1,0.000578,0.035626,-0.966667,48.5


In [12]:
features = X_train.groupby(['id', 'period'])['value'].agg(['mean', 'std', 'min', 'max']).unstack()
features.columns = ['mean_pre', 'mean_post', 'std_pre', 'std_post', 'min_pre', 'min_post', 'max_pre', 'max_post']

# Add delta features
features['mean_diff'] = features['mean_post'] - features['mean_pre']
features['std_diff'] = features['std_post'] - features['std_pre']
features['min_diff'] = features['min_post'] - features['min_pre']
features['max_diff'] = features['max_post'] - features['max_pre']

In [17]:
y_train

Unnamed: 0_level_0,structural_breakpoint
id,Unnamed: 1_level_1
0,False
1,False
2,True
3,False
4,False
...,...
9996,False
9997,False
9998,False
9999,False


In [18]:
# Combine with the labels
features = features.reset_index()
features['target'] = y_train.loc[features['id']].values

In [19]:
features

Unnamed: 0,level_0,index,id,mean_pre,mean_post,std_pre,std_post,min_pre,min_post,max_pre,max_post,mean_diff,std_diff,min_diff,max_diff,target
0,0,0,0,0.000015,0.000006,0.006987,0.006877,-0.022088,-0.019765,0.028202,0.017056,-0.000008,-0.000111,0.002323,-0.011145,False
1,1,1,1,0.000128,-0.000090,0.002524,0.002036,-0.017693,-0.014168,0.021874,0.007764,-0.000218,-0.000489,0.003525,-0.014110,False
2,2,2,2,0.000389,0.001790,0.017221,0.022900,-0.085878,-0.083094,0.087720,0.130874,0.001400,0.005678,0.002784,0.043154,True
3,3,3,3,0.000381,0.000326,0.008388,0.009286,-0.043547,-0.031330,0.064906,0.048893,-0.000055,0.000898,0.012217,-0.016013,False
4,4,4,4,-0.000016,0.000024,0.003314,0.003408,-0.010066,-0.008397,0.009546,0.010102,0.000040,0.000094,0.001669,0.000556,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,9996,9996,9996,0.000151,-0.000058,0.007718,0.004914,-0.036333,-0.037155,0.069722,0.070474,-0.000209,-0.002804,-0.000822,0.000752,False
9997,9997,9997,9997,0.000152,0.000529,0.006089,0.006142,-0.017788,-0.014769,0.020002,0.021399,0.000377,0.000053,0.003019,0.001397,False
9998,9998,9998,9998,-0.000007,0.000045,0.007290,0.007327,-0.027689,-0.022663,0.026602,0.025205,0.000052,0.000037,0.005026,-0.001398,False
9999,9999,9999,9999,0.000070,0.000065,0.001116,0.000922,-0.008782,-0.008344,0.006775,0.005395,-0.000006,-0.000193,0.000437,-0.001380,False


## Implementation

### `train()`

In the training function, users build and train the model to make inferences on the test data. <br />
Your model must be stored in the `model_directory_path`.

In [None]:
def train(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    model_directory_path: str,
):
    model = ...

    joblib.dump(model, os.path.join(model_directory_path, 'model.joblib'))

### `infer()`

In the inference function, the trained model is loaded and used to make inferences on a sample of data that matches the characteristics of the training test.

#### Setup

Once your model is loaded, you must do a `yield` to signal it to the runner. <br />
After that you can start reading data from `X_test`.

#### Iteration

The datasets must be read **one by one** and each value must be returned with a `yield <value>`. <br />
If you try to skip this, you will get an error. <br />
All values are then concatenated into a prediction file.

**Warning: The datasets can only be iterated once!**

#### Cleanup

Code can be executed after the `for` loop if you need to persist state or do some cleanup.

In [None]:
def infer(
    X_test: typing.Iterable[pd.DataFrame],
    model_directory_path: str,
):
    model = joblib.load(os.path.join(model_directory_path, 'model.joblib'))

    yield  # mark as ready

    # X_test can only be iterated once.
    # Before getting the next dataset, you must predict the current one.
    for dataset in X_test:
        # prediction = model.predict(dataset)
        prediction = round(random.random(), 2)

        yield prediction  # send the prediction for the current dataset

## Local testing

To make sure your `train()` and `infer()` function are working properly, you can call the `crunch.test()` function that will reproduce the cloud environment locally. <br />
Even if it is not perfect, it should give you a quick idea if your model is working properly.

In [None]:
crunch.test(
    # Uncomment to disable the train
    # force_first_train=False,

    # Uncomment to disable the determinism check
    # no_determinism_check=True,
)

00:51:39 no forbidden library found
00:51:39 
00:51:39 started
00:51:39 running local test
00:51:39 internet access isn't restricted, no check will be done
00:51:39 
00:51:40 starting unstructured loop...
00:51:40 executing - command=train


data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_train.parquet (204327238 bytes)
data/X_train.parquet: already exists, file length match
data/X_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/X_test.reduced.parquet (2380918 bytes)
data/X_test.reduced.parquet: already exists, file length match
data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_train.parquet (61003 bytes)
data/y_train.parquet: already exists, file length match
data/y_test.reduced.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/146/y_test.reduced.parquet (2655 bytes)
data/y_test.reduced.parquet: already exists, file length match


00:51:42 executing - command=infer
00:51:42 checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)
00:51:42 executing - command=infer
00:51:42 determinism check: failed
00:51:42 save prediction - path=data/prediction.parquet
00:51:42 ended
00:51:42 duration - time=00:00:03
00:51:42 memory - before="794.68 MB" after="818.32 MB" consumed="23.63 MB"


## Results

Once the local tester is done, you can preview the result stored in `data/prediction.parquet`.

In [None]:
prediction = pd.read_parquet("data/prediction.parquet")
prediction

Unnamed: 0_level_0,prediction
id,Unnamed: 1_level_1
10001,0.88
10002,0.23
10003,0.03
10004,0.95
10005,0.11
...,...
10097,0.54
10098,0.94
10099,0.08
10100,0.61


### Local scoring

You can call the function that the system uses to estimate your score locally.

In [None]:
# Load the targets
target = pd.read_parquet("data/y_test.reduced.parquet")["structural_breakpoint"].astype(float)

# Call the scoring function
sklearn.metrics.roc_auc_score(
    target,
    prediction,
)

np.float64(0.5586854460093896)

# Submit your Notebook

To submit your work, you must:
1. Download your Notebook from Colab
2. Upload it to the platform
3. Create a run to validate it

### >> https://hub.crunchdao.com/competitions/structural-break/submit/notebook

![Download and Submit Notebook](https://raw.githubusercontent.com/crunchdao/competitions/refs/heads/master/documentation/animations/download-and-submit-notebook.gif)