In [1]:
! pip install openstef==3.4.72 jupyter==1.0

Collecting openstef==3.4.72
  Downloading openstef-3.4.72-py3-none-any.whl.metadata (8.8 kB)
Collecting jupyter==1.0
  Downloading jupyter-1.0.0-py2.py3-none-any.whl.metadata (995 bytes)
Collecting holidays==0.21 (from openstef==3.4.72)
  Downloading holidays-0.21-py3-none-any.whl.metadata (15 kB)
Collecting joblib==1.3.2 (from openstef==3.4.72)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting mlflow~=2.3 (from openstef==3.4.72)
  Downloading mlflow-2.22.2-py3-none-any.whl.metadata (30 kB)
Collecting optuna~=3.1 (from openstef==3.4.72)
  Downloading optuna-3.6.2-py3-none-any.whl.metadata (17 kB)
Collecting optuna-integration~=3.6 (from openstef==3.4.72)
  Downloading optuna_integration-3.6.0-py3-none-any.whl.metadata (10 kB)
Collecting pvlib==0.10.5 (from openstef==3.4.72)
  Downloading pvlib-0.10.5-py3-none-any.whl.metadata (2.8 kB)
Collecting pymsteams~=0.2.2 (from openstef==3.4.72)
  Downloading pymsteams-0.2.5-py3-none-any.whl.metadata (22 kB)
Collecting sci

# Workshop part 1 | Learn how to train a model
In this first part of the workshop, all preparation for training a model and the actual training are performed.

The learning points are:
- How a prediction job works, and what the most important parameters mean;
- What data is required;
- Experience with the train model pipeline;
- How the model gets automatically stored and loaded;
- How to get info on the trained model.

In [1]:
# Import all required packages.
from openstef.data_classes.prediction_job import PredictionJobDataClass
from openstef.pipeline.train_model import train_model_pipeline
from IPython.display import IFrame
import pandas as pd

# Set plotly as the default pandas plotting backend.
pd.options.plotting.backend = 'plotly'

# Check if running in Google Colab.
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

## Define the prediction job

OpenSTEF uses prediction jobs to define the properties of training and prediction.

- model: xgboost
    - This is the (opensource) machine learning model type that we train to make the forecasts.
- quantile: 10, 30, 50, 70 and 90 percent
    - This provides a confidence interval within OpenSTEF, based on the standard deviation.
- forecast_type: demand
    - What we are actually forecasting. This can be demand (load on the grid), wind or basecase.
- latitude: 52.0, longitude: 5.0
    - This is used to calculate the derived solar features (direct normal irradiance and the global tilted irradiance).*
    - Also used to retrieve weather data in openstef-dbc (database connector).
- horizon minutes: 15
    - The horizon of the desired forecast in minutes. It entails how far into the future we want to predict. The value of 15 entails that at the moment of prediction, you predict 15 minutes into the future. So let's say you make a prediction at one o'clock, than the prediction is for 13.15 o'clock.
- resolution minutes: 15 minutes
    - resulution of the forecasts made in minutes: how many minutes between each sample in the prediction.
- name: workshop_exercise_1
    - Name you give to the prediction job.
- save_train_forecasts: true
    -Indicates whether the forecasts produced during the training process should be saved.


Bonus: look at the documentation [here](https://openstef.github.io/openstef/openstef.data_classes.html#module-openstef.data_classes.prediction_job).

*Curious about how the latitude and longitude are used to calculated derived weather features? See [here](https://github.com/OpenSTEF/openstef/blob/main/openstef/feature_engineering/weather_features.py)

In [2]:
# Define properties of training/prediction. We call this a 'prediction_job'.
pj = dict(id=288,
        model='xgb',
        quantiles=[0.10,0.30,0.50,0.70,0.90],
        forecast_type="demand",
        # lat=52.0,
        # lon=5.0,
        horizon_minutes=60, # 2 or 24
        resolution_minutes=60,
        name="workshop_exercise_1",
        save_train_forecasts=True,
       )

pj=PredictionJobDataClass(**pj)

In [3]:
# Inspect your prediction job here.
display(pj)

PredictionJobDataClass(id=288, model='xgb', model_kwargs=None, forecast_type='demand', horizon_minutes=15, resolution_minutes=15, lat=52.0, lon=5.0, name='workshop_exercise_1', electricity_bidding_zone=<BiddingZone.NL: 'NL'>, train_components=None, description=None, quantiles=[0.1, 0.3, 0.5, 0.7, 0.9], train_split_func=None, backtest_split_func=None, train_horizons_minutes=None, default_modelspecs=None, save_train_forecasts=True, completeness_threshold=0.5, minimal_table_length=100, flatliner_threshold_minutes=1440, detect_non_zero_flatliner=False, data_balancing_ratio=None, rolling_aggregate_features=[], depends_on=[], sid=None, turbine_type=None, n_turbines=None, hub_height=None, pipelines_to_run=[<PipelineType.TRAIN: 'train'>, <PipelineType.HYPER_PARMATERS: 'hyper_parameters'>, <PipelineType.FORECAST: 'forecast'>], alternative_forecast_model_pid=None, data_prep_class=None)

## Prepare and analyse the input data
OpenSTEF requires a certain input format: a dataframe with specific columns.

Exercise: look at the table and plots below and answer try to answer the following questions:
- What type of features do you see in the input data?
- How much time is there between two data points?
- Look at the plots for radiation and windspeed, do you see any paterns?
    - Hint: do you see something happening to the load when there is a peak in either radiation or wind speed? Can you explain why?
    - Note: in these plots we zoomed in on a random week, for visibility purposes.

Hint: you can zoom in on the plots to see more details.
Hint 2: the 'load' is the target that we want to forecast. So it is not a feature.

If you are working with Google Colab, just upload the data in the 'Files' section on Google Colab. You can find this at the left toolbar, the fifth item from the top.

In [45]:
if IN_COLAB:
    input_data=pd.read_csv("/content/master_data_with_forecasted.csv", index_col=0, parse_dates=True)
else:
    input_data=pd.read_csv("../data/master_data_with_forecasted.csv", index_col=0, parse_dates=True)

In [46]:
if isinstance(input_data, pd.DataFrame):
    print("The variable is a Pandas DataFrame.")

The variable is a Pandas DataFrame.


In [55]:
# Inspect all column names of the input data.
print(input_data.columns)

Index(['load', 'date_time_com', 'Holiday', 'Holiday_Type', 'temp', 'rhum',
       'prcp', 'wdir', 'wspd', 'pres', 'cldc', 'coco', 'forecasted_load'],
      dtype='object')


In [34]:
input_data = input_data.drop(columns=["date_time_com", "Holiday", "Holiday_Type", "forecasted_load"])
print(input_data.columns)

Index(['load', 'temp', 'rhum', 'prcp', 'wdir', 'wspd', 'pres', 'cldc', 'coco'], dtype='object')


In [35]:
pd.options.display.max_columns = None
display(input_data.head())

Unnamed: 0_level_0,load,temp,rhum,prcp,wdir,wspd,pres,cldc,coco
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-01-01 06:00:00+00:00,834.0,22.0,60.0,0.0,340.0,7.6,1020.2,1.0,1.0
2023-01-01 07:00:00+00:00,736.0,22.7,53.0,0.0,9.0,1.8,1018.2,1.0,1.0
2023-01-01 08:00:00+00:00,720.0,23.4,49.0,0.0,354.0,1.8,1017.3,1.0,1.0
2023-01-01 09:00:00+00:00,690.0,23.7,51.0,0.0,0.0,0.0,1017.2,0.0,1.0
2023-01-01 10:00:00+00:00,668.0,22.0,59.0,0.0,302.0,1.8,1016.9,0.0,1.0


In [48]:
# The model should be only trained on the training part of the input data. Therefore, the input data should be split.
train_data=input_data.iloc[:15000] #everything except the final 192 rows for training
test_data=input_data.iloc[-3000:,:] #final 192 rows for testing

In [49]:
fig_load=input_data["load"].iloc[57:729].plot()
fig_load.update_layout(
    xaxis_title='Timestamp',
    yaxis_title="Load [MW]"
)
fig_load.show()

In [50]:
fig_windspeed=input_data["temp"].iloc[57:729].plot()
fig_windspeed.update_layout(
    xaxis_title='Timestamp',
    yaxis_title="Windspeed"
)
fig_windspeed.show()

## Training the model
After defining the prediction job and preparing the input data, the model can be trained.

Exercise:
- Find out what happens in the 'train_model_pipeline'. More specifically, what are the inputs and outputs?
- Why do we only use the train_data?

Hint: find pipeline in the list provided on the OpenSTEF website, and look at the documentation [here](https://openstef.github.io/openstef/user_guides.html). Click on the pipeline openstef.pipeline.train_model to look at the documentation.

In [51]:
# Remove duplicate index values from train_data
train_data = train_data[~train_data.index.duplicated(keep='first')]

# Remove rows with NaT in the index
train_data = train_data[train_data.index.notna()]

import os

mlflow_dir = "./mlflow_trained_models"
mlflow_tracking_uri = os.path.abspath(mlflow_dir)

train_data, validation_data, test_data = train_model_pipeline(
    pj,
    train_data,
    check_old_model_age=False,
    mlflow_tracking_uri=mlflow_tracking_uri,
    artifact_folder="./mlflow_artifacts",
)

2025-11-01 15:28:49 [info     ] Model successfully loaded with MLflow
[0]	validation_0-rmse:212.78424	validation_1-rmse:206.59128
[1]	validation_0-rmse:168.95006	validation_1-rmse:165.46323
[2]	validation_0-rmse:140.26169	validation_1-rmse:139.68018
[3]	validation_0-rmse:121.79463	validation_1-rmse:123.89191
[4]	validation_0-rmse:109.51374	validation_1-rmse:114.22657
[5]	validation_0-rmse:101.42751	validation_1-rmse:108.41332
[6]	validation_0-rmse:95.29169	validation_1-rmse:104.42303
[7]	validation_0-rmse:91.49938	validation_1-rmse:101.42075
[8]	validation_0-rmse:86.74534	validation_1-rmse:98.18483
[9]	validation_0-rmse:83.81821	validation_1-rmse:96.90697
[10]	validation_0-rmse:80.01682	validation_1-rmse:94.64630
[11]	validation_0-rmse:77.88905	validation_1-rmse:94.12478
[12]	validation_0-rmse:74.93637	validation_1-rmse:92.90545
[13]	validation_0-rmse:73.10584	validation_1-rmse:91.89874
[14]	validation_0-rmse:71.18287	validation_1-rmse:91.47304
[15]	validation_0-rmse:70.07516	validatio

## Analyse the trained model
Now that the model has been trained, you can inspect the results.

Exercise: answer the following questions.
- Are all of the features in the feature importance plot in the input data? Why?
    - What are the most important features?
- Which time horizon is more accurate?
    - Hint: zoom in on the same day for both the Predictor0.25 and Predictor47.0 and examine them next to each other.
- Where is my trained model?




The first two plots are the 'predictor in action' plots for the two time horizons (0.25 means fifteen minutes ahead, 47.0 means 47 hours ahead). In these plots you can see three different data outputs: train, validation and test. For each of these, you can see an '_actual' and '_predict'. This entails that for everyone of these data outputs, the measured value and the predicted value by OpenSTEF is plotted. Thus 'train_predict' is the prediction by OpenSTEF based on the train data.  

The last plot is the feature importance, this plot shows all of your input features (radiation, windspeed, lagged load, etc, etc,) and how much they influence the forecast. If a block is relatively large, this means the feature is relatively important for the forecast. Thus, large changes in the value of this feature results in a large difference in forecast.

Note: These IFrames do not work in Google Colab. The images can be found in the folder ``mlflow_artifact'', and opened in jouw browser.

In [44]:
if not IN_COLAB:
    # Inspect local files.
    display(IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400))
    display(IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400))
    display(IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400))


## Visual Studio Code has difficulties with displaying htmls. If you are working with VSC and are not able to inspect the plots, uncomment the code below
## to open the plots in your browser.

# import webbrowser
# webbrowser.open('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']))
# webbrowser.open('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']))
# webbrowser.open('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']))

In [54]:

import numpy as np
from openstef.pipeline.create_forecast import create_forecast_pipeline

# Prepare data to make the forecast.
realised=input_data.loc[test_data.index, 'load'].copy(deep=True)
to_forecast_data=input_data.copy(deep=True)
to_forecast_data.loc[test_data.index, 'load']=np.nan #clear the load data for the part you want to forecast