<h1>Prophet with Scada Dataset</h1>

Prophet can only be trained on one time series (here turbine) at a time. It cannot handle multiple ts simultaneously. When predicting, you don't need to pass the original dataset. Instead, after training the model on the historical data (with columns ds for timestamps and y for the target), you only need to specify the forecast horizon (periods) and frequency (freq). Prophet will then generate future predictions without needing additional input data.

In [2]:
# Set up Spark

from pyspark.sql import SparkSession
spark = (
    SparkSession.builder
    .master("local[*]")
    .appName("SCADA-Forecasting")
    .config("spark.driver.memory", "8g")
    .config("spark.executor.memory", "8g")
    .config("spark.driver.maxResultSize", "2g")
    .config("spark.sql.shuffle.partitions", "50")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

# Load Data

import pandas as pd
from sktime.forecasting.model_selection import temporal_train_test_split

pdf = pd.read_parquet(r"scada_prepro.parquet")

pdf_turbine1 = pdf[pdf["item_id"] == "1_Kelmarsh"].copy()
pdf_turbine1 = pdf_turbine1.drop(columns=["item_id"])
pdf_turbine1["ds"] = pd.to_datetime(pdf_turbine1["timestamp"])
pdf_turbine1["y"] = pdf_turbine1["target"]
pdf_turbine1 = pdf_turbine1.drop(columns=["timestamp", "target"])
#df_turbine1 = pdf_turbine1.set_index("timestamp").sort_index()
pdf_turbine1 = pdf_turbine1[3:]
pdf_turbine1_no_NaN = pdf_turbine1.dropna(axis=1)

pdf_turbine1_no_NaN = pdf_turbine1_no_NaN[["ds", "y"]]

trainset, testset = temporal_train_test_split(pdf_turbine1_no_NaN, test_size=0.2)

scada_spark_trainset = spark.createDataFrame(trainset)
scada_spark_testset = spark.createDataFrame(testset)

In [4]:
# Import ProphetForecaster and train model with scada data

import sys
sys.path.append('/workspaces/amos2025ws03-rtdip-timeseries-forecasting/src/sdk/python')
from rtdip_sdk.pipelines.forecasting.spark.prophet import ProphetForecaster

pf = ProphetForecaster(scaling="absmax")
pf.train(scada_spark_trainset)
metrics = pf.evaluate(scada_spark_testset, "10min")

20:57:46 - cmdstanpy - INFO - Chain [1] start processing
20:57:57 - cmdstanpy - INFO - Chain [1] done processing



Prophet Metrics:
--------------------------------------------------------------------------------
MAE                 : 10497.5296
RMSE                : 11276.5351
MAPE                : 1245.2644
MASE                : 110.0565
SMAPE               : 199.4585
