Enhance spark API. #114

aadyotb · 2022-07-14T01:40:02Z

We refactor the spark apps and Dockerfile so that Merlion can be deployed with the spark-on-k8s-operator. To this end, the apps now accept individual, documented arguments instead of a single opaque config file. Additionally, we add tests which follow the same workflow as the pyspark apps. The tests are run on a subset of an open source dataset released under a CC0 license and cover hierarchical time series forecasting.

Next, we further improve the functionality of the forecasting spark app in 4 ways:

We allow return_prev=True to be passed to model.forecast() in case one wants to obtain a model's historical predictions on the train data. This can be done by supplying the --predict_on_train argument. Notably, hierarchical reconciliation is skipped for historical timestamps which do not have sampled values from all time series.
We allow the user to customize the aggregation used for different data columns when aggregating for hierarchical time series. This can be done by specifying the --agg_dict argument as an appropriate JSON string. Previously, all data columns were summed. Moreover, if a data column is not specified in the aggregation dictionary, that column will not be used to model any aggregated time series. This can be important for e.g. categorical columns that are useful at the base of the hierarchy, but not at higher levels.
We force the index columns of a dataset containing multiple time series to be interpreted as strings. This allows us to use a reserved "__aggregated__" keyword to indicate that a particular time series (in the output) has been aggregated.
We fix a bug with time series reconciliation which caused an exception to be thrown if some models produce stderr estimates but others didn't.

Finally, we enhance tree models so that they can accept max_forecast_steps=None and return_prev=True.

This avoids problems with aggregation when attempting to use NA values for integer columns.

…/Merlion into aadyot/spark-operator

Tree models can now accept max_forecast_steps=None and return_prev=True

Derived from https://www.kaggle.com/datasets/manjeetsingh/retaildataset which is released under a CC0 license.

aadyotb added 4 commits July 5, 2022 13:28

Update spark apps to use spark-operator.

b73449a

Refactor spark apps directory.

86af0c6

Allow spark.forecast to predict on train data.

15e38c6

Allow custom aggregations for hierarchical data.

e57c4cb

aadyotb requested a review from huan-dec July 14, 2022 01:40

Update version to 1.2.4.

376302e

huan-dec approved these changes Jul 14, 2022

View reviewed changes

aadyotb added 6 commits July 14, 2022 00:13

Merge branch 'main' into aadyot/spark-operator

8c53070

Always use strings for index columns.

fa18260

This avoids problems with aggregation when attempting to use NA values for integer columns.

Merge branch 'aadyot/spark-operator' of https://github.com/salesforce…

200ae01

…/Merlion into aadyot/spark-operator

Ensure consistent schema for stderr.

2d6b8f7

Fix some bugs with tree models.

daf506b

Tree models can now accept max_forecast_steps=None and return_prev=True

Fix bug in time series reconciliaton code.

47054d4

aadyotb changed the title ~~Refactor spark apps to work with spark operator.~~ Enhance spark API. Jul 14, 2022

aadyotb added 2 commits July 14, 2022 16:13

Added dataset for testing hierarchical time series

9949048

Derived from https://www.kaggle.com/datasets/manjeetsingh/retaildataset which is released under a CC0 license.

Add test coverage for pyspark code.

e543a7b

aadyotb merged commit 638529e into main Jul 15, 2022

aadyotb deleted the aadyot/spark-operator branch July 15, 2022 00:38

aadyotb mentioned this pull request Jul 15, 2022

Various bugfixes & enhancements. #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance spark API. #114

Enhance spark API. #114

aadyotb commented Jul 14, 2022 •

edited

Enhance spark API. #114

Enhance spark API. #114

Conversation

aadyotb commented Jul 14, 2022 • edited

aadyotb commented Jul 14, 2022 •

edited