Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance spark API. #114

Merged
merged 13 commits into from
Jul 15, 2022
Merged

Enhance spark API. #114

merged 13 commits into from
Jul 15, 2022

Conversation

aadyotb
Copy link
Contributor

@aadyotb aadyotb commented Jul 14, 2022

We refactor the spark apps and Dockerfile so that Merlion can be deployed with the spark-on-k8s-operator. To this end, the apps now accept individual, documented arguments instead of a single opaque config file. Additionally, we add tests which follow the same workflow as the pyspark apps. The tests are run on a subset of an open source dataset released under a CC0 license and cover hierarchical time series forecasting.

Next, we further improve the functionality of the forecasting spark app in 4 ways:

  1. We allow return_prev=True to be passed to model.forecast() in case one wants to obtain a model's historical predictions on the train data. This can be done by supplying the --predict_on_train argument. Notably, hierarchical reconciliation is skipped for historical timestamps which do not have sampled values from all time series.
  2. We allow the user to customize the aggregation used for different data columns when aggregating for hierarchical time series. This can be done by specifying the --agg_dict argument as an appropriate JSON string. Previously, all data columns were summed. Moreover, if a data column is not specified in the aggregation dictionary, that column will not be used to model any aggregated time series. This can be important for e.g. categorical columns that are useful at the base of the hierarchy, but not at higher levels.
  3. We force the index columns of a dataset containing multiple time series to be interpreted as strings. This allows us to use a reserved "__aggregated__" keyword to indicate that a particular time series (in the output) has been aggregated.
  4. We fix a bug with time series reconciliation which caused an exception to be thrown if some models produce stderr estimates but others didn't.

Finally, we enhance tree models so that they can accept max_forecast_steps=None and return_prev=True.

@aadyotb aadyotb requested a review from huan-dec July 14, 2022 01:40
@aadyotb aadyotb changed the title Refactor spark apps to work with spark operator. Enhance spark API. Jul 14, 2022
@aadyotb aadyotb merged commit 638529e into main Jul 15, 2022
@aadyotb aadyotb deleted the aadyot/spark-operator branch July 15, 2022 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants