This repository contains neural network models for predicting execution time of Spark applications, based on the paper:
Tariq, H., & Das, O. (2023). Execution Time Prediction Model that Considers Dynamic Allocation of Spark Executors.
Published in: EPEW/ASMTA 2023, Lecture Notes in Computer Science (LNCS).
DOI: https://doi.org/10.1007/978-3-031-43185-2_23
The goal is to accurately predict the total runtime of Spark applications affected by dynamic executor behavior. Two types of neural network models are implemented:
- Black-box model: Uses summary-level features (e.g., input size, total stages, executor scaling stats).
- White-box model: Uses detailed stage-level features including shuffle read/write, task metrics, and executor timelines.
Workloads include:
- TPC-DS SQL queries: Q26, Q52, Q70
- KMeans clustering
km_nn_blackbox.txt
: NN model using blackbox features for KMeans.km_nn_whitebox.txt
: NN model using whitebox features for KMeans.query26_nn_blackbox.txt
: NN model using blackbox features for Query-26.query26_nn_whitebox.txt
: NN model using whitebox features for Query-26.q52_NN_black box.ipynb
: Blackbox NN model for Query-52q52_NN_whitebox.ipynb
: Whitebox NN model for Query-52q70_NN_black box.ipynb
: Blackbox NN model for Query-70q70_NN_whitebox.ipynb
: Whitebox NN model for Query-70kmeansdata.csv
: Input data for KMeans models.query26_train_blackbox.csv
: Blackbox feature data for Query-26.query26_train_whitebox.csv
: Whitebox feature data for Query-26.query52train.csv
: Blackbox feature data for Query-52.query52train1.csv
: Whitebox feature data for Query-52.query70train.csv
: Blackbox feature data for Query-70.query70train1.csv
: Whitebox feature data for Query-70.
- Open a Jupyter notebook inside the
NN
folder. - Run the notebook to view predictions and plots.
- Integration with Spark UI for real-time feature extraction
- Coupling Dynamic Allocation Model (DAM) with an optimization framework for executor recommendation
- Extending DAM for multi-job workloads or streaming scenarios
If you use this code or build upon it, please cite the original paper:
@inproceedings{tariq2023execution,
title={Execution Time Prediction Model that Considers Dynamic Allocation of Spark Executors},
author={Tariq, Hina and Das, Olivia},
booktitle={Computer Performance Engineering (EPEW/ASMTA)},
series={Lecture Notes in Computer Science},
volume={14231},
pages={340--352},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-43185-2_23}
}