This is the executable artifact of the ICSE'24 paper: Predicting Performance and Accuracy of Mixed-Precision Programs for Precision Tuning. The artifact is intended to provide the users with raw data descriptions and a set of guidelines to reproduce experiment results.
The raw data description details the MixBench
dataset used to train and test our prediction models, which were created from the MixBench programs. In addition, we provide a description of the running results from the model's training, testing, and case studies.
Instructions are provided to play with our data, get statistics of our dataset, reproduce the testing scores of our pre-trained models on the dataset, as well as an optional step to train the prediction models from scratch. Besides, we also present the running commands to replicate and reproduce results in the case studies on the four target benchmarks: CG
, MG
, Lulesh
, and LBM
.
Our MixBench datasets, one for accuracy prediction
(raw/MixBench/error_root/processed
),
and the other for performance prediction
on floating-point programs
(raw/MixBench/time_root/processed
), are provided to reproduce
the experiment results reported in the paper.
Each data object ("data_idx.pt") is a graph representation
containing nodes, edges, and label information
for a corresponding mixed-precision floating-point program.
The MixBench dataset involves five benchmarks:
BlackScholes,
CFD,
Hotspot,
HPCCG,
and LavaMD.
The source code of these five benchmarks in their original precision
can be found here (raw/MixBench/orig_files
).
Dataset | Total | Label 0 | Label 1 | Avg. node count | Avg. edge count |
---|---|---|---|---|---|
Accuracy Prediction | 600 | 300 | 300 | 3191 | 11597 |
Performance Prediction | 628 | 314 | 314 | 3195 | 11487 |
The table above shows the statistics of the two balanced datasets. Note that for accuracy prediction, label 0 refers to "program not within error threshold", and label 1 refers to "program within error threshold"; while for performance prediction, label 0 refers to "program with no speedup", and label 1 refers to "program with speedup".
In raw/model
, the artifact provides the two well-trained models in FPLearner, error_AST_CFG_PDG_CAST_DEP_checkpoint.pt
(accuracy prediction model) and time_AST_CFG_PDG_CAST_DEP_checkpoint.pt
(performance prediction model), which were trained and tested on the provided MixBench dataset. Both models were trained
on the composite graph representation (including AST, Control Flow, Program Dependence, TypeCasting, and AssignedFrom edges).
In addition, the user can also find all the other pre-trained models in this directory, which are used for the edge ablation study. For example,
error_CFG_PDG_CAST_DEP_checkpoint.pt
is the accuracy prediction model trained with graphs_no_AST.
In raw/log
, the artifact provides training and testing logs for
both accuracy and performance prediction models on different combinations
of edges. The logs provide the full set of results shown in the ablation study of edges in the paper.
For example, the testing results including accuracy, precision, recall, and f-1 score for the accuracy prediction model learning on a composite graph can be found in raw/log/error_AST_CFG_PDG_CAST_DEP/test.log
which has the following content:
2023-03-01 15:13:31,038 Savedir: MixBench/error_allpurpose_root_AST_CFG_PDG_CAST_DEP
2023-03-01 15:13:31,050 Test dataset size: 120
2023-03-01 15:13:31,050 Split:
test -> 120
2023-03-01 15:13:31,051 Edges: ['AST', 'CFG', 'PDG', 'CAST', 'DEP']
2023-03-01 15:13:32,777 Model (model/MixBench/error_allpurpose_root_AST_CFG_PDG_CAST_DEP/checkpoint.pt) is loaded.
2023-03-01 15:13:33,911 ============> Testing mode start...
2023-03-01 15:14:14,466 Test Loss: 1.850765 | Test Acc: 96.875% | Test Pre: 97.24% | Test Rec: 96.82% | Test Fsc: 97.03%
2023-03-01 15:14:14,466 Acc0: 92.98% | Acc1: 100.00% | Pre0: 100.00% | Pre1: 94.49% | Rec0: 93.63% | Rec1: 100.00%
The running results from all case studies, including Precimonious+FPLearner
and HiFPTuner+FPLearner
on the four target benchmarks CG
, MG
, Lulesh
, and LBM
are provided in raw/case-study/
.
Each folder (e.g., raw/case-study/HiFPTuner+FPLearner/cg-results/
) consists of the following files:
*.json
: the precision configuraions for all mixed-precision floating-point programs in the searchdd2-20230608-210241.log
: the log file which contains both the prediction results and the ground truth for each candidate programdf-configs.csv
: the csv file which contains both the prediction results and the ground truth for each candidate programdd2_valid_cg_395.json
: the final precision configuration found in the searchbest_speedup_cg_395.txt
: the corresponding best speedup
Total time for environment setup: approx. 20min
The user has two options to run the tool: the CPU Only option, and the GPU option.
Difference between two options: If the user aims to replicate the experiment results by using the pre-trained model, they can choose the CPU Only option. Otherwise, if they want to train models by themselves, it's recommended to choose the GPU option.
- 40GB free disk space recommended (At lease more than 30GB to download the docker image and the github repo.)
- Ubuntu
- Recommended version: 20.04 with kernel version 5.14.0 (The reproduction package has not been tested on other operating systems.)
- Docker
- Recommended version: 23.0.1 (The reproduction package has not been tested with other Docker versions.)
-
An NVIDIA GPU with 48GB memory is recommended (The reproduction package is tested on the Nvidia RTX A6000, with the driver version 525.)
-
NVIDIA Container Toolkit installed with the instructions from here in the Setting up NVIDIA Container Toolkit section (The toolkit allows you to use GPUs in the Docker container.)
-
Make sure the Nvidia driver and library versions match in order to use GPU in the docker container. The tool is tested on the Nvidia Kernel Version and API version
525.89.02
.- Check the run-time driver information:
cat /proc/driver/nvidia/version
. - Compare against the version for drivers installed:
dpkg -l | grep nvidia-driver
.
- Check the run-time driver information:
In the following steps, please replace <YOUR LOCAL PATH TO THIS REPO>
to your local path of this github repository.
Clone this GitHub repository to your local directory.
git clone https://github.com/ucd-plse/FPLearner.git <YOUR LOCAL PATH TO THIS REPO>
Please note that the docker image size is 28.3GB.
docker pull ucdavisplse/fplearner
(The CPU option)
docker run -v <YOUR LOCAL PATH TO THIS REPO>:/root/home -ti --name fplearner ucdavisplse/fplearner
(Or the GPU option)
docker run -v <YOUR LOCAL PATH TO THIS REPO>:/root/home -ti --gpus all --name fplearner ucdavisplse/fplearner
If necessary, you can also change the container's name.
In the subsequent process of experiments reproduction, the first step is always to make sure you are inside the docker container. If you are not, please run the following command (approx. a few secs):
docker start -i fplearner
To exit and stop the container, press Ctrl+D.
Step 4: (For the GPU option) Check if CUDA is currently available in your container (approx. a few secs)
When you are inside the container,
python3 -c 'import torch; print(torch.cuda.is_available())'
The expected terminal output is "True" when CUDA is available in your container. If torch.cuda.is_available()
returns "False",
you could first check the PyTorch version and the CUDA version
in the running environment, check pytorch.org
to make sure the PyTorch version matches with the CUDA version.
In the setup stage, after the user runs
or starts a Docker container using the commands mentioned above (i.e., docker run
or docker start
) and checks pwd
and ls
, the user will see the following messages indicating that they are inside the container and the installation is successful:
==========
== CUDA ==
==========
CUDA Version 11.7.1
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
root@<CONTAINER_ID>:~/home# pwd
/root/home
root@<CONTAINER_ID>:~/home# ls
LICENSE README.md case-study docker figures raw scripts
If the user doesn't select the GPU option, a warning like this is expected and will not affect the following run:
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
In the following commandes,
the user can use -perf
to specify the performance prediction task,
and use -accr
to specify the accuracy prediction task.
If neither flag is used, the default task is performance prediction.
The artifact provides the instructions to get statistics of our MixBench dataset.
For the performance prediction task:
cd /root/home/scripts
python3 main.py -data -perf
Expected terminal output:
BATCH SIZE = 16
GRAPHS = AST_CFG_PDG_CAST_DEP
Finish dataset building.
Dataset size is: 628
100%|████████████████████| 628/628 [00:04<00:00, 153.55it/s]
in all:
# runtime == 0: 314 # runtime == 1: 314
# error == 0: 398 # error == 1: 230
Average edge number per graph = 11486.964968152866, average node number per graph = 3195.353503184713
For the accuracy prediction task:
cd /root/home/scripts
python3 main.py -data -accr
Expected terminal output:
BATCH SIZE = 16
GRAPHS = AST_CFG_PDG_CAST_DEP
Finish dataset building.
Dataset size is: 600
100%|████████████████████| 600/600 [00:25<00:00, 23.57it/s]
in all:
# runtime == 0: 372 # runtime == 1: 228
# error == 0: 300 # error == 1: 300
Average edge number per graph = 11596.956666666667, average node number per graph = 3191.036666666667
The output messages describe the dataset information for the program performance prediction task. The information involves:
- Dataset size
- Label distribution
- The average number of edges and nodes per graph in the dataset
The artifact provides three options to reproduce the testing scores of our model. To simplify, the instructions below utilize the performance prediction task as an illustration. Alternatively, you can opt for -accr
to assess the accuracy prediction task.
By default, you're reccomended to reproduce testing scores reported in the paper using the pre-trianed models.
cd /root/home/scripts
python3 main.py -test -perf -b 16
In this command, we are testing the performance prediction model with all types of edges by default, under the batch size 16.
Expected terminal output:
BATCH SIZE = 16
GRAPHS = AST_CFG_PDG_CAST_DEP
Savedir: runtime_AST_CFG_PDG_CAST_DEP
Test dataset size: 127
Split:
test -> 127
Edges: ['AST', 'CFG', 'PDG', 'CAST', 'DEP']
Model (time_AST_CFG_PDG_CAST_DEP_checkpoint.pt) is loaded.
100%|████████████████████| 8/8 [00:44<00:00, 5.56s/it]
Confusion_matrix:
[[60 3]
[ 2 62]]
Test Loss: 1.957213 | Test Acc: 96.094% | Test Pre: 96.72% | Test Rec: 95.96% | Test Fsc: 96.34%
Acc0: 95.24% | Acc1: 96.88% | Pre0: 97.19% | Pre1: 96.25% | Rec0: 95.09% | Rec1: 96.83%
The output message contains the following information:
- Testing dataset size
- Edges extracted in the Precision Interaction Graph (PIG)
- The confusion matrix on the testing dataset
- Metric scores to reflect the model's performance (accuracy, precision, recall and f1 score)
If the user wants to replicate the results of the edge ablation study,
they can use the -graph
option to specify different edge types.
For instance, set CFG_PDG_CAST_DEP
in this option to test the graph "No AST" for the accuracy prediction model:
cd /root/home/scripts
python3 main.py -test -accr -b 16 -graph CFG_PDG_CAST_DEP
Expected terminal output:
BATCH SIZE = 16
GRAPHS = CFG_PDG_CAST_DEP
Savedir: runtime_CFG_PDG_CAST_DEP
Test dataset size: 120
Split:
test -> 12
Edges: ['CFG', 'PDG', 'CAST', 'DEP']
Model (error_CFG_PDG_CAST_DEP_checkpoint.pt) is loaded.
100%|████████████████████| 8/8 [00:57<00:00, 7.20s/it]
Confusion_matrix:
[[46 11]
[ 3 60]]
Test Loss: 4.462965 | Test Acc: 89.062% | Test Pre: 90.21% | Test Rec: 90.16% | Test Fsc: 90.19%
Acc0: 80.70% | Acc1: 95.24% | Pre0: 94.79% | Pre1: 85.63% | Rec0: 84.21% | Rec1: 96.11%
We trained our models on the MixBench dataset from scratch using Nvidia RTX A6000 with a batch size of 16 data instances.
If you have the GPU with the required memory size 48GB, you could run this step to train from scratch.
cd /root/home/scripts
python3 main.py -train -b 16
The default total number of training epochs is set to be 500. The early-stopping approach is used with a patience of 30 epochs. The training process is tested on our GPU, Nvidia RTX A6000, where each epoch is expected to take around 2 minutes.
The training log and model checkpoints are automatically saved and updated under the scripts/log
and scripts/model
folders during the training process.
If you have a GPU with a smaller memory size, you could train from scratch by decreasing the batch size, e.g. batch size = 1, but this will lead to a longer training time.
cd /root/home/scripts
python3 main.py -train -b 1
In this section, our artifact presents instructions to reproduce case study on four target benchmarks: CG
, MG
, LULESH
, and LBM
. The user has the options to incorporate FPLearner
into two different precision tuners: Precimonious
and HiFPTuner
.
Please note that results of our case studies are non-deterministic, and thus are expected to vary from what is reported. The execution time of each benchmark might differ across various machines. As a result, the approximate duration of running the following commands could also vary on different machines.
Before reproducing case studies, we first provide a toy example to
show how to use Precimonious
to tune the precision of a small benchmark, funarc
.
cd /root/home/case-study/Precimonious
python3 run.py funarc 10
Note that the second argument 10
indicates the timeout in seconds to run the benchmark funarc
once.
To get an idea of how to compile and run funarc
, check the file /root/home/case-study/Precimonious/funarc/scripts/Makefile
and run make
under the same directory.
cd /root/home/case-study/Precimonious
python3 run.py cg 10
cd /root/home/case-study/Precimonious-plugin
python3 run.py cg 10
cd /root/home/case-study/HiFPTuner-plugin
python3 run.py cg 10
In the command python3 run.py cg 10
, we execute the script called run.py
to start dynamic precision tuning with plugin on the benchmark CG
. The second argument 10
indicates the maximum time in seconds to run the benchmark CG
once.
The beginning of the sample terminal output:
include.json is generated.
map.json is generated.
rm -f *.out *config_*.json *.txt
Runtime time_predictor ../src/time_finetune.pt is loaded.
Error error_predictor ../src/error_finetune.pt is loaded.
Rootnode is 1534.
One time preloading...
** Searching for valid configuration using delta-debugging algorithm
cp config.json results-eps=4-A/VALID_config_cg_0.json
-------- running config 1 --------
mv config_temp.json results-eps=4-A/INVALID_config_cg_1.json
-------- running config 2 --------
mv config_temp.json results-eps=4-A/INVALID_config_cg_2.json
-------- running config 3 --------
mv config_temp.json results-eps=4-A/INVALID_config_cg_3.json
Please ignore the possible error messages during the search which do not have any affect to the run:
- "pandas FutureWarning: df.append"
- "clang-12: error: linker command failed with exit code 1"
- "fatal error: 'npbparams.h' file not found"
Expected results:
After the precision tuning is done, you can find a folder in case-study/<TunerName>-plugin/cg/run/results-eps==4-A
which contains the following files (<TunerName>
is either Precimonious
or HiFPTuner
):
*.json
: all precision configurations in the search.log
: a log file containing model prediction results for each configuration and the corresponding verification results.csv
: a csv file containing model prediction results for each configuration and the corresponding verification resultsdd2_valid_{BENCH}_{IDX}.json
: the best precision configuration found by our toolbest_speedup_{BENCH}_{IDX}.txt
: the corresponding best speed up
cd /root/home/case-study/Precimonious-plugin
python3 run.py mg 10
cd /root/home/case-study/HiFPTuner-plugin
python3 run.py mg 10
For MG
, the timeout is 10s. The expected terminal output and results are similar to CG
.
cd /root/home/case-study/Precimonious-plugin
python3 run.py lulesh 30
cd /root/home/case-study/HiFPTuner-plugin
python3 run.py lulesh 30
For LULESH
, the timeout is 30s. The expected terminal output and results are similar to CG
.
The LBM
benchmark from SPEC CPU 2017 is proprietary and require a license from SPEC to use. We offer the instructions below and scripts for running our tool on LBM, but we don't provide the source code of LBM or any scripts to run SPEC benchmarks. If you have licence to SPEC CPU 2017 Benchmark Suites, please follow the instructions to run our tool:
Step 1: Downlowd and collect the source code of the LBM benchmark.
Step 2: Make sure you are able to compile and run the LBM with specmake
provided by SPEC. For more information or instructions, please refer to the official website from here.
Step 3: Install SPEC CPU 2017 in the docker container.
Step 4: Copy the source code of LBM to the path /root/home/case-study/<TunerName>-plugin/scripts
and /root/home/case-study/<TunerName>-plugin/tempscripts
. (Note that <TunerName>
is either Precimonious
or HiFPTuner
. The following instructions will use Precimonious
as an example.)
Step 5: Run the case study with the following commands
cd /root/home/case-study/Precimonious-plugin
python3 run.py lbm 300
For LBM
, the timeout is 300s. The expected terminal output and results are similar to CG
.