FLASH is a package to perform Bayesian optimization on tuning data analytic pipelines. Specifically, it is a two-layer Bayesian optimization framework, which first uses a parametric model to select promising algorithms, then computes a nonparametric model to fine-tune hyperparameters of the promising algorithms.
Details of FLASH are described in the paper:
FLASH: Fast Bayesian Optimization for Data Analytic Pipelines [arXiv]
Yuyu Zhang, Mohammad Taha Bahadori, Hang Su, Jimeng Sun
FLASH is licensed under the GPL license, which can be found in the package.
FLASH is developed on top of HPOlib, a general platform for hyperparameter optimization. Since HPOlib was developed on Ubuntu and currently only supports Linux distributions, FLASH also only works on Linux (we developed and tested our package on Ubuntu 14.04.4 LTS).
1. Clone repository
git clone https://github.com/yuyuz/FLASH.git
2. Install Miniconda
To avoid a variety of potential problems in environment settings, we highly recommend to use Miniconda (Python 2.7).
If you are using 64-bit Linux system (recommended):
wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh
If you are using 32-bit Linux system:
wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86.sh
bash Miniconda-latest-Linux-x86.sh
Answer yes
when the installer asks to prepend the Miniconda2 install location to PATH in your .bashrc
.
After installation completed, restart terminal or execute source ~/.bashrc
to make sure that conda has taken charge of your Python environment.
3. Install dependencies
Now we install dependencies within conda environment:
easy_install -U distribute
conda install -y openblas numpy scipy matplotlib scikit-learn==0.16.1
pip install hyperopt liac-arff
4. Install package
Install HPOlib and some requirements (pymongo
, protobuf
, networkx
). During the installation, please keep your system connected to the Internet, such that setup.py
can download optimizer code packages.
cd /path/to/FLASH
python setup.py install
All the benchmark datasets are publicly available here. These datasets were first introduced by Auto-WEKA and have been widely used to evaluate Bayesian optimization methods.
Due to the file size limit, we are not able to provide all those datasets in our Github repository. In fact, only the madelon
dataset is provided as an example. To deploy a new benchmark dataset, download the zip file from here and then uncompress it. You will get a dataset folder including two files train.arff
and test.arff
. Move this folder into the data directory, just like the dataset folder madelon
we already put there.
For benchmark datasets, we build a general data analytic pipeline based on scikit-learn, following the pipeline design of auto-sklearn. We have 4 computational steps with 33 algorithms in this pipeline. Details are discussed in the paper.
To run this pipeline on a specific dataset, first you need to correctly set the configuration file (/path/to/FLASH/benchmarks/sklearn/config.cfg
):
- In the
HPOLIB
section, change thefunction
path according to your local setting. - In the
HPOLIB
section, change thedata_path
according to your local setting. - In the
HPOLIB
section, change thedataset
name to whichever dataset you have deployed as the input of pipeline.
Now you can tune the pipeline using different Bayesian optimization methods. For each method, we provide a Python script to run the tuning process.
For our method, it currently has two versions (with different optimizers in the last phase): FLASH and FLASH*.
To run FLASH:
cd /path/to/FLASH/benchmarks/sklearn
python run_flash.py
To run FLASH*:
cd /path/to/FLASH/benchmarks/sklearn
python run_flash_star.py
For other methods (SMAC, TPE, Random Search), we use the implementations in HPOlib and also provide Python scripts.
To run SMAC:
cd /path/to/FLASH/benchmarks/sklearn
python run_smac.py
To run TPE:
cd /path/to/FLASH/benchmarks/sklearn
python run_tpe.py
To run Random Search:
cd /path/to/FLASH/benchmarks/sklearn
python run_random.py
In the configuration file (/path/to/FLASH/benchmarks/sklearn/config.cfg
), you can set quite a few advanced configurations.
The configuration items in the HPOLIB
section are effective for all the optimization methods above:
- Set
use_caching
as1
to enable pipeline caching,0
to disable caching. cv_folds
specifies the number of cross validation folds during the optimization.- For other items such as
number_of_jobs
andresult_on_terminate
, refer to the HPOlib manual.
The configuration items in the LR
section are only effective for FLASH and FLASH*:
- Set
use_optimal_design
as1
to enable optimal design for initialization,0
to use random initialization. init_budget
specifies the number of iterations for Phase 1 (initialization).ei_budget
specifies the number of iterations for Phase 2 (pruning).bopt_budget
specifies the number of iterations for Phase 3 (fine-tuning).ei_xi
is the trade-off parameter ξ in EI and EIPS functions, which balances the exploitation and exploration.top_k_pipelines
specifies the number of best pipeline paths to select at the end of Phase 2 (pruning).