Continuous benchmark suite for the scikit-learn project.
In order to run the benchmarks on your own machine, please follow these steps.
Clone the repository somewhere, for example
Extract the datasets:
cd ~/code/scikit-learn-speed/benchmarks tar jxvf data.tar.bz2
Create the configuration file
~/.vbench-skl. For example:
[setup] repo_path = /Users/vene/code/scikit-learn repo_url = email@example.com:scikit-learn/scikit-learn.git db_path = /Users/vene/code/scikit-learn-speed/benchmarks/benchmarks.db tmp_dir = /tmp/vb_sklearn
The values displayed above are hardcoded defaults, and they are used in case
the configuration value doesn't exist, or to override skipped values.
Specifically, this means you don't have to bother to set
python run_suite.py # This runs the entire suite, ~10min on my machine python generate_rst_files.py # This prepares the rst documentation
To actually generate the HTML files, change to the
scikit-learn-speedfoldar and execute::
You can view the results by opening
The following datasets are available:
- arcene: train: (100, 10000), test: (100, 10000)
- madelon: train: (2000, 500), test: (600, 500)
- minimadelon: train: (30, 500), test: (20, 500), 10 output
- blobs: train: (300, 50) test: (200, 50), 10 tight centers
- newsgroups: sparse, train: (11214, 130088), test: (7432, 130088)
In addition, you can append the following options to any dataset's name:
-oney: Only keeps the first output, i. e.
y = y[:, 0]. Necessary for estimators that don't support multidimensional output arrays.
-semi: Unlabels samples at random, by setting the corresponding output to
-1. Useful for semi-supervised algorithms.