Skip to content

Commit

Permalink
Automated ensembling (#43)
Browse files Browse the repository at this point in the history
* add new structure JsonEncodedList and new Column base_learner_ids in StackedEnsemble

* add filter so stacked ensembles won't be repeated

* fix some bugs

* add unfinished start_greedy_ensemble_search

* finish start_greedy_ensemble_search

* fix some bugs

* fix bug

* add unique constraint to association table

* make search greedier

* add filtering to base learners

* add some functions to create greedy run in ContainerBaseLearner

* Fa Cogs are cool

* add UI for creating greedy run

* add docs for gfms
  • Loading branch information
reiinakano committed Jun 23, 2017
1 parent 977f68c commit 8c07dc4
Show file tree
Hide file tree
Showing 13 changed files with 335 additions and 14 deletions.
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -43,3 +43,4 @@ coverage.xml
# misc
.DS_Store
.env
*~
39 changes: 39 additions & 0 deletions docs/advanced.rst
Expand Up @@ -174,3 +174,42 @@ The ``pbounds`` variable is a dictionary that maps the hyperparameters to tune w
}

For more info on setting ``maximize_config``, please see the :func:`maximize` method of the :class:`bayes_opt.BayesianOptimization` class in the `BayesianOptimization source code <https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py>`_. Seeing this `notebook example <https://github.com/fmfn/BayesianOptimization/blob/master/examples/exploitation%20vs%20exploration.ipynb>`_ will also give you some intuition on how the different acquisition function parameters ``acq``, ``kappa``, and ``xi`` affect the Bayesian search.

Greedy Forward Model Selection
------------------------------

Stacking is usually reserved as the last step of the Xcessiv process, after you've squeezed out all you can from pipeline and hyperparameter optimization. When creating stacked ensembles, you can usually expect its performance to be better than any single base learner in the ensemble.

The problem here lies in figuring out which base learners to include in your ensemble. Stacking together the top N base learners is a good first strategy, but not always optimal. Even if a base learner doesn't perform that well on its own, it could still provide brand new information to the secondary learner, thereby boosting the entire ensemble's performance even further. One way to look at it is that it provides the secondary learner a *new angle* to look at the problem and make better judgments moving forward.

Figuring out which base learners to add to a stacked ensemble is much like hyperparameter optimization. You can't really be sure if something will work until you try it. Unfortunately, trying out every possible combination of base learners is unfeasible when you have hundreds of base learners to choose from.

Xcessiv provides an automated ensemble construction method based on a heuristic process called **greedy forward model selection**. This method is adapted from `Ensemble Selection from Libraries of Models <http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf>`_ by Caruana et al.

In a nutshell, the algorithm is as follows:

1) Start with the empty ensemble
2) Add to the ensemble the model in the library that maximizes the ensemmble's performance on the error metric.
3) Repeat step 2 for a fixed number of iterations or until all models have been used.

That's it!

To perform greedy forward model selection in Xcessiv, simply click on the **Automated ensemble search** button in the Stacked Ensemble section.

Select your secondary base learner in the configuration modal (Logistic Regression is a good first choice for classification tasks) and copy the following code into the code box and click Go to start your automated run.::

secondary_learner_hyperparameters = {} # hyperparameters of secondary learner

metric_to_optimize = 'Accuracy' # metric to optimize

invert_metric = False # Whether or not to invert metric e.g. optimizing a loss

max_num_base_learners = 6 # Maximum size of ensemble to consider (the higher this is, the longer the run will take)

``secondary_learner_hyperparameters`` is a dictionary containing the hyperparameters for your chosen secondary learner. Again, an empty dictionary signifies default parameters.

``metric_to_optimize`` and ``invert_metric`` mean the same things they do as in :ref:`Bayesian Hyperparameter Search`.

``max_num_base_learners`` refers to the total number of iterations of the algorithm. As such, this also signifies the maximum number of base learners that a stacked ensemble found through this automated run can contain. Please note that the higher this number is, the longer the search will run.

Unlike TPOT pipeline construction and Bayesian optimization, which both have an element of randomness, greedy forward model selection will always explore the same ensembles if the pool of base learners remains unchanged.
1 change: 1 addition & 0 deletions docs/index.rst
Expand Up @@ -21,6 +21,7 @@ Features
* Easy management and comparison of hundreds of different model-hyperparameter combinations
* Automatic saving of generated secondary meta-features
* Stacked ensemble creation in a few clicks
* Automated ensemble construction through greedy forward model selection
* Export your stacked ensemble as a standalone Python file to support multiple levels of stacking

----------------
Expand Down
157 changes: 150 additions & 7 deletions xcessiv/automatedruns.py
Expand Up @@ -206,10 +206,6 @@ def start_naive_bayes(automated_run, session, path):

bo.maximize(**module.maximize_config)

automated_run.job_status = 'finished'
session.add(automated_run)
session.commit()


def start_tpot(automated_run, session, path):
"""Starts a TPOT automated run that exports directly to base learner setup
Expand Down Expand Up @@ -248,8 +244,155 @@ def start_tpot(automated_run, session, path):
meta_feature_generator='predict'
)

automated_run.job_status = 'finished'

session.add(blo)
session.add(automated_run)
session.commit()


def eval_stacked_ensemble(stacked_ensemble, session, path):
"""Evaluate stacked ensemble
Args:
stacked_ensemble (xcessiv.models.StackedEnsemble)
session: Valid SQLAlchemy session
path (str, unicode): Path to project folder
Returns:
stacked_ensemble (xcessiv.models.StackedEnsemble)
"""
try:
meta_features_list = []
for base_learner in stacked_ensemble.base_learners:
mf = np.load(base_learner.meta_features_path(path))
if len(mf.shape) == 1:
mf = mf.reshape(-1, 1)
meta_features_list.append(mf)

secondary_features = np.concatenate(meta_features_list, axis=1)

# Get data
extraction = session.query(models.Extraction).first()
return_splits_iterable = functions.import_object_from_string_code(
extraction.meta_feature_generation['source'],
'return_splits_iterable'
)
X, y = extraction.return_train_dataset()

# We need to retrieve original order of meta-features
indices_list = [test_index for train_index, test_index in return_splits_iterable(X, y)]
indices = np.concatenate(indices_list)
X, y = X[indices], y[indices]

est = stacked_ensemble.return_secondary_learner()

return_splits_iterable_stacked_ensemble = functions.import_object_from_string_code(
extraction.stacked_ensemble_cv['source'],
'return_splits_iterable'
)
preds = []
trues_list = []
for train_index, test_index in return_splits_iterable_stacked_ensemble(secondary_features, y):
X_train, X_test = secondary_features[train_index], secondary_features[test_index]
y_train, y_test = y[train_index], y[test_index]
est = est.fit(X_train, y_train)
preds.append(
getattr(est, stacked_ensemble.base_learner_origin.
meta_feature_generator)(X_test)
)
trues_list.append(y_test)
preds = np.concatenate(preds, axis=0)
y_true = np.concatenate(trues_list)

for key in stacked_ensemble.base_learner_origin.metric_generators:
metric_generator = functions.import_object_from_string_code(
stacked_ensemble.base_learner_origin.metric_generators[key],
'metric_generator'
)
stacked_ensemble.individual_score[key] = metric_generator(y_true, preds)

stacked_ensemble.job_status = 'finished'
session.add(stacked_ensemble)
session.commit()
return stacked_ensemble

except:
session.rollback()
stacked_ensemble.job_status = 'errored'
stacked_ensemble.description['error_type'] = repr(sys.exc_info()[0])
stacked_ensemble.description['error_value'] = repr(sys.exc_info()[1])
stacked_ensemble.description['error_traceback'] = \
traceback.format_exception(*sys.exc_info())
session.add(stacked_ensemble)
session.commit()
raise


def start_greedy_ensemble_search(automated_run, session, path):
"""Starts an automated ensemble search using greedy forward model selection.
The steps for this search are adapted from "Ensemble Selection from Libraries of Models" by
Caruana.
1. Start with the empty ensemble
2. Add to the ensemble the model in the library that maximizes the ensemmble's
performance on the error metric.
3. Repeat step 2 for a fixed number of iterations or until all models have been used.
Args:
automated_run (xcessiv.models.AutomatedRun): Automated run object
session: Valid SQLAlchemy session
path (str, unicode): Path to project folder
"""
module = functions.import_string_code_as_module(automated_run.source)
assert module.metric_to_optimize in automated_run.base_learner_origin.metric_generators

best_ensemble = [] # List containing IDs of best performing ensemble for the last round

secondary_learner = automated_run.base_learner_origin.return_estimator()
secondary_learner.set_params(**module.secondary_learner_hyperparameters)

for i in range(module.max_num_base_learners):
best_score = -float('inf') # Best metric for this round (not in total!)
current_ensemble = best_ensemble[:] # Shallow copy of best ensemble
for base_learner in session.query(models.BaseLearner).filter_by(job_status='finished').all():
if base_learner in current_ensemble: # Don't append when learner is already in
continue
current_ensemble.append(base_learner)

# Check if our "best ensemble" already exists
existing_ensemble = session.query(models.StackedEnsemble).\
filter_by(base_learner_origin_id=automated_run.base_learner_origin.id,
secondary_learner_hyperparameters=secondary_learner.get_params(),
base_learner_ids=sorted([bl.id for bl in current_ensemble])).first()

if existing_ensemble and existing_ensemble.job_status == 'finished':
score = existing_ensemble.individual_score[module.metric_to_optimize]

elif existing_ensemble and existing_ensemble.job_status != 'finished':
eval_stacked_ensemble(existing_ensemble, session, path)
score = existing_ensemble.individual_score[module.metric_to_optimize]

else:
stacked_ensemble = models.StackedEnsemble(
secondary_learner_hyperparameters=secondary_learner.get_params(),
base_learners=current_ensemble,
base_learner_origin=automated_run.base_learner_origin,
job_status='started'
)
session.add(stacked_ensemble)
session.commit()
eval_stacked_ensemble(stacked_ensemble, session, path)
score = stacked_ensemble.individual_score[module.metric_to_optimize]

score = -score if module.invert_metric else score

if best_score < score:
best_score = score
best_ensemble = current_ensemble[:]

current_ensemble.pop()
19 changes: 17 additions & 2 deletions xcessiv/models.py
Expand Up @@ -3,7 +3,7 @@
import random
import string
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Text, Integer, Boolean, TypeDecorator, ForeignKey, Table
from sqlalchemy import Column, Text, Integer, Boolean, TypeDecorator, ForeignKey, Table, UniqueConstraint
from sqlalchemy.orm import relationship
from sqlalchemy.ext import mutable
import numpy as np
Expand All @@ -30,6 +30,18 @@ def process_result_value(self, value, dialect):
return json.loads(value)


class JsonEncodedList(TypeDecorator):
"""Enables JSON storage of list by encoding and decoding on the fly.
No need for mutability tracking"""
impl = Text

def process_bind_param(self, value, dialect):
return json.dumps(value, sort_keys=True)

def process_result_value(self, value, dialect):
return json.loads(value)


mutable.MutableDict.associate_with(JsonEncodedDict)


Expand Down Expand Up @@ -257,7 +269,8 @@ def serialize(self):
association_table = Table(
'association', Base.metadata,
Column('baselearner_id', Integer, ForeignKey('baselearner.id')),
Column('stackedensemble_id', Integer, ForeignKey('stackedensemble.id'))
Column('stackedensemble_id', Integer, ForeignKey('stackedensemble.id')),
UniqueConstraint('baselearner_id', 'stackedensemble_id', name='UC_baselearner_id_stackedensemble_id')
)


Expand Down Expand Up @@ -366,6 +379,7 @@ class StackedEnsemble(Base):
secondary=association_table,
back_populates='stacked_ensembles'
)
base_learner_ids = Column(JsonEncodedList)
base_learner_origin_id = Column(Integer, ForeignKey('baselearnerorigin.id'))
base_learner_origin = relationship('BaseLearnerOrigin', back_populates='stacked_ensembles')
secondary_learner_hyperparameters = Column(JsonEncodedDict)
Expand All @@ -383,6 +397,7 @@ def __init__(self, secondary_learner_hyperparameters, base_learners,
self.job_status = job_status
self.job_id = None
self.description = dict()
self.base_learner_ids = sorted([bl.id for bl in base_learners])

def return_secondary_learner(self):
"""Returns secondary learner using its origin and the given hyperparameters
Expand Down
7 changes: 7 additions & 0 deletions xcessiv/rqtasks.py
Expand Up @@ -199,9 +199,16 @@ def start_automated_run(path, automated_run_id):
elif automated_run.category == 'tpot':
automatedruns.start_tpot(automated_run, session, path)

elif automated_run.category == 'greedy_ensemble_search':
automatedruns.start_greedy_ensemble_search(automated_run, session, path)

else:
raise Exception('Something went wrong. Invalid category for automated run')

automated_run.job_status = 'finished'
session.add(automated_run)
session.commit()

except:
session.rollback()
automated_run.job_status = 'errored'
Expand Down
3 changes: 3 additions & 0 deletions xcessiv/ui/src/AutomatedRuns/AutomatedRunsDisplay.js
Expand Up @@ -9,6 +9,7 @@ import FaTrash from 'react-icons/lib/fa/trash';
import FaSpinner from 'react-icons/lib/fa/spinner';
import FaExclamationCircle from 'react-icons/lib/fa/exclamation-circle'
import FaInfo from 'react-icons/lib/fa/info';
import FaCogs from 'react-icons/lib/fa/cogs';
import { DetailsModal, DeleteModal } from './Modals'

class AutomatedRunsDisplay extends Component {
Expand All @@ -26,6 +27,8 @@ class AutomatedRunsDisplay extends Component {
<table><tbody><tr>
<td>
<Button onClick={() => this.setState({open: !this.state.open})}>
<FaCogs />
{' '}
{(this.state.open ? 'Hide' : 'Show') + ' Automated Runs'}
</Button>
</td>
Expand Down
4 changes: 3 additions & 1 deletion xcessiv/ui/src/BaseLearnerOrigin/BaseLearnerOrigin.js
Expand Up @@ -12,6 +12,7 @@ import $ from 'jquery';
import FaCheck from 'react-icons/lib/fa/check';
import FaSpinner from 'react-icons/lib/fa/spinner';
import FaExclamationCircle from 'react-icons/lib/fa/exclamation-circle';
import FaCogs from 'react-icons/lib/fa/cogs';
import { Button, ButtonToolbar, Glyphicon, Alert, Panel as BsPanel,
Form, FormGroup, ControlLabel, FormControl, DropdownButton,
MenuItem } from 'react-bootstrap';
Expand Down Expand Up @@ -417,7 +418,8 @@ class BaseLearnerOrigin extends Component {
<Button
disabled={!this.props.data.final}
onClick={() => this.setState({showBayesianRunModal: true})}>
Bayesian Optimization
<FaCogs />
{' Bayesian Optimization'}
</Button>

</ButtonToolbar>
Expand Down
3 changes: 2 additions & 1 deletion xcessiv/ui/src/BaseLearnerOrigin/ListBaseLearnerOrigin.js
Expand Up @@ -3,6 +3,7 @@ import './BaseLearnerOrigin.css';
import BaseLearnerOrigin from './BaseLearnerOrigin';
import { Button, Glyphicon, ButtonGroup } from 'react-bootstrap';
import { TpotModal } from './BaseLearnerOriginModals'
import FaCogs from 'react-icons/lib/fa/cogs';

class ListBaseLearnerOrigin extends Component {

Expand Down Expand Up @@ -47,7 +48,7 @@ class ListBaseLearnerOrigin extends Component {
{' Add new base learner origin'}
</Button>
<Button href="#" onClick={() => this.setState({showTpotModal: true})}>
<Glyphicon glyph="plus" />
<FaCogs />
{' Automated base learner generation with TPOT'}
</Button>
</ButtonGroup>
Expand Down
17 changes: 16 additions & 1 deletion xcessiv/ui/src/Ensemble/EnsembleBuilder.js
Expand Up @@ -9,6 +9,8 @@ import CodeMirror from 'react-codemirror';
import 'codemirror/lib/codemirror.css';
import 'codemirror/mode/python/python';
import { Button, Glyphicon } from 'react-bootstrap';
import FaCogs from 'react-icons/lib/fa/cogs';
import { GreedyRunModal } from './EnsembleMoreDetailsModal'


const defaultSourceParams = [
Expand All @@ -24,7 +26,8 @@ class EnsembleBuilder extends Component {
super(props);
this.state = {
selectedValue: null,
source: defaultSourceParams
source: defaultSourceParams,
showGreedyModal: false
};
}

Expand Down Expand Up @@ -67,13 +70,25 @@ class EnsembleBuilder extends Component {
/>
<Button
block
href="#"
disabled={buttonDisabled}
bsStyle='primary'
onClick={() => this.props.createStackedEnsemble(
this.state.selectedValue.value, this.state.source)}>
<Glyphicon glyph="plus" />
{' Create new ensemble'}
</Button>
<Button
block
onClick={() => this.setState({showGreedyModal: true})}
>
<FaCogs />
{' Automated ensemble search'}
</Button>
<GreedyRunModal isOpen={this.state.showGreedyModal}
onRequestClose={() => this.setState({showGreedyModal: false})}
handleYes={(id, source) => this.props.startGreedyRun(id, source)}
optionsBaseLearnerOrigins={this.props.optionsBaseLearnerOrigins}/>
</div>
)
}
Expand Down

0 comments on commit 8c07dc4

Please sign in to comment.