Automated ensembling (#43)

* add new structure JsonEncodedList and new Column base_learner_ids in StackedEnsemble * add filter so stacked ensembles won't be repeated * fix some bugs * add unfinished start_greedy_ensemble_search * finish start_greedy_ensemble_search * fix some bugs * fix bug * add unique constraint to association table * make search greedier * add filtering to base learners * add some functions to create greedy run in ContainerBaseLearner * Fa Cogs are cool * add UI for creating greedy run * add docs for gfms
reiinakano · Jun 23, 2017 · 8c07dc4 · 8c07dc4
1 parent 977f68c
commit 8c07dc4
Show file tree

Hide file tree

Showing 13 changed files with 335 additions and 14 deletions.
diff --git a/.gitignore b/.gitignore
@@ -43,3 +43,4 @@ coverage.xml
 # misc
 .DS_Store
 .env
+*~
diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -174,3 +174,42 @@ The ``pbounds`` variable is a dictionary that maps the hyperparameters to tune w
    }
 
 For more info on setting ``maximize_config``, please see the :func:`maximize` method of the :class:`bayes_opt.BayesianOptimization` class in the `BayesianOptimization source code <https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py>`_. Seeing this `notebook example <https://github.com/fmfn/BayesianOptimization/blob/master/examples/exploitation%20vs%20exploration.ipynb>`_ will also give you some intuition on how the different acquisition function parameters ``acq``, ``kappa``, and ``xi`` affect the Bayesian search.
+
+Greedy Forward Model Selection
+------------------------------
+
+Stacking is usually reserved as the last step of the Xcessiv process, after you've squeezed out all you can from pipeline and hyperparameter optimization. When creating stacked ensembles, you can usually expect its performance to be better than any single base learner in the ensemble.
+
+The problem here lies in figuring out which base learners to include in your ensemble. Stacking together the top N base learners is a good first strategy, but not always optimal. Even if a base learner doesn't perform that well on its own, it could still provide brand new information to the secondary learner, thereby boosting the entire ensemble's performance even further. One way to look at it is that it provides the secondary learner a *new angle* to look at the problem and make better judgments moving forward.
+
+Figuring out which base learners to add to a stacked ensemble is much like hyperparameter optimization. You can't really be sure if something will work until you try it. Unfortunately, trying out every possible combination of base learners is unfeasible when you have hundreds of base learners to choose from.
+
+Xcessiv provides an automated ensemble construction method based on a heuristic process called **greedy forward model selection**. This method is adapted from `Ensemble Selection from Libraries of Models <http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf>`_ by Caruana et al.
+
+In a nutshell, the algorithm is as follows:
+
+1) Start with the empty ensemble
+2) Add to the ensemble the model in the library that maximizes the ensemmble's performance on the error metric.
+3) Repeat step 2 for a fixed number of iterations or until all models have been used.
+
+That's it!
+
+To perform greedy forward model selection in Xcessiv, simply click on the **Automated ensemble search** button in the Stacked Ensemble section.
+
+Select your secondary base learner in the configuration modal (Logistic Regression is a good first choice for classification tasks) and copy the following code into the code box and click Go to start your automated run.::
+
+   secondary_learner_hyperparameters = {}  # hyperparameters of secondary learner
+
+   metric_to_optimize = 'Accuracy'  # metric to optimize
+
+   invert_metric = False  # Whether or not to invert metric e.g. optimizing a loss
+
+   max_num_base_learners = 6  # Maximum size of ensemble to consider (the higher this is, the longer the run will take)
+
+``secondary_learner_hyperparameters`` is a dictionary containing the hyperparameters for your chosen secondary learner. Again, an empty dictionary signifies default parameters.
+
+``metric_to_optimize`` and ``invert_metric`` mean the same things they do as in :ref:`Bayesian Hyperparameter Search`.
+
+``max_num_base_learners`` refers to the total number of iterations of the algorithm. As such, this also signifies the maximum number of base learners that a stacked ensemble found through this automated run can contain. Please note that the higher this number is, the longer the search will run.
+
+Unlike TPOT pipeline construction and Bayesian optimization, which both have an element of randomness, greedy forward model selection will always explore the same ensembles if the pool of base learners remains unchanged.
diff --git a/docs/index.rst b/docs/index.rst
@@ -21,6 +21,7 @@ Features
 * Easy management and comparison of hundreds of different model-hyperparameter combinations
 * Automatic saving of generated secondary meta-features
 * Stacked ensemble creation in a few clicks
+* Automated ensemble construction through greedy forward model selection
 * Export your stacked ensemble as a standalone Python file to support multiple levels of stacking
 
 ----------------

diff --git a/xcessiv/automatedruns.py b/xcessiv/automatedruns.py
@@ -206,10 +206,6 @@ def start_naive_bayes(automated_run, session, path):
 
     bo.maximize(**module.maximize_config)
 
-    automated_run.job_status = 'finished'
-    session.add(automated_run)
-    session.commit()
-
 
 def start_tpot(automated_run, session, path):
     """Starts a TPOT automated run that exports directly to base learner setup
@@ -248,8 +244,155 @@ def start_tpot(automated_run, session, path):
         meta_feature_generator='predict'
     )
 
-    automated_run.job_status = 'finished'
-
     session.add(blo)
-    session.add(automated_run)
     session.commit()
+
+
+def eval_stacked_ensemble(stacked_ensemble, session, path):
+    """Evaluate stacked ensemble
+
+    Args:
+        stacked_ensemble (xcessiv.models.StackedEnsemble)
+
+        session: Valid SQLAlchemy session
+
+        path (str, unicode): Path to project folder
+
+    Returns:
+        stacked_ensemble (xcessiv.models.StackedEnsemble)
+    """
+    try:
+        meta_features_list = []
+        for base_learner in stacked_ensemble.base_learners:
+            mf = np.load(base_learner.meta_features_path(path))
+            if len(mf.shape) == 1:
+                mf = mf.reshape(-1, 1)
+            meta_features_list.append(mf)
+
+        secondary_features = np.concatenate(meta_features_list, axis=1)
+
+        # Get data
+        extraction = session.query(models.Extraction).first()
+        return_splits_iterable = functions.import_object_from_string_code(
+            extraction.meta_feature_generation['source'],
+            'return_splits_iterable'
+        )
+        X, y = extraction.return_train_dataset()
+
+        #  We need to retrieve original order of meta-features
+        indices_list = [test_index for train_index, test_index in return_splits_iterable(X, y)]
+        indices = np.concatenate(indices_list)
+        X, y = X[indices], y[indices]
+
+        est = stacked_ensemble.return_secondary_learner()
+
+        return_splits_iterable_stacked_ensemble = functions.import_object_from_string_code(
+            extraction.stacked_ensemble_cv['source'],
+            'return_splits_iterable'
+        )
+        preds = []
+        trues_list = []
+        for train_index, test_index in return_splits_iterable_stacked_ensemble(secondary_features, y):
+            X_train, X_test = secondary_features[train_index], secondary_features[test_index]
+            y_train, y_test = y[train_index], y[test_index]
+            est = est.fit(X_train, y_train)
+            preds.append(
+                getattr(est, stacked_ensemble.base_learner_origin.
+                        meta_feature_generator)(X_test)
+            )
+            trues_list.append(y_test)
+        preds = np.concatenate(preds, axis=0)
+        y_true = np.concatenate(trues_list)
+
+        for key in stacked_ensemble.base_learner_origin.metric_generators:
+            metric_generator = functions.import_object_from_string_code(
+                stacked_ensemble.base_learner_origin.metric_generators[key],
+                'metric_generator'
+            )
+            stacked_ensemble.individual_score[key] = metric_generator(y_true, preds)
+
+        stacked_ensemble.job_status = 'finished'
+        session.add(stacked_ensemble)
+        session.commit()
+        return stacked_ensemble
+
+    except:
+        session.rollback()
+        stacked_ensemble.job_status = 'errored'
+        stacked_ensemble.description['error_type'] = repr(sys.exc_info()[0])
+        stacked_ensemble.description['error_value'] = repr(sys.exc_info()[1])
+        stacked_ensemble.description['error_traceback'] = \
+            traceback.format_exception(*sys.exc_info())
+        session.add(stacked_ensemble)
+        session.commit()
+        raise
+
+
+def start_greedy_ensemble_search(automated_run, session, path):
+    """Starts an automated ensemble search using greedy forward model selection.
+
+    The steps for this search are adapted from "Ensemble Selection from Libraries of Models" by
+    Caruana.
+
+    1. Start with the empty ensemble
+
+    2. Add to the ensemble the model in the library that maximizes the ensemmble's
+    performance on the error metric.
+
+    3. Repeat step 2 for a fixed number of iterations or until all models have been used.
+
+    Args:
+        automated_run (xcessiv.models.AutomatedRun): Automated run object
+
+        session: Valid SQLAlchemy session
+
+        path (str, unicode): Path to project folder
+    """
+    module = functions.import_string_code_as_module(automated_run.source)
+    assert module.metric_to_optimize in automated_run.base_learner_origin.metric_generators
+
+    best_ensemble = []  # List containing IDs of best performing ensemble for the last round
+
+    secondary_learner = automated_run.base_learner_origin.return_estimator()
+    secondary_learner.set_params(**module.secondary_learner_hyperparameters)
+
+    for i in range(module.max_num_base_learners):
+        best_score = -float('inf')  # Best metric for this round (not in total!)
+        current_ensemble = best_ensemble[:]  # Shallow copy of best ensemble
+        for base_learner in session.query(models.BaseLearner).filter_by(job_status='finished').all():
+            if base_learner in current_ensemble:  # Don't append when learner is already in
+                continue
+            current_ensemble.append(base_learner)
+
+            # Check if our "best ensemble" already exists
+            existing_ensemble = session.query(models.StackedEnsemble).\
+                filter_by(base_learner_origin_id=automated_run.base_learner_origin.id,
+                          secondary_learner_hyperparameters=secondary_learner.get_params(),
+                          base_learner_ids=sorted([bl.id for bl in current_ensemble])).first()
+
+            if existing_ensemble and existing_ensemble.job_status == 'finished':
+                score = existing_ensemble.individual_score[module.metric_to_optimize]
+
+            elif existing_ensemble and existing_ensemble.job_status != 'finished':
+                eval_stacked_ensemble(existing_ensemble, session, path)
+                score = existing_ensemble.individual_score[module.metric_to_optimize]
+
+            else:
+                stacked_ensemble = models.StackedEnsemble(
+                    secondary_learner_hyperparameters=secondary_learner.get_params(),
+                    base_learners=current_ensemble,
+                    base_learner_origin=automated_run.base_learner_origin,
+                    job_status='started'
+                )
+                session.add(stacked_ensemble)
+                session.commit()
+                eval_stacked_ensemble(stacked_ensemble, session, path)
+                score = stacked_ensemble.individual_score[module.metric_to_optimize]
+
+            score = -score if module.invert_metric else score
+
+            if best_score < score:
+                best_score = score
+                best_ensemble = current_ensemble[:]
+
+            current_ensemble.pop()
diff --git a/xcessiv/models.py b/xcessiv/models.py
@@ -3,7 +3,7 @@
 import random
 import string
 from sqlalchemy.ext.declarative import declarative_base
-from sqlalchemy import Column, Text, Integer, Boolean, TypeDecorator, ForeignKey, Table
+from sqlalchemy import Column, Text, Integer, Boolean, TypeDecorator, ForeignKey, Table, UniqueConstraint
 from sqlalchemy.orm import relationship
 from sqlalchemy.ext import mutable
 import numpy as np
@@ -30,6 +30,18 @@ def process_result_value(self, value, dialect):
         return json.loads(value)
 
 
+class JsonEncodedList(TypeDecorator):
+    """Enables JSON storage of list by encoding and decoding on the fly.
+     No need for mutability tracking"""
+    impl = Text
+
+    def process_bind_param(self, value, dialect):
+        return json.dumps(value, sort_keys=True)
+
+    def process_result_value(self, value, dialect):
+        return json.loads(value)
+
+
 mutable.MutableDict.associate_with(JsonEncodedDict)
 
 
@@ -257,7 +269,8 @@ def serialize(self):
 association_table = Table(
     'association', Base.metadata,
     Column('baselearner_id', Integer, ForeignKey('baselearner.id')),
-    Column('stackedensemble_id', Integer, ForeignKey('stackedensemble.id'))
+    Column('stackedensemble_id', Integer, ForeignKey('stackedensemble.id')),
+    UniqueConstraint('baselearner_id', 'stackedensemble_id', name='UC_baselearner_id_stackedensemble_id')
 )
 
 
@@ -366,6 +379,7 @@ class StackedEnsemble(Base):
         secondary=association_table,
         back_populates='stacked_ensembles'
     )
+    base_learner_ids = Column(JsonEncodedList)
     base_learner_origin_id = Column(Integer, ForeignKey('baselearnerorigin.id'))
     base_learner_origin = relationship('BaseLearnerOrigin', back_populates='stacked_ensembles')
     secondary_learner_hyperparameters = Column(JsonEncodedDict)
@@ -383,6 +397,7 @@ def __init__(self, secondary_learner_hyperparameters, base_learners,
         self.job_status = job_status
         self.job_id = None
         self.description = dict()
+        self.base_learner_ids = sorted([bl.id for bl in base_learners])
 
     def return_secondary_learner(self):
         """Returns secondary learner using its origin and the given hyperparameters

diff --git a/xcessiv/rqtasks.py b/xcessiv/rqtasks.py
@@ -199,9 +199,16 @@ def start_automated_run(path, automated_run_id):
             elif automated_run.category == 'tpot':
                 automatedruns.start_tpot(automated_run, session, path)
 
+            elif automated_run.category == 'greedy_ensemble_search':
+                automatedruns.start_greedy_ensemble_search(automated_run, session, path)
+
             else:
                 raise Exception('Something went wrong. Invalid category for automated run')
 
+            automated_run.job_status = 'finished'
+            session.add(automated_run)
+            session.commit()
+
         except:
             session.rollback()
             automated_run.job_status = 'errored'

diff --git a/xcessiv/ui/src/AutomatedRuns/AutomatedRunsDisplay.js b/xcessiv/ui/src/AutomatedRuns/AutomatedRunsDisplay.js
@@ -9,6 +9,7 @@ import FaTrash from 'react-icons/lib/fa/trash';
 import FaSpinner from 'react-icons/lib/fa/spinner';
 import FaExclamationCircle from 'react-icons/lib/fa/exclamation-circle'
 import FaInfo from 'react-icons/lib/fa/info';
+import FaCogs from 'react-icons/lib/fa/cogs';
 import { DetailsModal, DeleteModal } from './Modals'
 
 class AutomatedRunsDisplay extends Component {
@@ -26,6 +27,8 @@ class AutomatedRunsDisplay extends Component {
       <table><tbody><tr>
         <td>
           <Button onClick={() => this.setState({open: !this.state.open})}>
+            <FaCogs />
+            {' '}
             {(this.state.open ?  'Hide' : 'Show') + ' Automated Runs'}
           </Button>
         </td>

diff --git a/xcessiv/ui/src/BaseLearnerOrigin/BaseLearnerOrigin.js b/xcessiv/ui/src/BaseLearnerOrigin/BaseLearnerOrigin.js
@@ -12,6 +12,7 @@ import $ from 'jquery';
 import FaCheck from 'react-icons/lib/fa/check';
 import FaSpinner from 'react-icons/lib/fa/spinner';
 import FaExclamationCircle from 'react-icons/lib/fa/exclamation-circle';
+import FaCogs from 'react-icons/lib/fa/cogs';
 import { Button, ButtonToolbar, Glyphicon, Alert, Panel as BsPanel,
   Form, FormGroup, ControlLabel, FormControl, DropdownButton,
   MenuItem } from 'react-bootstrap';
@@ -417,7 +418,8 @@ class BaseLearnerOrigin extends Component {
           <Button 
             disabled={!this.props.data.final}
             onClick={() => this.setState({showBayesianRunModal: true})}>
-            Bayesian Optimization
+            <FaCogs />
+            {' Bayesian Optimization'}
           </Button>
 
         </ButtonToolbar>

diff --git a/xcessiv/ui/src/BaseLearnerOrigin/ListBaseLearnerOrigin.js b/xcessiv/ui/src/BaseLearnerOrigin/ListBaseLearnerOrigin.js
@@ -3,6 +3,7 @@ import './BaseLearnerOrigin.css';
 import BaseLearnerOrigin from './BaseLearnerOrigin';
 import { Button, Glyphicon, ButtonGroup } from 'react-bootstrap';
 import { TpotModal } from './BaseLearnerOriginModals'
+import FaCogs from 'react-icons/lib/fa/cogs';
 
 class ListBaseLearnerOrigin extends Component {
 
@@ -47,7 +48,7 @@ class ListBaseLearnerOrigin extends Component {
           {' Add new base learner origin'}
         </Button>
         <Button href="#" onClick={() => this.setState({showTpotModal: true})}>
-          <Glyphicon glyph="plus" />
+          <FaCogs />
           {' Automated base learner generation with TPOT'}
         </Button>
       </ButtonGroup>

diff --git a/xcessiv/ui/src/Ensemble/EnsembleBuilder.js b/xcessiv/ui/src/Ensemble/EnsembleBuilder.js
@@ -9,6 +9,8 @@ import CodeMirror from 'react-codemirror';
 import 'codemirror/lib/codemirror.css';
 import 'codemirror/mode/python/python';
 import { Button, Glyphicon } from 'react-bootstrap';
+import FaCogs from 'react-icons/lib/fa/cogs';
+import { GreedyRunModal } from './EnsembleMoreDetailsModal'
 
 
 const defaultSourceParams = [
@@ -24,7 +26,8 @@ class EnsembleBuilder extends Component {
     super(props);
     this.state = {
       selectedValue: null,
-      source: defaultSourceParams
+      source: defaultSourceParams,
+      showGreedyModal: false
     };
   }
 
@@ -67,13 +70,25 @@ class EnsembleBuilder extends Component {
         />
         <Button 
           block
+          href="#"
           disabled={buttonDisabled}
           bsStyle='primary'
           onClick={() => this.props.createStackedEnsemble(
             this.state.selectedValue.value, this.state.source)}>
           <Glyphicon glyph="plus" />
           {' Create new ensemble'}
         </Button>
+        <Button 
+          block
+          onClick={() => this.setState({showGreedyModal: true})}
+        >
+          <FaCogs />
+          {' Automated ensemble search'}
+        </Button>
+        <GreedyRunModal isOpen={this.state.showGreedyModal} 
+          onRequestClose={() => this.setState({showGreedyModal: false})}
+          handleYes={(id, source) => this.props.startGreedyRun(id, source)} 
+          optionsBaseLearnerOrigins={this.props.optionsBaseLearnerOrigins}/>
       </div>
     )
   }