Skip to content

Contributing new estimator

Alex Rogozhnikov edited this page May 5, 2015 · 1 revision

Basic concept

REP estimators (classifiers, regressors) are (almost) scikit-learn compatible, which means that they can be used (under some restrictions listed below) as usual scikit-learn estimators.

API of sklearn and Rolling your own estimator from sklearn should be read before proceeding.

Expected behavior

  1. estimator.__init__(features=None, <other parameters>)
    initialization is in general in the same way as in sklearn, but there is one mandatory argument features (which takes a list of strings - that are name of features to be used in training, default value is None, in which case all of the training features are used).

  2. all the arguments in __init__ should have default values (no mutable types here: no lists or user types - otherwise the same parameter value will be shared between all instances of classifier)

  3. estimator with default parameters should work and should provide moderate quality

  4. parameter's validation should be done inside fit method, constructor just sets everything to attributes

  5. estimator.fit(X, y, <optional arguments>) or estimator.fit(X, y, sample_weight=None, <optional arguments>),

  6. if classifier supports weights for samples, they are passed as sample_weight argument.

  7. Function should return estimator itself.

  8. NB: X can be pandas.DataFrame, the self.features are names of features used from dataframe. If numpy.array is passed, it should be considered as pandas.Dataframe with default column names: Feature_0, Feature_1, etc. This provides compatibility with sklearn.

  9. If self.features was None, the names of all columns in DataFrame are saved to self.features

  10. NB: For classification, 'y' is array-like with integers from [0, 1, .. n_classes-1], while sklearn supports any labels. This restriction simplifies usage of predict_proba and preparing reports.

  11. Fitting again erases results of previous training (unless opposite was explicitly pointed by user).

  12. pickle: classifier is assumed to be pickleable at any moment (before or after training), this is the default mechanism to save/load classifier or transfer it from one process/host to another.

  13. estimator.predict(X), estimator.staged_predict(X), estimator.predict_proba(X) and estimator.staged_predict_proba(X) work in the same way as in sklearn (and should return numpy.arrays), but X may have different number or order of columns (though all variables used in Features should present).
    Arguments passed to classifier (X, y, sample_weight) in any of methods should never be changed by classifier. Create copies if needed.

  14. estimator.get_params / estimator.set_params follows sklearn's interface completely, make sure that after cloning parameters there are no common objects inside original and clone (so no mutable objects were directly transferred).

  • set_params should be able to get any parameter named in constructor.
  • if there is some complex parameter, and it is reasonable to change it parts independently, this should be possible. For instance, if there is layers argument for neural networks, we should be able to use
    network.set_params(layers=[5, 7, 2])
    network.set_params(layers__0=3)

Since this may be useful to modify layers independently.

  1. self.classes_ for classifiers has the same meaning as for sklearn classifiers (so it will be always equal to numpy.arange(n_classes))

  2. self.feature_importances_ (if it is implemented) is expected to return numpy.array with importances of used features (so it's length and order of components correspond to self.features)

  3. self.get_feature_importances() (if it is implemented) should return pandas.DataFrame, with index=self.features, usually DataFrame uses only one column named effect, but if classifier supports several ways of computing importances, DataFrame may contain several columns.

Minimalistic example of classifier

Let's implement the simplest classifier, which predicts the same probabilities for all events (equal to proportions we observe in training dataset)

from rep.estimators.interface import Classifier
from rep.estimators.utils import check_inputs
import numpy

# we derive from `rep.estimators.Classifier`
# classifier is derived from sklearn.BaseEstimator and sklearn.ClassifierMixin, 
# so we meet expectations of sklearn.
class BasicClassifier(Classifier):
    """
    This dummy classifier returns the same probabilities to all events

    Parameters:
    -----------
    :param features: features used in training
    :type features: list[str] or None
    :param regularization: regularization, added to number of observed events in each class.
    :type regularization: float
    """

    def __init__(self, regularization=5., features=None):
        # init simply saves everything to fields (fields have same names, that's important!)
        self.regularization = regularization
        Classifier.__init__(self, features=features)

    def fit(self, X, y, sample_weight=None):
        # performing parameter validation
        assert isinstance(self.regularization, float), 'Regularization in BaseicClassifier should be float!' 
        # check inputs and sanize them
        X, y, sample_weight = check_inputs(X, y, sample_weight=sample_weight, allow_none_weights=True)
        # taking only those features named in self.features
        # this function sets self.features if it was None
        X = self._get_train_features(X)
        # set self.classes_ and control that classes are enumerated as [0, 1, ...]
        self._set_classes(y)
        self._probabilities = numpy.bincount(y, weights=sample_weight) + self.regularization
        self._probabilities /= numpy.sum(self._probabilities)
        # features are not used, so: 
        self.feature_importances_ = numpy.zeros(len(self.features))
        # don't forget to return self
        return self

    def predict_proba(self, X):
        # If it was real classifier,  this would be the first step
        X = self._get_train_features(X)
        # we return the same probabilities for all events
        result = numpy.zeros([len(X), len(self._probabilities)])
        result += self._probabilities[numpy.newaxis, :]
        return result
    
    
    # get_state, set_state are not implemented, because they present in Classifier
    # predict is implemented is Classifier base class, and it uses predict_proba, so no need to overload it
    # get_feature_importances implemented in Classifier and uses feature_importances_, so it works.
    # Since all fields are simple, this classifier is picklable. 

Special moments about neural networks implementation

We are implementing multilayer networks

  1. Each network should have layers parameter, which correspond only to hidden layers, number of units in input and visible layers should be detected automatically
  2. User should be able to set activation functions of all layers (in case of classification the output activation should be softmax, in case of regression - identity)
  3. NB. estimator should have scaler parameter which can be any sklearn.transformer. By default, this should be StandardScaler, but user should be able to point explicitly that no scaler needed. Proposed three scenarious: network(), network(scaler=MinMaxScaler()), network(scaler=False), in the last case no scaling is used. [TODO подумать]
    Parameters of scaler should be accessible, like this:network.set_params(scaler__with_mean=False), this is implemented in default set_params.
  4. All the standard tests should be passed
  5. Stacking with BaggingClassifier, AdaBoostClassifier (if supports weights), Pipeline.
  6. Do as expected or fail. If there was some argument/input you were not able to proceed (i.e. not supported) or there is no such argument, throw an exception during fit. It's better when user knows that this doesn't work rather then it works, but not as user thinks.
  7. Cloning should work, note that new object should not contain any common references with original objects (lists, dicts, and so on), use deepcopy-ing if needed on get_params.
  8. Tests should contain examples for different number of hidden layers (0, 1, 2), different activation functions, different trainers (if supported).
  9. All the parameters should be explained, for complex situations (like parameters of trainers which are very different for different trainers) a link to a page in library's documentation should be given.