Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)? #91

sgbaird · 2022-02-26T05:29:36Z

The idea is to take in predictions from an arbitrary number of models, and find optimal weights that maximize the accuracy of the ensembled model.

Here's the estimator that I wrote:

from typing import List, Optional
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils import check_X_y, check_array
from sklearn.utils.estimator_checks import check_estimator, check_is_fitted
from sklearn.metrics import mean_absolute_error


class WeightedAverageEnsemble(BaseEstimator, RegressorMixin):
    """
    
    >>> wae = WeightedAverageEnsemble()
    >>> X = np.random.rand(20, 5)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)
    
    >>> wae = WeightedAverageEnsemble(weights=[0.25, 0.75])
    >>> X = np.random.rand(20, 2)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)

    Parameters
    ----------
    BaseEstimator : _type_
        _description_
    RegressorMixin : _type_
        _description_
    """

    def __init__(self, weights: Optional[List[float]] = None):
        if weights is not None:
            assert np.isclose(sum(weights), 1.0)
        self.weights = weights

    def fit(self, X, y):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X, y = check_X_y(X, y, accept_sparse=False)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]
        if self.weights is None:
            self._mod_weights = np.ones(self.n_features_in_) / self.n_features_in_
            # equivalent to:
            # w = np.ones(self.n_features_in_).reshape(1, -1)
            # w = sklearn.preprocessing.normalize(w, norm="l1", axis=1)
        else:
            self._mod_weights = self.weights
        return self

    def predict(self, X):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X = check_array(X, accept_sparse=False)
        check_is_fitted(self, "is_fitted_")
        W = np.tile(self._mod_weights, (X.shape[0], 1))
        y = np.einsum("ij, ij->i", W, X)
        # should be equivalent to: y = np.sum(W * X)
        # loop with np.dot might also be fast due to BLAS compatibility
        # https://stackoverflow.com/a/26168677/13697228
        # https://stackoverflow.com/a/39657770/13697228
        return y

    def score(self, X, y, **kwargs):
        y_pred = self.predict(X)
        return mean_absolute_error(y, y_pred, **kwargs)


check_estimator(WeightedAverageEnsemble())

Related: https://machinelearningmastery.com/weighted-average-ensemble-with-python

How would you suggest optimizing weights since it's a vector that can change in size based on the size of the input data?

The text was updated successfully, but these errors were encountered:

rodrigo-arenas · 2022-02-26T19:27:25Z

Hi @sgbaird, currently, the package only accepts the hyperparameters to be integers, floats, or categorical; in this case, as you have an array, it's not natively supported.
One walkaround I can think of is that instead of taking the parameter weights as a vector, use the syntax **kwargs in your __init__ method to get an arbitrary number of extra parameters, and each of those parameters will represent the weights; your example would change to:

wae = WeightedAverageEnsemble(w1=0.25, w2=0.75)

This way, you can define the param grid as:

param_grid =  {'w1': Continuous(0.01, 0.99, distribution='log-uniform'),
               'w2': Continuous(0.01, 0.99, distribution='log-uniform')}

The main issue with this is that you can't guarantee that all the weights will add up to one, so a normalization might be required and that makes the optimization problem harder since even if you set w1 to a fixed number, the actual number used after normalization will be w1/(w1+w2) which is a function of w2 (or a multivariate function if you have more weights), and vice-versa when you normalize w2, so even if it can be optimized it will probably take a longer time to converge since its a little misleading to the algorithm.

I hope it makes sense.

rodrigo-arenas · 2022-03-06T23:15:43Z

I'm closing this issue, but feel free to raise more questions if needed

sgbaird · 2022-03-07T12:44:39Z

@rodrigo-arenas thank you! Good point about the normalization.

rodrigo-arenas added the question Further information is requested label Feb 26, 2022

rodrigo-arenas closed this as completed Mar 6, 2022

rodrigo-arenas mentioned this issue Apr 22, 2022

NGrams #93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)? #91

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)? #91

sgbaird commented Feb 26, 2022

rodrigo-arenas commented Feb 26, 2022 •

edited

Loading

rodrigo-arenas commented Mar 6, 2022

sgbaird commented Mar 7, 2022

Can a vector of weights be specified in param_grid within GASearchCV (somehow)? #91

Can a vector of weights be specified in param_grid within GASearchCV (somehow)? #91

Comments

sgbaird commented Feb 26, 2022

rodrigo-arenas commented Feb 26, 2022 • edited Loading

rodrigo-arenas commented Mar 6, 2022

sgbaird commented Mar 7, 2022

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)? #91

Can a vector of weights be specified in `param_grid` within GASearchCV (somehow)? #91

rodrigo-arenas commented Feb 26, 2022 •

edited

Loading