Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can a vector of weights be specified in param_grid within GASearchCV (somehow)? #91

Closed
sgbaird opened this issue Feb 26, 2022 · 3 comments
Labels
question Further information is requested

Comments

@sgbaird
Copy link

sgbaird commented Feb 26, 2022

The idea is to take in predictions from an arbitrary number of models, and find optimal weights that maximize the accuracy of the ensembled model.

Here's the estimator that I wrote:

from typing import List, Optional
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils import check_X_y, check_array
from sklearn.utils.estimator_checks import check_estimator, check_is_fitted
from sklearn.metrics import mean_absolute_error


class WeightedAverageEnsemble(BaseEstimator, RegressorMixin):
    """
    
    >>> wae = WeightedAverageEnsemble()
    >>> X = np.random.rand(20, 5)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)
    
    >>> wae = WeightedAverageEnsemble(weights=[0.25, 0.75])
    >>> X = np.random.rand(20, 2)
    >>> y = np.random.rand(20, 1)
    >>> wae.fit(X, y)
    >>> wae.predict(X)

    Parameters
    ----------
    BaseEstimator : _type_
        _description_
    RegressorMixin : _type_
        _description_
    """

    def __init__(self, weights: Optional[List[float]] = None):
        if weights is not None:
            assert np.isclose(sum(weights), 1.0)
        self.weights = weights

    def fit(self, X, y):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X, y = check_X_y(X, y, accept_sparse=False)
        self.is_fitted_ = True
        self.n_features_in_ = X.shape[1]
        if self.weights is None:
            self._mod_weights = np.ones(self.n_features_in_) / self.n_features_in_
            # equivalent to:
            # w = np.ones(self.n_features_in_).reshape(1, -1)
            # w = sklearn.preprocessing.normalize(w, norm="l1", axis=1)
        else:
            self._mod_weights = self.weights
        return self

    def predict(self, X):
        # TODO: deal with sparse inputs (i.e. mask `W` and convert to sparse)
        X = check_array(X, accept_sparse=False)
        check_is_fitted(self, "is_fitted_")
        W = np.tile(self._mod_weights, (X.shape[0], 1))
        y = np.einsum("ij, ij->i", W, X)
        # should be equivalent to: y = np.sum(W * X)
        # loop with np.dot might also be fast due to BLAS compatibility
        # https://stackoverflow.com/a/26168677/13697228
        # https://stackoverflow.com/a/39657770/13697228
        return y

    def score(self, X, y, **kwargs):
        y_pred = self.predict(X)
        return mean_absolute_error(y, y_pred, **kwargs)


check_estimator(WeightedAverageEnsemble())

Related: https://machinelearningmastery.com/weighted-average-ensemble-with-python

How would you suggest optimizing weights since it's a vector that can change in size based on the size of the input data?

@rodrigo-arenas
Copy link
Owner

rodrigo-arenas commented Feb 26, 2022

Hi @sgbaird, currently, the package only accepts the hyperparameters to be integers, floats, or categorical; in this case, as you have an array, it's not natively supported.
One walkaround I can think of is that instead of taking the parameter weights as a vector, use the syntax **kwargs in your __init__ method to get an arbitrary number of extra parameters, and each of those parameters will represent the weights; your example would change to:

wae = WeightedAverageEnsemble(w1=0.25, w2=0.75)

This way, you can define the param grid as:

param_grid =  {'w1': Continuous(0.01, 0.99, distribution='log-uniform'),
               'w2': Continuous(0.01, 0.99, distribution='log-uniform')}

The main issue with this is that you can't guarantee that all the weights will add up to one, so a normalization might be required and that makes the optimization problem harder since even if you set w1 to a fixed number, the actual number used after normalization will be w1/(w1+w2) which is a function of w2 (or a multivariate function if you have more weights), and vice-versa when you normalize w2, so even if it can be optimized it will probably take a longer time to converge since its a little misleading to the algorithm.

I hope it makes sense.

@rodrigo-arenas rodrigo-arenas added the question Further information is requested label Feb 26, 2022
@rodrigo-arenas
Copy link
Owner

I'm closing this issue, but feel free to raise more questions if needed

@sgbaird
Copy link
Author

sgbaird commented Mar 7, 2022

@rodrigo-arenas thank you! Good point about the normalization.

@rodrigo-arenas rodrigo-arenas mentioned this issue Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants