Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Penalty factors for each coefficient in enet (similar to the R's glmnet library) #11566

Open
doaa-altarawy opened this issue Jul 16, 2018 · 17 comments
Labels
API help wanted Moderate Anything that requires some knowledge of conventions and best practices module:linear_model Needs Decision Requires decision

Comments

@doaa-altarawy
Copy link

doaa-altarawy commented Jul 16, 2018

Description

This is a feature request in allow flexibility in elastic net, which allows the user to apply separate penalties to each coefficient of the L1 term. The default value of this penalty factor is 1, meaning the regular elastic net behavior. If the penalty factor of a feature is Zero, it means it's not penalized at all, and that the user wants this feature to be always there in the model.

This feature is very useful in Bioinformatics and systems biology (that's why it is in Stanford's R package glmnet). With this feature, the user can run feature selection on a set of genes while making sure some genes stay in the system and not penalized (because of prior knowledge that they are involvid the system).

Here is glmnet documentation explaining the penalty factor. Mainly, it's controlling the selection weight on the lasso term:

https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin

This feature is used in several papers, I've implemented in it in one of mine in scikit-learn. I've requests from other biologists to use this method in Python without having to recompile scikit-learn. I can do a pull request with this feature.

Papers using this feature:

  • Altarawy, Doaa, Fatma-Elzahraa Eid, and Lenwood S. Heath. "PEAK: Integrating Curated and Noisy Prior Knowledge in Gene Regulatory Network Inference." Journal of Computational Biology 24.9 (2017): 863-873.
  • Greenfield, Alex, Christoph Hafemeister, and Richard Bonneau. "Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks." Bioinformatics 29.8 (2013): 1060-1067.
  • Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. "Regularization paths for generalized linear models via coordinate descent." Journal of statistical software 33.1 (2010): 1.
@doaa-altarawy doaa-altarawy changed the title Add Penalty factors for each coefficient (similar to the R's glmnet library) Add Penalty factors for each coefficient in enet (similar to the R's glmnet library) Jul 16, 2018
doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 16, 2018
@amueller
Copy link
Member

sounds like a reasonable addition. please open the PR. I can't guarantee that it'll get accepted but it makes sense to me.

doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 19, 2018
doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 24, 2018
doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 24, 2018
doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 24, 2018
doaa-altarawy added a commit to doaa-altarawy/scikit-learn that referenced this issue Jul 24, 2018
@lorentzenchr
Copy link
Member

This feature is also implemented in #9405, where you can exclude coefficients from the L1 as well as from the L2 penalty term.

@hermidalc
Copy link
Contributor

If possible could penalty factors also be added to LogisticRegression (which is also part of glmnet)? It would also be very useful in the same way as for ElasticNet except for classification problems where you need to include unpenalized covariates.

@cmarmo cmarmo added module:linear_model help wanted Needs Benchmarks A tag for the issues and PRs which require some benchmarks labels Jan 17, 2022
@lorentzenchr lorentzenchr added Moderate Anything that requires some knowledge of conventions and best practices Needs Decision - API and removed Needs Benchmarks A tag for the issues and PRs which require some benchmarks labels Feb 1, 2022
@lorentzenchr
Copy link
Member

How do we want the API?

  1. Allow to specify an array-like for the penalization strength? Same for l1_ratio?
  2. Could we already permit to work with feature names? Then a dict could work {"featue_1": 0.5, "features_2": 2.5}

Among the estimators are:

  • LogisticRegression(C=, l1_ratio=)
  • Lasso(alpha=)
  • ElasticNet(alpha=, l1_ration=)
  • Ridge(alpha=)

What do we do with the CV variants? I would leave them untouched for the moment.

@scikit-learn/core-devs as info

@agramfort
Copy link
Member

agramfort commented Feb 1, 2022 via email

@xiaowei1234
Copy link

can I get a review on my pull request that addresses this issue?

@thomasjpfan
Copy link
Member

Ridge already accepts alpha as an array of (n_targets,). If GLMs support an array of (n_features,) the two APIs would be inconsistent.

@jnothman
Copy link
Member

jnothman commented Mar 6, 2022

Can we accept shape (n_targets,), (1, n_targets), (n_features, 1) or (n_features, n_targets)?

@lorentzenchr
Copy link
Member

Personally, I find the multi-output story for penalties a bit unfortunate (maybe I'm blind or unaware of good use cases). Given that we can't change that (easily), what about introducing new parameters P1 and P2 as in https://glum.readthedocs.io/en/latest/glm.html (and as in #9405) for penalty matrix for l1 and penalty matrix for l2.

In particular the P2 would be nice to have as it generalizes the penalty to coef @ P2 @ coef. This can come in very handy, e.g. for penalizing differences of coefficients or just for different penalty strength per feature.

Further advantages:

  • unified way to specify feature-wise penalties across all (linear) estimators: always the same parameter name
  • clear distinction between l1 and l2
  • no mixing/confusion with n_targets

@xiaowei1234
Copy link

What is n_targets? is that not the number of features?

@thomasjpfan
Copy link
Member

It's the number of targets in y:

from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge

X, y = make_regression(n_targets=2, random_state=0)

print(y.shape)
# (100, 2)

reg =  Ridge().fit(X, y)
print(reg.coef_.shape)
# (2, 100)

@xiaowei1234
Copy link

It's the number of targets in y:

from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge

X, y = make_regression(n_targets=2, random_state=0)

print(y.shape)
# (100, 2)

reg =  Ridge().fit(X, y)
print(reg.coef_.shape)
# (2, 100)

ah ok thanks, haven't ever had the need to fit a regression w/ multiple targets before

@agramfort
Copy link
Member

agramfort commented Mar 8, 2022 via email

@lorentzenchr
Copy link
Member

But to me P1 cannot be a generic matrix as otherwise
the solvers will be a lot more complicated.

Correct, a 1d array for the diagonals of a P1 matrix suffices and keeps things solvable within our codebase.

@lorentzenchr
Copy link
Member

As discussed in the dev meeting 28 March 2022, new parameter(s) seems fine. The questions are:

  1. One or two parameters?
  2. Naming

One vs two

Having one parameter like alpha_features would work for pretty much all linear models. This keeps L1 and L2 penalties in sync.
Having two individual ones, like P1 and P2 for ||P1 * w||_1 and ||w' @ diag(P2) @ w||, would allow more control. It also opens the opportunity for later to allow P2 to be a 2d array.

@agramfort
Copy link
Member

agramfort commented Mar 29, 2022 via email

@lorentzenchr
Copy link
Member

There are several use cases:

  • I might not want to penalize some coefficients at all, in particular very strong main effects.
  • I might consider to add L1 to all coefficients, but exclude main effects from L2 which acts stronger on large coefficients.
  • With two different parameters for L1 and L2, we could later allow the L2 parameter to be a 2d array. With this one can construct:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API help wanted Moderate Anything that requires some knowledge of conventions and best practices module:linear_model Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

9 participants