Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Issue #2559: Added normalize option to LogisticRegression #2567

Open
wants to merge 7 commits into
from

Conversation

Projects
None yet
5 participants

And to BaseLibLinear, which is base class for LogisticRegression and does all computations.

Question, should center_data be moved to general utils module, because it's now been used from linear_model and svm packages?

@ilblackdragon ilblackdragon Issue #2559: Added normalize option to LogisticRegression (and BaseLi…
…bLinear, which is base class for LogisticRegression and does all computations)
cfdd9da

@agramfort agramfort commented on an outdated diff Nov 3, 2013

sklearn/svm/base.py
@@ -677,6 +681,12 @@ def fit(self, X, y):
# LibLinear wants targets as doubles, even for classification
y = np.asarray(y, dtype=np.float64).ravel()
+
+ # Center data if self.normalize
+ X, y, X_mean, y_mean, X_std = self._center_data(X, y,
+ self.fit_intercept,
+ self.normalize)
@agramfort

agramfort Nov 3, 2013

Owner

this will not work as you want at predict time to normalize X_test using X_mean and X_std

Owner

agramfort commented Nov 3, 2013

travis is not happy.

Owner

amueller commented Nov 7, 2013

Do we want this? I am actually not happy that there is such an option (with differing default parameters iirc) in the linear models.

@amueller You think it would be better to make user use preprocessing.scale explicitly and correct coefficient manually?

Btw, are any of you on PyData in NYC this weekend?

Coverage Status

Coverage remained the same when pulling 76f31ca on ilblackdragon:logit-normalize into 8ad2f5e on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 4fe4255 on ilblackdragon:logit-normalize into 8ad2f5e on scikit-learn:master.

@ilblackdragon ilblackdragon Issue #2559: Modify test_logisitc:test_normalize to check that it wor…
…ks correctly with fit_intercept=False and non-default scaling_intercept
1b3c783

Coverage Status

Coverage remained the same when pulling 1b3c783 on ilblackdragon:logit-normalize into 8ad2f5e on scikit-learn:master.

@agramfort agramfort commented on an outdated diff Nov 10, 2013

sklearn/svm/base.py
@@ -677,6 +678,12 @@ def fit(self, X, y):
# LibLinear wants targets as doubles, even for classification
y = np.asarray(y, dtype=np.float64).ravel()
+
+ # Center data if self.normalize
+ if self.normalize:
+ X_mean, X_std = np.mean(X, axis=0), np.std(X, axis=0)
+ X = (X - X_mean) / X_std
@agramfort

agramfort Nov 10, 2013

Owner

I am not sure if there will be a consensus to merge this but let me give feedback to help

doing this will break on sparse data. Maybe preventing normalize on sparse data is the way to go.
Also if X has already been copied you should write

X -= X_mean
X /= X_std

this will prevent a reallocation of the size of X.

@agramfort agramfort commented on an outdated diff Nov 10, 2013

sklearn/linear_model/tests/test_logistic.py
@@ -155,3 +155,21 @@ def test_liblinear_random_state():
lr2 = logistic.LogisticRegression(random_state=0)
lr2.fit(X, y)
assert_array_almost_equal(lr1.coef_, lr2.coef_)
+
+def test_normalize():
+ """Test for normalize option in LogisticRegression
+ to verify that prediction of array that already normalize is same as if normalize option is enabled
@agramfort

agramfort Nov 10, 2013

Owner

line too long. Consider a pep8 checker in your editor.

Also it should be:

to verify that the prediction with an array that has already been normalized is the same as if normalize option is enabled.

Thanks for feedback.

To @amueller comment to have normalize in a first place - from one side it's useful that it's inside LinearModel, and actually fixes coefficient to work with unnormalized data, from the other - may be having a pipeline of Scale + Model is a better approach, because it can be any arbitrary model.

What about adding some helper function normalized(estimator) that will wrap this as pipeline together?

Coverage Status

Coverage remained the same when pulling 669c3b0 on ilblackdragon:logit-normalize into 8ad2f5e on scikit-learn:master.

Owner

amueller commented Nov 11, 2013

I don't have a strong opinion on this, maybe we should get some other peoples' feedback, too?
I think I wouldn't want a normalized. Why for StandardScaler and not for MinMaxScaler and Normalizer?
Do you feel writing
estimator = Pipline([("scaler", StandardScaler), ("est", old_estimator)]) is too complicated?
Adding a new function (or rather one for every preprocessor) seems like adding much interface for little gain.

I though a little more about original normalize option in LinearRegression. The beauty of it, that coefficients and intercept is been adjusted, which means that it's not just Pipeline of estimators you can score/predict with. You have ability to analyze coefficients and intercept itself (i.e. feature importance and testing that coefficients equal to zero with null hypotesis). Same goes to any other machine learning algorithm - if scale in the main estimator is not restored, it will be very hard to analyze it's internals.

Yes, writing pipeline isn't that complicated. Apparently I'm not yet very proficient with with scikit-learn, and when I needed to build normalized LogisticRegression writing pipeline didn't come up in my mind. May be, it would be helpful to add it as an example in the manual http://scikit-learn.org/stable/modules/preprocessing.html

Owner

larsmans commented Nov 25, 2013

This is pushing a problem-specific assumption, that the data is not normalized and performance will benefit from one specific form of normalization, into the estimator. It violates the principle that estimators should be orthogonal components as much as possible.

As for inspectability, I don't really understand the reasoning: if you put LR in a pipeline with a StandardScaler up front, you'll get the same coefficients that you get from this solution, right?

(I know we've done this in a bunch of other places, and I didn't like it there either.)

@larsmans No, coefficients will differ - because LR in a pipeline will have coefficients that work with already scaled data, when in this solution coefficients are rescaled internally back to work with original data (i.e. without scaling first, you can just use lr.score(original_x)).

Owner

agramfort commented Nov 27, 2013

indeed it avoids a copy of X at predict time.

Owner

larsmans commented Nov 27, 2013

Ok, I see what you mean. But StandardScaler has a copy option, so it doesn't need to copy either.

Owner

agramfort commented Nov 27, 2013

ok but you agree that with a pipeline you have to center/scale the X_test?
indeed in place if you can but still...

Owner

larsmans commented Nov 27, 2013

Pipelines do the transform when you predict with them. That's what they're for: to run a bunch of transformations automatically as if their effects were merged into the estimator.

Actually about pipelines - it makes sense to not use full X when predict - but do it raw by raw.
Just because if you do something like http://www-stat.stanford.edu/~jhf/R-RuleFit.html (pipeline, when each leaf of decision tree from gradient boosting is been transformed to variable that passed to logistic regression) - memory expansion of each raw is enuourmous. It makes sense to do it raw by raw to minimize overall memory footprint.

@larsmans larsmans commented on an outdated diff Nov 27, 2013

sklearn/linear_model/logistic.py
@@ -36,6 +36,10 @@ class LogisticRegression(BaseLibLinear, LinearClassifierMixin,
Specifies if a constant (a.k.a. bias or intercept) should be
added the decision function.
+ normalize : boolean, optional, default False
+ If True, the regressors X will be normalized before
+ logistic regression.
@larsmans

larsmans Nov 27, 2013

Owner

(Minor point:) we never (I hope) call training data "the regressors". This is very confusing, given that LR is a classification model and the term "regressors" is not commonly used in applied ML.

Owner

larsmans commented Nov 27, 2013

(Forget the previous comment, I don't read yours correctly.)

When predicting, you have full control over how many samples you feed in, regardless of the estimator type. That's tangential to this PR.

Coverage Status

Coverage remained the same when pulling 9e4458a on ilblackdragon:logit-normalize into 8ad2f5e on scikit-learn:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment