New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] add sample_weight into LinearRegression #4881

Merged
merged 1 commit into from Aug 30, 2015

Conversation

Projects
None yet
7 participants
@sonnyhu
Contributor

sonnyhu commented Jun 21, 2015

Working on #4735

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Jun 21, 2015

I have a question in terms of testing. https://github.com/scikit-learn/scikit-learn/pull/4881/files#diff-f556973f37707f94a71f0ad38243dcf3R54 I copied this line from test_ridge.py and 0.47 seems to be a magic number to me. Is there a reason for this number or this came from experimentations? Thanks.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Jun 21, 2015

Travis build failed because of this error:

======================================================================
ERROR: sklearn.ensemble.tests.test_weight_boosting.test_sample_weight_missing
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python2.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/tests/test_weight_boosting.py", line 263, in test_sample_weight_missing
    assert_raises(ValueError, clf.fit, X, y_regr)
  File "/home/travis/miniconda/envs/testenv/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/weight_boosting.py", line 409, in fit
    return super(AdaBoostClassifier, self).fit(X, y, sample_weight)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/weight_boosting.py", line 139, in fit
    sample_weight)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/weight_boosting.py", line 467, in _boost
    return self._boost_discrete(iboost, X, y, sample_weight)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/weight_boosting.py", line 546, in _boost_discrete
    self.n_classes_ = len(self.classes_)
TypeError: object of type 'NoneType' has no len()

I cannot reproduce it in my computer and not quite sure why my change is breaking this test.

@agramfort

This comment has been minimized.

Member

agramfort commented Jun 21, 2015

@TomDLT you might want to have a look.

Returns
-------
self : returns an instance of self.
"""
from ridge import _rescale_data

This comment has been minimized.

@mblondel

mblondel Jun 22, 2015

Member

Move _rescale_data to base.py instead.

assert_raise_message(ValueError,
"Sample weights must be 1D array or scalar",
fit_linearregression_not_ok)

This comment has been minimized.

@TomDLT

TomDLT Jun 22, 2015

Member

You can give arguments to the function in assert_raise_message.
example:

assert_raise_message(ValueError, "Sample weights must be 1D array or scalar",
                     clf.fit, X, y, sample_weights_not_OK)`
@TomDLT

This comment has been minimized.

Member

TomDLT commented Jun 22, 2015

The error comes from test_sample_weight_missing() in test_weight_boosting.py.
As LinearRegression now has sample_weights, you have to remove it from this test.

0.47 seems to be a magic number to me

.47 is the lower bound of the correct behavior. Once you think your method is correct, you can take a value below the actual score. The test will check that the score does not decrease with future changes.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Jun 23, 2015

Ok, thanks. Will fix it later this week.

@sonnyhu sonnyhu changed the title from [WIP] add sample_weight into LinearRegression to add sample_weight into LinearRegression Jun 27, 2015

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Jul 10, 2015

This is ready for code review. Thanks.

@amueller amueller changed the title from add sample_weight into LinearRegression to [MRG] add sample_weight into LinearRegression Jul 11, 2015

clf.fit(X, y, sample_weights_OK_1)
clf.fit(X, y, sample_weights_OK_2)
def fit_linearregression_not_ok():

This comment has been minimized.

@TomDLT

TomDLT Jul 28, 2015

Member

you can delete this now

@TomDLT

This comment has been minimized.

Member

TomDLT commented Jul 28, 2015

LGTM

@@ -112,6 +112,19 @@ def center_data(X, y, fit_intercept, normalize=False, copy=True,
return X, y, X_mean, y_mean, X_std
def rescale_data(X, y, sample_weight):

This comment has been minimized.

@amueller

amueller Jul 28, 2015

Member

I think this method should stay private

This comment has been minimized.

@ndawe

ndawe Jul 31, 2015

Member

rescale_data -> _rescale_data

This comment has been minimized.

@sonnyhu

sonnyhu Aug 1, 2015

Contributor

rescale_data is used in ridge.py as well. Though in some sense, it is private to linear_model.

This comment has been minimized.

@mblondel

mblondel Aug 1, 2015

Member

+1 for making it private

def rescale_data(X, y, sample_weight):
"""Rescale data so as to support sample_weight"""
n_samples = X.shape[0]
sample_weight = sample_weight * np.ones(n_samples)

This comment has been minimized.

@ndawe

ndawe Jul 31, 2015

Member

Do we need to support a scalar sample_weight? Does that really change the result or does it wash out?

This comment has been minimized.

@amueller

amueller Jul 31, 2015

Member

It should rescale the penalty, so there is not really a point, right?

This comment has been minimized.

@amueller

amueller Jul 31, 2015

Member

never mind, there is no penalty here.

This comment has been minimized.

@sonnyhu

sonnyhu Aug 1, 2015

Contributor

This function is used in ridge.py as well. For Ridge, does it make sense to have a scaler sample_weight? Did some simple tests, it did not change the output. If scalar sample_weight can go away in both cases, it will be easier to write this function.

This comment has been minimized.

@mblondel

mblondel Aug 1, 2015

Member

I don't think it harms to keep it that way.

This comment has been minimized.

@sonnyhu

sonnyhu Aug 1, 2015

Contributor

Keep scalar sample_weight for both regressors?

@@ -363,16 +376,29 @@ def fit(self, X, y):
y : numpy array of shape [n_samples, n_targets]
Target values
sample_weight : float or numpy array of shape [n_samples]

This comment has been minimized.

@ndawe

ndawe Jul 31, 2015

Member

Same comment here about a scalar sample_weight. Sorry if I'm missing something here, but intuitively I would expect that the result would not change if you weighted all samples by the same amount.

This comment has been minimized.

@ndawe

ndawe Jul 31, 2015

Member

From some quick experiments, weighting all samples by an equal amount does not change anything, as I expected. So I don't think we need to support a scalar sample_weight.

This comment has been minimized.

@sonnyhu

sonnyhu Aug 1, 2015

Contributor

See above

This comment has been minimized.

@mblondel

mblondel Aug 1, 2015

Member

I agree that we don't need to add it to the doc.

@ndawe

This comment has been minimized.

Member

ndawe commented Jul 31, 2015

Please clean up your commit message too. Thanks!

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Jul 31, 2015

Thanks for feedback. Will clean PR up this weekend!

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 1, 2015

Hey all, the only question left now is if we want to keep a scalar sample_weight. I agree that a scalar does not really make sense, so I am thinking to remove scalar sample_weight from LinearRegression and Ridge as well as their documents. And in _rescale_data will only process non-scalar sample_weight. Does this work? Please let me know. Thanks!

@amueller

This comment has been minimized.

Member

amueller commented Aug 1, 2015

Was it possible to use scalar sample weight in Ridge before? If so, we would at least need to deprecate it before removing it.

@sonnyhu

This comment has been minimized.

@amueller

This comment has been minimized.

Member

amueller commented Aug 1, 2015

Then I'd leave it for the time being.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 1, 2015

Hmm, ok. So leave _rescale_data as it is. But change the doc for LinearRegression saying only accepting array for sample_weights but actually a scalar would work. Or should I add in a check in LinearRegression for sample_weights? I am not sure if I like this inconsistency.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 2, 2015

Hey, I would appreciate if anyone has some inputs. I want to close this during this weekend. Thanks!

@mblondel

This comment has been minimized.

Member

mblondel commented Aug 2, 2015

I would leave it as is but remove it from the docstring. It's not worth going through a deprecation cycle. You can think of it as broadcasting rule, like in NumPy.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 2, 2015

Ok, thanks!

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 2, 2015

Ok, done.

@TomDLT

This comment has been minimized.

Member

TomDLT commented Aug 3, 2015

LGTM 👍

@@ -286,15 +286,8 @@ def test_base_estimator():
def test_sample_weight_missing():
from sklearn.linear_model import LinearRegression

This comment has been minimized.

@amueller

amueller Aug 3, 2015

Member

can you add another model here that has no sample weights?

This comment has been minimized.

@sonnyhu

sonnyhu Aug 4, 2015

Contributor

Added LogisticRegression

n_jobs_ = self.n_jobs
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
y_numeric=True, multi_output=True)
if ((sample_weight is not None) and
np.atleast_1d(sample_weight).ndim > 1):
raise ValueError("Sample weights must be 1D array")

This comment has been minimized.

@amueller

amueller Aug 3, 2015

Member

Why not use column_or_1d?

This comment has been minimized.

@sonnyhu

sonnyhu Aug 4, 2015

Contributor

Good call. Done.

@amueller

This comment has been minimized.

Member

amueller commented Aug 3, 2015

looks good apart from minor nitpicks

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 4, 2015

Cool. Fixed. Thanks!

@amueller amueller changed the title from [MRG] add sample_weight into LinearRegression to [MRG + 1] add sample_weight into LinearRegression Aug 4, 2015

@amueller

This comment has been minimized.

Member

amueller commented Aug 4, 2015

The apveyor failure is related. And I'm not sure why travis is not running :-/

@amueller amueller changed the title from [MRG + 1] add sample_weight into LinearRegression to [MRG] add sample_weight into LinearRegression Aug 4, 2015

@amueller

This comment has been minimized.

Member

amueller commented Aug 4, 2015

maybe column_or_1d can't handle scalars :-/

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 5, 2015

Hmm, ok. I changed back to np.atleast_1d(sample_weight).ndim > 1.

@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 7, 2015

@amueller Hey, could you take a look at the final change please? Thanks!

sample_weights_OK = rng.randn(n_samples) ** 2 + 1
sample_weights_OK_1 = 1.
sample_weights_OK_2 = 2.
sample_weights_not_OK = sample_weights_OK[:, np.newaxis]

This comment has been minimized.

@amueller

amueller Aug 13, 2015

Member

these are not used anywhere, right?

This comment has been minimized.

@sonnyhu

sonnyhu Aug 13, 2015

Contributor

Yeah. Thanks.

@amueller

This comment has been minimized.

Member

amueller commented Aug 13, 2015

apart from the nitpick in the testing, this looks good.
@mblondel @agramfort ?

@amueller amueller changed the title from [MRG] add sample_weight into LinearRegression to [MRG + 1] add sample_weight into LinearRegression Aug 13, 2015

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 30, 2015

LGTM. Thank you! Merging

GaelVaroquaux added a commit that referenced this pull request Aug 30, 2015

Merge pull request #4881 from sonnyhu/weighted_least_squares
[MRG + 1] add sample_weight into LinearRegression

@GaelVaroquaux GaelVaroquaux merged commit 190abde into scikit-learn:master Aug 30, 2015

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@sonnyhu

This comment has been minimized.

Contributor

sonnyhu commented Aug 30, 2015

Thanks!

@sonnyhu sonnyhu deleted the sonnyhu:weighted_least_squares branch Sep 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment