Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[WIP] FIX normalize to unit variance in [sparse_]center_data #3005

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
Owner

jnothman commented Mar 26, 2014

It appears center_data had been normalizing to variance=n_samples since 295dc3c#diff-7b11510996b18ecc1e4bfd4f4ec850b4R67

At master:

>>> X1 = np.random.normal(size=(1000, 1))
>>> X2 = np.random.normal(size=(10000, 1))
>>> center_data(X1, np.zeros(X1.shape[0]), True, True)[0].std()
0.031622776601683784
>>> center_data(X2, np.zeros(X1.shape[0]), True, True)[0].std()
0.0099999999999999794

This also deletes sparse_std, cleans up and tests the centering code.

Owner

jnothman commented Mar 26, 2014

Repushing without recythonised files...

Owner

MechCoder commented Mar 26, 2014

Doesn't this fail any test? I tried the same thing, but it looked as if it failed a few tests (Travis is not running, btw)

Owner

MechCoder commented Mar 26, 2014

Do you need to rebase?

Owner

jnothman commented Mar 26, 2014

Certainly do. And probably will fail tests. only tested linear_model.tests.test_base locally.

Owner

MechCoder commented Mar 26, 2014

I was just about to push in a commit. You beat me to it :(

Owner

jnothman commented Mar 26, 2014

You intended to push a commit to my branch? Or you'd prepared a similar patch?

Owner

MechCoder commented Mar 26, 2014

A commit to this #3004 . We'll still have to fix tests though.

Owner

MechCoder commented Mar 26, 2014

@jnothman Do you want me to take a shot at fixing tests. Or are you doing it?

Owner

jnothman commented Mar 26, 2014

Well, I'm still not assured what the correct behaviour is. The behaviour at master seems strange to me, but I don't know much about these normalisation processes and what they're useful for.

The test failures seem various. And some of them non-trivial. But I've not looked at them properly. Do you have a good idea of what's going on, and whether it's correct to fix this, or if this patch is misguided?

Owner

jnothman commented Mar 26, 2014

Ahh. I just realised I'd made a rebase error which meant sparse_center_data triggered an error.

Owner

MechCoder commented Mar 26, 2014

I'm too not sure about why the normalization is done that way. Maybe @JeanKossaifi would be free enough to explain why? All I wanted was for sparse_center_data to support CSR matrices for the normailze case that was broken. Maybe the test failures might give an idea as to why this is being done.

Owner

jnothman commented Mar 26, 2014

All I wanted was for sparse_center_data to support CSR matrices for the normailze case that was broken.

I still gather this should be possible without sparse_std, even if you emulate the behaviour of returning data with variance=n_samples rather than 1.

Owner

jnothman commented Mar 26, 2014

@manoj-kumar-s, if you do feel like looking through the test failures, that would be nice ;) I did this while sick and off work, but won't have much time to spend on it.

@MechCoder MechCoder and 2 others commented on an outdated diff Mar 26, 2014

sklearn/linear_model/base.py
if normalize:
- X_std = sparse_std(
- X.shape[0], X.shape[1],
- X_data, X.indices, X.indptr, X_mean)
+ X_std = np.sqrt(X_var, X_var)
+ del X_var
@MechCoder

MechCoder Mar 26, 2014

Owner

Is it better to write this as just X_std = np.sqrt(X_var)

@jnothman

jnothman Mar 26, 2014

Owner

Probably.

@GaelVaroquaux

GaelVaroquaux Mar 26, 2014

Owner
  •        X_std = np.sqrt(X_var, X_var)
    
  •        del X_var
    

Is it better to write this as just X_std = np.sqrt(X_var)

No, your proposal creates a copy, whereas the above does not.

Owner

jnothman commented Mar 26, 2014

Perhaps it's not worth having unit variance... :s

Owner

MechCoder commented Mar 26, 2014

Please wait before pushing anything to this branch. I have a commit almost ready. I want to add a few more things. Thanks.

Owner

GaelVaroquaux commented Mar 26, 2014

Perhaps it's not worth having unit variance... :s

In general, it does not matter. The only thing that is sets is the
lambda.

However, if we are indeed solving the standard lasso objective, with the
right normalization factors (which we should be, I believe), I think that
it could matter.

I haven't had time follow this issue, so I can't tell. I'll try to do
this, while still listening to the guy speaking currently ... :$

Owner

MechCoder commented Mar 26, 2014

@jnothman and @GaelVaroquaux . I sent a PR across your branch jnothman#3 . I made some changes to sparsefuncs.pyx too. Please tell me how it looks like.

Owner

MechCoder commented Mar 28, 2014

@jnothman Do you have any objections with merging my PR (jnothman#3)? For now I suppose it would be best, to leave it as, i.e without unit variance. Also removing sparse_std would mean sparse_center_data, would support CSR matrices, which means I can resume my work. I would like to get this PR in. Thanks.

Owner

jnothman commented Mar 29, 2014

@MechCoder, the point of this PR is to fix -- or determine whether we should fix -- the scale of variance. I agree, sparse_std should be removed and can be without the remainder of this PR. I'll try pull that together.

@GaelVaroquaux it seems the scale of the variance has a big impact on OMP, and some on Lasso.

@jnothman jnothman changed the title from [MRG] FIX normalize to unit variance in [sparse_]center_data to [WIP] FIX normalize to unit variance in [sparse_]center_data Mar 29, 2014

Owner

jnothman commented Mar 30, 2014

I've rebased this as a minimal changeset following #3019. It's here to highlight what difference feature scale makes so we can decide if it's worth fixing it to unit variance.

Owner

jnothman commented Aug 10, 2014

I'm closing this, presuming Not Fixing.

@jnothman jnothman closed this Aug 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment