Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

Conversation

yenchenlin
Copy link
Contributor

This is a follow up PR from #6302

Can @TomDLT please check it?
Thanks!

@TomDLT
Copy link
Member

TomDLT commented Feb 18, 2016

What is the purpose of supporting sparse y if we make it dense?

You should adapt the checking to handle sparse y.

@yenchenlin
Copy link
Contributor Author

@TomDLT Sorry for my misunderstanding.
Do you mean I need to remove

    y = np.asarray(y)
    if y.ndim != 1 and not multilabel:
        raise ValueError("expected y of shape (n_samples,), got %r"
                         % (y.shape,))

and do the check which original code do to X at the following line?
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/svmlight_format.py#L393

@yenchenlin yenchenlin changed the title Make dump_svmlight_file support sparse y (fixes #6301) [WIP] Make dump_svmlight_file support sparse y (fixes #6301) Feb 18, 2016
@TomDLT
Copy link
Member

TomDLT commented Feb 18, 2016

You need to remove y = np.asarray(y) and use check_array with ensure_2d=False.
But we have to keep the check of y.ndim if not multilabel.

You also have to sort indices (as for X), to modify _dump_svmlight to handle CSR y, and to update the tests (sklearn/datasets/tests/test_svmlight_format.py)

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 4 times, most recently from 2101e1e to 4cc198c Compare February 20, 2016 04:24
@yenchenlin
Copy link
Contributor Author

Hello @TomDLT ,

I've modified the code.
Could you please have a look?

@yenchenlin yenchenlin changed the title [WIP] Make dump_svmlight_file support sparse y (fixes #6301) [MRG] Make dump_svmlight_file support sparse y (fixes #6301) Feb 20, 2016
@@ -262,6 +263,16 @@ def test_dump_multilabel():
assert_equal(f.readline(), b("0,2 \n"))
assert_equal(f.readline(), b("0,1 1:5 3:1\n"))

# test if y is sparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid code duplication with:

y_dense = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
y_csr = sp.csr_matrix(y_dense)
for y in [y_dense, y_csr]:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks for the review.

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 4cc198c to 295be2b Compare February 22, 2016 10:30
@TomDLT
Copy link
Member

TomDLT commented Feb 22, 2016

Could you also adapt test_dump() ?
(with something like for X, y in zip((Xs, Xd, Xsliced), (ys, yd, ysliced)):)

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 2 times, most recently from cc68e62 to 3a14be5 Compare February 22, 2016 11:46
@yenchenlin
Copy link
Contributor Author

Hello @TomDLT ,

Done!
Please have a look.

# default anymore.

# make y conforms to shape: (n_samples, n_labels)
if (sp.issparse(y) and y.shape[0] == 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just for y_sliced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean

if (sp.issparse(y) and y.shape[0] == 1):

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes
is it for y_sparse, y_dense or y_sliced ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because if y_dense is a 1d array which has shape (n_samples, ), turning it into a csr_matrix will make its shape become (1, n_samples).
However, sparse matrix passed into dump_svmlight_file must have shape (n_samples, n_labels).
Therefore I add

if (sp.issparse(y) and y.shape[0] == 1):
    y = y.T

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 3a14be5 to a4ddee8 Compare February 22, 2016 14:55
assert_array_almost_equal(
X_dense.astype(dtype), X2_dense, 4)
assert_array_almost_equal(
y_dense.astype(dtype), yd, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to compare yd and y2, or y_dense and y2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to compare a dense array transformed from sparse array, which is yd in this case, and y2.
y_dense is dense but is not transformed from sparse array. So I declare yd here.

Do you think I need to rename yd to something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you don't compare with y2 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Thanks for pointing out the mistakes.

I just read the code again ...
Maybe we can just test y_dense.astype(dtype) and y2 here, and don't need a dense matrix which is transformed from sparse matrix like yd ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the code.

A dense array transformed from sparse array is not needed ...
I misunderstand the code here - I thought there will be a rounding error when transforming a sparse matrix into dense matrix.

Would @TomDLT please check again? 🙏

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 2 times, most recently from 81d9e28 to c27969d Compare February 22, 2016 16:20
assert_array_almost_equal(
y_dense.astype(dtype), y2, 15)

if not sp.issparse(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean remove

if not sp.issparse(y):
    assert_array_equal(y, y2)

?

I do this because y is not equal to y2 when y refers to sparse array (y_sparse and y_sliced)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this line is to verify that y is correctly dumped and reloaded (into y2).
So we check that y and y2 are equal, but you added this verification above (between y2 and y_dense).
So you can remove both lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I got it!
Done!

@TomDLT
Copy link
Member

TomDLT commented Feb 22, 2016

LGTM

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from c27969d to a2048bb Compare February 23, 2016 02:40
@yenchenlin
Copy link
Contributor Author

Great thanks to @TomDLT 's explanation and review.

@yenchenlin yenchenlin changed the title [MRG] Make dump_svmlight_file support sparse y (fixes #6301) [MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) Feb 23, 2016
y : array-like, shape = [n_samples] or [n_samples, n_labels]
Target values. Class labels must be an integer or float, or array-like
objects of integer or float for multilabel classifications.
y : {array-like, sparse matrix}, shape = [n_samples] or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do y : {array-like, sparse matrix}, shape = [n_samples (, n_labels)]

to keep in in a single line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!
Code updated.

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from a2048bb to 7bc17da Compare February 25, 2016 18:02
@tguillemot
Copy link
Contributor

LGTM

@raghavrv
Copy link
Member

@TomDLT merge?

y = y.T

dump_svmlight_file(X.astype(dtype), y, f, comment="test",
zero_based=zero_based)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just align this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@yenchenlin yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 7bc17da to 22d7cd5 Compare March 23, 2016 17:23
@TomDLT
Copy link
Member

TomDLT commented Mar 23, 2016

Thanks @yenchenlin1994 !

@TomDLT TomDLT merged commit eed5fc5 into scikit-learn:master Mar 23, 2016
@yenchenlin yenchenlin deleted the make-dump_svmlight_file-support-sparse-y branch March 23, 2016 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants