[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

yenchenlin · 2016-02-18T10:38:29Z

This is a follow up PR from #6302

Can @TomDLT please check it?
Thanks!

TomDLT · 2016-02-18T10:54:16Z

What is the purpose of supporting sparse y if we make it dense?

You should adapt the checking to handle sparse y.

yenchenlin · 2016-02-18T12:15:29Z

@TomDLT Sorry for my misunderstanding.
Do you mean I need to remove

    y = np.asarray(y)
    if y.ndim != 1 and not multilabel:
        raise ValueError("expected y of shape (n_samples,), got %r"
                         % (y.shape,))

and do the check which original code do to X at the following line?
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/svmlight_format.py#L393

TomDLT · 2016-02-18T12:45:49Z

You need to remove y = np.asarray(y) and use check_array with ensure_2d=False.
But we have to keep the check of y.ndim if not multilabel.

You also have to sort indices (as for X), to modify _dump_svmlight to handle CSR y, and to update the tests (sklearn/datasets/tests/test_svmlight_format.py)

yenchenlin · 2016-02-20T05:30:06Z

Hello @TomDLT ,

I've modified the code.
Could you please have a look?

TomDLT · 2016-02-22T10:24:42Z

sklearn/datasets/tests/test_svmlight_format.py

@@ -262,6 +263,16 @@ def test_dump_multilabel():
    assert_equal(f.readline(), b("0,2 \n"))
    assert_equal(f.readline(), b("0,1 1:5 3:1\n"))

+    # test if y is sparse


you can avoid code duplication with:

y_dense = [[0, 1, 0], [1, 0, 1], [1, 1, 0]] y_csr = sp.csr_matrix(y_dense) for y in [y_dense, y_csr]: ...

Done!
Thanks for the review.

TomDLT · 2016-02-22T10:32:34Z

Could you also adapt test_dump() ?
(with something like for X, y in zip((Xs, Xd, Xsliced), (ys, yd, ysliced)):)

yenchenlin · 2016-02-22T12:48:55Z

Hello @TomDLT ,

Done!
Please have a look.

TomDLT · 2016-02-22T13:12:50Z

sklearn/datasets/tests/test_svmlight_format.py

+                    # default anymore.
+
+                    # make y conforms to shape: (n_samples, n_labels)
+                    if (sp.issparse(y) and y.shape[0] == 1):


is it just for y_sliced?

you mean

if (sp.issparse(y) and y.shape[0] == 1):

?

yes
is it for y_sparse, y_dense or y_sliced ?

It is because if y_dense is a 1d array which has shape (n_samples, ), turning it into a csr_matrix will make its shape become (1, n_samples).
However, sparse matrix passed into dump_svmlight_file must have shape (n_samples, n_labels).
Therefore I add

if (sp.issparse(y) and y.shape[0] == 1): y = y.T

TomDLT · 2016-02-22T15:53:10Z

sklearn/datasets/tests/test_svmlight_format.py

+                        assert_array_almost_equal(
+                            X_dense.astype(dtype), X2_dense, 4)
+                        assert_array_almost_equal(
+                            y_dense.astype(dtype), yd, 4)


you probably want to compare yd and y2, or y_dense and y2

I want to compare a dense array transformed from sparse array, which is yd in this case, and y2.
y_dense is dense but is not transformed from sparse array. So I declare yd here.

Do you think I need to rename yd to something else?

but you don't compare with y2 right?

💡 Thanks for pointing out the mistakes.

I just read the code again ...
Maybe we can just test y_dense.astype(dtype) and y2 here, and don't need a dense matrix which is transformed from sparse matrix like yd ?

I've modified the code.

A dense array transformed from sparse array is not needed ...
I misunderstand the code here - I thought there will be a rounding error when transforming a sparse matrix into dense matrix.

Would @TomDLT please check again? 🙏

TomDLT · 2016-02-22T16:31:26Z

sklearn/datasets/tests/test_svmlight_format.py

+                        assert_array_almost_equal(
+                            y_dense.astype(dtype), y2, 15)
+
+                    if not sp.issparse(y):


I think you can remove this

You mean remove

if not sp.issparse(y): assert_array_equal(y, y2)

?

I do this because y is not equal to y2 when y refers to sparse array (y_sparse and y_sliced)

The goal of this line is to verify that y is correctly dumped and reloaded (into y2).
So we check that y and y2 are equal, but you added this verification above (between y2 and y_dense).
So you can remove both lines.

Yeah I got it!
Done!

TomDLT · 2016-02-22T16:31:36Z

LGTM

yenchenlin · 2016-02-23T02:42:14Z

Great thanks to @TomDLT 's explanation and review.

raghavrv · 2016-02-25T17:27:08Z

sklearn/datasets/svmlight_format.py

-    y : array-like, shape = [n_samples] or [n_samples, n_labels]
-        Target values. Class labels must be an integer or float, or array-like
-        objects of integer or float for multilabel classifications.
+    y : {array-like, sparse matrix}, shape = [n_samples] or


I'd do y : {array-like, sparse matrix}, shape = [n_samples (, n_labels)]

to keep in in a single line

Thanks for pointing this out!
Code updated.

tguillemot · 2016-03-23T15:07:07Z

LGTM

raghavrv · 2016-03-23T15:11:40Z

@TomDLT merge?

TomDLT · 2016-03-23T17:18:22Z

sklearn/datasets/tests/test_svmlight_format.py

+                        y = y.T
+
+                    dump_svmlight_file(X.astype(dtype), y, f, comment="test",
+                                    zero_based=zero_based)


can you just align this?

TomDLT · 2016-03-23T18:19:39Z

Thanks @yenchenlin1994 !

yenchenlin changed the title ~~Make dump_svmlight_file support sparse y (fixes #6301)~~ [WIP] Make dump_svmlight_file support sparse y (fixes #6301) Feb 18, 2016

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 4 times, most recently from 2101e1e to 4cc198c Compare February 20, 2016 04:24

yenchenlin changed the title ~~[WIP] Make dump_svmlight_file support sparse y (fixes #6301)~~ [MRG] Make dump_svmlight_file support sparse y (fixes #6301) Feb 20, 2016

TomDLT reviewed Feb 22, 2016
View reviewed changes

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 4cc198c to 295be2b Compare February 22, 2016 10:30

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 2 times, most recently from cc68e62 to 3a14be5 Compare February 22, 2016 11:46

TomDLT reviewed Feb 22, 2016
View reviewed changes

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 3a14be5 to a4ddee8 Compare February 22, 2016 14:55

TomDLT reviewed Feb 22, 2016
View reviewed changes

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch 2 times, most recently from 81d9e28 to c27969d Compare February 22, 2016 16:20

TomDLT reviewed Feb 22, 2016
View reviewed changes

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from c27969d to a2048bb Compare February 23, 2016 02:40

yenchenlin changed the title ~~[MRG] Make dump_svmlight_file support sparse y (fixes #6301)~~ [MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) Feb 23, 2016

raghavrv reviewed Feb 25, 2016
View reviewed changes

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from a2048bb to 7bc17da Compare February 25, 2016 18:02

TomDLT reviewed Mar 23, 2016
View reviewed changes

Make dump_svmlight_file support sparse y

22d7cd5

yenchenlin force-pushed the make-dump_svmlight_file-support-sparse-y branch from 7bc17da to 22d7cd5 Compare March 23, 2016 17:23

TomDLT merged commit eed5fc5 into scikit-learn:master Mar 23, 2016

yenchenlin mentioned this pull request Mar 23, 2016

[MRG+1] Raise error when y is passed as a sparse matrix into dump_svmlight_file (fixes #6301) #6302

Closed

yenchenlin deleted the make-dump_svmlight_file-support-sparse-y branch March 23, 2016 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

yenchenlin commented Feb 18, 2016

TomDLT commented Feb 18, 2016

yenchenlin commented Feb 18, 2016

TomDLT commented Feb 18, 2016

yenchenlin commented Feb 20, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT commented Feb 22, 2016

yenchenlin commented Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 22, 2016

TomDLT Feb 22, 2016

yenchenlin Feb 23, 2016

TomDLT commented Feb 22, 2016

yenchenlin commented Feb 23, 2016

raghavrv Feb 25, 2016

yenchenlin Feb 25, 2016

tguillemot commented Mar 23, 2016

raghavrv commented Mar 23, 2016

TomDLT Mar 23, 2016

yenchenlin Mar 23, 2016

TomDLT commented Mar 23, 2016

[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395

Conversation

yenchenlin commented Feb 18, 2016

TomDLT commented Feb 18, 2016

yenchenlin commented Feb 18, 2016

TomDLT commented Feb 18, 2016

yenchenlin commented Feb 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Feb 22, 2016

yenchenlin commented Feb 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Feb 22, 2016

yenchenlin commented Feb 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tguillemot commented Mar 23, 2016

raghavrv commented Mar 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Mar 23, 2016