-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395
[MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) #6395
Conversation
What is the purpose of supporting sparse You should adapt the checking to handle sparse |
@TomDLT Sorry for my misunderstanding. y = np.asarray(y)
if y.ndim != 1 and not multilabel:
raise ValueError("expected y of shape (n_samples,), got %r"
% (y.shape,)) and do the check which original code do to X at the following line? |
You need to remove You also have to sort indices (as for X), to modify |
2101e1e
to
4cc198c
Compare
Hello @TomDLT , I've modified the code. |
@@ -262,6 +263,16 @@ def test_dump_multilabel(): | |||
assert_equal(f.readline(), b("0,2 \n")) | |||
assert_equal(f.readline(), b("0,1 1:5 3:1\n")) | |||
|
|||
# test if y is sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid code duplication with:
y_dense = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
y_csr = sp.csr_matrix(y_dense)
for y in [y_dense, y_csr]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Thanks for the review.
4cc198c
to
295be2b
Compare
Could you also adapt |
cc68e62
to
3a14be5
Compare
Hello @TomDLT , Done! |
# default anymore. | ||
|
||
# make y conforms to shape: (n_samples, n_labels) | ||
if (sp.issparse(y) and y.shape[0] == 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just for y_sliced
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean
if (sp.issparse(y) and y.shape[0] == 1):
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
is it for y_sparse, y_dense or y_sliced ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because if y_dense
is a 1d array which has shape (n_samples, ), turning it into a csr_matrix will make its shape become (1, n_samples).
However, sparse matrix passed into dump_svmlight_file
must have shape (n_samples, n_labels).
Therefore I add
if (sp.issparse(y) and y.shape[0] == 1):
y = y.T
3a14be5
to
a4ddee8
Compare
assert_array_almost_equal( | ||
X_dense.astype(dtype), X2_dense, 4) | ||
assert_array_almost_equal( | ||
y_dense.astype(dtype), yd, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to compare yd
and y2
, or y_dense
and y2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to compare a dense array transformed from sparse array, which is yd
in this case, and y2
.
y_dense
is dense but is not transformed from sparse array. So I declare yd
here.
Do you think I need to rename yd
to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you don't compare with y2
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Thanks for pointing out the mistakes.
I just read the code again ...
Maybe we can just test y_dense.astype(dtype)
and y2
here, and don't need a dense matrix which is transformed from sparse matrix like yd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the code.
A dense array transformed from sparse array is not needed ...
I misunderstand the code here - I thought there will be a rounding error when transforming a sparse matrix into dense matrix.
Would @TomDLT please check again? 🙏
81d9e28
to
c27969d
Compare
assert_array_almost_equal( | ||
y_dense.astype(dtype), y2, 15) | ||
|
||
if not sp.issparse(y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean remove
if not sp.issparse(y):
assert_array_equal(y, y2)
?
I do this because y
is not equal to y2
when y
refers to sparse array (y_sparse
and y_sliced
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this line is to verify that y
is correctly dumped and reloaded (into y2
).
So we check that y
and y2
are equal, but you added this verification above (between y2
and y_dense
).
So you can remove both lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I got it!
Done!
LGTM |
c27969d
to
a2048bb
Compare
Great thanks to @TomDLT 's explanation and review. |
y : array-like, shape = [n_samples] or [n_samples, n_labels] | ||
Target values. Class labels must be an integer or float, or array-like | ||
objects of integer or float for multilabel classifications. | ||
y : {array-like, sparse matrix}, shape = [n_samples] or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do y : {array-like, sparse matrix}, shape = [n_samples (, n_labels)]
to keep in in a single line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
Code updated.
a2048bb
to
7bc17da
Compare
LGTM |
@TomDLT merge? |
y = y.T | ||
|
||
dump_svmlight_file(X.astype(dtype), y, f, comment="test", | ||
zero_based=zero_based) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just align this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
7bc17da
to
22d7cd5
Compare
Thanks @yenchenlin1994 ! |
This is a follow up PR from #6302
Can @TomDLT please check it?
Thanks!