New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Modifies T-SNE for sparse matrix #10206
[MRG] Modifies T-SNE for sparse matrix #10206
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just be going through each row, raising an error if the number of nonzeros is less than n_neighbors, otherwise argsorting, and taking neigh_ind[i] = row.indices[order][:n]
. Is that not right?
sklearn/neighbors/base.py
Outdated
row = dist.getrow(i) | ||
non_zero = row.size | ||
j_ = 0 | ||
for j in range(0, n_neighbors+non_zero): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around + please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't really get what this is doing (the names j_
and j
don't help!)
sklearn/neighbors/base.py
Outdated
non_zero = row.size | ||
j_ = 0 | ||
for j in range(0, n_neighbors+non_zero): | ||
if j not in row.indices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is slow...
@jnothman Why do we raise an error if the number of nonzeros is less than n_neighbors? Also, what I attempt by using those j and j_ is to get a list of indices having value 0. Is there a simple function doing the same? I coudn't find one. The tests will be modified once this is approved. Thanks for the review. |
we want to ignore implicit 0 distances. we are trying to support similar
sparse precomputed matrix semantics to dbscan, like that output by
kneighbors_graph. There, the "zeros" implied by the sparsity structure do
not mean "zero distance" (i.e. identical points) but "farther than the
number of required nearest neighbours". Thus if the matrix does not report
enough neighbours to calculate tSNE, an error should be raised.
|
@jnothman I guess this is what you intended. But, after this change, the snippet given in the original issue itself fails with n_neighbors=40. Can you please check of this is in line with what you said? |
Is this the failure you're talking about? ValueError Traceback (most recent call last)
<ipython-input-1-6e15fc3677d3> in <module>()
5 bt = BallTree(X, leaf_size=300)
6 distances = kneighbors_graph(bt, n_neighbors=40, mode="distance", metric="cosine")
----> 7 X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances)
/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in fit_transform(self, X, y)
845 Embedding of the training data in low-dimensional space.
846 """
--> 847 embedding = self._fit(X)
848 self.embedding_ = embedding
849 return self.embedding_
/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in _fit(self, X, skip_num_points)
712 t0 = time()
713 distances_nn, neighbors_nn = knn.kneighbors(
--> 714 None, n_neighbors=k)
715 duration = time() - t0
716 if self.verbose:
/Users/joel/repos/scikit-learn/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
367 non_zero = row.size
368 if non_zero < n_neighbors:
--> 369 raise ValueError("Invalid Format")
370 else:
371 neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors]
ValueError: Invalid Format |
YES, this Invalid Format error has been added by me when number of non zero is less than n_neighbors. |
Yes, the default perplexity is too high for 40 neighbors to be sufficient. It is correct to throw an error, but the error message is not appropriate atm. perplexity=12 is the maximum for that data. |
Yeah, I was not aware of what's the error here so left the error message as that for the time being? Is the implementation correct? Suggest me an error message here. Also, if this is correct, should I move on to fixing the tests? |
Well, the error message in NearestNeighbors is "Not enough neighbors in
sparse precomputed matrix to get {n_neighbors} nearest". BUt you might want
a different one in TSNE that says "Reduce the perplexity".
…On 28 November 2017 at 21:40, Kumar Ashutosh ***@***.***> wrote:
Yeah, I was not aware of what's the error here so left the error message
as that for the time being? Is the implementation correct? Suggest me an
error message here. Also, if this is correct, should I move on to fixing
the tests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10206 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69P2ACGVw5y8rnH9PctdgbV2zUo4ks5s6-MJgaJpZM4Qqyrz>
.
|
Adding an error message for TSNE would require me to add the error and error message in TSNE. Is this what you want? |
Either you perform the check in TSNE (negligible cost, fine by me) or
detect the error from neighbors in tsne and raise with a different message.
|
@jnothman I have added what you suggested. But I doubt if this is correct. I don't know but even when the neighbour is set to 1 as in the sample code given in the issue, the number of non_zero is less than n_neighbours. Am I missing out something? |
"when the neighbour is set to 1" I'm not sure what you're referring to. Link to the relevant issue/comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide runnable snippets, or failing unit tests if you want me to understand what behaviour you think is incorrect.
You need to add tests anyway.
sklearn/neighbors/base.py
Outdated
if non_zero < n_neighbors: | ||
raise ValueError("Invalid Format") | ||
else: | ||
neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely you need to update the distances too.
Needs tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of how distances are to be updated? Can you give me an example of already existing such implementation.
Also, tests will be attempted when the implementation goes correct.
sklearn/neighbors/base.py
Outdated
sample_range, np.argsort(dist[sample_range, neigh_ind])] | ||
if issparse(dist): | ||
neigh_ind = np.zeros((dist.shape[0], n_neighbors)) | ||
for i in range(0, dist.shape[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to raise an error here, if we're to implement this in sklearn.neighbors and not just in TSNE.
Perhaps use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...
?
sklearn/manifold/t_sne.py
Outdated
@@ -714,6 +699,13 @@ def _fit(self, X, skip_num_points=0): | |||
if self.verbose: | |||
print("[t-SNE] Computing {} nearest neighbors...".format(k)) | |||
|
|||
for i in range(0, X.shape[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...
?
@jnothman This is what I am referring to. Here I changed n_neighbors to 1 from 40. Still the error raises as the number of non-zero is less than n_neighbours. Is this normal or there;s something wron with my implementation. from sklearn.neighbors import BallTree, kneighbors_graph
from sklearn.manifold import TSNE
X = np.random.randn(100, 10)
bt = BallTree(X, leaf_size=300)
distances = kneighbors_graph(bt, n_neighbors=1, mode="distance", metric="cosine")
X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances) |
you will have a better idea of what a correct implementation is by writing
the tests, I would think.
Reducing the number of neighbours to 1 will only exacerbate the problem.
Try with more neighbours or with higher perplexity. Look in the tSNE
implementation for how n_neighbours is calculated for a given perplexity.
|
The tests have been added. Kindly review. Also. @jnothman , the tests are failing with the error getnnz() got an unexpected keyword argument 'axis' This is working fine locally. |
Regarding explicit/implicit zeros, you might find
http://scikit-learn.org/dev/glossary.html#term-sparse-matrix helpful
|
sklearn/manifold/tests/test_t_sne.py
Outdated
dist = np.zeros((3, 3)) | ||
dist_csr = sp.csr_matrix(dist) | ||
tsne = TSNE(metric="precomputed") | ||
X_transformed_dense = tsne.fit_transform(dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that tSNE results for sparse X are the same as for dense X where X is a feature matrix and where X is a precomputed distance matrix with all zeros explicit
Does this properly implement what you intended by this statement? I am not sure since the output of neighbors inside TNSE.fit give the same result. I guess this function needs modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't mean that the entire matrix is zeros. I mean that the zero distances on the diagonally should be explicitly in X.data, as opposed to an element not explicitly stored (and hence assumed to have a value of zero).
You can construct an all-explicit CSR matrix with:
def array_to_explicit_csr(X):
return csr_matrix((X.ravel(), np.tile(np.arange(X.shape[1]), X.shape[0]), np.arange(0, X.size + 1, X.shape[1])))
Would you like me to write the tests for you, and you fix up the implementation? let me know. |
@jnothman Yeah, I need help in tests. The implementation can be corrected by using this explicit zeros. I never thought this could be achieved. That's why I used the current work around. |
you shouldn't do that kind of thing in implementation, as it will mean the
data is not sparse. But yes, the user can produce an array where some 0s
are implicit and some explicit. I'll try come back with tests soon.
|
sklearn/utils/fixes.py
Outdated
|
||
|
||
def getnnz(X, axis=None): | ||
if axis is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please reduce indent
sklearn/neighbors/base.py
Outdated
raise ValueError("Not enough neighbors in sparse " | ||
"precomputed matrix to get {} " | ||
"nearest neighbors".format(n_neighbors)) | ||
if dist.diagonal().min() < 0 or dist.diagonal().max() > 0 or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement that the diagonal is 0 only applies if X == self._fit_X. The requirement of symmetry should not exist. Precomputed sparse distance matrices (and hacked dense ones) -- even for X == self._fit_X -- are not necessarily symmetric in all their cells, only in those cells that are present (and finite/reachable in the case of dense) on both sides of the diagonal. Just remove this validation.
if issparse(dist): | ||
print "Dist being printed \n" | ||
print dist.toarray() | ||
print dist.indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://gist.github.com/thechargedneutron/c2f95ac46a9ab61beb0de8a91a9a533f
Running the above code (tests provided by @jnothman ) on this branch produces the following dist
matrix:
[[0. 0. 0. 0. 0. 0. 0.56968608 0. 0.44086692 0. ]
[0. 0. 0.71768226 0. 0. 0.71600003 0. 0. 0. 0. ]
[0. 0.71768226 0. 0. 0. 0. 0.51642623 0. 0. 0. ]
[0. 0. 0. 0. 0. 0.31561065 0. 0.51627111 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.44107968 0. 0. 0.51790786]
[0. 0. 0. 0.31561065 0. 0. 0. 0.38322922 0. 0.]
[0. 0. 0.51642623 0. 0.44107968 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.38322922 0. 0. 0. 0.40501049]
[0.44086692 0. 0. 0. 0.88847197 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.50529526 0. 0.40501049 0. 0.]]
And the corresponding indices of dist
are:
[0 8 6 1 5 2 2 6 1 3 5 7 4 6 9 5 3 7 6 4 2 7 5 9 8 0 4 9 7 5]
And the expected result of indices are:
[[8 6 4]
[5 2 9]
[6 1 4]
[5 7 4]
[6 9 5]
[3 7 9]
[4 2 0]
[5 9 3]
[0 4 6]
[7 5 4]]
But in row 1 for example, how are we supposed to get 4 as an index while it is passed as a sparse array with a zero at that position (visible from indices output). Am I missing out on something?
I know this issue is taking too long. Sorry for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dist matrix give you 2 neighbors, 3 if you include self as explicit zeros.
The expected result of indices corresponds to 3 neighbors, without self.
The fix is replacing in the gist nn.kneighbors_graph(X_train.copy(), mode='distance')
by nn.kneighbors_graph(None, mode='distance')
, which is the correct syntax to have 3 neighbors without self.
In base.py
you also have to uncomment:
if query_is_train:
# this is done to add self as nearest neighbor
neigh_ind = np.concatenate((sample_range, neigh_ind),
axis=1)
neigh_ind = neigh_ind[:, :-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my mistake, you are testing the explicit diagonal case...
The expected result of indices is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomDLT Is the fix that you provided above in the gist still valid? (after you accepted there's fault in your above comment?) . I am not familiar with generating CSR Matrix from kneighbors_graph
.
I might have made a mistake in tests for the the X is not None case. In the
case where the diagonal zeros are left implicit, and the prediction data is
being passed in explicitly, that should not be the expected output.
|
basically if the sparse matrix does not implicitly store its zero diagonal,
a query with X= the training data should return results like when X=None. I
suppose this is best done by adding an implicit_diagonal parameter to the
check function
|
@thechargedneutron, I suspect we are going to absorb this PR into #10482.
We might land up not supporting the explicit diagonal case.
|
I'm not sure reviewing this altogether is a great use of my time, which is becoming much more limited for the next five months. I may try to look at some of the recent changes, but can I suggest that we do the following: constrain this PR (or a new one in its place) to the case of allowing sparse matrices in tSNE without metric='computed'. That should be a straightforward change. And you take a good look at #10482, and either collaborate with @TomDLT, or ask if he wants you to take over its primary development. It's an extension of what you have here in terms of generalised support for sparse precomputed input to nearest neighbors-based estimators. Although, by describing the API more formally, we can get away with simplifying the implementation by excluding some annoying edge cases, such as saying "We're going to assume that the input is like the output from |
Soo... are we closing this? Is it appropriate for #10482 to include sparse support in BHTSNE in the non-precomputed case? |
This work has been merged as part of #10482 |
Reference Issues/PRs
Fixes #9691
What does this implement/fix? Explain your changes.
Modifies sklearn/neighbors/base.py
Any other comments?
Not sure about the complexity of the modification.