[MRG] Modifies T-SNE for sparse matrix #10206

thechargedneutron · 2017-11-26T14:26:29Z

Reference Issues/PRs

Fixes #9691

What does this implement/fix? Explain your changes.

Modifies sklearn/neighbors/base.py

Any other comments?

Not sure about the complexity of the modification.

…into tsne_csr

jnothman

I think you should just be going through each row, raising an error if the number of nonzeros is less than n_neighbors, otherwise argsorting, and taking neigh_ind[i] = row.indices[order][:n]. Is that not right?

jnothman · 2017-11-26T21:10:29Z

sklearn/neighbors/base.py

+                    row = dist.getrow(i)
+                    non_zero = row.size
+                    j_ = 0
+                    for j in range(0, n_neighbors+non_zero):


Spaces around + please

But I don't really get what this is doing (the names j_ and j don't help!)

jnothman · 2017-11-26T22:22:07Z

sklearn/neighbors/base.py

+                    non_zero = row.size
+                    j_ = 0
+                    for j in range(0, n_neighbors+non_zero):
+                        if j not in row.indices:


this is slow...

thechargedneutron · 2017-11-27T14:34:37Z

@jnothman Why do we raise an error if the number of nonzeros is less than n_neighbors?

Also, what I attempt by using those j and j_ is to get a list of indices having value 0. Is there a simple function doing the same? I coudn't find one.

The tests will be modified once this is approved. Thanks for the review.

jnothman · 2017-11-27T20:47:34Z

we want to ignore implicit 0 distances. we are trying to support similar sparse precomputed matrix semantics to dbscan, like that output by kneighbors_graph. There, the "zeros" implied by the sparsity structure do not mean "zero distance" (i.e. identical points) but "farther than the number of required nearest neighbours". Thus if the matrix does not report enough neighbours to calculate tSNE, an error should be raised.

…into tsne_csr

thechargedneutron · 2017-11-28T10:26:42Z

@jnothman I guess this is what you intended. But, after this change, the snippet given in the original issue itself fails with n_neighbors=40. Can you please check of this is in line with what you said?

jnothman · 2017-11-28T10:31:10Z

Is this the failure you're talking about?

ValueError                                Traceback (most recent call last)
<ipython-input-1-6e15fc3677d3> in <module>()
      5 bt = BallTree(X, leaf_size=300)
      6 distances = kneighbors_graph(bt, n_neighbors=40, mode="distance", metric="cosine")
----> 7 X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances)

/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in fit_transform(self, X, y)
    845             Embedding of the training data in low-dimensional space.
    846         """
--> 847         embedding = self._fit(X)
    848         self.embedding_ = embedding
    849         return self.embedding_

/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in _fit(self, X, skip_num_points)
    712             t0 = time()
    713             distances_nn, neighbors_nn = knn.kneighbors(
--> 714                 None, n_neighbors=k)
    715             duration = time() - t0
    716             if self.verbose:

/Users/joel/repos/scikit-learn/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    367                     non_zero = row.size
    368                     if non_zero < n_neighbors:
--> 369                         raise ValueError("Invalid Format")
    370                     else:
    371                         neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors]

ValueError: Invalid Format

thechargedneutron · 2017-11-28T10:33:35Z

YES, this Invalid Format error has been added by me when number of non zero is less than n_neighbors.

jnothman · 2017-11-28T10:37:59Z

Yes, the default perplexity is too high for 40 neighbors to be sufficient. It is correct to throw an error, but the error message is not appropriate atm. perplexity=12 is the maximum for that data.

thechargedneutron · 2017-11-28T10:40:07Z

Yeah, I was not aware of what's the error here so left the error message as that for the time being? Is the implementation correct? Suggest me an error message here. Also, if this is correct, should I move on to fixing the tests?

jnothman · 2017-11-28T10:42:14Z

Well, the error message in NearestNeighbors is "Not enough neighbors in sparse precomputed matrix to get {n_neighbors} nearest". BUt you might want a different one in TSNE that says "Reduce the perplexity".

…

On 28 November 2017 at 21:40, Kumar Ashutosh ***@***.***> wrote: Yeah, I was not aware of what's the error here so left the error message as that for the time being? Is the implementation correct? Suggest me an error message here. Also, if this is correct, should I move on to fixing the tests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69P2ACGVw5y8rnH9PctdgbV2zUo4ks5s6-MJgaJpZM4Qqyrz> .

thechargedneutron · 2017-11-28T10:44:25Z

Adding an error message for TSNE would require me to add the error and error message in TSNE. Is this what you want?

jnothman · 2017-11-28T11:06:27Z

Either you perform the check in TSNE (negligible cost, fine by me) or detect the error from neighbors in tsne and raise with a different message.

…into tsne_csr

thechargedneutron · 2017-11-29T22:01:46Z

@jnothman I have added what you suggested. But I doubt if this is correct. I don't know but even when the neighbour is set to 1 as in the sample code given in the issue, the number of non_zero is less than n_neighbours. Am I missing out something?

jnothman · 2017-11-29T22:56:00Z

"when the neighbour is set to 1" I'm not sure what you're referring to. Link to the relevant issue/comment?

jnothman

Please provide runnable snippets, or failing unit tests if you want me to understand what behaviour you think is incorrect.

You need to add tests anyway.

jnothman · 2017-11-28T10:40:17Z

sklearn/neighbors/base.py

+                    if non_zero < n_neighbors:
+                        raise ValueError("Invalid Format")
+                    else:
+                        neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors]


surely you need to update the distances too.

Needs tests.

I am not aware of how distances are to be updated? Can you give me an example of already existing such implementation.

Also, tests will be attempted when the implementation goes correct.

jnothman · 2017-11-28T10:40:22Z

sklearn/neighbors/base.py

-                sample_range, np.argsort(dist[sample_range, neigh_ind])]
+            if issparse(dist):
+                neigh_ind = np.zeros((dist.shape[0], n_neighbors))
+                for i in range(0, dist.shape[0]):


You still need to raise an error here, if we're to implement this in sklearn.neighbors and not just in TSNE.

Perhaps use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...?

jnothman · 2017-11-29T22:56:22Z

sklearn/manifold/t_sne.py

@@ -714,6 +699,13 @@ def _fit(self, X, skip_num_points=0):
            if self.verbose:
                print("[t-SNE] Computing {} nearest neighbors...".format(k))

+            for i in range(0, X.shape[0]):


Use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...?

thechargedneutron · 2017-11-30T10:33:11Z

@jnothman This is what I am referring to. Here I changed n_neighbors to 1 from 40. Still the error raises as the number of non-zero is less than n_neighbours. Is this normal or there;s something wron with my implementation.

from sklearn.neighbors import BallTree, kneighbors_graph
from sklearn.manifold import TSNE
X = np.random.randn(100, 10)

bt = BallTree(X, leaf_size=300)
distances = kneighbors_graph(bt, n_neighbors=1, mode="distance", metric="cosine")
X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances)

jnothman · 2017-11-30T11:00:34Z

you will have a better idea of what a correct implementation is by writing the tests, I would think. Reducing the number of neighbours to 1 will only exacerbate the problem. Try with more neighbours or with higher perplexity. Look in the tSNE implementation for how n_neighbours is calculated for a given perplexity.

…into tsne_csr

thechargedneutron · 2017-11-30T22:26:26Z

The tests have been added. Kindly review. Also. @jnothman , the tests are failing with the error

getnnz() got an unexpected keyword argument 'axis'

This is working fine locally.
Can you please look into this problem. A similar error was found in the given URL when I trie finding solution for the same. lyst/lightfm#87

jnothman · 2018-01-10T05:29:18Z

Regarding explicit/implicit zeros, you might find http://scikit-learn.org/dev/glossary.html#term-sparse-matrix helpful

…into tsne_csr

thechargedneutron · 2018-01-14T11:10:54Z

sklearn/manifold/tests/test_t_sne.py

+    dist = np.zeros((3, 3))
+    dist_csr = sp.csr_matrix(dist)
+    tsne = TSNE(metric="precomputed")
+    X_transformed_dense = tsne.fit_transform(dist)


@jnothman

Check that tSNE results for sparse X are the same as for dense X where X is a feature matrix and where X is a precomputed distance matrix with all zeros explicit

Does this properly implement what you intended by this statement? I am not sure since the output of neighbors inside TNSE.fit give the same result. I guess this function needs modification.

No, I don't mean that the entire matrix is zeros. I mean that the zero distances on the diagonally should be explicitly in X.data, as opposed to an element not explicitly stored (and hence assumed to have a value of zero).

You can construct an all-explicit CSR matrix with:

def array_to_explicit_csr(X): return csr_matrix((X.ravel(), np.tile(np.arange(X.shape[1]), X.shape[0]), np.arange(0, X.size + 1, X.shape[1])))

jnothman · 2018-01-14T23:05:48Z

Would you like me to write the tests for you, and you fix up the implementation? let me know.

thechargedneutron · 2018-01-15T15:40:26Z

@jnothman Yeah, I need help in tests. The implementation can be corrected by using this explicit zeros. I never thought this could be achieved. That's why I used the current work around.

jnothman · 2018-01-15T21:06:27Z

you shouldn't do that kind of thing in implementation, as it will mean the data is not sparse. But yes, the user can produce an array where some 0s are implicit and some explicit. I'll try come back with tests soon.

jnothman · 2018-01-14T23:04:44Z

sklearn/utils/fixes.py

+
+
+def getnnz(X, axis=None):
+        if axis is None:


please reduce indent

jnothman · 2018-01-16T02:36:03Z

sklearn/neighbors/base.py

+                    raise ValueError("Not enough neighbors in sparse "
+                                     "precomputed matrix to get {} "
+                                     "nearest neighbors".format(n_neighbors))
+                if dist.diagonal().min() < 0 or dist.diagonal().max() > 0 or \


The requirement that the diagonal is 0 only applies if X == self._fit_X. The requirement of symmetry should not exist. Precomputed sparse distance matrices (and hacked dense ones) -- even for X == self._fit_X -- are not necessarily symmetric in all their cells, only in those cells that are present (and finite/reachable in the case of dense) on both sides of the diagonal. Just remove this validation.

jnothman · 2018-01-16T10:02:36Z

See thechargedneutron#4

Thanks Joel

…into tsne_csr

thechargedneutron · 2018-01-20T18:37:28Z

sklearn/neighbors/base.py

+            if issparse(dist):
+                print "Dist being printed \n"
+                print dist.toarray()
+                print dist.indices


https://gist.github.com/thechargedneutron/c2f95ac46a9ab61beb0de8a91a9a533f
Running the above code (tests provided by @jnothman ) on this branch produces the following dist matrix:

[[0. 0. 0. 0. 0. 0. 0.56968608 0. 0.44086692 0. ] [0. 0. 0.71768226 0. 0. 0.71600003 0. 0. 0. 0. ] [0. 0.71768226 0. 0. 0. 0. 0.51642623 0. 0. 0. ] [0. 0. 0. 0. 0. 0.31561065 0. 0.51627111 0. 0. ] [0. 0. 0. 0. 0. 0. 0.44107968 0. 0. 0.51790786] [0. 0. 0. 0.31561065 0. 0. 0. 0.38322922 0. 0.] [0. 0. 0.51642623 0. 0.44107968 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.38322922 0. 0. 0. 0.40501049] [0.44086692 0. 0. 0. 0.88847197 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.50529526 0. 0.40501049 0. 0.]]

And the corresponding indices of dist are:

[0 8 6 1 5 2 2 6 1 3 5 7 4 6 9 5 3 7 6 4 2 7 5 9 8 0 4 9 7 5]

And the expected result of indices are:

[[8 6 4] [5 2 9] [6 1 4] [5 7 4] [6 9 5] [3 7 9] [4 2 0] [5 9 3] [0 4 6] [7 5 4]]

But in row 1 for example, how are we supposed to get 4 as an index while it is passed as a sparse array with a zero at that position (visible from indices output). Am I missing out on something?
I know this issue is taking too long. Sorry for this.

This dist matrix give you 2 neighbors, 3 if you include self as explicit zeros.
The expected result of indices corresponds to 3 neighbors, without self.

The fix is replacing in the gist nn.kneighbors_graph(X_train.copy(), mode='distance') by nn.kneighbors_graph(None, mode='distance'), which is the correct syntax to have 3 neighbors without self.

In base.py you also have to uncomment:

if query_is_train: # this is done to add self as nearest neighbor neigh_ind = np.concatenate((sample_range, neigh_ind), axis=1) neigh_ind = neigh_ind[:, :-1]

oh my mistake, you are testing the explicit diagonal case...

The expected result of indices is wrong

@TomDLT Is the fix that you provided above in the gist still valid? (after you accepted there's fault in your above comment?) . I am not familiar with generating CSR Matrix from kneighbors_graph.

jnothman · 2018-01-22T20:29:19Z

I might have made a mistake in tests for the the X is not None case. In the case where the diagonal zeros are left implicit, and the prediction data is being passed in explicitly, that should not be the expected output.

jnothman · 2018-01-22T21:17:55Z

basically if the sparse matrix does not implicitly store its zero diagonal, a query with X= the training data should return results like when X=None. I suppose this is best done by adding an implicit_diagonal parameter to the check function

jnothman · 2018-01-24T04:50:25Z

@thechargedneutron, I suspect we are going to absorb this PR into #10482. We might land up not supporting the explicit diagonal case.

thechargedneutron · 2018-01-25T19:18:34Z

@jnothman Cool. Can you review the PR again with the desired features. If this goes good, I'll open a new PR to #10482 's branch.

jnothman · 2018-01-25T22:17:23Z

I'm not sure reviewing this altogether is a great use of my time, which is becoming much more limited for the next five months. I may try to look at some of the recent changes, but can I suggest that we do the following: constrain this PR (or a new one in its place) to the case of allowing sparse matrices in tSNE without metric='computed'. That should be a straightforward change.

And you take a good look at #10482, and either collaborate with @TomDLT, or ask if he wants you to take over its primary development. It's an extension of what you have here in terms of generalised support for sparse precomputed input to nearest neighbors-based estimators. Although, by describing the API more formally, we can get away with simplifying the implementation by excluding some annoying edge cases, such as saying "We're going to assume that the input is like the output from kneighbors_graph(X, mode='distance') with the zeros on the diagonal implicit."

TomDLT · 2018-01-26T18:18:47Z

I squashed your commits into 96c9d94 and I picked it in #10482.

thechargedneutron · 2018-01-26T18:23:06Z

I squashed your commits into 96c9d94 and I picked it in #10482.

Thanks a lot!! 😃

jnothman · 2018-01-29T12:55:29Z

Soo... are we closing this? Is it appropriate for #10482 to include sparse support in BHTSNE in the non-precomputed case?

TomDLT · 2019-09-18T23:16:09Z

This work has been merged as part of #10482

thechargedneutron added 2 commits November 26, 2017 19:42

initial commit

b14d689

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

8f10631

…into tsne_csr

jnothman reviewed Nov 26, 2017

View reviewed changes

thechargedneutron added 2 commits November 28, 2017 15:54

changes added

a224e68

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

390cb86

…into tsne_csr

thechargedneutron added 2 commits November 30, 2017 03:29

changes added

92d2912

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

b8fc385

…into tsne_csr

jnothman reviewed Nov 29, 2017

View reviewed changes

thechargedneutron added 7 commits November 30, 2017 22:40

changes added

fb9a76e

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

9e109a9

…into tsne_csr

check for sparse added

500f0c4

changes added

2d985bd

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

fdfbf11

…into tsne_csr

check for non-negativity done

4c2cf9a

tests added

164a232

jnothman mentioned this pull request Jan 13, 2018

Toward a consistent API for NearestNeighbors & co #10463

Closed

thechargedneutron added 3 commits January 14, 2018 12:00

tests added

acb9c2e

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

eba780a

…into tsne_csr

pep8 corrected

d105231

thechargedneutron commented Jan 14, 2018

View reviewed changes

jnothman reviewed Jan 16, 2018

View reviewed changes

sklearn/utils/fixes.py Outdated

def getnnz(X, axis=None):

if axis is None:

Copy link

Member

jnothman Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reduce indent

jnothman reviewed Jan 16, 2018

View reviewed changes

jnothman and others added 5 commits January 17, 2018 00:26

Tests for sparse support in tSNE and Neighbors (#4)

ab82925

Thanks Joel

previous review comments

e6bc8aa

number of neighbors mismatch changed

cb3127d

initial changes

788d890

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

76b0e69

…into tsne_csr

thechargedneutron commented Jan 20, 2018

View reviewed changes

TomDLT mentioned this pull request Jan 23, 2018

FEA Generalize the use of precomputed sparse distance matr… #10482

Merged

amueller added the Needs Decision Requires decision label Aug 5, 2019

TomDLT closed this Sep 18, 2019

[MRG] Modifies T-SNE for sparse matrix #10206

[MRG] Modifies T-SNE for sparse matrix #10206

Conversation

thechargedneutron commented Nov 26, 2017

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thechargedneutron commented Nov 27, 2017

jnothman commented Nov 27, 2017 via email

thechargedneutron commented Nov 28, 2017

jnothman commented Nov 28, 2017

thechargedneutron commented Nov 28, 2017

jnothman commented Nov 28, 2017

thechargedneutron commented Nov 28, 2017

jnothman commented Nov 28, 2017 via email

thechargedneutron commented Nov 28, 2017

jnothman commented Nov 28, 2017 via email

thechargedneutron commented Nov 29, 2017

jnothman commented Nov 29, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thechargedneutron commented Nov 30, 2017

jnothman commented Nov 30, 2017 via email

thechargedneutron commented Nov 30, 2017

jnothman commented Jan 10, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 14, 2018

thechargedneutron commented Jan 15, 2018

jnothman commented Jan 15, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 16, 2018

Choose a reason for hiding this comment

TomDLT Jan 23, 2018 • edited

Choose a reason for hiding this comment

TomDLT Jan 23, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 22, 2018 via email

jnothman commented Jan 22, 2018 via email

jnothman commented Jan 24, 2018 via email

thechargedneutron commented Jan 25, 2018

jnothman commented Jan 25, 2018

TomDLT commented Jan 26, 2018

thechargedneutron commented Jan 26, 2018

jnothman commented Jan 29, 2018

TomDLT commented Sep 18, 2019

TomDLT Jan 23, 2018 •

edited

TomDLT Jan 23, 2018 •

edited