Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Modifies T-SNE for sparse matrix #10206

Closed

Conversation

thechargedneutron
Copy link
Contributor

Reference Issues/PRs

Fixes #9691

What does this implement/fix? Explain your changes.

Modifies sklearn/neighbors/base.py

Any other comments?

Not sure about the complexity of the modification.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just be going through each row, raising an error if the number of nonzeros is less than n_neighbors, otherwise argsorting, and taking neigh_ind[i] = row.indices[order][:n]. Is that not right?

row = dist.getrow(i)
non_zero = row.size
j_ = 0
for j in range(0, n_neighbors+non_zero):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around + please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't really get what this is doing (the names j_ and j don't help!)

non_zero = row.size
j_ = 0
for j in range(0, n_neighbors+non_zero):
if j not in row.indices:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is slow...

@thechargedneutron
Copy link
Contributor Author

@jnothman Why do we raise an error if the number of nonzeros is less than n_neighbors?

Also, what I attempt by using those j and j_ is to get a list of indices having value 0. Is there a simple function doing the same? I coudn't find one.

The tests will be modified once this is approved. Thanks for the review.

@jnothman
Copy link
Member

jnothman commented Nov 27, 2017 via email

@thechargedneutron
Copy link
Contributor Author

@jnothman I guess this is what you intended. But, after this change, the snippet given in the original issue itself fails with n_neighbors=40. Can you please check of this is in line with what you said?

@jnothman
Copy link
Member

Is this the failure you're talking about?

ValueError                                Traceback (most recent call last)
<ipython-input-1-6e15fc3677d3> in <module>()
      5 bt = BallTree(X, leaf_size=300)
      6 distances = kneighbors_graph(bt, n_neighbors=40, mode="distance", metric="cosine")
----> 7 X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances)

/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in fit_transform(self, X, y)
    845             Embedding of the training data in low-dimensional space.
    846         """
--> 847         embedding = self._fit(X)
    848         self.embedding_ = embedding
    849         return self.embedding_

/Users/joel/repos/scikit-learn/sklearn/manifold/t_sne.py in _fit(self, X, skip_num_points)
    712             t0 = time()
    713             distances_nn, neighbors_nn = knn.kneighbors(
--> 714                 None, n_neighbors=k)
    715             duration = time() - t0
    716             if self.verbose:

/Users/joel/repos/scikit-learn/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance)
    367                     non_zero = row.size
    368                     if non_zero < n_neighbors:
--> 369                         raise ValueError("Invalid Format")
    370                     else:
    371                         neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors]

ValueError: Invalid Format

@thechargedneutron
Copy link
Contributor Author

YES, this Invalid Format error has been added by me when number of non zero is less than n_neighbors.

@jnothman
Copy link
Member

Yes, the default perplexity is too high for 40 neighbors to be sufficient. It is correct to throw an error, but the error message is not appropriate atm. perplexity=12 is the maximum for that data.

@thechargedneutron
Copy link
Contributor Author

Yeah, I was not aware of what's the error here so left the error message as that for the time being? Is the implementation correct? Suggest me an error message here. Also, if this is correct, should I move on to fixing the tests?

@jnothman
Copy link
Member

jnothman commented Nov 28, 2017 via email

@thechargedneutron
Copy link
Contributor Author

Adding an error message for TSNE would require me to add the error and error message in TSNE. Is this what you want?

@jnothman
Copy link
Member

jnothman commented Nov 28, 2017 via email

@thechargedneutron
Copy link
Contributor Author

@jnothman I have added what you suggested. But I doubt if this is correct. I don't know but even when the neighbour is set to 1 as in the sample code given in the issue, the number of non_zero is less than n_neighbours. Am I missing out something?

@jnothman
Copy link
Member

"when the neighbour is set to 1" I'm not sure what you're referring to. Link to the relevant issue/comment?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide runnable snippets, or failing unit tests if you want me to understand what behaviour you think is incorrect.

You need to add tests anyway.

if non_zero < n_neighbors:
raise ValueError("Invalid Format")
else:
neigh_ind[i][:n_neighbors] = row.indices[np.argsort(row.data)][:n_neighbors]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely you need to update the distances too.

Needs tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of how distances are to be updated? Can you give me an example of already existing such implementation.

Also, tests will be attempted when the implementation goes correct.

sample_range, np.argsort(dist[sample_range, neigh_ind])]
if issparse(dist):
neigh_ind = np.zeros((dist.shape[0], n_neighbors))
for i in range(0, dist.shape[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to raise an error here, if we're to implement this in sklearn.neighbors and not just in TSNE.

Perhaps use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...?

@@ -714,6 +699,13 @@ def _fit(self, X, skip_num_points=0):
if self.verbose:
print("[t-SNE] Computing {} nearest neighbors...".format(k))

for i in range(0, X.shape[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if np.any(dist.getnnz(axis=1) < n_neighbors): raise ...?

@thechargedneutron
Copy link
Contributor Author

@jnothman This is what I am referring to. Here I changed n_neighbors to 1 from 40. Still the error raises as the number of non-zero is less than n_neighbours. Is this normal or there;s something wron with my implementation.

from sklearn.neighbors import BallTree, kneighbors_graph
from sklearn.manifold import TSNE
X = np.random.randn(100, 10)

bt = BallTree(X, leaf_size=300)
distances = kneighbors_graph(bt, n_neighbors=1, mode="distance", metric="cosine")
X_embedded = TSNE(n_components=2, metric="precomputed").fit_transform(distances)

@jnothman
Copy link
Member

jnothman commented Nov 30, 2017 via email

@thechargedneutron
Copy link
Contributor Author

The tests have been added. Kindly review. Also. @jnothman , the tests are failing with the error

getnnz() got an unexpected keyword argument 'axis'

This is working fine locally.
Can you please look into this problem. A similar error was found in the given URL when I trie finding solution for the same. lyst/lightfm#87

@jnothman
Copy link
Member

jnothman commented Jan 10, 2018 via email

dist = np.zeros((3, 3))
dist_csr = sp.csr_matrix(dist)
tsne = TSNE(metric="precomputed")
X_transformed_dense = tsne.fit_transform(dist)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman

Check that tSNE results for sparse X are the same as for dense X where X is a feature matrix and where X is a precomputed distance matrix with all zeros explicit

Does this properly implement what you intended by this statement? I am not sure since the output of neighbors inside TNSE.fit give the same result. I guess this function needs modification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't mean that the entire matrix is zeros. I mean that the zero distances on the diagonally should be explicitly in X.data, as opposed to an element not explicitly stored (and hence assumed to have a value of zero).

You can construct an all-explicit CSR matrix with:

def array_to_explicit_csr(X):
    return csr_matrix((X.ravel(), np.tile(np.arange(X.shape[1]), X.shape[0]), np.arange(0, X.size + 1, X.shape[1])))

@jnothman
Copy link
Member

Would you like me to write the tests for you, and you fix up the implementation? let me know.

@thechargedneutron
Copy link
Contributor Author

@jnothman Yeah, I need help in tests. The implementation can be corrected by using this explicit zeros. I never thought this could be achieved. That's why I used the current work around.

@jnothman
Copy link
Member

jnothman commented Jan 15, 2018 via email



def getnnz(X, axis=None):
if axis is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reduce indent

raise ValueError("Not enough neighbors in sparse "
"precomputed matrix to get {} "
"nearest neighbors".format(n_neighbors))
if dist.diagonal().min() < 0 or dist.diagonal().max() > 0 or \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement that the diagonal is 0 only applies if X == self._fit_X. The requirement of symmetry should not exist. Precomputed sparse distance matrices (and hacked dense ones) -- even for X == self._fit_X -- are not necessarily symmetric in all their cells, only in those cells that are present (and finite/reachable in the case of dense) on both sides of the diagonal. Just remove this validation.

@jnothman
Copy link
Member

See thechargedneutron#4

if issparse(dist):
print "Dist being printed \n"
print dist.toarray()
print dist.indices
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gist.github.com/thechargedneutron/c2f95ac46a9ab61beb0de8a91a9a533f
Running the above code (tests provided by @jnothman ) on this branch produces the following dist matrix:

[[0.    0.    0.    0.    0.    0.    0.56968608    0.    0.44086692    0. ]
 [0.    0.    0.71768226    0.    0.    0.71600003    0.    0.    0.    0. ]
 [0.    0.71768226    0.    0.    0.    0.    0.51642623    0.    0.    0. ]
 [0.    0.    0.    0.    0.    0.31561065    0.    0.51627111    0.    0. ]
 [0.    0.    0.    0.    0.    0.    0.44107968    0.    0.    0.51790786]
 [0.    0.    0.    0.31561065    0.    0.    0.    0.38322922    0.    0.]
 [0.    0.    0.51642623    0.    0.44107968    0.    0.    0.    0.    0.]
 [0.    0.    0.    0.    0.    0.38322922    0.    0.    0.    0.40501049]
 [0.44086692    0.    0.    0.    0.88847197    0.    0.    0.    0.    0.]
 [0.    0.    0.    0.    0.    0.50529526    0.    0.40501049    0.    0.]]

And the corresponding indices of dist are:

[0 8 6 1 5 2 2 6 1 3 5 7 4 6 9 5 3 7 6 4 2 7 5 9 8 0 4 9 7 5]

And the expected result of indices are:

[[8 6 4]
 [5 2 9]
 [6 1 4]
 [5 7 4]
 [6 9 5]
 [3 7 9]
 [4 2 0]
 [5 9 3]
 [0 4 6]
 [7 5 4]]

But in row 1 for example, how are we supposed to get 4 as an index while it is passed as a sparse array with a zero at that position (visible from indices output). Am I missing out on something?
I know this issue is taking too long. Sorry for this.

Copy link
Member

@TomDLT TomDLT Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dist matrix give you 2 neighbors, 3 if you include self as explicit zeros.
The expected result of indices corresponds to 3 neighbors, without self.

The fix is replacing in the gist nn.kneighbors_graph(X_train.copy(), mode='distance') by nn.kneighbors_graph(None, mode='distance'), which is the correct syntax to have 3 neighbors without self.

In base.py you also have to uncomment:

                if query_is_train:
                    # this is done to add self as nearest neighbor
                    neigh_ind = np.concatenate((sample_range, neigh_ind),
                                               axis=1)
                    neigh_ind = neigh_ind[:, :-1]

Copy link
Member

@TomDLT TomDLT Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my mistake, you are testing the explicit diagonal case...

The expected result of indices is wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomDLT Is the fix that you provided above in the gist still valid? (after you accepted there's fault in your above comment?) . I am not familiar with generating CSR Matrix from kneighbors_graph.

@jnothman
Copy link
Member

jnothman commented Jan 22, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jan 22, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jan 24, 2018 via email

@thechargedneutron
Copy link
Contributor Author

@jnothman Cool. Can you review the PR again with the desired features. If this goes good, I'll open a new PR to #10482 's branch.

@jnothman
Copy link
Member

I'm not sure reviewing this altogether is a great use of my time, which is becoming much more limited for the next five months. I may try to look at some of the recent changes, but can I suggest that we do the following: constrain this PR (or a new one in its place) to the case of allowing sparse matrices in tSNE without metric='computed'. That should be a straightforward change.

And you take a good look at #10482, and either collaborate with @TomDLT, or ask if he wants you to take over its primary development. It's an extension of what you have here in terms of generalised support for sparse precomputed input to nearest neighbors-based estimators. Although, by describing the API more formally, we can get away with simplifying the implementation by excluding some annoying edge cases, such as saying "We're going to assume that the input is like the output from kneighbors_graph(X, mode='distance') with the zeros on the diagonal implicit."

@TomDLT
Copy link
Member

TomDLT commented Jan 26, 2018

I squashed your commits into 96c9d94 and I picked it in #10482.

@thechargedneutron
Copy link
Contributor Author

I squashed your commits into 96c9d94 and I picked it in #10482.

Thanks a lot!! 😃

@jnothman
Copy link
Member

Soo... are we closing this? Is it appropriate for #10482 to include sparse support in BHTSNE in the non-precomputed case?

@amueller amueller added the Needs Decision Requires decision label Aug 5, 2019
@TomDLT
Copy link
Member

TomDLT commented Sep 18, 2019

This work has been merged as part of #10482

@TomDLT TomDLT closed this Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

T-SNE fails for CSR matrix
4 participants