Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] DOC Fix error in documentation of trustworthiness #9800

Merged
merged 6 commits into from Sep 25, 2017

Conversation

wdevazelhes
Copy link
Contributor

Reference Issue

Fixes #9799

What does this implement/fix? Explain your changes.

It fixes an error in the docstring of function manifold.t_sne.trustworthiness: the rank in the formula should be the rank in the original input space.

@lesteve
Copy link
Member

lesteve commented Sep 19, 2017

@tomMoral any opinion on this since you have worked on t-SNE recently?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key to get across is that any unexpected nearest neighbors in the embedded space are penalised in proportion to their rank in the original space.

@@ -387,10 +387,10 @@ def trustworthiness(X, X_embedded, n_neighbors=5, precomputed=False):
T(k) = 1 - \frac{2}{nk (2n - 3k - 1)} \sum^n_{i=1}
\sum_{j \in U^{(k)}_i} (r(i, j) - k)

where :math:`r(i, j)` is the rank of the embedded datapoint j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some confusion in "according to", and r is not described as a function of i. Can you write it in your own words? Use multiple sentences. Perhaps describe U before you describe r. You can also say "for each sample i" to make it simpler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that U should be move before r and the "for each" should help defining things.
Perhaps you can use "input space" instead of original space and insist on the fact that the data sample j is the r(i,j)-th neighbors of the data sample i

@wdevazelhes
Copy link
Contributor Author

Thanks for the comments. I tried to include them in the new commit. I also added @jnothman sum-up of the function in the docstring, tell me if it seems good to you.

Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM It is way clearer with this phrasing.

is the set of points that are in the k nearest neighbors in the embedded
space but not in the original space.
where for each sample i, :math:`U^{(k)}_i` are all samples j that are in
the k-nearest neighbor of i in the embedded space but are the :math:`r(i,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the "output space" would be clearer? input/output seem clearer when talking about mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I will do so.

the k-nearest neighbor of i in the embedded space but are the :math:`r(i,
j)`-th nearest neighbor of i in the input space with r(i, j) > k. In other
words, any unexpected nearest neighbors in the embedded space are penalised
in proportion to their rank in the input space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this sum up! 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 👍

@wdevazelhes wdevazelhes changed the title DOC Fix error in documentation of trustworthiness [MRG] DOC Fix error in documentation of trustworthiness Sep 20, 2017
:math:`U^{(k)}_i` is the set of points that are in the k nearest
neighbors in the embedded space but not in the original space.
where for each sample i, :math:`U^{(k)}_i` are all samples j that are in
the k-nearest neighbor of i in the output space but are the :math:`r(i,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this difficult. Can we just get rid of U from above and just have max(0, r(i, j) - k) or max(k, r(i, j)) - k?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be annoying. I'm now thinking that the problem with U was that it expressed two things: being in i's embedded neighborhood and being outside of i's original neighborhood. It's now a bit weird that we don't have a function for "the k-neighborhood of i in embedded space".

Otherwise, I think this is a vast improvement. Thank you.

according to the pairwise distances between the embedded datapoints,
:math:`U^{(k)}_i` is the set of points that are in the k nearest
neighbors in the embedded space but not in the original space.
where for each sample i, j is among its k nearest neighbors in the output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output -> embedded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change according to @tomMoral comment:

Maybe use the "output space" would be clearer? input/output seem clearer when talking about mappings.

Both ways seem clear to me: "embedded" is more precise, but maybe "output" is clear enough in this case and maybe simpler ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fussed either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fussed either way

@wdevazelhes
Copy link
Contributor Author

No pb, I agree, I will fix that

@jnothman
Copy link
Member

wonderful

@jnothman jnothman merged commit 8de1844 into scikit-learn:master Sep 25, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
@wdevazelhes wdevazelhes deleted the i/9799 branch November 27, 2017 15:15
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC Error in documentation of trustworthiness
4 participants