New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t-SNE has inefficient memory structure #7089

Closed
amueller opened this Issue Jul 26, 2016 · 19 comments

Comments

Projects
None yet
6 participants
@amueller
Member

amueller commented Jul 26, 2016

The barnes-hut implementation of t-SNE currently uses a dense matrix representation for the distances, but should be using a sparse matrix.
Discussion see #4025

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Jul 30, 2016

Does this still need a contributor? I'll work on this issue.

shanglun commented Jul 30, 2016

Does this still need a contributor? I'll work on this issue.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 3, 2016

Member

It does. It's somewhat non-trivial but you're very welcome to give it a go!

Member

amueller commented Aug 3, 2016

It does. It's somewhat non-trivial but you're very welcome to give it a go!

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 4, 2016

Ok, I will give it a go then. Glad to be helping on a 1.0 milestone!

shanglun commented Aug 4, 2016

Ok, I will give it a go then. Glad to be helping on a 1.0 milestone!

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany

zhexuany Aug 8, 2016

@shanglun Have you made any progress? I am also willing to contribute some code for this issue. :)

zhexuany commented Aug 8, 2016

@shanglun Have you made any progress? I am also willing to contribute some code for this issue. :)

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 8, 2016

Yes, I am still looking into it and I am making progress. I will reach out
on this thread if I need help. Thank you!

On Aug 8, 2016 11:31 AM, "Zhexuan Zachary Yang" notifications@github.com
wrote:

@shanglun https://github.com/shanglun Have you made any progress? I am
also willing to contribute some code for this issue. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_OTWUqjHenuZQ1XTY6MMQGhrtYNIks5qd0vEgaJpZM4JVevI
.

shanglun commented Aug 8, 2016

Yes, I am still looking into it and I am making progress. I will reach out
on this thread if I need help. Thank you!

On Aug 8, 2016 11:31 AM, "Zhexuan Zachary Yang" notifications@github.com
wrote:

@shanglun https://github.com/shanglun Have you made any progress? I am
also willing to contribute some code for this issue. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_OTWUqjHenuZQ1XTY6MMQGhrtYNIks5qd0vEgaJpZM4JVevI
.

@zhexuany

This comment has been minimized.

Show comment
Hide comment
@zhexuany

zhexuany Aug 8, 2016

Great. Please let me know if you need any help. :)

On Aug 8, 2016, at 10:41 AM, Sean Wang notifications@github.com wrote:

Yes, I am still looking into it and I am making progress. I will reach out
on this thread if I need help. Thank you!

On Aug 8, 2016 11:31 AM, "Zhexuan Zachary Yang" notifications@github.com
wrote:

@shanglun https://github.com/shanglun Have you made any progress? I am
also willing to contribute some code for this issue. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_OTWUqjHenuZQ1XTY6MMQGhrtYNIks5qd0vEgaJpZM4JVevI
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #7089 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AMtCBIeecpI5SngxTwEJ9kN6dDnjo_4eks5qd04MgaJpZM4JVevI.

zhexuany commented Aug 8, 2016

Great. Please let me know if you need any help. :)

On Aug 8, 2016, at 10:41 AM, Sean Wang notifications@github.com wrote:

Yes, I am still looking into it and I am making progress. I will reach out
on this thread if I need help. Thank you!

On Aug 8, 2016 11:31 AM, "Zhexuan Zachary Yang" notifications@github.com
wrote:

@shanglun https://github.com/shanglun Have you made any progress? I am
also willing to contribute some code for this issue. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_OTWUqjHenuZQ1XTY6MMQGhrtYNIks5qd0vEgaJpZM4JVevI
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #7089 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AMtCBIeecpI5SngxTwEJ9kN6dDnjo_4eks5qd04MgaJpZM4JVevI.

@csanhuezalobos

This comment has been minimized.

Show comment
Hide comment
@csanhuezalobos

csanhuezalobos Aug 10, 2016

In this work, the authors proposed a new method for visualization of high-dimensional data (LargeVis) using ideas of a previous study (LINE). I did some test and the method works really fast. They proposed an interesting algorithm to build a pretty accurate kNN graph. In the experiment sections, the authors mentioned they parallelize parts of t-SNE, so maybe you could ask the authors to contribute.

BTW, I'm looking forward to see LINE and LargeVis in future versions ;)

csanhuezalobos commented Aug 10, 2016

In this work, the authors proposed a new method for visualization of high-dimensional data (LargeVis) using ideas of a previous study (LINE). I did some test and the method works really fast. They proposed an interesting algorithm to build a pretty accurate kNN graph. In the experiment sections, the authors mentioned they parallelize parts of t-SNE, so maybe you could ask the authors to contribute.

BTW, I'm looking forward to see LINE and LargeVis in future versions ;)

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 10, 2016

Interesting. I think these studies might be out of scope for this
particular ticket, but it would be interesting to investigate further and
include additional methods in the future. Delving into the implementation
details of the t-Sne in the past few days has been quite interesting.

Was your benchmarking written in Python? Or account language?

On Aug 10, 2016 1:39 AM, "Claudio Sanhueza" notifications@github.com
wrote:

In this work http://dl.acm.org/citation.cfm?id=2883041, the authors
proposed a new method for visualization of high-dimensional data (LargeVis
https://github.com/lferry007/LargeVis) using ideas of a previous study
https://arxiv.org/abs/1503.03578 (LINE
https://github.com/tangjianpku/LINE). I did some test and the method
works really fast. They proposed an interesting algorithm to build a pretty
accurate kNN graph. In the experiment sections, the authors mentioned they
parallelize parts of t-SNE, so maybe you could ask the authors to
contribute.

BTW, I'm looking forward to see LINE and LargeVis in future versions ;)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_JIt5qV51j87yQ8B5qA1jKW2ih3Aks5qeWP3gaJpZM4JVevI
.

shanglun commented Aug 10, 2016

Interesting. I think these studies might be out of scope for this
particular ticket, but it would be interesting to investigate further and
include additional methods in the future. Delving into the implementation
details of the t-Sne in the past few days has been quite interesting.

Was your benchmarking written in Python? Or account language?

On Aug 10, 2016 1:39 AM, "Claudio Sanhueza" notifications@github.com
wrote:

In this work http://dl.acm.org/citation.cfm?id=2883041, the authors
proposed a new method for visualization of high-dimensional data (LargeVis
https://github.com/lferry007/LargeVis) using ideas of a previous study
https://arxiv.org/abs/1503.03578 (LINE
https://github.com/tangjianpku/LINE). I did some test and the method
works really fast. They proposed an interesting algorithm to build a pretty
accurate kNN graph. In the experiment sections, the authors mentioned they
parallelize parts of t-SNE, so maybe you could ask the authors to
contribute.

BTW, I'm looking forward to see LINE and LargeVis in future versions ;)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_JIt5qV51j87yQ8B5qA1jKW2ih3Aks5qeWP3gaJpZM4JVevI
.

@csanhuezalobos

This comment has been minimized.

Show comment
Hide comment
@csanhuezalobos

csanhuezalobos Aug 11, 2016

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?

csanhuezalobos commented Aug 11, 2016

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 11, 2016

Yeah, still investigating and experimenting, but based on discussions in
linked thread we will probably move to a sparse matrix implementation and
make sure that we are smart enough about the data to handle the edge cases.

If you have the C++ code I'd very much love to collaborate on a new
implementation. If you are already good with Python I think there wouldn't
be any barrier to you implementing it and issuing a PR, however.

On Aug 10, 2016 10:02 PM, "Claudio Sanhueza" notifications@github.com
wrote:

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_ECUJcqkoC_cxmglkc9Et2AOQIRjks5qeoKkgaJpZM4JVevI
.

shanglun commented Aug 11, 2016

Yeah, still investigating and experimenting, but based on discussions in
linked thread we will probably move to a sparse matrix implementation and
make sure that we are smart enough about the data to handle the edge cases.

If you have the C++ code I'd very much love to collaborate on a new
implementation. If you are already good with Python I think there wouldn't
be any barrier to you implementing it and issuing a PR, however.

On Aug 10, 2016 10:02 PM, "Claudio Sanhueza" notifications@github.com
wrote:

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_ECUJcqkoC_cxmglkc9Et2AOQIRjks5qeoKkgaJpZM4JVevI
.

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 11, 2016

If you just want to see another implementation option in t-Sne you can
point me to your C++ code. I'll read the papers, look at the C++ code and
implement the python version and issue a PR if it's a good fit for the
library. Always look forward to write interesting code.

On Aug 10, 2016 10:11 PM, "Shanglun Wang" shanglunwang@gmail.com wrote:

Yeah, still investigating and experimenting, but based on discussions in
linked thread we will probably move to a sparse matrix implementation and
make sure that we are smart enough about the data to handle the edge cases.

If you have the C++ code I'd very much love to collaborate on a new
implementation. If you are already good with Python I think there wouldn't
be any barrier to you implementing it and issuing a PR, however.

On Aug 10, 2016 10:02 PM, "Claudio Sanhueza" notifications@github.com
wrote:

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_ECUJcqkoC_cxmglkc9Et2AOQIRjks5qeoKkgaJpZM4JVevI
.

shanglun commented Aug 11, 2016

If you just want to see another implementation option in t-Sne you can
point me to your C++ code. I'll read the papers, look at the C++ code and
implement the python version and issue a PR if it's a good fit for the
library. Always look forward to write interesting code.

On Aug 10, 2016 10:11 PM, "Shanglun Wang" shanglunwang@gmail.com wrote:

Yeah, still investigating and experimenting, but based on discussions in
linked thread we will probably move to a sparse matrix implementation and
make sure that we are smart enough about the data to handle the edge cases.

If you have the C++ code I'd very much love to collaborate on a new
implementation. If you are already good with Python I think there wouldn't
be any barrier to you implementing it and issuing a PR, however.

On Aug 10, 2016 10:02 PM, "Claudio Sanhueza" notifications@github.com
wrote:

I did the tests in C++.

What are you specifically modifying in t-SNE? Memory management?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_ECUJcqkoC_cxmglkc9Et2AOQIRjks5qeoKkgaJpZM4JVevI
.

@csanhuezalobos

This comment has been minimized.

Show comment
Hide comment
@csanhuezalobos

csanhuezalobos Aug 11, 2016

The original implementation of t-SNE was done in C++. You can find the original sources here.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted binary file created by the wrappers.

csanhuezalobos commented Aug 11, 2016

The original implementation of t-SNE was done in C++. You can find the original sources here.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted binary file created by the wrappers.

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 11, 2016

Oh, I see, the improvements you mentioned in the original post was just
about parallelization of the t-Sne, and the paper wasn't really about that.

I was under the impression that there was some new implementation of the
t-Sne that was faster.

Let's follow up when I finish this ticket, and we can look into optimizing
the t-Sne and implementing other algorithms that might be useful, such as
largeVis.

On Aug 10, 2016 10:39 PM, "Claudio Sanhueza" notifications@github.com
wrote:

The original implementation of t-SNE was done in C++. You can find the
original sources here https://github.com/lvdmaaten/bhtsne/.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted binary
file created by the wrappers.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_BjICwiF1f49DgJZ6vCOg6Ct83eCks5qeotJgaJpZM4JVevI
.

shanglun commented Aug 11, 2016

Oh, I see, the improvements you mentioned in the original post was just
about parallelization of the t-Sne, and the paper wasn't really about that.

I was under the impression that there was some new implementation of the
t-Sne that was faster.

Let's follow up when I finish this ticket, and we can look into optimizing
the t-Sne and implementing other algorithms that might be useful, such as
largeVis.

On Aug 10, 2016 10:39 PM, "Claudio Sanhueza" notifications@github.com
wrote:

The original implementation of t-SNE was done in C++. You can find the
original sources here https://github.com/lvdmaaten/bhtsne/.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted binary
file created by the wrappers.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_BjICwiF1f49DgJZ6vCOg6Ct83eCks5qeotJgaJpZM4JVevI
.

@shanglun

This comment has been minimized.

Show comment
Hide comment
@shanglun

shanglun Aug 18, 2016

Not very happy about this but I have hit a busy period with work and will
not be able to devote enough time to ticket for a short bit. If someone
would like to work on this ticket please feel free.

On Aug 10, 2016 10:44 PM, "Shanglun Wang" shanglunwang@gmail.com wrote:

Oh, I see, the improvements you mentioned in the original post was just
about parallelization of the t-Sne, and the paper wasn't really about that.

I was under the impression that there was some new implementation of the
t-Sne that was faster.

Let's follow up when I finish this ticket, and we can look into optimizing
the t-Sne and implementing other algorithms that might be useful, such as
largeVis.

On Aug 10, 2016 10:39 PM, "Claudio Sanhueza" notifications@github.com
wrote:

The original implementation of t-SNE was done in C++. You can find the
original sources here https://github.com/lvdmaaten/bhtsne/.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted
binary file created by the wrappers.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_BjICwiF1f49DgJZ6vCOg6Ct83eCks5qeotJgaJpZM4JVevI
.

shanglun commented Aug 18, 2016

Not very happy about this but I have hit a busy period with work and will
not be able to devote enough time to ticket for a short bit. If someone
would like to work on this ticket please feel free.

On Aug 10, 2016 10:44 PM, "Shanglun Wang" shanglunwang@gmail.com wrote:

Oh, I see, the improvements you mentioned in the original post was just
about parallelization of the t-Sne, and the paper wasn't really about that.

I was under the impression that there was some new implementation of the
t-Sne that was faster.

Let's follow up when I finish this ticket, and we can look into optimizing
the t-Sne and implementing other algorithms that might be useful, such as
largeVis.

On Aug 10, 2016 10:39 PM, "Claudio Sanhueza" notifications@github.com
wrote:

The original implementation of t-SNE was done in C++. You can find the
original sources here https://github.com/lvdmaaten/bhtsne/.

I just adapt:

  1. Include parameters in command line.
  2. Read data points from a text file.
  3. Small memory management issues.

Originally, all the input data is contained in a specific formatted
binary file created by the wrappers.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7089 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKNQ_BjICwiF1f49DgJZ6vCOg6Ct83eCks5qeotJgaJpZM4JVevI
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 24, 2017

Member

More discussion of this issue and its potential solution over at #8582

Member

jnothman commented May 24, 2017

More discussion of this issue and its potential solution over at #8582

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 27, 2017

Member

I'm so annoyed by this false advertising that I'm tempted to fix it myself. But given my lack of availability, I'm going to mark it for the sprint and hope that someone in Paris can give it a go.

Member

jnothman commented May 27, 2017

I'm so annoyed by this false advertising that I'm tempted to fix it myself. But given my lack of availability, I'm going to mark it for the sprint and hope that someone in Paris can give it a go.

@csanhuezalobos

This comment has been minimized.

Show comment
Hide comment
@csanhuezalobos

csanhuezalobos commented May 28, 2017

Maybe this can help to improve things.
https://github.com/DmitryUlyanov/Multicore-TSNE

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 28, 2017

Member
Member

jnothman commented May 28, 2017

@lesteve lesteve added the Sprint label Jun 1, 2017

@tomMoral tomMoral referenced this issue Jun 7, 2017

Merged

[MRG+1] Reducing t-SNE memory usage #9032

5 of 5 tasks complete
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 13, 2017

Member

Fixed in #9032

Member

jnothman commented Jul 13, 2017

Fixed in #9032

@jnothman jnothman closed this Jul 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment