Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Multithreaded kneighbors_graph in TSNE #15082
On the first 20k samples of MNIST
import numpy as np from sklearn.manifold import TSNE from sklearn.datasets import fetch_openml from joblib import Memory memory = Memory('/tmp/joblib_cache') mnist = memory.cache(fetch_openml)(data_id=41063) TSNE(verbose=5, n_jobs=8).fit_transform(mnist.data[:20000])
we can get up to a 2.5x run time speedup when running with
The strong scaling limit seem to be around 10 CPU cores in this example.
For the full MNIST (60k samples), k-neighbours search takes proportionally even longer, and therefore multi-core scalability is better, with a x3.0 speed up at
#10044 (comment) proposed smarter improvements that are still worth investigating, but in any case NN search is a bottleneck and we can't use
Not sure if this needs additional tests given that
In the end using the
Though, this PR is still relevant, as for less high dimensional data kd_tree or ball_tree could be the right choice.
You mean with precomputed distance sparse distance matrices? #10482 added
The point of this PR is however that we can be significantly faster than what we are now for TSNE even without using external dependencies.