Early stopping criteria in TSNE set too high #24776

pavlin-policar · 2022-10-28T06:14:21Z

As I was benchmarking the scikit-learn TSNE implementation, I ran across a strange problem. I run my benchmarks on a large data set, ~1.3 million data points, can be downloaded from http://file.biolab.si/opentsne/benchmark/10x_mouse_zheng.pkl.gz. This can be opened with

with utils.Timer("Loading data...", verbose=True):
    with gzip.open(path.join("data", "10x_mouse_zheng.pkl.gz"), "rb") as f:
        data = pickle.load(f)

However, when I ran TSNE on this code, I was surprised that the results I got weren't good at all. For instance,

tsne_embedding_sklearn = TSNE(verbose=2).fit_transform(data["pca_50"])

outputs

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1306127 samples in 0.260s...
[t-SNE] Computed neighbors for 1306127 samples in 864.820s...
...
[t-SNE] Computed conditional probabilities for sample 1306127 / 1306127
[t-SNE] Mean sigma: 1.052610
[t-SNE] Computed conditional probabilities in 18.266s
[t-SNE] Iteration 50: error = 141.9723663, gradient norm = 0.0000003 (50 iterations in 209.344s)
[t-SNE] Iteration 100: error = 141.9723663, gradient norm = 0.0000000 (50 iterations in 215.698s)
[t-SNE] Iteration 100: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 100 iterations with early exaggeration: 141.972366
[t-SNE] Iteration 150: error = 9.5632687, gradient norm = 0.0000000 (50 iterations in 223.293s)
[t-SNE] Iteration 150: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 150 iterations: 9.563269

and produces

which has clearly not converged.

The output indicates that the optimization ran only for about 150 iterations in total, so I can fix this by setting the min_grad_norm parameter to zero.

tsne_embedding_sklearn_fixed = TSNE(min_grad_norm=0, verbose=2).fit_transform(data["pca_50"])

which outputs

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1306127 samples in 0.097s...
[t-SNE] Computed neighbors for 1306127 samples in 1547.009s...
...
[t-SNE] Mean sigma: 1.052610
[t-SNE] Computed conditional probabilities in 19.482s
[t-SNE] Iteration 50: error = 141.9723663, gradient norm = 0.0000003 (50 iterations in 212.290s)
[t-SNE] Iteration 100: error = 141.9723663, gradient norm = 0.0000000 (50 iterations in 237.889s)
[t-SNE] Iteration 150: error = 141.9723663, gradient norm = 0.0000000 (50 iterations in 255.057s)
[t-SNE] Iteration 200: error = 141.9723663, gradient norm = 0.0000001 (50 iterations in 252.916s)
[t-SNE] Iteration 250: error = 141.9723663, gradient norm = 0.0000009 (50 iterations in 342.999s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 141.972366
[t-SNE] Iteration 300: error = 9.5632687, gradient norm = 0.0000020 (50 iterations in 405.740s)
[t-SNE] Iteration 350: error = 9.5632391, gradient norm = 0.0000139 (50 iterations in 425.176s)
[t-SNE] Iteration 400: error = 9.5540562, gradient norm = 0.0002201 (50 iterations in 403.761s)
[t-SNE] Iteration 450: error = 9.0204744, gradient norm = 0.0007621 (50 iterations in 256.870s)
[t-SNE] Iteration 500: error = 8.2073746, gradient norm = 0.0004778 (50 iterations in 217.761s)
[t-SNE] Iteration 550: error = 7.7878709, gradient norm = 0.0003390 (50 iterations in 199.893s)
[t-SNE] Iteration 600: error = 7.5136371, gradient norm = 0.0002662 (50 iterations in 202.268s)
[t-SNE] Iteration 650: error = 7.3095994, gradient norm = 0.0002202 (50 iterations in 217.566s)
[t-SNE] Iteration 700: error = 7.1467400, gradient norm = 0.0001886 (50 iterations in 202.001s)
[t-SNE] Iteration 750: error = 7.0109062, gradient norm = 0.0001652 (50 iterations in 200.462s)
[t-SNE] Iteration 800: error = 6.8946157, gradient norm = 0.0001469 (50 iterations in 201.288s)
[t-SNE] Iteration 850: error = 6.7932062, gradient norm = 0.0001322 (50 iterations in 196.971s)
[t-SNE] Iteration 900: error = 6.7032018, gradient norm = 0.0001203 (50 iterations in 213.613s)
[t-SNE] Iteration 950: error = 6.6225247, gradient norm = 0.0001104 (50 iterations in 219.994s)
[t-SNE] Iteration 1000: error = 6.5494680, gradient norm = 0.0001020 (50 iterations in 225.912s)
[t-SNE] KL divergence after 1000 iterations: 6.549468

and produces

This works correctly, but the end result still hasn't converged, which is to be expected using the standard learning_rate=200. I was also pleasantly surprised to find that there is now an learning_rate="auto" option, which also solves the early stopping issue.

tsne_embedding_sklearn_lr_auto = TSNE(learning_rate="auto", verbose=2).fit_transform(data["pca_50"])

outputs

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1306127 samples in 0.100s...
[t-SNE] Computed neighbors for 1306127 samples in 864.156s...
...
[t-SNE] Mean sigma: 1.052610
[t-SNE] Computed conditional probabilities in 18.788s
[t-SNE] Iteration 50: error = 141.9426575, gradient norm = 0.0005064 (50 iterations in 511.803s)
[t-SNE] Iteration 100: error = 126.3311920, gradient norm = 0.0012780 (50 iterations in 237.508s)
[t-SNE] Iteration 150: error = 122.2045288, gradient norm = 0.0006919 (50 iterations in 219.642s)
[t-SNE] Iteration 200: error = 120.6939392, gradient norm = 0.0005690 (50 iterations in 641.196s)
[t-SNE] Iteration 250: error = 119.9162292, gradient norm = 0.0005069 (50 iterations in 1209.537s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 119.916229
[t-SNE] Iteration 300: error = 6.5854406, gradient norm = 0.0001135 (50 iterations in 270.180s)
[t-SNE] Iteration 350: error = 6.0168037, gradient norm = 0.0000586 (50 iterations in 243.136s)
[t-SNE] Iteration 400: error = 5.7149744, gradient norm = 0.0000381 (50 iterations in 192.304s)
[t-SNE] Iteration 450: error = 5.5248947, gradient norm = 0.0000279 (50 iterations in 191.463s)
[t-SNE] Iteration 500: error = 5.3935366, gradient norm = 0.0000221 (50 iterations in 1107.546s)
[t-SNE] Iteration 550: error = 5.2953587, gradient norm = 0.0000185 (50 iterations in 1211.873s)
[t-SNE] Iteration 600: error = 5.2193165, gradient norm = 0.0000160 (50 iterations in 3095.981s)
[t-SNE] Iteration 650: error = 5.1583886, gradient norm = 0.0000142 (50 iterations in 1824.597s)
[t-SNE] Iteration 700: error = 5.1083159, gradient norm = 0.0000128 (50 iterations in 3901.780s)
[t-SNE] Iteration 750: error = 5.0662918, gradient norm = 0.0000117 (50 iterations in 1222.378s)
[t-SNE] Iteration 800: error = 5.0304165, gradient norm = 0.0000108 (50 iterations in 240.268s)
[t-SNE] Iteration 850: error = 4.9990072, gradient norm = 0.0000100 (50 iterations in 260.818s)
[t-SNE] Iteration 900: error = 4.9712958, gradient norm = 0.0000094 (50 iterations in 268.655s)
[t-SNE] Iteration 950: error = 4.9466047, gradient norm = 0.0000088 (50 iterations in 273.766s)
[t-SNE] Iteration 1000: error = 4.9244399, gradient norm = 0.0000083 (50 iterations in 312.489s)
[t-SNE] KL divergence after 1000 iterations: 4.924440

producing

This is very similar to what I get with openTSNE using default parameters:

The differences likely stem from the initalization (which I'm also glad to see is going to default to "pca" in next versions).

Notice that in this last example, I didn't have to set the min_grad_norm to zero. However, the default behaviour I've shown in the first example is wrong, and should probably be fixed. Perhaps setting the min_grad_norm to a lower value might be a solution? Or removing it altogether wouldn't hurt either. In my experience with t-SNE, I've never come across any meaningful example where the min_grad_norm criteria was actually met.

Scikit-learn versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ]
executable: /Users/pavlin/miniconda3/envs/ml/bin/python
   machine: macOS-12.6-arm64-arm-64bit

Python dependencies:
      sklearn: 1.1.2
          pip: 22.1.2
   setuptools: 63.4.1
        numpy: 1.23.1
        scipy: 1.9.1
       Cython: 0.29.32
       pandas: 1.4.3
   matplotlib: 3.5.3
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/pavlin/miniconda3/envs/ml/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 10

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/pavlin/miniconda3/envs/ml/lib/libopenblasp-r0.3.20.dylib
        version: 0.3.20
threading_layer: pthreads
   architecture: armv8
    num_threads: 10

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/pavlin/miniconda3/envs/ml/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: armv8
    num_threads: 10

The text was updated successfully, but these errors were encountered:

TomDLT · 2022-10-28T17:00:45Z

Thanks for your detailed explanation. I would be curious to know more about your benchmark results.

For the convergence issue you are reporting, it seems that updating scikit-learn to version 1.2 would solve the problem, because it changes the default parameter to learning_rate="auto". Do you think it is still necessary to change the default value of min_grad_norm? (also ping @dkobak if available)

dkobak · 2022-10-29T21:19:36Z

Hi Pavlin! And thanks, Tom, for pinging me.

Several important changes are scheduled to become default in version 1.2, in particular PCA init (correctly scaled) and O(n) learning rate (see #18018). I have no opinion about the min_grad_norm parameter: I think it's probably not really needed, but after the changes in 1.2 it won't really hurt either.

dkobak · 2022-10-29T21:27:47Z

By the way, I checked now, and it seems the 1.2 release is expected in the "coming weeks", which is great!

pavlin-policar · 2022-10-30T07:22:37Z

Hey Dmitry, I'm glad to see these changes finally make their way into scikit-learn!

Thanks for your detailed explanation. I would be curious to know more about your benchmark results.

Sure thing, I'm putting these benchmarks together for openTSNE, so I can ping you once they're finalized. But the benchmarks are pretty much the same as they've always been. For instance, for 1mln data points, using 8 cores, openTSNE (FFT) and FIt-SNE take about 15mins, openTSNE (BH) roughly 60 minutes, MulticoreTSNE roughly 95 minutes, and scikit-learn roughly 2 hours. From what I can remember, the scikit-learn implementation was particularily slow in the past, so this is a wonderful improvement.

Regarding the actual issue at hand, I removed the min_grad_norm parameter from openTSNE a while back (pavlin-policar/openTSNE#113) and it has never caused any issues. I had never come across this particular bug I described in this issue in openTSNE, but since this is an issue in scikit-learn, I'd probably recommend removing it altogether. As I've said, I've never actually seen the stopping condition be met for any realistic example, so it seems like dead code that can potentially be harmful (as in this example).

dkobak · 2022-10-30T16:45:25Z

for 1mln data points, using 8 cores, openTSNE (FFT) and FIt-SNE take about 15mins, openTSNE (BH) roughly 60 minutes, MulticoreTSNE roughly 95 minutes, and scikit-learn roughly 2 hours.

Would it make sense to separately profile openTSNE BH with exact nearest neighbors? Sklearn uses exact kNN, whereas openTSNE/Fit-SNE/etc use approximate kNN which for 1mln points is of course faster.

since this is an issue in scikit-learn, I'd probably recommend removing it altogether.

I agree that it can be removed, but do not personally feel strongly about it; any API change to sklearn would need to go through a long deprecation cycle, which is quite a bit of hassle...

pavlin-policar · 2022-10-30T17:43:11Z

Would it make sense to separately profile openTSNE BH with exact nearest neighbors? Sklearn uses exact kNN, whereas openTSNE/Fit-SNE/etc use approximate kNN which for 1mln points is of course faster.

Yes, definitely, that would definitely make a lot of sense. But here, I was more interested in the overall runtime than in how fast particular optimization schemes are.

any API change to sklearn would need to go through a long deprecation cycle, which is quite a bit of hassle...

Yeah, I suppose that would be quite a hassle and I guess it may not be worth it then. I don't really have strong feelings on this either, though I still would recommend removing this parameter down the road. I suppose it's more up to the core developers here to decide what they want to do here. Perhaps this issue itself can be useful to potential future users who run into this issue, and maybe that's enough.

TomDLT · 2022-10-31T17:40:29Z

Sklearn uses exact kNN, whereas openTSNE/Fit-SNE/etc use approximate kNN which for 1mln points is of course faster.

FYI (not necessarily for your benchmark), here is an example of how to precompute approximate nearest neighbors and using them in scikit-learn. Recently, the example has not been showing a big difference with exact nearest neighbors anymore, due to large speedups in scikit-learn exact nearest neighbors, but with more data points this example could still be useful.

betatim · 2022-11-17T17:00:42Z

The issue should stop happening thanks to #19491, which will be part of v1.2.

If you/someone finds a dataset where the new behaviour is still problematic please open a new issue.

Let's close this issue for the time being.

pavlin-policar · 2022-11-17T17:10:03Z

Great, glad to hear it!

ogrisel · 2022-11-17T18:10:56Z

Actually the fact that learning_rate="auto" is the default in 1.2 was merged as part of #24389.

github-actions bot added the Needs Triage Issue requires triage label Oct 28, 2022

betatim removed the Needs Triage Issue requires triage label Nov 17, 2022

pavlin-policar closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early stopping criteria in TSNE set too high #24776

Early stopping criteria in TSNE set too high #24776

pavlin-policar commented Oct 28, 2022

TomDLT commented Oct 28, 2022

dkobak commented Oct 29, 2022

dkobak commented Oct 29, 2022

pavlin-policar commented Oct 30, 2022

dkobak commented Oct 30, 2022

pavlin-policar commented Oct 30, 2022

TomDLT commented Oct 31, 2022

betatim commented Nov 17, 2022

pavlin-policar commented Nov 17, 2022

ogrisel commented Nov 17, 2022

Early stopping criteria in TSNE set too high #24776

Early stopping criteria in TSNE set too high #24776

Comments

pavlin-policar commented Oct 28, 2022

Scikit-learn versions

TomDLT commented Oct 28, 2022

dkobak commented Oct 29, 2022

dkobak commented Oct 29, 2022

pavlin-policar commented Oct 30, 2022

dkobak commented Oct 30, 2022

pavlin-policar commented Oct 30, 2022

TomDLT commented Oct 31, 2022

betatim commented Nov 17, 2022

pavlin-policar commented Nov 17, 2022

ogrisel commented Nov 17, 2022