Skip to content

A few docs take orders of magnitude longer than others #3448

Open
kastnerkyle opened this Issue Jul 20, 2014 · 6 comments

4 participants

@kastnerkyle
scikit-learn member
Various Agglomerative Clustering on a 2D embedding of digits
SNIP
ward : 96.95s
average : 96.07s
complete : 97.23s
 - time elapsed : 3.2e+02 sec

Is there any way we can reduce the time it takes to make this doc? It is way, way slower than the rest. For this plot, it is 320 seconds, versus approximately 90 seconds for the next worst plot (LARS image denoising... which is next on my list to look at). All others besides a few seem to be in the 2 to 10 second range.

@kastnerkyle
scikit-learn member

Fuel for the fire...
example_time_hist
cumulative_sum_times

make html 2>&1 | tee log.log
grep 'time elapsed' log.log | cut -d : -f 2 | cut -d ' ' -f 2 | cut -d 's' -f 1 > times.txt
grep 'plot_' log.log | grep -v '\[' | grep -v example | grep -v File | cut -d ' ' -f 2 > names.txt

then run this script

import numpy as np
import matplotlib.pyplot as plt

times = np.loadtxt('times.txt')
with open('names.txt') as f:
    names = [l.strip() for l in f.readlines()]
sorted_indices = np.argsort(times)
n, bins, patches = plt.hist(times, bins=100, color='steelblue')
plt.annotate(names[sorted_indices[-1]],
             xy=(bins[-1], n[-1]),
             xytext=(-25, 25),
             xycoords='data',
             textcoords='offset points',
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3,rad=0.3"))
plt.title("Histogram of document/example times")
plt.xlabel("Time (s)")
plt.ylabel("Count")
plt.figure()
plt.plot(np.cumsum(np.sort(times)), color='steelblue')
plt.title("Cumulative sum of document/example times")
plt.xlabel("Test count")
plt.ylabel("Time (s), total %i seconds" % times.sum())
plt.show()
@kastnerkyle kastnerkyle changed the title from Agglomerative clustering doc takes orders of magnitude longer than others to A few docs take orders of magnitude longer than others Jul 20, 2014
@arjoly
scikit-learn member
arjoly commented Jul 20, 2014

Waouw, 20 examples over 140 are taking around 1000s (71%) of the time.

@GaelVaroquaux
scikit-learn member
@kastnerkyle
scikit-learn member

Well my laptop is pretty old, but even 90s seems long to me :) . It would be intersting to see on another box if the distribution is still the same, or if that particular test is just a result of poor model optimization on old hardware.

@larsmans
scikit-learn member

Bumming the actual code is an option too ;)

@larsmans
scikit-learn member

I just checked and the example is spending ca. 100% of its time in scipy.cluster. Different SciPy versions can explain the timing differences. If we want to speed this stuff up, we have to fork scipy.cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.