A few docs take orders of magnitude longer than others #3448

Closed
kastnerkyle opened this Issue Jul 20, 2014 · 7 comments

Comments

Projects
None yet
5 participants
@kastnerkyle
Member

kastnerkyle commented Jul 20, 2014

Various Agglomerative Clustering on a 2D embedding of digits
SNIP
ward : 96.95s
average : 96.07s
complete : 97.23s
 - time elapsed : 3.2e+02 sec

Is there any way we can reduce the time it takes to make this doc? It is way, way slower than the rest. For this plot, it is 320 seconds, versus approximately 90 seconds for the next worst plot (LARS image denoising... which is next on my list to look at). All others besides a few seem to be in the 2 to 10 second range.

@kastnerkyle

This comment has been minimized.

Show comment
Hide comment
@kastnerkyle

kastnerkyle Jul 20, 2014

Member

Fuel for the fire...
example_time_hist
cumulative_sum_times

make html 2>&1 | tee log.log
grep 'time elapsed' log.log | cut -d : -f 2 | cut -d ' ' -f 2 | cut -d 's' -f 1 > times.txt
grep 'plot_' log.log | grep -v '\[' | grep -v example | grep -v File | cut -d ' ' -f 2 > names.txt

then run this script

import numpy as np
import matplotlib.pyplot as plt

times = np.loadtxt('times.txt')
with open('names.txt') as f:
    names = [l.strip() for l in f.readlines()]
sorted_indices = np.argsort(times)
n, bins, patches = plt.hist(times, bins=100, color='steelblue')
plt.annotate(names[sorted_indices[-1]],
             xy=(bins[-1], n[-1]),
             xytext=(-25, 25),
             xycoords='data',
             textcoords='offset points',
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3,rad=0.3"))
plt.title("Histogram of document/example times")
plt.xlabel("Time (s)")
plt.ylabel("Count")
plt.figure()
plt.plot(np.cumsum(np.sort(times)), color='steelblue')
plt.title("Cumulative sum of document/example times")
plt.xlabel("Test count")
plt.ylabel("Time (s), total %i seconds" % times.sum())
plt.show()
Member

kastnerkyle commented Jul 20, 2014

Fuel for the fire...
example_time_hist
cumulative_sum_times

make html 2>&1 | tee log.log
grep 'time elapsed' log.log | cut -d : -f 2 | cut -d ' ' -f 2 | cut -d 's' -f 1 > times.txt
grep 'plot_' log.log | grep -v '\[' | grep -v example | grep -v File | cut -d ' ' -f 2 > names.txt

then run this script

import numpy as np
import matplotlib.pyplot as plt

times = np.loadtxt('times.txt')
with open('names.txt') as f:
    names = [l.strip() for l in f.readlines()]
sorted_indices = np.argsort(times)
n, bins, patches = plt.hist(times, bins=100, color='steelblue')
plt.annotate(names[sorted_indices[-1]],
             xy=(bins[-1], n[-1]),
             xytext=(-25, 25),
             xycoords='data',
             textcoords='offset points',
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3,rad=0.3"))
plt.title("Histogram of document/example times")
plt.xlabel("Time (s)")
plt.ylabel("Count")
plt.figure()
plt.plot(np.cumsum(np.sort(times)), color='steelblue')
plt.title("Cumulative sum of document/example times")
plt.xlabel("Test count")
plt.ylabel("Time (s), total %i seconds" % times.sum())
plt.show()

@kastnerkyle kastnerkyle changed the title from Agglomerative clustering doc takes orders of magnitude longer than others to A few docs take orders of magnitude longer than others Jul 20, 2014

@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 20, 2014

Member

Waouw, 20 examples over 140 are taking around 1000s (71%) of the time.

Member

arjoly commented Jul 20, 2014

Waouw, 20 examples over 140 are taking around 1000s (71%) of the time.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 23, 2014

Member
  • time elapsed : 3.2e+02 sec

Interesting, it is much slower than on my box: it takes 90s on my box,
which is more acceptable, but still slow.

Is there any way we can reduce the time it takes to make this doc?

Reducing it (for instance by reducing the number of point) will now
enable to show the percolation behavior, and thus will reduce the
difference between the various clustering algorithms.

That said, 320s is too long. We might have to do something, if other
people can confirm that it takes so long.

Member

GaelVaroquaux commented Jul 23, 2014

  • time elapsed : 3.2e+02 sec

Interesting, it is much slower than on my box: it takes 90s on my box,
which is more acceptable, but still slow.

Is there any way we can reduce the time it takes to make this doc?

Reducing it (for instance by reducing the number of point) will now
enable to show the percolation behavior, and thus will reduce the
difference between the various clustering algorithms.

That said, 320s is too long. We might have to do something, if other
people can confirm that it takes so long.

@kastnerkyle

This comment has been minimized.

Show comment
Hide comment
@kastnerkyle

kastnerkyle Jul 23, 2014

Member

Well my laptop is pretty old, but even 90s seems long to me :) . It would be intersting to see on another box if the distribution is still the same, or if that particular test is just a result of poor model optimization on old hardware.

Member

kastnerkyle commented Jul 23, 2014

Well my laptop is pretty old, but even 90s seems long to me :) . It would be intersting to see on another box if the distribution is still the same, or if that particular test is just a result of poor model optimization on old hardware.

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jul 30, 2014

Member

Bumming the actual code is an option too ;)

Member

larsmans commented Jul 30, 2014

Bumming the actual code is an option too ;)

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Oct 19, 2014

Member

I just checked and the example is spending ca. 100% of its time in scipy.cluster. Different SciPy versions can explain the timing differences. If we want to speed this stuff up, we have to fork scipy.cluster.

Member

larsmans commented Oct 19, 2014

I just checked and the example is spending ca. 100% of its time in scipy.cluster. Different SciPy versions can explain the timing differences. If we want to speed this stuff up, we have to fork scipy.cluster.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 27, 2016

Member

takes <1s on master.

Member

amueller commented Oct 27, 2016

takes <1s on master.

@amueller amueller closed this Oct 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment