## Performance Measurement and Improvement Techniques

#### Goals
<p> In image processing, since you are dealing with large number of operations per second, it is mandatory that your code
is not only providing the correct solution, but also in the fastest manner. You will learn:
<p>• To measure the performance of your code.
<p>• Some tips to improve the performance of your code.
<p>• You will see these functions : cv2.getTickCount, cv2.getTickFrequency etc.
Apart from OpenCV, Python also provides a module time which is helpful in measuring the time of execution. Another
module profile helps to get detailed report on the code, like how much time each function in the code took, how many
times the function was called etc. But, if you are using IPython, all these features are integrated in an user-friendly
manner. We will see some important ones, and for more details, check links in Additional Resouces section.


#### The Functions

<p>**cv2.getTickCount** function returns the number of clock-cycles after a reference event (like the moment machine was
switched ON) to the moment this function is called. So if you call it before and after the function execution, you get
number of clock-cycles used to execute a function.
<p>**cv2.getTickFrequency** function returns the frequency of clock-cycles, or the number of clock-cycles per second. So
to find the time of execution in seconds, you can do following:

In [47]:
import cv2
import numpy as np

img1 = cv2.imread('img/neymar.jpg')
e1 = cv2.getTickCount()
for i in xrange(5,49,2):
    img1 = cv2.medianBlur(img1,i)
e2 = cv2.getTickCount()
t = (e2 - e1)/cv2.getTickFrequency()
print t
# Result I got is 0.521107655 seconds

1.33977951444


We will demonstrate with following example. Following example apply median filtering with a kernel of odd size
ranging from 5 to 49.

**Note:** You can do the same with time module. Instead of cv2.getTickCount, use time.time() function.
Then take the difference of two times.

## Default Optimization in OpenCV

<p>Many of the OpenCV functions are optimized using SSE2, AVX etc. It contains unoptimized code also. So if our
system support these features, we should exploit them (almost all modern day processors support them). It is enabled
by default while compiling. So OpenCV runs the optimized code if it is enabled, else it runs the unoptimized code.
You can use cv2.useOptimized() to check if it is enabled/disabled and cv2.setUseOptimized() to enable/disable it.
Let’s see a simple example.

In [48]:
# check if optimization is enabled
cv2.useOptimized()

%timeit res = cv2.medianBlur(img1,49)

# Disable it
cv2.setUseOptimized(False)
cv2.useOptimized()

%timeit res = cv2.medianBlur(img1,49)


10 loops, best of 3: 57.6 ms per loop
10 loops, best of 3: 58.7 ms per loop


See, optimized median filtering is ~2x faster than unoptimized version. If you check its source, you can see median
filtering is SIMD optimized. So you can use this to enable optimization at the top of your code (remember it is enabled
by default).

## Measuring Performance in IPython

<p>Sometimes you may need to compare the performance of two similar operations. IPython gives you a magic command
%timeit to perform this. It runs the code several times to get more accurate results. Once again, they are suitable to
measure single line codes.
<p>For example, do you know which of the following addition operation is more better, x = 5; y = x**2, x =
5; y = x*x, x = np.uint8([5]); y = x*x or y = np.square(x) ? We will find it with %timeit in
IPython shell.

In [49]:
x = 5
%timeit y=x**2

%timeit y=x*x

z = np.uint8([5])
%timeit y=z*z

%timeit y=np.square(z)


The slowest run took 16.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 64 ns per loop
10000000 loops, best of 3: 74.3 ns per loop
The slowest run took 24.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 786 ns per loop
The slowest run took 10.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.29 µs per loop


You can see that, x = 5 ; y = x*x is fastest and it is around 20x faster compared to Numpy. If you consider the
array creation also, it may reach upto 100x faster. Cool, right? (Numpy devs are working on this issue)<p>

<p>**Note:** Python scalar operations are faster than Numpy scalar operations. So for operations including one or two
elements, Python scalar is better than Numpy arrays. Numpy takes advantage when size of array is a little bit bigger.

