Both these examples look at some simple techniques that can be used to improve the performance of your pandas.

Categoricals for imporving data efficiency and processing, and numexpr for improving the performance of expression evaluation.

## Making Expressions Faster

The following is a basic example of an expression evaluation, making use of pure Python. Notice the performance characteristics.

In [1]:
import math
loops = 2500000
a = range(1, loops)

def f(x):
    return 3 * math.log(x) + math.cos(x) ** 2

%timeit [f(x) for x in a]

1.27 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We can immediately improve the performance, as already seen when looking at vectorization, by moving away from Python's own math library and making use of the NumPy maths libraries.

In [2]:
import numpy as np
a = np.arange(1, loops)

%timeit 3 * np.log(a) + np.cos(a) ** 2

58.7 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


That's a significant improvement in performance already... but we can do better.

Numexpr is a package that allows us to 'compile' our expressions and then execute them natively...

In [5]:
import numexpr as ne
ne.set_num_threads(1)

f = '3 * log(a) + cos(a) ** 2'
%timeit ne.evaluate(f)

15.2 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


And, it gets better. If we have more cores available for the computation, we can increase the number of threads that the numexpr library will make use of..

In [6]:
ne.set_num_threads(4)
%timeit ne.evaluate(f)

7.29 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


That's a significant improvement in performance.