### Introduction NumExpr
- One interesting way of achieving Python parallelism is through NumExpr*, in which a symbolic evaluator transforms numerical Python expressions into high-performance, vectorized code. 
- NumExpr achieves this by vectorizing in chunks of elements instead of compiling everything at once—thus creating accelerated object kernels that are usable from Python code.

### Make Code run faster using NumExpr

In [1]:
# pip install numexpr

import numexpr as ne

In [2]:
# version of numexpr 
ne.__version__

'2.7.1'

**NumExpr** is a fast numerical expression evaluator for NumPy. 
- With it, expressions that operate on arrays are accelerated and use less memory than doing the same calculation in Python.
- Its multi-threaded capabilities can make use of all your cores 


### Why NumExpr works better 
The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general. Due to this, NumExpr works best with large arrays.

To boost performance, NumExpr can use the optimized Intel® Vector Mathematical Function Library (Intel® VML), included in Intel® Math Kernel Library (Intel® MKL). This makes it possible to accelerate the evaluation of mathematical functions

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### A simple vector expression


In [4]:
a = np.arange(1e6)
b = np.arange(1e6)

In [7]:
%timeit 10*a -5*b

20.4 ms ± 734 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
%timeit ne.evaluate("10*a - 5*b")

5.58 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### Boolean filtering

In [11]:
x1 = np.random.random(1000000)
x2 = np.random.random(1000000)
y1 = np.random.random(1000000)
y2 = np.random.random(1000000)

In [12]:
%%timeit -n100 -r10

c = np.sqrt((x1 - x2)**2 + (y1-y2)**2 ) > 0.5

38.1 ms ± 1.08 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)


In [14]:
%%timeit -n100 -r10
c = ne.evaluate("sqrt((x1 - x2)**2 + (y1-y2)**2 ) > 0.5")

7.56 ms ± 351 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)


#### large 3 vectors calculation

In [18]:
a, b, c = np.random.rand(3, 1000000)

In [20]:
%timeit a + (b**2 + (c*a + 1)*3)

34.9 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
%timeit ne.evaluate("a + (b**2 + (c*a + 1)*3)")

6.81 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### No. of Cores to be used 
With Numexpr we can specify number of cores using the set_num_threads() function

In [25]:
for i in range(4):
    ne.set_num_threads(i)
    %timeit ne.evaluate('a + (b**2 + (c*a + 1)*3)')

8.04 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.6 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.2 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.23 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
