# High Performance with `eval()` and `query()`

As we've seen in previous sections, both NumPy and Pandas provides a intuitive syntax to push basic operations into C code for efficiency. While these abstractions are effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause overhead in both computational time and memory use.

However, Pandas includes some tools that allow direct access to operations at the speed of C without allocation of intermediate arrays. These are the `eval()` and `query()` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package.

## Motivation: Compound Expressions

We've seen previously that Numpy and Pandas support fast vectorized operations (e.g. adding the elements of two arrays).

However, this abstraction can become less efficient when computing compound expressions. For example, consider the following expression:

In [1]:
import numpy as np

rng = np.random.RandomState(42)

x = rng.rand(1000000)
y = rng.rand(1000000)

# Compound expression
mask = (x > 0.5) & (y < 0.5)

Because NumPy evaluates each subexpression, we can think of this expression as equivalent to:

In [2]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

That is, every intermediate step is _explicitly allocated in memory_. Depending on the size of `x` and `y`, this can lead to very significant memory and computational overhead. The Numexpr library gives us the ability to compute this type of expression element by element, without the need to allocate full intermediate arrays. Let's take a look:

In [3]:
import numexpr

mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
# Verify that we get the same result
np.array_equal(mask, mask_numexpr)

True

The main benefit here is that Numexpr evaluates the expression in a way that does not require full-sized temporary arrays, and thus can be much more efficient than NumPy (especially for large arrays). The Pandas `eval()` and `query()` functions that we will discuss here are similar, and depend on the Numexpr package.

## `eval()` for Efficient Operations

The `eval()` function in Pandas uses string expressions to efficiently compute operations using `DataFrame`s. For example, consider the following:

In [4]:
import pandas as pd

nrows, ncols = 100000, 100

df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))

Let's compute the sum of all four `DataFrame`s using the typical approach by just writing the sum:

In [5]:
%timeit df1 + df2 + df3 + df4

73.8 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now let's compute the same result using `pd.eval()` by constructing the expression as a string:

In [6]:
%timeit pd.eval('df1 + df2 + df3 + df4')

30.4 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


That's about 50 to 60% faster, and using much less memory, while giving the same result:

In [7]:
np.array_equal(df1 + df2 + df3 + df4,
              pd.eval('df1 + df2 + df3 + df4'))

True

### Supported operations

As of Pandas v1.1.0, `pd.eval()` supports a wide range of operations. Let's go over these using the following integer `DataFrame`s:

In [8]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5))

#### Arithmetic operators

`pd.eval()` supports all arithmetic operators. For example:

In [9]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.array_equal(result1, result2)

True

#### Comparions operators

`pd.eval()` supports all comparison operators, including chained expressions:

In [10]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.array_equal(result1, result2)

True

#### Bitwise operators

`pd.eval()` supports the `&`, `|` and `~` bitwise operators:

In [11]:
result1 = (df1 < 0.5) & (df2 < 0.5) | ~ (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | ~ (df3 < df4)')
np.array_equal(result1, result2)

True

Additionaly, it supports the use of the literals `and`, `or` and `not` in Boolean expressions:

In [12]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or not (df3 < df4)')
np.array_equal(result1, result3)

True

#### Object attributes and indices

`pd.eval()` suppots access to object attributes via the `obj.attr` syntax and indexes via the `obj[index]` syntax:

In [13]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.array_equal(result1, result2)

True

#### Other operations

Operations such as function calls, conditional statements, loops and other more involver constructs are currently not implemented in `pd.eval()`. For executing these more complicated types of expressions, we could use the Numexpr library itself.

## `DataFrame.eval()` for Column-Wise Operations

`DataFrame`s have an `eval()` method that works similar to `pd.eval()`. The main benefit of this method is that columns can be referred to by name. Let's see some examples with this labeled array:

In [14]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.979625,0.648023,0.613852
1,0.34833,0.460903,0.147246
2,0.121401,0.582239,0.031329
3,0.570329,0.324969,0.03716
4,0.322832,0.704335,0.300646


We could use `pd.eval()` to compute expressions with the three columns:

In [15]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.array_equal(result1, result2)

True

The `DataFrame.eval()` method however, allows for a much more succint expression by treating the column names as variables within the expression:

In [16]:
result3 = df.eval('(A + B) / (C - 1)')
np.array_equal(result1, result3)

True

### Assignment with `DataFrame.eval()`

In addition to what was just discussed, `DataFrame.eval()` also allows assignment to any column. Let's use the `DataFrame` from before as an example.

We can use `df.eval()` to create a new column `'D'` and assign to it a value computed from the other columns:

In [17]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.979625,0.648023,0.613852,2.651533
1,0.34833,0.460903,0.147246,5.495802
2,0.121401,0.582239,0.031329,22.459391
3,0.570329,0.324969,0.03716,24.09276
4,0.322832,0.704335,0.300646,3.416529


### Local variables in `DataFrame.eval()`

The `DataFrame.eval()` method supports an additional syntax that lets it work with local Python variables:

In [18]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.array_equal(result1, result2)

True

The `@` character here is indicating that `column_mean` is actually a variable name, rather than a column name.

## `DataFrame.query()` method

The `DataFrame` has another method based on evaluating string: the `query()` method. Consider the following:

In [19]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.array_equal(result1, result2)

True

As with the examples used previously, this is an expression that involves columns of the `DataFrame`. It cannot be expressed using the `DataFrame.eval()` syntax, however! 

Instead, for this type of filtering operation, we can use the `query()` method:

In [20]:
result2 = df.query('A < 0.5 and B < 0.5')
np.array_equal(result1, result2)

True

In addition to being a more efficient computation, this is also much easier to read and understand. Note that the `query()` method also accepts the `@` flag to mark local variables:

In [21]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.array_equal(result1, result2)

True