### High-Performance Pandas: eval() and query()

•  Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays.

•  These are the eval() and query() functions, which rely on the Numexpr package.

### pandas.eval() for Efficient Operations

• The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames.

• For example, consider the following DataFrames:

In [1]:
import numpy as np
import pandas as pd

In [2]:
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))

• To compute the sum of all four DataFrame s using the typical Pandas approach, we can just write the sum:

In [3]:
%timeit df1 + df2 + df3 + df4

88.5 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


• We can compute the same result via pd.eval by constructing the expression as a string:

• The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same result

In [4]:
%timeit pd.eval('df1 + df2 + df3 + df4')

55.8 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Comparison operators

• pd.eval() supports all comparison operators, including chained expressions.

In [8]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

### Bitwise operators

• pd.eval() supports the & and | bitwise operators

In [9]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

• In addition, it supports the use of the literal and and or in Boolean expressions

In [10]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

### Object attributes and indices

• pd.eval() supports access to object attributes via the obj.attr syntax, and indexes via the obj[index] syntax.

In [11]:
result1 = df1.T[0] + df2.iloc[1]
result2 = pd.eval('df1.T[0] + df2.iloc[1]')
np.allclose(result1, result2)

True

### DataFrame.eval() for Column-Wise Operations

• Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar ways.

• The benefit of the eval() method is that columns can be referred to by name.

• We’ll use this labeled array as an example.

In [12]:
df = pd.DataFrame(rng.rand(1000, 3), columns = ['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.615875,0.525167,0.047354
1,0.330858,0.412879,0.441564
2,0.689047,0.559068,0.23035
3,0.290486,0.695479,0.852587
4,0.42428,0.534344,0.245216


• Using pd.eval() as above, we can compute expressions with the three columns like this.

In [13]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval('(df.A + df.B) / (df.C - 1)')
np.allclose(result1, result2)

True

• The DataFrame.eval() method allows much more succinct evaluation of expressions with the columns.

In [14]:
result3 = df.eval(('(A + B) / (C - 1)'))
np.allclose(result1, result3)

True

### Assignment in DataFrame.eval()

• In addition to the options just discussed, DataFrame.eval() also allows assignment to any column.

• Let’s use the DataFrame from before, which has columns 'A' , 'B' , and 'C'

• We can use df.eval() to create a new column 'D' and assign to it a value computed from the other columns.

In [15]:
df.eval('D = (A + B) / C', inplace = True)
df.head()

Unnamed: 0,A,B,C,D
0,0.615875,0.525167,0.047354,24.095868
1,0.330858,0.412879,0.441564,1.684325
2,0.689047,0.559068,0.23035,5.418335
3,0.290486,0.695479,0.852587,1.156439
4,0.42428,0.534344,0.245216,3.909296


• In the same way, any existing column can be modified

In [19]:
df.eval('D = (A - B) / C', inplace = True)
df.head()

Unnamed: 0,A,B,C,D
0,0.615875,0.525167,0.047354,1.915527
1,0.330858,0.412879,0.441564,-0.185752
2,0.689047,0.559068,0.23035,0.564268
3,0.290486,0.695479,0.852587,-0.475016
4,0.42428,0.534344,0.245216,-0.448844


### Local variables in DataFrame.eval()

• The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables.

• Consider the following

In [30]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

• The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python objects.

• Notice that this @ character is only supported by the DataFrame.eval() method, not by the pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace.

### DataFrame.query() Method

• The DataFrame has another method based on evaluated strings, called the query() method.

• Consider the following:

In [33]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

• As with the example used in our discussion of DataFrame.eval() , this is an expression involving columns of the DataFrame .

• It cannot be expressed using the DataFrame.eval() syntax, however! Instead, for this type of filtering operation, you can use the query() method:

In [34]:
result3 = df.query('A < 0.5 & B < 0.5')
np.allclose(result1, result3)

True

• In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand.

• Note that the query() method also accepts the @ flag to mark local variables:

In [36]:
cmean = df['C'].mean()
result1 = df[(df.A < cmean) & (df.B < cmean)]
result2 = df.query('A < @cmean & B < @cmean')
np.allclose(result1, result2)

True