# High-Performance Pandas

Based on Chapter 3 from *Python for Data Science Handbook*, by Jake VanderPlas.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn; seaborn.set()

#import warnings
#warnings.filterwarnings("ignore")

In [2]:
class display(object):
    
    """Display HTML representation of multiple objects"""
    
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

As we've already seen in previous sections, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.

### Motivating `query()` and `eval()`: Compound Expressions

In [3]:
rng = np.random.RandomState(42)

x = rng.rand(1000000)
y = rng.rand(1000000)

%timeit x + y

3.53 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


This is much faster than doing the addition via a Python loop or comprehension.

In [4]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

385 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
mask = (x > 0.5) & (y < 0.5)

In [6]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

In [7]:
import numexpr

mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')

np.allclose(mask, mask_numexpr)

True

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. The Pandas `eval()` and `query()` tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

# `pandas.eval()` for Efficient Operations

In [8]:
nrows, ncols = 100000, 100

rng = np.random.RandomState(42)

df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))

display('df1.head()', 'df2.head()', 'df3.head()', 'df4.head()')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.37454,0.950714,0.731994,0.598658,0.156019,0.155995,0.058084,0.866176,0.601115,0.708073,...,0.119594,0.713245,0.760785,0.561277,0.770967,0.493796,0.522733,0.427541,0.025419,0.107891
1,0.031429,0.63641,0.314356,0.508571,0.907566,0.249292,0.410383,0.755551,0.228798,0.07698,...,0.093103,0.897216,0.900418,0.633101,0.33903,0.34921,0.725956,0.89711,0.887086,0.779876
2,0.642032,0.08414,0.161629,0.898554,0.606429,0.009197,0.101472,0.663502,0.005062,0.160808,...,0.0305,0.037348,0.822601,0.360191,0.127061,0.522243,0.769994,0.215821,0.62289,0.085347
3,0.051682,0.531355,0.540635,0.63743,0.726091,0.975852,0.5163,0.322956,0.795186,0.270832,...,0.990505,0.412618,0.372018,0.776413,0.340804,0.930757,0.858413,0.428994,0.750871,0.754543
4,0.103124,0.902553,0.505252,0.826457,0.32005,0.895523,0.389202,0.010838,0.905382,0.091287,...,0.455657,0.620133,0.277381,0.188121,0.463698,0.353352,0.583656,0.077735,0.974395,0.986211

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.926538,0.382461,0.871469,0.761471,0.328826,0.988821,0.120738,0.358905,0.954462,0.004711,...,0.310465,0.816988,0.930747,0.111477,0.772517,0.801181,0.466825,0.005912,0.70511,0.487674
1,0.715167,0.490948,0.904532,0.319521,0.582585,0.98033,0.019068,0.089363,0.281105,0.143648,...,0.433028,0.13254,0.263659,0.339079,0.234842,0.507921,0.544545,0.197424,0.432392,0.218104
2,0.975796,0.049902,0.092684,0.158453,0.858309,0.65255,0.681106,0.360168,0.843117,0.619341,...,0.156821,0.772316,0.412088,0.796167,0.54858,0.722526,0.141587,0.459266,0.128221,0.661666
3,0.369458,0.911366,0.892686,0.763454,0.581681,0.207756,0.024249,0.92586,0.191849,0.047043,...,0.313598,0.566552,0.844425,0.079068,0.33843,0.921877,0.856621,0.285027,0.505441,0.571166
4,0.794953,0.714644,0.652743,0.639999,0.801813,0.223324,0.468607,0.409739,0.846211,0.488558,...,0.349061,0.986111,0.389271,0.42801,0.645183,0.998789,0.805533,0.310009,0.876316,0.946936

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.381785,0.88428,0.45058,0.889203,0.400178,0.329899,0.37492,0.289165,0.856012,0.170531,...,0.232548,0.42708,0.687908,0.990223,0.532107,0.29157,0.604532,0.510344,0.178462,0.816248
1,0.575967,0.057404,0.320802,0.174745,0.708598,0.165073,0.8523,0.841379,0.810541,0.867123,...,0.205949,0.18309,0.481792,0.47993,0.36037,0.920427,0.515166,0.698365,0.925812,0.272917
2,0.553476,0.657017,0.72186,0.058866,0.818086,0.882324,0.633707,0.786487,0.107093,0.659608,...,0.725832,0.627375,0.387747,0.20446,0.973627,0.26253,0.912395,0.852041,0.050451,0.668992
3,0.84181,0.738977,0.768721,0.352721,0.454399,0.91565,0.164899,0.872948,0.419942,0.671492,...,0.917378,0.928159,0.034869,0.679377,0.351755,0.23352,0.620001,0.338868,0.797963,0.447284
4,0.069417,0.37045,0.329881,0.88214,0.688254,0.393034,0.288496,0.248113,0.835122,0.668993,...,0.723569,0.378604,0.294903,0.595871,0.940018,0.544825,0.030322,0.157838,0.364742,0.932007

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.773222,0.02287,0.135256,0.547153,0.112734,0.382484,0.282446,0.47939,0.973054,0.968159,...,0.742549,0.014689,0.638707,0.557382,0.935098,0.161364,0.792444,0.789514,0.522443,0.575358
1,0.608169,0.141071,0.560629,0.028672,0.017801,0.92825,0.939959,0.865063,0.125569,0.062302,...,0.563912,0.085168,0.545653,0.062591,0.079648,0.904816,0.570289,0.112442,0.18727,0.167751
2,0.79028,0.450114,0.316514,0.443655,0.961636,0.18353,0.092308,0.563372,0.137717,0.493172,...,0.943914,0.999072,0.656912,0.87979,0.801385,0.020247,0.27461,0.013139,0.884154,0.128746
3,0.062328,0.129402,0.951153,0.674908,0.706534,0.06913,0.331226,0.421508,0.578126,0.67481,...,0.026716,0.690321,0.373365,0.361318,0.044817,0.219551,0.684745,0.104272,0.996603,0.25626
4,0.13967,0.010372,0.683865,0.662876,0.593358,0.27329,0.001748,0.173523,0.578557,0.084194,...,0.472835,0.914288,0.634008,0.922544,0.063849,0.203463,0.805392,0.09748,0.733605,0.278122


To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:

In [9]:
%timeit df1 + df2 + df3 + df4

89.6 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The same result can be computed via pd.eval by constructing the expression as a string:

In [10]:
%timeit pd.eval('df1 + df2 + df3 + df4')

49.9 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The `eval()` version of this expression is faster (and uses much less memory), while giving the same result:

In [11]:
np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))

True

# Operations supported by `pd.eval()`

In [12]:
# 'for i in range(5)' ensures that 5 different dataframes are produced.
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5)) 

display('df1.head(3)', 'df2.head(3)', 'df3.head(3)', 'df4.head(3)', 'df5.head(3)')

Unnamed: 0,0,1,2
0,180,112,748
1,447,205,487
2,656,100,98

Unnamed: 0,0,1,2
0,75,15,719
1,741,587,37
2,879,695,688

Unnamed: 0,0,1,2
0,912,97,806
1,766,714,218
2,502,508,541

Unnamed: 0,0,1,2
0,461,139,372
1,487,838,959
2,97,231,638

Unnamed: 0,0,1,2
0,369,17,867
1,60,582,857
2,148,139,351


In [13]:
df1

Unnamed: 0,0,1,2
0,180,112,748
1,447,205,487
2,656,100,98
3,90,450,613
4,529,224,530
...,...,...,...
95,31,787,643
96,984,624,352
97,283,543,751
98,5,142,278


### Arithmetic operators

In [14]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')

np.allclose(result1, result2)

True

In [15]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,0,1,2
0,-378.832484,-24.118644,-1323.546689
1,-324.347167,-659.535438,-872.309261
2,-1110.644407,-233.046008,-408.187447

Unnamed: 0,0,1,2
0,-378.832484,-24.118644,-1323.546689
1,-324.347167,-659.535438,-872.309261
2,-1110.644407,-233.046008,-408.187447


### Comparison operators

In [16]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')

np.allclose(result1, result2)

True

In [17]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,0,1,2
0,False,False,False
1,True,True,False
2,False,False,False

Unnamed: 0,0,1,2
0,False,False,False
1,True,True,False
2,False,False,False


### Bitwise operators

In [18]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')

np.allclose(result1, result2)

True

In [19]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,0,1,2
0,False,True,False
1,False,True,True
2,False,False,True

Unnamed: 0,0,1,2
0,False,True,False
1,False,True,True
2,False,False,True


In [20]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')

np.allclose(result1, result3)

True

In [21]:
display('result1.head(3)', 'result3.head(3)')

Unnamed: 0,0,1,2
0,False,True,False
1,False,True,True
2,False,False,True

Unnamed: 0,0,1,2
0,False,True,False
1,False,True,True
2,False,False,True


### Object attributes and indices

In [22]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')

np.allclose(result1, result2)

True

In [23]:
result1.head(3)

0    841
1    729
2    937
dtype: int32

In [24]:
result2.head(3)

0    841
1    729
2    937
dtype: int32

# `DataFrame.eval()` for Column-Wise Operations

In [25]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])

df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [26]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")

np.allclose(result1, result2)

True

In [27]:
result1.head(3)

0   -0.841283
1   -0.360327
2   -3.197758
dtype: float64

In [28]:
result2.head(3)

0   -0.841283
1   -0.360327
2   -3.197758
dtype: float64

In [29]:
result3 = df.eval('(A + B) / (C - 1)')

np.allclose(result1, result3)

True

In [30]:
result1.head(3)

0   -0.841283
1   -0.360327
2   -3.197758
dtype: float64

In [31]:
result3.head(3)

0   -0.841283
1   -0.360327
2   -3.197758
dtype: float64

### Assignment in `DataFrame.eval()`

In [32]:
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


We can use `df.eval()` to create a new column 'D' and assign to it a value computed from the other columns:

In [33]:
df.eval('D = (A + B) / C', inplace=True)

df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,11.18762
1,0.069087,0.235615,0.154374,1.973796
2,0.677945,0.433839,0.652324,1.704344
3,0.264038,0.808055,0.347197,3.087857
4,0.589161,0.252418,0.557789,1.508776


In [34]:
df.eval('D = (A - B) / C', inplace=True)

df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
2,0.677945,0.433839,0.652324,0.374209
3,0.264038,0.808055,0.347197,-1.566886
4,0.589161,0.252418,0.557789,0.603708


### Local variables in `DataFrame.eval()`

In [35]:
column_mean = df.mean(1)

result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')

np.allclose(result1, result2)

True

The `@` character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects. Notice that this `@` character is only supported by the `DataFrame.eval()` method, not by the `pandas.eval()` function, because the `pandas.eval()` function only has access to the one (Python) namespace.

In [36]:
result1.head(3)

0    0.476246
1   -0.085826
2    1.212524
dtype: float64

In [37]:
result2.head(3)

0    0.476246
1   -0.085826
2    1.212524
dtype: float64

### `DataFrame.query()` Method

In [38]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')

np.allclose(result1, result2)

True

In [39]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526


In [40]:
result2 = df.query('A < 0.5 and B < 0.5')

np.allclose(result1, result2)

True

In [41]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526


In [42]:
result1.shape

(231, 4)

In [43]:
Cmean = df['C'].mean()

result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')

np.allclose(result1, result2)

True

In [44]:
display('result1.head(3)', 'result2.head(3)')

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
7,0.406639,0.128631,0.160742,1.729526


In [45]:
result1.shape

(235, 4)

# Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations: computation time and memory use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas DataFrames will result in implicit creation of temporary arrays: For example, this:

In [46]:
x = df[(df.A < 0.5) & (df.B < 0.5)]

Is roughly equivalent to this:

In [47]:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5

tmp3 = tmp1 & tmp2

x = df[tmp3]

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an `eval()` or `query()` expression. You can check the approximate size of your array in bytes using this:

In [48]:
df.values.nbytes

32000

On the performance side, `eval()` can be faster even when you are not maxing-out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then `eval()` can avoid some potentially slow movement of values between the different memory caches. In practice, I find that the difference in computation time between the traditional methods and the `eval/query` method is usually not significant–if anything, the traditional method is faster for smaller arrays! The benefit of `eval/query` is mainly in the saved memory, and the sometimes cleaner syntax they offer.