---
# Lecture notes - High performance Pandas
---

This is a lecture note on **high performance Pandas** - but it's built upon contents from pandas and previous course:

- Python programming

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to high performance. I encourage you to read further about high performance.

Read more

- [Enhancing performance](https://pandas.pydata.org/docs/user_guide/enhancingperf.html)
- [pandas eval()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html?highlight=eval#pandas.DataFrame.eval)
- [pandas query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html?highlight=query#pandas.DataFrame.query)
- [Scaling to large datasets](https://pandas.pydata.org/docs/user_guide/scale.html?highlight=efficency)

---


## Eval

We use a compound expression to motivation eval(): 

```python
mask = (x > 0.5) & (y < 0.5)
```
will create the following steps which are explicitly allocated in memory: 

```python
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
```

Using eval() will perform elementwise directly without intermediate steps using numexpr. 

eval can be slower than normal pandas expressions. Rule of thumb:
if df rows > 10000 can use eval() else use normal df expressions


In [3]:
import numpy as np 
import pandas as pd 

nrows, ncols = 1000000, 100

df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-2.270727,0.01864,-0.625415,-2.232702,-0.299816,0.657887,-0.637094,-0.396985,-0.032817,-0.39366,...,0.534807,-1.04397,-0.169893,-0.07732,0.084369,0.01181,-0.024487,2.444251,1.402581,1.346007
1,-0.126252,-0.205095,0.31174,-0.516885,1.586192,-1.604354,0.09114,-0.591145,0.742003,-0.024304,...,-0.481786,-1.980397,0.275345,-0.083718,-0.878676,0.137975,0.151696,0.848355,0.99176,-1.75532
2,-1.865572,1.185869,-1.25238,0.578595,0.792551,0.352381,1.153259,0.579484,0.212804,-0.302899,...,1.765283,-1.063014,-0.567347,1.037695,-0.544713,-0.010489,-0.374798,-0.015899,-0.096799,-1.913347
3,0.098595,-0.120241,0.134475,-0.244167,0.053739,2.047715,0.992032,1.541574,-0.517581,1.394895,...,-0.518481,0.22018,-0.77385,1.496664,1.326732,-0.02714,1.663377,0.357079,0.531195,0.097199
4,0.048917,-1.600566,2.07018,-0.524443,-0.118089,-0.290112,1.365544,0.929064,1.902168,0.151379,...,-2.038821,0.066976,-0.94713,-1.149153,-0.389273,-0.653003,-0.528338,0.171273,1.755204,0.776575


In [4]:
%timeit sum_plain = df1+df2+df3+df4

The slowest run took 5.02 times longer than the fastest. This could mean that an intermediate result is being cached.
3.09 s ± 2.52 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
%timeit sum_eval = pd.eval("df1 + df2 + df3 + df4") # pd.eval()

773 ms ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
sum_plain == sum_eval

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
999996,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
999997,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
999998,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [7]:
sum_plain.equals(sum_eval)

True

In [8]:
# df.eval()
rolls = pd.DataFrame(np.random.randint(1,6,(6,3)), columns = ["Die1", "Die2", "Die3"])
rolls.eval("Sum = Die1 + Die2 + Die3", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum
0,4,4,2,10
1,1,1,2,4
2,5,1,2,8
3,5,4,5,14
4,1,3,2,6
5,1,2,5,8


In [17]:
# use variables
high = 10 
rolls.eval("High = Sum > @high", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,4,4,2,10,False
1,1,1,2,4,False
2,5,1,2,8,False
3,5,4,5,14,True
4,1,3,2,6,False
5,1,2,5,8,False


In [20]:
# filter out traditional way
rolls[rolls["Sum"] <= high]

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,4,4,2,10,False
1,1,1,2,4,False
2,5,1,2,8,False
4,1,3,2,6,False
5,1,2,5,8,False


## Query

Cleaner syntax for selection. Faster for larger datasets and compound expressions.

In [21]:
rolls.query("Sum <= @high")

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,4,4,2,10,False
1,1,1,2,4,False
2,5,1,2,8,False
4,1,3,2,6,False
5,1,2,5,8,False


In [10]:
low = 10
small_plain = rolls[rolls["Sum"] < low]
small_plain

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,5,1,1,7,False
1,3,1,1,5,False
4,1,1,3,5,False


In [11]:
small_query = rolls.query("Sum < @low")
small_query

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,5,1,1,7,False
1,3,1,1,5,False
4,1,1,3,5,False


In [13]:
os = pd.read_csv("Data/athlete_events.csv")
os.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [17]:
%timeit os[os["Season"] == "Winter"]
%timeit os.query("Season == 'Winter'")

plain = os[os["Season"] == "Winter"]
query = os.query("Season == 'Winter'")

plain.equals(query)

28.7 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13.1 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True

In [18]:
%timeit os[os["Height"] > 180]
%timeit os.query("Height > 180") # note that query is slower here

plain = os[os["Height"] > 180]
query = os.query("Height > 180") 

plain.equals(query)

8.05 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.77 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True

In [19]:
# query faster on compound expressions as it doesn't have to save intermediate results into memory
%timeit os[(os["Sex"] == "F") & (os["Height"] > 180) & (os["NOC"] == "SWE")]
%timeit os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'") 

plain = os[(os["Sex"] == "F") & (os["Height"] > 180) & (os["NOC"] == "SWE")]
query = os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'")

plain.equals(query)

41.8 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
18.4 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True