<a href="https://colab.research.google.com/github/kokchun/Databehandling-AI22/blob/main/Lectures/L7-high-performance.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code


---
# Lecture notes - High performance Pandas
---

This is a lecture note on **high performance Pandas** - but it's built upon contents from pandas and previous course:

- Python programming

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to high performance. I encourage you to read further about high performance.

Read more

- [Enhancing performance](https://pandas.pydata.org/docs/user_guide/enhancingperf.html)
- [pandas eval()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html?highlight=eval#pandas.DataFrame.eval)
- [pandas query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html?highlight=query#pandas.DataFrame.query)
- [Scaling to large datasets](https://pandas.pydata.org/docs/user_guide/scale.html?highlight=efficency)

---


## Eval

We use a compound expression to motivation eval(): 

```python
mask = (x > 0.5) & (y < 0.5)
```
will create the following steps which are explicitly allocated in memory: 

```python
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
```

Using eval() will perform elementwise directly without intermediate steps using numexpr. 

eval can be slower than normal pandas expressions. Rule of thumb:
if df rows > 10000 can use eval() else use normal df expressions


Note that normal Python's eval can be a security risk if used together with user input. Pandas eval however can't execute arbitrary functions. 


In [2]:
import numpy as np 
import pandas as pd 

nrows, ncols = 1000000, 100

df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]

In [4]:
# pd.eval()
%timeit sum_plain = df1+df2+df3+df4
%timeit sum_eval = pd.eval("df1 + df2 + df3 + df4")
sum_plain = df1+df2+df3+df4
sum_eval = pd.eval("df1 + df2 + df3 + df4")
sum_plain.equals(sum_eval)

896 ms ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
302 ms ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


True

In [6]:
# df.eval()
rolls = pd.DataFrame(np.random.randint(1,6,(6,3)), columns = ["Die1", "Die2", "Die3"])
rolls.eval("Sum = Die1 + Die2 + Die3", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum
0,5,1,1,7
1,3,1,1,5
2,3,5,3,11
3,5,5,1,11
4,1,1,3,5
5,4,2,5,11


In [7]:
# use variables
high = 10 
rolls.eval("High = Sum > @high", inplace = True)
rolls

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,5,1,1,7,False
1,3,1,1,5,False
2,3,5,3,11,True
3,5,5,1,11,True
4,1,1,3,5,False
5,4,2,5,11,True


## Query

Cleaner syntax for selection. Faster for larger datasets and compound expressions.

In [10]:
low = 10
small_plain = rolls[rolls["Sum"] < low]
small_plain

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,5,1,1,7,False
1,3,1,1,5,False
4,1,1,3,5,False


In [11]:
small_query = rolls.query("Sum < @low")
small_query

Unnamed: 0,Die1,Die2,Die3,Sum,High
0,5,1,1,7,False
1,3,1,1,5,False
4,1,1,3,5,False


In [13]:
os = pd.read_csv("Data/athlete_events.csv")
os.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [17]:
%timeit os[os["Season"] == "Winter"]
%timeit os.query("Season == 'Winter'")

plain = os[os["Season"] == "Winter"]
query = os.query("Season == 'Winter'")

plain.equals(query)

28.7 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13.1 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True

In [18]:
%timeit os[os["Height"] > 180]
%timeit os.query("Height > 180") # note that query is slower here

plain = os[os["Height"] > 180]
query = os.query("Height > 180") 

plain.equals(query)

8.05 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.77 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True

In [19]:
# query faster on compound expressions as it doesn't have to save intermediate results into memory
%timeit os[(os["Sex"] == "F") & (os["Height"] > 180) & (os["NOC"] == "SWE")]
%timeit os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'") 

plain = os[(os["Sex"] == "F") & (os["Height"] > 180) & (os["NOC"] == "SWE")]
query = os.query("Sex == 'F' & Height > 180 & NOC == 'SWE'")

plain.equals(query)

41.8 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
18.4 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


True

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---
