Both these examples look at some simple techniques that can be used to improve the performance of your pandas. Categoricals for imporving data efficiency and processing, and numexpr for improving the performance of expression evaluation.

## Categorical Data

Columns in a table often contain repeated instances of a smaller set of distinctive values. Functions like unique and value_counts enable us to extract the distinct values from an array and compute their requencies.

In [9]:
import numpy as np;
import pandas as pd;

values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [10]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [11]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data wearhousing, statistical computing, etc.) have developed specialised approaches for representing data with repeated values for more efficient storage and computation. 

In data wearhousing, a best practice is to use a dimension table containing the distinct values and storing the primary observerations as integer keys referencing the dimension table.

In [12]:
values = pd.Series([0, 1, 0, 0,] * 2)

dim = pd.Series(['apple', 'orange'])

In [13]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [14]:
dim

0     apple
1    orange
dtype: object

In [15]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called a categorical or dictionary-encoded representation. The array of distinct values can be called the categories, dictionary or levels of the data. The integer values that reference the categories are called category codes.

This representation can yield significan performance improvements when doing analytics. You can perform transformations on the categories while leaving the codes unmodified. For example:

    *Renaming categories
    *Addping a new category without changing the order or position of the existing categories
    
Pandas has a special Categorical type for holding data that uses the integer-based categorical representation.

In [17]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

N = len(fruits)

df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                   columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,14,2.51821
1,1,orange,9,3.24909
2,2,apple,8,1.677339
3,3,apple,11,2.877242
4,4,apple,8,0.281992
5,5,orange,10,0.181555
6,6,apple,5,1.10224
7,7,apple,14,2.886661


In [18]:
c = df['fruit']
c

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object

In [20]:
type(c)

pandas.core.series.Series

We can convert the array produced by df['fruit'] into a categorical...

In [21]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [22]:
c = fruit_cat.values
c

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

In [23]:
type(c)

pandas.core.arrays.categorical.Categorical

So, we can access the Categorical attributes...

In [24]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [25]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [26]:
df['fruit'] = df['fruit'].astype('category')
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,14,2.51821
1,1,orange,9,3.24909
2,2,apple,8,1.677339
3,3,apple,11,2.877242
4,4,apple,8,0.281992
5,5,orange,10,0.181555
6,6,apple,5,1.10224
7,7,apple,14,2.886661


Using Categorical in Pandas, as opposed to non-encoded versions like arrays of strings generally behave the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some functions that can utilise the ordered flag. We can generate some random numeric data, and use pandas.qcut binning function. Notice it returns a categorical!

In [27]:
np.random.seed(12345)

draws = np.random.randn(1000)

draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

Bin the values into 4 bins ...

In [29]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

bins

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q3', 'Q2', 'Q1', 'Q3', 'Q4']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [31]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [32]:
bins = pd.Series(bins, name='quartile')
bins

0      Q2
1      Q3
2      Q2
3      Q2
4      Q4
       ..
995    Q3
996    Q2
997    Q1
998    Q3
999    Q4
Name: quartile, Length: 1000, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [33]:
result = pd.Series(draws).groupby(bins).agg(['count','min','max']).reset_index()

result

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


If you are doing a lot of analytics on a dataset, converting to categorical can yield substantial overall performance gains - and will probably use significantly less memory too! 

For example, let's assume we have a Series with 10 million elements and a small number of distinct categories.

In [34]:
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo','bar','baz','qux'] * (N // 4))

In [36]:
categories = labels.astype('category')

In [37]:
labels.memory_usage()

80000128

In [39]:
categories.memory_usage()

10000332

GroupBy operations can be significantly faster with categoricals because the underlying algorithm uses the integer-based codes rather than an array of strings