# Chapter 12 - Advanced pandas

The preceding chapters have focused on introducing different types of data wrangling
workflows and features of NumPy, pandas, and other libraries. Over time, pandas has
developed a depth of features for power users. This chapter digs into a few more
advanced feature areas to help you deepen your expertise as a pandas user.

## 12.1 Categorical Data

This section introduces the pandas Categorical type. I will show how you can ach‐
ieve better performance and memory use in some pandas operations by using it. I
also introduce some tools for using categorical data in statistics and machine learning
applications.

### Background and Motivation

Frequently, a column in a table may contain repeated instances of a smaller set of dis‐
tinct values. We have already seen functions like unique and value_counts, which
enable us to extract the distinct values from an array and compute their frequencies,
respectively:

In [2]:
import numpy as np; import pandas as pd

In [4]:
values = pd.Series(['apple', 'orange', 'apple','apple']*2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [5]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [6]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have
developed specialized approaches for representing data with repeated values for more
efficient storage and computation. In data warehousing, a best practice is to use socalled dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [7]:
values = pd.Series([0, 1, 0, 0]*2)
dim = pd.Series(['apple','orange'])

We can use the take method to restore the original Series of strings:

In [8]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the categorical or dictionary-encoded repre‐
sentation. The array of distinct values can be called the categories, dictionary, or levels
of the data. In this book we will use the terms categorical and categories. The integer
values that reference the categories are called the category codes or simply codes.

The categorical representation can yield significant performance improvements when
you are doing analytics. You can also perform transformations on the categories while
leaving the codes unmodified. Some example transformations that can be made at
relatively low cost are:

• Renaming categories

• Appending a new category without changing the order or position of the existing
categories

### Categorical Type in pandas

pandas has a special Categorical type for holding data that uses the integer-based
categorical representation or encoding. Let’s consider the example Series from before:

In [9]:
fruits = ['apple', 'orange', 'apple', 'apple']*2

In [10]:
N = len(fruits)

In [11]:
df = pd.DataFrame({'fruit': fruits, 'basket_id': np.arange(N), 'count': np.random.randint(3, 15, size=N),
                  'weight': np.random.uniform(0, 4, size=N)}, columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,3,2.688002
1,1,orange,3,1.98269
2,2,apple,9,0.788339
3,3,apple,4,2.043525
4,4,apple,7,1.637106
5,5,orange,3,2.534282
6,6,apple,14,0.350466
7,7,apple,4,1.60916


Here, df['fruit'] is an array of Python string objects. We can convert it to categorical by calling:

In [12]:
fruit_cat = df['fruit'].astype('category')

In [13]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

The values for fruit_cat are not a NumPy array, but an instance of pandas.Catego
rical:

In [14]:
c = fruit_cat.values

In [15]:
type(c)

pandas.core.arrays.categorical.Categorical

The Categorical object has categories and codes attributes:

In [16]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [17]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

You can convert a DataFrame column to categorical by assigning the converted result:

In [18]:
df['fruit'] = df['fruit'].astype('category')

In [19]:
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

You can also create pandas.Categorical directly from other types of Python
sequences:

In [20]:
my_cat = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_cat

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

If you have obtained categorical encoded data from another source, you can use the
alternative from_codes constructor:

In [21]:
categories = ['foo', 'bar', 'baz']

In [22]:
codes = [0, 1, 2, 0, 0, 1]

In [23]:
my_cats2 = pd.Categorical.from_codes(codes, categories)

In [24]:
my_cats2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

Unless explicitly specified, categorical conversions assume no specific ordering of the
categories. So the categories array may be in a different order depending on the
ordering of the input data. When using from_codes or any of the other constructors,
you can indicate that the categories have a meaningful ordering:

In [25]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

In [26]:
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering,
and so on. An unordered categorical instance can be made ordered with as_ordered:

In [27]:
my_cats2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

As a last note, categorical data need not be strings, even though I have only showed
string examples. A categorical array can consist of any immutable value types.

### Computations with Categoricals

Using Categorical in pandas compared with the non-encoded version (like an array
of strings) generally behaves the same way. Some parts of pandas, like the groupby
function, perform better when working with categoricals. There are also some func‐
tions that can utilize the ordered flag.

Let’s consider some random numeric data, and use the pandas.qcut binning func‐
tion. This return pandas.Categorical; we used pandas.cut earlier in the book but
glossed over the details of how categoricals work:

In [28]:
np.random.seed(12345)

In [29]:
draws = np.random.randn(1000)

In [30]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

Let’s compute a quartile binning of this data and extract some statistics:

In [32]:
bins = pd.qcut(draws, 4)

In [33]:
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

While useful, the exact sample quartiles may be less useful for producing a report
than quartile names. We can achieve this with the labels argument to qcut:

In [34]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [35]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

The labeled bins categorical does not contain information about the bin edges in the
data, so we can use groupby to extract some summary statistics:

In [36]:
bins = pd.Series(bins, name='quartile')

In [37]:
result = (pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index())

In [38]:
result

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


The 'quartile' column in the result retains the original categorical information,
including ordering, from bins:

In [39]:
result['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### Better performance with categoricals

If you do a lot of analytics on a particular dataset, converting to categorical can yield
substantial overall performance gains. A categorical version of a DataFrame column
will often use significantly less memory, too. Let’s consider some Series with 10 mil‐
lion elements and a small number of distinct categories:

In [41]:
N = 10000000

In [42]:
draws = pd.Series(np.random.randn(N))

In [43]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux']*(N//4))

Now we convert labels to categorical:

In [44]:
categories = labels.astype('category')

Now we note that labels uses significantly more memory than categories:

In [45]:
labels.memory_usage()

80000080

In [46]:
categories.memory_usage()

10000272

The conversion to category is not free, of course, but it is a one-time cost:

In [47]:
%time _ = labels.astype('category')

Wall time: 944 ms


GroupBy operations can be significantly faster with categoricals because the underly‐
ing algorithms use the integer-based codes array instead of an array of strings.

### Categorical Methods

Series containing categorical data have several special methods similar to the Ser
ies.str specialized string methods. This also provides convenient access to the categories and codes. Consider the Series:

In [48]:
s = pd.Series(['a', 'b', 'c', 'd']*2)

In [49]:
cat_s = s.astype('category')

In [51]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

The special attribute cat provides access to categorical methods:

In [55]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [56]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the
four values observed in the data. We can use the set_categories method to change
them:

In [57]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [58]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [59]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

While it appears that the data is unchanged, the new categories will be reflected in
operations that use them. For example, value_counts respects the categories, if
present:

In [60]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [61]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings
and better performance. After you filter a large DataFrame or Series, many of the
categories may not appear in the data. To help with this, we can use the
remove_unused_categories method to trim unobserved categories:

In [63]:
cat_s3 = cat_s[cat_s.isin(['a','b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [65]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

See Table 12-1 for a listing of available categorical methods.

![](cat_series.jpg)

#### Creating dummy variables for modeling

When you’re using statistics or machine learning tools, you’ll often transform catego‐
rical data into dummy variables, also known as one-hot encoding. This involves creat‐
ing a DataFrame with a column for each distinct category; these columns contain 1s
for occurrences of a given category and 0 otherwise.

Consider the previous example:

In [66]:
cat_s = pd.Series(['a', 'b', 'c', 'd']*2, dtype='category')

As mentioned previously in Chapter 7, the pandas.get_dummies function converts
this one-dimensional categorical data into a DataFrame containing the dummy
variable:

In [67]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
