# pandas - Advanced pandas

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Categorical Data

In [2]:
values = Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [3]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [4]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so- called dimension tables containing the distinct values and storing the primary obser‐ vations as integer keys referencing the dimension table:

In [5]:
values = Series([0, 1, 0, 0] * 2)
dim = Series(['apple', 'orange'])

In [6]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [7]:
dim

0     apple
1    orange
dtype: object

In [8]:
dim.take(values) # restore the original Series of strings

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the categorical or dictionary-encoded repre‐ sentation. The array of distinct values can be called the categories, dictionary, or levels of the data. In this book we will use the terms categorical and categories. The integer values that reference the categories are called the category codes or simply codes.

### Categorical Type in pandas

In [9]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = DataFrame({'fruit': fruits,
                'basket_id': np.arange(N),
                'count': np.random.randint(3, 15, size=N),
                'weight': np.random.uniform(0, 4, size=N)},
                columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,10,1.503373
1,1,orange,10,1.858161
2,2,apple,5,3.117123
3,3,apple,14,1.212262
4,4,apple,8,3.53091
5,5,orange,12,0.845803
6,6,apple,13,2.175598
7,7,apple,10,1.222217


In [10]:
fruit_cat = df['fruit'].astype('category') # fruits are categorical
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [11]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

In [12]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [13]:
c.codes # categorical data has codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [14]:
df.fruit = df.fruit.astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

You can also create `pandas.Categorical` directly from other types of Python sequences:

In [15]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [16]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]

my_cats_2 = pd.Categorical.from_codes(codes, categories) # if already have categorical encoded data
my_cats_2

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

In [17]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True) # indicates that categories have a meaningful ordering
ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

In [18]:
my_cats_2.as_ordered() # an unordered categorical instance can be made ordered with as_ordered

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

### Computations with Categoricals

Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some func‐ tions that can utilize the ordered flag.

In [19]:
np.random.seed(42)

In [20]:
draws = np.random.randn(1000)
draws[:5]

array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337])

In [21]:
bins = pd.qcut(draws, 4)
bins

[(0.0253, 0.648], (-0.648, 0.0253], (0.0253, 0.648], (0.648, 3.853], (-0.648, 0.0253], ..., (-0.648, 0.0253], (0.648, 3.853], (0.0253, 0.648], (-0.648, 0.0253], (0.0253, 0.648]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.242, -0.648] < (-0.648, 0.0253] < (0.0253, 0.648] < (0.648, 3.853]]

In [22]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # add labels for useful reports
bins

['Q3', 'Q2', 'Q3', 'Q4', 'Q2', ..., 'Q2', 'Q4', 'Q3', 'Q2', 'Q3']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [23]:
bins.codes[:10]

array([2, 1, 2, 3, 1, 1, 3, 3, 1, 2], dtype=int8)

In [24]:
bins = Series(bins, name='quartile')
results = pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index()
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.241267,-0.650643
1,Q2,250,-0.646573,0.02451
2,Q3,250,0.026091,0.647689
3,Q4,250,0.64871,3.852731


In [25]:
results['quartile'] # original categorical information \

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

#### Better performance with categoricals

In [26]:
N = 10000000 # 10 mil elements
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))
labels

0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: object

In [27]:
categories = labels.astype('category') # convert labels to categorical
categories

0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: category
Categories (4, object): ['bar', 'baz', 'foo', 'qux']

In [28]:
labels.memory_usage() # using more memory than categories

80000128

In [29]:
categories.memory_usage()

10000332

In [30]:
%time _ = labels.astype('category')

CPU times: user 396 ms, sys: 19.3 ms, total: 416 ms
Wall time: 419 ms


### Categorical Methods

Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.

In [31]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: object

In [32]:
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [33]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [34]:
# Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data
actual_categories = ['a', 'b', 'c', 'd', 'e']

# We can use the set_categories method to change them:
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [35]:
cat_s.value_counts()

a    2
b    2
c    2
d    2
dtype: int64

In [36]:
cat_s2.value_counts()

a    2
b    2
c    2
d    2
e    0
dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. After you filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the `remove_unused_categories` method to trim unobserved categories:

In [37]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [38]:
cat_s3.cat.remove_unused_categories() # remove unused categories

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

#### Creating dummy variables for modeling

When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as ***one-hot encoding***. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.

In [39]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [40]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## Advanced GroupBy Use

### Group Transforms and “Unwrapped” GroupBys

In [41]:
df = DataFrame({'key': ['a', 'b', 'c'] * 4, 'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [42]:
g = df.groupby('key').value
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as `df['value']` but with values replaced by the average grouped by '`key`'. We can pass the function `lambda x: x.mean()` to `transform`:

In [43]:
g.transform(lambda x: x.mean()) # replace values with its average grouped 'key'

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [44]:
g.transform('mean') # built-in aggs funcs

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [45]:
g.transform(lambda x: x * 2) # multiplies each value by 2 then store it

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [46]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [47]:
def normalize(x):
    return (x - x.mean()) / x.std()

g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [48]:
g.apply(normalize) # can also use apply for group transformation function

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### Grouped Time Resampling

In [49]:
N = 15
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
df = DataFrame({'time': times, 'value': np.arange(N)})
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


In [50]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [51]:
# Suppose that a DataFrame contains multiple time series, marked by an additional group key column...
df2 = DataFrame({'time': times.repeat(3), 'key': np.tile(['a', 'b', 'c'], N), 'value': np.arange(N * 3.)})
df2.head(7)

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0
