# 7.5 Categorical Data

1. [Background and Motivation](#background)
1. [Categorical Extension Type in pandas](#extension)
1. [Computations with Categoricals](#computation)
1. [Categorical Methods](#methods)

<a name="background"></a>
# Background and Motivation

Seems to me that this is a bit like R's factors.  

If we have an array with repeats of certain values, instead of storing them all individually, it can be faster to store them as their integer "factor levels" and then have another Series that has the categories recorded and associated with the integers.  

This is called `categorical` representation. The array of distinct values is referred to as the `categories` and the integer values that reference the categories are the `category codes` (or just `codes`)

First, we'll look at the non-categorical way of doing it - A Series with 6 of one value and 2 of another:

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Make a series
values = pd.Series(['apple', 'orange', 'apple',
                    'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
# View the uniques
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [5]:
# Get counts of each
values.value_counts()

apple     6
orange    2
Name: count, dtype: int64

Now look at an example where apple and orange are stored in a `categories` Series while the values are stored in a `codes` Series:

In [6]:
# Codes
values = pd.Series([0, 1, 0, 0] * 2)
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [7]:
# Categories
dim = pd.Series(['apple', 'orange'])
dim

0     apple
1    orange
dtype: object

In [8]:
# Map them
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

<a name="extension"></a>
# Categorical Extension Type in pandas

This extension type is specifically for this integer-based categorical encoding. 

In [9]:
# List of fruits:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
fruits

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']

In [10]:
# Length
N = len(fruits)

In [11]:
# DataFrame of fruits and some information about them
rng = np.random.default_rng(seed=12345) # set seed
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': rng.integers(3, 15, size=N),
                   'weight': rng.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,1.564438
1,1,orange,5,1.331256
2,2,apple,12,2.393235
3,3,apple,6,0.746937
4,4,apple,5,2.691024
5,5,orange,12,3.767211
6,6,apple,10,0.992983
7,7,apple,11,3.795525


In [12]:
# Look at fruit column - an array of strings
print(df['fruit'])
print("\n")
print(type(df['fruit']))
print("\n")
print(type(df['fruit'][0]))

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object


<class 'pandas.core.series.Series'>


<class 'str'>


In [13]:
# Make it a categorical
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [14]:
type(fruit_cat)

pandas.core.series.Series

In [15]:
fruit_cat[1]

'orange'

The values of `fruit_cat` are now an instance of `pandas.Categorical`, accessed by `.array` attribute.

In [16]:
c = fruit_cat.array
c

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

In [17]:
type(c)

pandas.core.arrays.categorical.Categorical

In [18]:
c[1]

'orange'

Categorical objects have different attributes, namely `categories` and `codes`

In [19]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [20]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [21]:
# Map codes and categories
dict(enumerate(c.categories))

{0: 'apple', 1: 'orange'}

In [22]:
# Update the original DataFrame fruit column with the categorical version
df['fruit'] = df['fruit'].astype('category')
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [23]:
# Make a categorical out of a list
myList = ['foo', 'bar', 'baz', 'foo', 'bar']
print(myList)
myCats = pd.Categorical(myList)
myCats

['foo', 'bar', 'baz', 'foo', 'bar']


['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [24]:
# Make a categorical out of categories and codes
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
print(f"Categories:\n{categories}\n\nCodes:\n{codes}")
myNewCats = pd.Categorical.from_codes(codes, categories)
myNewCats


Categories:
['foo', 'bar', 'baz']

Codes:
[0, 1, 2, 0, 0, 1]


['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo', 'bar', 'baz']

Categories don't have an order (levels) by default, but can specify one:

In [25]:
pd.Categorical(myList, ordered=True)

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar' < 'baz' < 'foo']

In [26]:
pd.Categorical.from_codes(codes, categories, ordered=True)

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

In [27]:
myNewCats.as_ordered()

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

<a name="computation"></a>
# Computations with Categoricals

Categoricals generally behave the same as their nonencoded (e.g. strings) counterparts, they're just faster and have some extra functionality.

In [28]:
# Make an array of random data
rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
draws[:5]

array([-1.42382504,  1.26372846, -0.87066174, -0.25917323, -0.07534331])

In [29]:
# Make quartiles (which happen to be a categorical)
bins = pd.qcut(draws, 4)
bins

[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]

In [30]:
# Also label them for easier reading
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

['Q1', 'Q4', 'Q1', 'Q2', 'Q2', ..., 'Q3', 'Q3', 'Q2', 'Q3', 'Q2']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [31]:
# View the codes
bins.codes[:10]

array([0, 3, 0, 1, 1, 0, 0, 2, 2, 0], dtype=int8)

In [32]:
# Convert to a series
bins = pd.Series(bins, name='quartile')
bins

0      Q1
1      Q4
2      Q1
3      Q2
4      Q2
       ..
995    Q3
996    Q3
997    Q2
998    Q3
999    Q2
Name: quartile, Length: 1000, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [33]:
# Convert draws to a series also
draws = pd.Series(draws)

In [34]:
# Do a bunch of stuff in one command
# First group, then aggregate
results = (draws.groupby(bins).agg(['count', 'min', 'max']).reset_index())
results

  results = (draws.groupby(bins).agg(['count', 'min', 'max']).reset_index())


Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.119609,-0.678494
1,Q2,250,-0.673305,0.008009
2,Q3,250,0.018753,0.686183
3,Q4,250,0.688282,3.211418


In [35]:
# Quartile column contains info from bins category
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

## Better performance with categoricals

In order to see the performance improvements, we need a larger example.

Look at the memory usage of a Series of 10 million elements with a few different categories. One version is a standard Series of strings, the other is a categorical.

In [36]:
# Make Series of strings
N = 10_000_000
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))
labels

0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: object

In [37]:
# Convert to categories
categories = labels.astype('category')
categories

0          foo
1          bar
2          baz
3          qux
4          foo
          ... 
9999995    qux
9999996    foo
9999997    bar
9999998    baz
9999999    qux
Length: 10000000, dtype: category
Categories (4, object): ['bar', 'baz', 'foo', 'qux']

In [38]:
# compare Memory usage
print(labels.memory_usage(deep=True))
print(categories.memory_usage(deep=True))

600000128
10000540


Group by operations are even faster:

In [39]:
%timeit labels.value_counts()

191 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [40]:
%timeit categories.value_counts()

25 ms ± 161 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


<a name="methods"></a>
# Categorical Methods

Similar to `Series.str` specialized string methods, Series with categorical data have some special methods.

`cat` is an **accesor** attribute that gives access to these categorical methods

<img src="./myImages/table7.7_seriesCatMethods.png" width = 600>

In [41]:
# Make a series and convert to categorical
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [42]:
# View the codes
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [43]:
# View the categories
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [44]:
# Add other levels even if not present in current data
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [45]:
# Value counts on original:
cat_s.value_counts()

a    2
b    2
c    2
d    2
Name: count, dtype: int64

In [46]:
# Value counts with extra category
cat_s2.value_counts()

a    2
b    2
c    2
d    2
e    0
Name: count, dtype: int64

In [47]:
# Subset maintains all categories
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [48]:
# Drop unused levels after a subset
cat_s3.cat.remove_unused_categories()


0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

## Creating dummy variables for modeling

categorical data is often transformed into dummy variables for stats/machine learning.

In [49]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [50]:
pd.get_dummies(cat_s, dtype=float)

Unnamed: 0,a,b,c,d
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,1.0
