The categorical data type is useful in the following cases:

- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will **save some memory**
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will **use the logical order instead of the lexical order**
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

CategoricalAccessor: (attributes)
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py
- s.cat.categories
- s.cat.categories = list('abc')
- s.cat.rename_categories(list('cab'))
- s.cat.reorder_categories(list('cab'))
- s.cat.add_categories(['d','e'])
- s.cat.remove_categories(['d'])
- s.cat.remove_unused_categories()
- s.cat.set_categories(list('abcde'))
- s.cat.as_ordered()
- s.cat.as_unordered()
- **s.cat.codes**

- s.cat.ordered: boolean



1. `pd.Categorical(values, categories=, ordered=, dtype=)` will get a list a values with type of category
1. `pd.api.types.CategoricalDtype(categories=, ordered=)` will create a type of category data, used in `dtype=` or `.astype()`
2. groupby the category data: will show the unused values, cell [30] [31]
3. s.cat.codes: missing data will be set by a code of -1


In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 

In [2]:
import numpy as np
import pandas as pd

### Object creation:
#### Series
- create from the beginning

In [3]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

- change in the middle

In [4]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype('category')
df.dtypes

A      object
B    category
dtype: object

- by some special functions, such as cut()

In [5]:
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(5)
df.dtypes

Unnamed: 0,value,group
0,20,20 - 29
1,90,90 - 99
2,42,40 - 49
3,77,70 - 79
4,50,50 - 59


value       int64
group    category
dtype: object

- passing a pandas.Categorical object to a Series or a DataFrame

In [6]:
raw_cat = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False)
s = pd.Series(raw_cat)
s

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

In [7]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = raw_cat
df.dtypes

A      object
B    category
dtype: object

#### dataframe

In [8]:
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")
df.dtypes

A    category
B    category
dtype: object

In [9]:
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
df_cat = df.astype('category') # transform all columns
df_cat.dtypes

A    category
B    category
dtype: object

#### control the categorical types
In the examples above where we passed dtype='category', we used the default behavior:

- Categories are inferred from the data.
- Categories are unordered.

To control those behaviors, instead of passing 'category', use an instance of CategoricalDtype.


In [10]:
from pandas.api.types import CategoricalDtype
s = pd.Series(["a", "b", "c", "a"])
cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True) # it has order now
s_cat = s.astype(cat_type)
s_cat

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b < c < d]

In [11]:
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
cat_type = CategoricalDtype(categories=list('abcd'), ordered=True)
df_cat = df.astype(cat_type)
df_cat['A']
df_cat['B']

0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): [a < b < c < d]

0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): [a < b < c < d]

In [12]:
splitter = np.random.choice([0, 1], 5, p=[0.5, 0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
s

0     test
1    train
2    train
3     test
4    train
dtype: category
Categories (2, object): [train, test]

#### Regaining Original Data
use Series.astype(original_dtype) or np.asarray(categorical)

In [13]:
s = pd.Series(["a", "b", "c", "a"])
s2 = s.astype('category')
# get back to original dtype
s2.astype(str)
np.asarray(s2)

0    a
1    b
2    c
3    a
dtype: object

array(['a', 'b', 'c', 'a'], dtype=object)

### CategoricalDtype
A categorical’s type is fully described by
- categories: a sequence of unique values and no missing values
- ordered: a boolean

This information can be stored in a CategoricalDtype. The categories argument is optional, which implies that the actual categories should be inferred from whatever is present in the data when the pandas.Categorical is created. The categories are assumed to be unordered by default.

In [14]:
# A CategoricalDtype can be used in any place pandas expects a dtype. 
# For example pandas.read_csv(), pandas.DataFrame.astype(), or in the Series constructor.
from pandas.api.types import CategoricalDtype
CategoricalDtype(['a', 'b', 'c'])
CategoricalDtype(['a', 'b', 'c'], ordered=True)
CategoricalDtype()

CategoricalDtype(categories=['a', 'b', 'c'], ordered=None)

CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

CategoricalDtype(categories=None, ordered=None)

### Description
Using describe() on categorical data will produce similar output to a Series or DataFrame of type string.

In [15]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]})
df.describe()
df["cat"].describe()

Unnamed: 0,cat,s
count,3,3
unique,2,2
top,c,c
freq,2,2


count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

### Working with categories


In [16]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s.cat.categories
s.cat.ordered

Index(['a', 'b', 'c'], dtype='object')

False

In [17]:
# It’s also possible to pass in the categories in a specific order:
# New categorical data are not automatically ordered. You must explicitly pass ordered=True to indicate an ordered Categorical.
s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
s.cat.categories
s.cat.ordered
s.unique() # different with .cat.categories

Index(['c', 'b', 'a'], dtype='object')

False

[a, b, c]
Categories (3, object): [a, b, c]

#### Renaming categories
- assigning new values to the Series.cat.categories property 
- using the rename_categories() method
- passing a dict-like object to map the renaming

In [18]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]; s
s = s.cat.rename_categories([1, 2, 3]); s
s = s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'}); s
# Be aware that assigning new categories is an inplace operation, 
# while most other operations under Series.cat per default return a new Series of dtype category.

0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): [x, y, z]

#### Appending and removing categories

In [19]:
s = s.cat.add_categories([4])
s.cat.categories

Index(['x', 'y', 'z', 4], dtype='object')

In [20]:
s = s.cat.remove_categories([4])
s.cat.categories

Index(['x', 'y', 'z'], dtype='object')

Removing unused categories:

In [21]:
s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
s.cat.remove_unused_categories()

0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]

#### Setting categories
If you want to do remove and add new categories in one step (which has some speed advantage), or simply set the categories to a predefined scale

In [22]:
s = pd.Series(["one", "two", "four", "-"], dtype="category")
s = s.cat.set_categories(["one", "two", "three", "four"])
s

0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): [one, two, three, four]

### Sorting and order

In [23]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
s.sort_values(inplace=True)
s
s.min(), s.max()

0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

('a', 'c')

In [24]:
# You can set categorical data to be ordered by using as_ordered() or unordered by using as_unordered(). 
# These will by default return a new object.
s.cat.as_ordered()
s.cat.as_unordered()

0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

Sorting will use the order defined by categories, not any lexical order present on the data type. This is even true for strings and numeric data:

In [25]:
s = pd.Series([1, 2, 3, 1], dtype="category")
s = s.cat.set_categories([2, 3, 1], ordered=True)
s
s.sort_values(inplace=True)
s
s.min(), s.max()

0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

(2, 1)

Reordering: via the Categorical.reorder_categories() and the Categorical.set_categories()

In [26]:
s = pd.Series([1, 2, 3, 1], dtype="category")
s = s.cat.reorder_categories([2, 3, 1], ordered=True)
s
s.sort_values(inplace=True)
s
s.min(), s.max()

0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

(2, 1)

Multi Column Sorting:

A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns. The ordering of the categorical is determined by the categories of that column.

In [27]:
dfs = pd.DataFrame({'A': pd.Categorical(list('bbeebbaa'), categories=['e', 'a', 'b'], ordered=True), 'B': [1, 2, 1, 2, 2, 1, 2, 1]})
dfs.sort_values(by=['A', 'B'])

Unnamed: 0,A,B
2,e,1
3,e,2
7,a,1
6,a,2
0,b,1
5,b,1
1,b,2
4,b,2


In [28]:
# Reordering the categories changes a future sort
dfs['A'] = dfs['A'].cat.reorder_categories(['a', 'b', 'e'])
dfs.sort_values(by=['A', 'B'])

Unnamed: 0,A,B
7,a,1
6,a,2
0,b,1
5,b,1
1,b,2
4,b,2
2,e,1
3,e,2


### Comparisons
Comparing categorical data with other objects is possible in three cases:

- Comparing equality (== and !=) to a list-like object (list, Series, array, …) of the same length as the categorical data.
- All comparisons (==, !=, >, >=, <, and <=) of categorical data to another categorical Series, when ordered==True and **the categories are the same.**
- All comparisons of a categorical data to a scalar.

### Operations
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:

In [29]:
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
s.value_counts()

c    2
b    1
a    1
d    0
dtype: int64

In [30]:
# Groupby will also show “unused” categories:
cats = pd.Categorical(["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]) 
df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
df.groupby("cats").mean()

Unnamed: 0_level_0,values
cats,Unnamed: 1_level_1
a,1.0
b,2.0
c,4.0
d,


In [31]:
cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df2 = pd.DataFrame({"cats": cats2, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
df2.groupby(["cats", "B"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,values
cats,B,Unnamed: 2_level_1
a,c,1.0
a,d,2.0
b,c,3.0
b,d,4.0
c,c,
c,d,


### Data munging
The optimized pandas data access methods .loc, .iloc, .at, and .iat, work as normal. The only difference is the return type (for getting) and that only values already in categories can be assigned.
#### getting

In [32]:
idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx)
values = [1, 2, 2, 2, 3, 4, 5]
df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
df.iloc[2:4, :]
df.loc["h":"j", "cats"]
df[df["cats"] == "b"]

Unnamed: 0,cats,values
j,b,2
k,b,2


h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): [a, b, c]

Unnamed: 0,cats,values
i,b,2
j,b,2
k,b,2


#### String and datetime accessors
The accessors .dt and .str will work if the s.cat.categories are of an appropriate type:

In [33]:
str_s = pd.Series(list('aabb'))
str_cat = str_s.astype('category')
str_cat
str_cat.str.contains("a")

date_s = pd.Series(pd.date_range('1/1/2015', periods=5))
date_cat = date_s.astype('category')
date_cat

date_cat.dt.day

0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]

0     True
1     True
2    False
3    False
dtype: bool

0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]

0    1
1    2
2    3
3    4
4    5
dtype: int64

#### setting
Setting values in a categorical column (or Series) works as long as the value is included in the categories:

#### Merging
You can concat two DataFrames containing categorical data together, but the categories of these categoricals need to be the same:

#### Unioning
If you want to combine categoricals that do not necessarily have the same categories, the union_categoricals() function will combine a list-like of categoricals. The new categories will be the union of the categories being combined.
#### Concatenation
This section describes concatenations specific to category dtype. See Concatenating objects for general description.

### Missing Data

Missing values should not be included in the Categorical’s categories, only in the values.

Instead, it is understood that NaN is different, and is always a possibility. 

**When working with the Categorical’s codes, missing values will always have a code of -1.**

In [34]:
s = pd.Series(["a", "b", np.nan, "a"], dtype="category")
s
s.cat.codes

0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): [a, b]

0    0
1    1
2   -1
3    0
dtype: int8

### Memory Usage
The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data.