## Chapter 5: Categorical data

**Categoricals** are a pandas data type, which correspond to categorical variables in statistics: a
variable, which can take on only a limited, and usually fixed, number of possible values. Examples are gender, social class, blood types, country affiliations, etc.

Example: pandas Series.

In [1]:
import pandas as pd

In [2]:
s = pd.Series(["a","b","c","a","c"], dtype="category")

In [3]:
s

0    a
1    b
2    c
3    a
4    c
dtype: category
Categories (3, object): [a, b, c]

Example: pandas DataFrame.

In [4]:
df = pd.DataFrame({"A":["a","b","c","a", "c"]})

In [5]:
df["B"] = df["A"].astype('category')

In [6]:
df["C"] = pd.Categorical(df["A"])

In [7]:
df

Unnamed: 0,A,B,C
0,a,a,a
1,b,b,b
2,c,c,c
3,a,a,a
4,c,c,c


In [8]:
df.dtypes

A      object
B    category
C    category
dtype: object

Example: creating a large random DataFrame of categoricals.

In [9]:
import numpy as np

In [10]:
df = pd.DataFrame(np.random.choice(['foo','bar','baz'], size=(100000,3)))
df = df.apply(lambda col: col.astype('category'))

In [11]:
df.head()

Unnamed: 0,0,1,2
0,bar,foo,bar
1,baz,foo,bar
2,baz,foo,foo
3,foo,bar,foo
4,foo,bar,foo


In [12]:
df.dtypes

0    category
1    category
2    category
dtype: object

In [13]:
df.shape

(100000, 3)