In [1]:
%pylab inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Populating the interactive namespace from numpy and matplotlib


Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a **limited, and usually fixed, number of possible values** (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales. 

In contrast to statistical categorical variables, categorical data might have **an order** (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’). All values of categorical data are either **in categories or np.nan. Order is defined by the order of categories, not lexical order of the values**

<span style="color:red;font-weight:bold">one reason to use Categorical to represent symbols is due to performance reason</span>. if represented as text, Pandas represents text with the object dtype which holds a normal Python string. This is a common culprit for slow code because object dtypes run at Python speeds, not at Pandas’ normal C speeds.

Pandas categoricals are a new and powerful feature that **encodes categorical data numerically so that we can leverage Pandas' fast C code on this kind of text data**.

## Index
* [Object Creation](#Object-Creation)
    * [ordered categorical](#ordered-categorical)
* [Working with categories](#Working-with-categories)
* [Sorting and Order](#Sorting-and-Order)
    * [reorder](#reorder)

## Object Creation

In [2]:
# By specifying dtype="category" when constructing a Series:
s = pd.Series(["a","b","c","a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [3]:
# By converting an existing Series or column to a category dtype:
df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')
df

Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


In [4]:
# use cut()
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)

Unnamed: 0,value,group
0,30,30 - 39
1,2,0 - 9
2,52,50 - 59
3,16,10 - 19
4,62,60 - 69
5,48,40 - 49
6,83,80 - 89
7,46,40 - 49
8,2,0 - 9
9,69,60 - 69


#### ordered categorical

In [5]:
# By passing a pandas.Categorical object to a Series or assigning it to a DataFrame.
# not necessary to pass "ordered=False", "False" is default option
# New categorical data are NOT automatically ordered. You must explicity pass ordered=True to indicate an ordered Categorical.
raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],     ordered=False)
raw_cat

[NaN, b, c, NaN]
Categories (3, object): [b, c, d]

In [6]:
# !!! notice, pd.Categorical isn't enough, we have to pass it into a Series
s = pd.Series(raw_cat)
s

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

## Working with categories

Categorical data has a **categories and a ordered** property, 
* **categories** list their possible values. use "**series.cat.categories**"
* **ordered**whether the ordering matters or not. use "**series.cat.ordered**"

If you don’t manually specify categories and ordering, they are inferred from the passed in values.

In [7]:
s = pd.Series(pd.Categorical(["a","b","c","a"], categories=["c","b","a"], ordered=True))
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [c < b < a]

In [8]:
s.cat.categories# notice that, returned is an index

Index([u'c', u'b', u'a'], dtype='object')

In [9]:
s.cat.ordered

True

In [10]:
s.cat.codes

0    2
1    1
2    0
3    2
dtype: int8

In [11]:
s.min(),s.max()

('c', 'a')

In [12]:
s.mode()

0    a
dtype: category
Categories (3, object): [c < b < a]

In [13]:
s.value_counts()

a    2
b    1
c    1
dtype: int64

## Sorting and Order

In [14]:
s_unorder = pd.Series(pd.Categorical(["a","b","c","b","c","b","a"], ordered=False))
s_unorder

0    a
1    b
2    c
3    b
4    c
5    b
6    a
dtype: category
Categories (3, object): [a, b, c]

In [15]:
s_unorder.sort_values()# order by Lexical order

0    a
6    a
1    b
3    b
5    b
2    c
4    c
dtype: category
Categories (3, object): [a, b, c]

In [16]:
# specify the order, and then set "ordered=True"
s_order = pd.Series(pd.Categorical(["a","b","c","b","c","b","a"], categories=["b","c","a"], ordered=True))
s_order

0    a
1    b
2    c
3    b
4    c
5    b
6    a
dtype: category
Categories (3, object): [b < c < a]

In [17]:
s_order.sort_values()

1    b
3    b
5    b
2    c
4    c
0    a
6    a
dtype: category
Categories (3, object): [b < c < a]

In [18]:
# switch order/unordered
s_unorder.sort_values()

0    a
6    a
1    b
3    b
5    b
2    c
4    c
dtype: category
Categories (3, object): [a, b, c]

In [19]:
new_s = s_unorder.cat.set_categories(["b","c","a"],ordered=True)
new_s.sort_values(inplace=True)
new_s

1    b
3    b
5    b
2    c
4    c
0    a
6    a
dtype: category
Categories (3, object): [b < c < a]

In [20]:
new_s.min(),new_s.max()

('b', 'a')

<a id="reorder"></a>
#### reorder
Reordering the categories is possible via the 
* Categorical.reorder_categories() and the 
* Categorical.set_categories() methods. 

both methods can "order", but for Categorical.reorder_categories(), all old categories must be included in the new categories and no new categories are allowed. This will necessarily make the sort order the same as the categories order.

In [21]:
s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.reorder_categories(["c","b","a"], ordered=True)
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [c < b < a]

In [23]:
s.sort_values(inplace=True)
s

2    c
1    b
0    a
3    a
dtype: category
Categories (3, object): [c < b < a]