![rmotr](https://i.imgur.com/jiPp4hj.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39119047-31551228-46eb-11e8-9cfb-d6ca05a05fe4.jpeg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Categorical Data and Sorting

Categorical Data represents a special data type. A field that can take only a limited number of distinct values. For example, Sex (`M`, `F`), Football Player Positions (`GK`, `DF`, `MF`, `FW`).

Sometimes, categorical data can have an order associated ("Please rate our service: `Bad`, `Good`, `Excellent`"). They're important to statistical analysis, but can't be operated on (you can't multiply categories, for example). Sometimes Categories can accept "empty values" (represented with `np.nan`) and sometimes that's not allowed.

To save memory space and speed up computations, categories are "coded". For example, Sex `M`, `F` can be represented as `0`, `1` internally.

![separator2](https://i.imgur.com/4gX5WFr.png)

## Hands on! 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Categories are `Series` or `DataFrame` columns (also `Series`). To create a `Series` as categorical you can do: 

In [2]:
s = pd.Series(['M', 'F', 'F', 'M', 'F', 'M', 'M'], dtype='category')

In [3]:
s

0    M
1    F
2    F
3    M
4    F
5    M
6    M
dtype: category
Categories (2, object): [F, M]

The values we created are:

In [4]:
s.values

[M, F, F, M, F, M, M]
Categories (2, object): [F, M]

Of the following categories:

In [5]:
s.values.categories

Index(['F', 'M'], dtype='object')

Internally encoded as:

In [7]:
s.values.codes

array([1, 0, 0, 1, 0, 1, 1], dtype=int8)

![separator1](https://i.imgur.com/ZUWYTii.png)

### Advanced usage: order

As we mentioned, categories can be created with an inherent order. To do that, we need to manually create a `pandas.Categorical` object:

In [8]:
service_ratings = pd.Categorical([
    'Very satisfied',
    'Neither satisfied nor dissatisfied',
    'Very satisfied',
    'Somewhat satisfied',
    'Very dissatisfied',
    'Neither satisfied nor dissatisfied',
], categories=[
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
], ordered=True)

In [9]:
service_ratings.as_ordered()

[Very satisfied, Neither satisfied nor dissatisfied, Very satisfied, Somewhat satisfied, Very dissatisfied, Neither satisfied nor dissatisfied]
Categories (5, object): [Very dissatisfied < Somewhat dissatisfied < Neither satisfied nor dissatisfied < Somewhat satisfied < Very satisfied]

In [10]:
service_ratings.categories

Index(['Very dissatisfied', 'Somewhat dissatisfied',
       'Neither satisfied nor dissatisfied', 'Somewhat satisfied',
       'Very satisfied'],
      dtype='object')

In [11]:
service_ratings.codes

array([4, 2, 4, 3, 0, 2], dtype=int8)

In [12]:
service_ratings.get_values()

array(['Very satisfied', 'Neither satisfied nor dissatisfied',
       'Very satisfied', 'Somewhat satisfied', 'Very dissatisfied',
       'Neither satisfied nor dissatisfied'], dtype=object)

The most common approach is to construct a series from the `pandas.Categorical` object and then use the `cat` accessor to reference the categorical info:

In [12]:
s = pd.Series(service_ratings)

In [13]:
s.cat.codes

0    4
1    2
2    4
3    3
4    0
5    2
dtype: int8

In [14]:
s.sort_values()

4                     Very dissatisfied
1    Neither satisfied nor dissatisfied
5    Neither satisfied nor dissatisfied
3                    Somewhat satisfied
0                        Very satisfied
2                        Very satisfied
dtype: category
Categories (5, object): [Very dissatisfied < Somewhat dissatisfied < Neither satisfied nor dissatisfied < Somewhat satisfied < Very satisfied]

You can also create a series with the values, and create a `CategoricalDtype` object with info about categories:

In [15]:
from pandas.api.types import CategoricalDtype

In [16]:
s = pd.Series([
    'Very satisfied',
    'Neither satisfied nor dissatisfied',
    'Very satisfied',
    'Somewhat satisfied',
    'Very dissatisfied',
    'Neither satisfied nor dissatisfied',
], dtype=CategoricalDtype(categories=[
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
], ordered=True))

In [17]:
s

0                        Very satisfied
1    Neither satisfied nor dissatisfied
2                        Very satisfied
3                    Somewhat satisfied
4                     Very dissatisfied
5    Neither satisfied nor dissatisfied
dtype: category
Categories (5, object): [Very dissatisfied < Somewhat dissatisfied < Neither satisfied nor dissatisfied < Somewhat satisfied < Very satisfied]

In [18]:
s.sort_values()

4                     Very dissatisfied
1    Neither satisfied nor dissatisfied
5    Neither satisfied nor dissatisfied
3                    Somewhat satisfied
0                        Very satisfied
2                        Very satisfied
dtype: category
Categories (5, object): [Very dissatisfied < Somewhat dissatisfied < Neither satisfied nor dissatisfied < Somewhat satisfied < Very satisfied]

In [19]:
s.cat.codes

0    4
1    2
2    4
3    3
4    0
5    2
dtype: int8

In [20]:
s.cat.categories

Index(['Very dissatisfied', 'Somewhat dissatisfied',
       'Neither satisfied nor dissatisfied', 'Somewhat satisfied',
       'Very satisfied'],
      dtype='object')

![separator1](https://i.imgur.com/ZUWYTii.png)

### Categories in DataFrames

Categorical data in `DataFrame`s behave in the same way. After all, each column is a `Series`.

In [21]:
df = pd.DataFrame({
    "Name": [
        "John",
        "Robert",
        "Jane",
        "Mary",
        "Rose"
    ],
    "Sex": [
        "M","M","F","F", "F"
    ],
    
})

In [22]:
df

Unnamed: 0,Name,Sex
0,John,M
1,Robert,M
2,Jane,F
3,Mary,F
4,Rose,F


In [23]:
df['Sex'] = df['Sex'].astype('category')

In [24]:
df['Sex'].values

[M, M, F, F, F]
Categories (2, object): [F, M]

In [25]:
df['Sex'].values.categories

Index(['F', 'M'], dtype='object')

In [26]:
df['Sex'].value_counts()

F    3
M    2
Name: Sex, dtype: int64

In [27]:
df['Sex'].values.codes

array([1, 1, 0, 0, 0], dtype=int8)

Order also works in `DataFrame`s, but we need to reset the object type first

In [28]:
df['Sex'] = df['Sex'].astype('object')

And then set the type again:

In [29]:
df['Sex'] = df['Sex'].astype(CategoricalDtype(['F', 'M'], ordered=True))

In [30]:
df['Sex'].cat.ordered

True

In [31]:
df.sort_values('Sex')

Unnamed: 0,Name,Sex
2,Jane,F
3,Mary,F
4,Rose,F
0,John,M
1,Robert,M


![separator1](https://i.imgur.com/ZUWYTii.png)

### Encoding Categories

Storing categories based on their codes is more space efficient than doing it with their actual values. We can translate back and forth categories to codes and vice versa. Assuming results from a survey, as our previous example, we could have the following values:

In [32]:
results = pd.Series([4, 2, 3, 1, 2, 3, 3, 4])

In [33]:
catd = CategoricalDtype(categories=[
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
], ordered=True)

In [34]:
labels = pd.Series([
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
])

In [35]:
labels.take(results)

4                        Very satisfied
2    Neither satisfied nor dissatisfied
3                    Somewhat satisfied
1                 Somewhat dissatisfied
2    Neither satisfied nor dissatisfied
3                    Somewhat satisfied
3                    Somewhat satisfied
4                        Very satisfied
dtype: object

To create a `Categorical` object combining codes and labels you can use the `from_codes` class method:

In [36]:
pd.Series(
    pd.Categorical.from_codes(
        results, labels, ordered=True))

0                        Very satisfied
1    Neither satisfied nor dissatisfied
2                    Somewhat satisfied
3                 Somewhat dissatisfied
4    Neither satisfied nor dissatisfied
5                    Somewhat satisfied
6                    Somewhat satisfied
7                        Very satisfied
dtype: category
Categories (5, object): [Very dissatisfied < Somewhat dissatisfied < Neither satisfied nor dissatisfied < Somewhat satisfied < Very satisfied]

![separator1](https://i.imgur.com/ZUWYTii.png)

### Dummy Variables

Categorical data can be also "expanded" into what's called as "Dummy Variables". This works by creating a new column per each possible value in the `DataFrame` and marking the corresponding column with `0` or `1`. Let's see an example:

In [37]:
df = pd.DataFrame({
    "Name": [
        "John",
        "Robert",
        "Jane",
        "Mary",
        "Rose"
    ],
    "Sex": pd.Series([
        "M","M","F","F", "F"
    ], dtype='category'),   
})

In [38]:
df

Unnamed: 0,Name,Sex
0,John,M
1,Robert,M
2,Jane,F
3,Mary,F
4,Rose,F


In [39]:
pd.get_dummies(df['Sex'])

Unnamed: 0,F,M
0,0,1
1,0,1
2,1,0
3,1,0
4,1,0


In [40]:
pd.concat([df, pd.get_dummies(df['Sex'])], axis=1)

Unnamed: 0,Name,Sex,F,M
0,John,M,0,1
1,Robert,M,0,1
2,Jane,F,1,0
3,Mary,F,1,0
4,Rose,F,1,0


![separator1](https://i.imgur.com/ZUWYTii.png)

### Comparing Memory Usage

It's easy to see the efficiency of Categorical Types. We'll create two `Series`: `s_cat` (containing a `Categorical` type) and `s_obj` (containing Strings, or objects). Both will have the same (1000) values generated randomly :

In [41]:
values = np.random.randint(5, size=1000)

In [42]:
labels = pd.Series([
    'Very dissatisfied',
    'Somewhat dissatisfied',
    'Neither satisfied nor dissatisfied',
    'Somewhat satisfied',
    'Very satisfied'
])

In [43]:
s_cat = pd.Series(
    pd.Categorical.from_codes(
        values, labels, ordered=True))

In [44]:
s_obj = labels.take(values)

In [45]:
s_cat.value_counts()

Somewhat satisfied                    216
Neither satisfied nor dissatisfied    207
Very dissatisfied                     206
Somewhat dissatisfied                 190
Very satisfied                        181
dtype: int64

The total space taken by our `s_obj` series:

In [46]:
s_obj.nbytes

8000

The total space taken by our `s_cat` series:

In [47]:
s_cat.nbytes

1040

`s_cat` is 7 times smaller in bytes (values stored). Total memory usage is small too:

In [48]:
s_cat.memory_usage(False)

1200

In [49]:
s_obj.memory_usage(False, )

8000

![separator2](https://i.imgur.com/4gX5WFr.png)