# Categorical Data in Pandas

<br>
<p style="color: green; line-height: 0.8;"><strong>&#9650; Memory Usage</strong> &#9650;</p>
<p style="color: green; line-height: 0.8;"><strong>&#9650; Performance</strong> &#9650;</p>
<p style="color: green; line-height: 0.8;"><strong>&#9650; Library Integration</strong> &#9650;</p>
<p style="color: green; line-height: 0.8;"><strong>&#9650; Added Functionality</strong> &#9650;</p>

In [1]:
import pandas as pd
import numpy as np

pd.Series(['a', 'b', 'b', 'a', 'c', 'c'], dtype='category')

0    a
1    b
2    b
3    a
4    c
5    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

### Memory Usage 


In [2]:
df_size = 100_000

df1 = pd.DataFrame(
    {
        "float_1": np.random.rand(df_size),
        "food": np.random.choice(["pizza", "burger", "sushi", "tacos"], size=df_size),
    }
)

In [3]:
df1.memory_usage(deep=True)

Index          128
float_1     800000
food       6225064
dtype: int64

**100.000** floats only use up **0.8MB**, while the same amount of strings uses **6MB**

#### Now let's make the data categorical!

In [4]:
df1_cat = df1.astype({"food": "category"})

df1_cat.memory_usage(deep=True)

Index         128
float_1    800000
food       100421
dtype: int64

 We reduced the memory usage from **6MB** down to **0.1MB** :D


<br>

### Performance

In [5]:
%%timeit 
df1["food"].str.upper()

10.8 ms ± 49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%%timeit 
df1_cat["food"].str.upper()

675 µs ± 363 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


<br>

### BUT

In [7]:
df1_cat["food"].str.upper().memory_usage(deep=True)

6225192

Memory usage went back to **6MB**.<br>
Because our series is not of type *categorical* anymore.

In [8]:
str(df1_cat["food"].str.upper().dtype)

'object'

<br>

### Solution
We don't want to operate on the whole series but rather on the individual categories.

In [9]:
%%timeit 
df1_cat["food"].cat.rename_categories(str.upper)

51.1 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


This also makes the operation and additional **10x** faster compared to our previous approach.

In [10]:
str(df1_cat["food"].cat.rename_categories(str.upper).dtype)

'category'

And also our categories are intact

There are some more tricky things, that you might encounter when working with categorical series.<br>
[Here](https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a) you can find some information on how to handle this tricky situations.

<br>

## Ordered Categorical Data

In [11]:
s1 = pd.Series(['Low', 'Medium', 'High', 'Low', 'High'])

# Define the ordered categories and their order
ordered_categories = ['Low', 'Medium', 'High']

# Create an ordered categorical series
s1_ord = pd.Series(pd.Categorical(s1, categories=ordered_categories, ordered=True))

s1_ord

0       Low
1    Medium
2      High
3       Low
4      High
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

In [12]:
s2_ord = pd.Series(pd.Categorical(['Medium', 'Medium', 'Low', 'High', 'High'], categories=['Low', 'Medium', 'High'], ordered=True))
s2_ord

0    Medium
1    Medium
2       Low
3      High
4      High
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

<br>

####  operations with ordered categorical data
with ordered categories we can use the comparison operators `(<, >, <=, >=)`<br>
and methods like `min()`, `max()`, `sort_values()`. 

In [13]:
s1_ord < s2_ord

0     True
1    False
2    False
3     True
4    False
dtype: bool

In [14]:
s1_ord.min(), s1_ord.max()

('Low', 'High')

In [15]:
s1_ord.sort_values()

0       Low
3       Low
1    Medium
2      High
4      High
dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

<br>

## Discretizing continuous values (Binning)
Sometimes it makes sense to convert numeric into categorical data. For example, for some problems the exact age of a person might not matter, but only whether the person is underaged or not. This process of conversion is called binning.

In [16]:
import seaborn as sns

titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [17]:
titanic['age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

In [18]:
titanic['age'].head(5)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

### `cut`

In [19]:
pd.cut(titanic['age'], bins=3).head(5)

0      (0.34, 26.947]
1    (26.947, 53.473]
2      (0.34, 26.947]
3    (26.947, 53.473]
4    (26.947, 53.473]
Name: age, dtype: category
Categories (3, interval[float64, right]): [(0.34, 26.947] < (26.947, 53.473] < (53.473, 80.0]]

By default cut will split the data into equally sized intervals. We can also set the bin edges ourselves.

In [20]:
pd.cut(titanic['age'], bins=[0, 18, 67, 80], include_lowest=True).head(5)

0    (18.0, 67.0]
1    (18.0, 67.0]
2    (18.0, 67.0]
3    (18.0, 67.0]
4    (18.0, 67.0]
Name: age, dtype: category
Categories (3, interval[float64, right]): [(-0.001, 18.0] < (18.0, 67.0] < (67.0, 80.0]]

If you set the bin edges manually, be sure to cover the whole range as values not falling into an interval will be set to NA.

In [21]:
pd.cut(titanic['age'], 
       bins=[64, 66, 67, 80],
       labels=['child', 'grown-up', 'senior']).head(5)

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: age, dtype: category
Categories (3, object): ['child' < 'grown-up' < 'senior']

In [23]:
titanic['Age_coarse'] = pd.cut(titanic['age'], bins=[0, 18, 67, 80], labels=['child', 'grown-up', 'senior'])
titanic['Age_coarse']

0      grown-up
1      grown-up
2      grown-up
3      grown-up
4      grown-up
         ...   
886    grown-up
887    grown-up
888         NaN
889    grown-up
890    grown-up
Name: Age_coarse, Length: 891, dtype: category
Categories (3, object): ['child' < 'grown-up' < 'senior']

### `qcut`
cuts at quantiles, meaning it will try to create n evenly sized bins

In [25]:
pd.qcut(titanic['age'], 4).head()

0    (20.125, 28.0]
1      (28.0, 38.0]
2    (20.125, 28.0]
3      (28.0, 38.0]
4      (28.0, 38.0]
Name: age, dtype: category
Categories (4, interval[float64, right]): [(0.419, 20.125] < (20.125, 28.0] < (28.0, 38.0] < (38.0, 80.0]]

## Sources

[Using pandas categories properly is tricky, here’s why…](https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a)



[Here you can find an exercise](optional_exercises.ipynb#exe01)
<img src="pictures/optex1.png" width="50" style="float: right;"/>