In [1]:
import pandas as pd
import numpy as np
print(pd.__version__)

2.0.2


Categorical data represents the classification or grouping of data based on certain attributes or phenomena. It is also known as qualitative or nominal data. In categorical data, the categories or groups have no inherent numerical relationship or difference.

To facilitate computer processing, categorical data is often represented using numeric codes. For example, using 1 to represent "male" and 0 to represent "female". However, these numeric codes are just symbolic representations of the categories and do not imply any numerical relationships or differences between them.

Categorical data is commonly used in various fields such as demographics, market research, and data analysis. It allows for the analysis and comparison of different groups or categories based on their qualitative attributes rather than numerical values.

# Create Categorical Data
Categorical data in pandas provides a more efficient and effective way to represent and manipulate data with discrete categories. It can be created in a Series or DataFrame column using various methods in pandas.

## Series
To create a categorical object for blood types, we can use the `dtype = "category"` parameter when creating a Series or assigning values to a column in a DataFrame.

In [2]:
blood = pd.Series(data = ["A", "AB", np.nan, "AB", "O", "B"], name = "blood_type", dtype = "category")
blood

0      A
1     AB
2    NaN
3     AB
4      O
5      B
Name: blood_type, dtype: category
Categories (4, object): ['A', 'AB', 'B', 'O']

Use the `.astype('category')` method to convert a column or a Series to categorical data.

In [3]:
blood = pd.Series(data = ["A", "AB", np.nan, "AB", "O", "B"], name = "blood_type")
blood.astype('category')

0      A
1     AB
2    NaN
3     AB
4      O
5      B
Name: blood_type, dtype: category
Categories (4, object): ['A', 'AB', 'B', 'O']

In certain cases, pandas will automatically create categorical data types for columns when specific methods are applied
* The `groupby()` method: When performing grouping operations using `groupby()`, pandas will automatically create categorical data types for the grouping columns.

* The `cut()` function: When using the cut() function to bin numerical data into discrete intervals, pandas will create a categorical data type for the resulting bins.

* The `value_counts()` method: When calling `value_counts()` on a column, pandas will return the frequency counts as well as create a categorical data type for the unique values.

In [4]:
bins = [0,20,30,40,100]
labels = ['Age below 21','From 21 to 30','From 31 to 40','Above 41']
age = pd.Series([18, 30, 35, 18, np.nan, 30, 37, 25], name = 'age')
pd.cut(age, bins = bins, labels = labels) 

0     Age below 21
1    From 21 to 30
2    From 31 to 40
3     Age below 21
4              NaN
5    From 21 to 30
6    From 31 to 40
7    From 21 to 30
Name: age, dtype: category
Categories (4, object): ['Age below 21' < 'From 21 to 30' < 'From 31 to 40' < 'Above 41']

## `pd.Categorical`
 The `pandas.Categorical` class can be used to create a categorical data series in a DataFrame.

In [5]:
pd.Categorical(["A", "AB", np.nan, "AB", "O", "B"])

['A', 'AB', NaN, 'AB', 'O', 'B']
Categories (4, object): ['A', 'AB', 'B', 'O']

The `pd.Categorical` function can also handle missing values (`np.nan`) 

The resulting categorical series will include the missing value (`np.nan`) and display the categories accordingly.

In [6]:
categories = pd.Categorical(["A", "AB", np.nan, "AB", "O", "B"], 
                            categories = ["A", "B", "AB"])

print(categories)

['A', 'AB', NaN, 'AB', NaN, 'B']
Categories (3, object): ['A', 'B', 'AB']


## Dataframe

In [7]:
data = {
    "blood": ["A", "AB","AB", "O", "B"],
    'sex': ['Male', 'Male', 'Female', 'Male', 'Male']
}
user_info = pd.DataFrame(data, dtype = "category")
user_info.dtypes

blood    category
sex      category
dtype: object

In [8]:
data = {
    'Name': ['John', 'Alice', 'Bob'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)

# Convert specific columns to categorical data type
df[['Name', 'City']] = df[['Name', 'City']].astype('category')

print(df.dtypes)

Name    category
Age        int64
City    category
dtype: object


## `CategoricalDtype`
`CategoricalDtype` is a pandas data type object that can be created with the following parameters:

* `categories`: A sequence of unique values without missing values.

* `ordered`: A boolean value indicating whether the categories have a meaningful order. By default, it is set to False, meaning the categories are unordered.

`CategoricalDtype` is useful when you want to define the categories and order explicitly for a categorical data type column. It allows for better control and handling of categorical data in pandas.

In [9]:
pd.CategoricalDtype(['a', 'b', 'c'])

CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

In [10]:
pd.CategoricalDtype(['a', 'b', 'c'], ordered = True)

CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

`CategoricalDtype` can be specified in various places in pandas where you need to specify a data type, such as in `pandas.read_csv()`, `pandas.DataFrame.astype()`, or the Series constructor.

For convenience, when you want the default behavior of categories to be unordered and equal to the values set in the array, you can use the string 'category' as a shorthand for `CategoricalDtype()`. In other words, `dtype = 'category'` is equivalent to `dtype = CategoricalDtype()`.

When comparing two instances of `CategoricalDtype`, they are considered equal as long as they have the same categories and order. When comparing two unordered categories, the order of the categories is not considered.

### Controlling Behavior 

We used `dtype ='category'` to specify the categorical data type. In that case:

* The specific categories were inferred from the data.

* The specific categorical data had no order.

we can also use the `CategoricalDtype` instance to define categorical data and specify the order:

In [11]:
s = pd.Series(["a", "b", "c", "a"])
cat_type = pd.CategoricalDtype(categories = ["b", "c", "d"], ordered = True)
s_cat = s.astype(cat_type)
s_cat

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): ['b' < 'c' < 'd']

Similarly, `CategoricalDtype` can be used together with a DataFrame to ensure consistency of categories across all columns.

In [12]:
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
cat_type = pd.CategoricalDtype(categories = list('abcd'),
                            ordered = True)
df_cat = df.astype(cat_type)
df_cat['A']

0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']

###  Retrieve the original Series
If you want to retrieve the original Series or NumPy array, you can use `Series.astype`(original_dtype) or `np.asarray`(original_data):

In [13]:
blood = pd.Series(data = ["A", "AB", np.nan, "AB", "O", "B"], dtype = "category")
blood

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A', 'AB', 'B', 'O']

In [14]:
blood.astype(str)

0      A
1     AB
2    nan
3     AB
4      O
5      B
dtype: object

In [15]:
np.asarray(blood)

array(['A', 'AB', nan, 'AB', 'O', 'B'], dtype=object)

# The Use of Categorical Data 
##  Categorical Data Structure
A categorical variable consists of three components: the element values, the categories, and whether it is ordered or not. From the above, we can see that categorical variables created using the cat function are by default ordered categorical variables. Now let's explore the other attributes and methods available in cat for viewing and manipulating the Categorical data type.
Here are some of the attributes and methods available in cat for working with Categorical data:

In [16]:
[i for i in dir(blood.cat) if not i.startswith('_')]

['add_categories',
 'as_ordered',
 'as_unordered',
 'categories',
 'codes',
 'ordered',
 'remove_categories',
 'remove_unused_categories',
 'rename_categories',
 'reorder_categories',
 'set_categories']

### `describe()` method
The `describe()` method provides a summary of a categorical series, including the count of non-missing values, the number of unique element values (not the number of categories), the most frequently occurring element and its frequency.

In [17]:
blood

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A', 'AB', 'B', 'O']

In [18]:
blood.describe()

count      5
unique     4
top       AB
freq       2
dtype: object

### categories property
To view the categories of a categorical series

In [19]:
blood.cat.categories

Index(['A', 'AB', 'B', 'O'], dtype='object')

### ordered property 
To check if a categorical series is ordered

In [20]:
blood.cat.ordered

False

## Modify the categories
### `set_categories`
To modify the categories of a categorical series without changing its values, we can use the `set_categories` method. This method allows you to assign new categories to the categorical series while keeping the original values intact.

In [21]:
blood.cat.set_categories(['RH', 'A', 'O'])

0      A
1    NaN
2    NaN
3    NaN
4      O
5    NaN
dtype: category
Categories (3, object): ['RH', 'A', 'O']

### `rename_categories`

In [22]:
blood.cat.rename_categories(['new_%s'%i for i in blood.cat.categories])

0     new_A
1    new_AB
2       NaN
3    new_AB
4     new_O
5     new_B
dtype: category
Categories (4, object): ['new_A', 'new_AB', 'new_B', 'new_O']

In [23]:
# using a dictionary
blood.cat.rename_categories({"A": 'a', "B": 'b'})

0      a
1     AB
2    NaN
3     AB
4      O
5      b
dtype: category
Categories (4, object): ['a', 'AB', 'b', 'O']

## Add new categories
### `add_categories()`

In [24]:
blood = pd.Series(data=pd.Categorical(["A", "AB", "RH", "AB", "O", "B"], categories=['A', 'B', 'AB', 'O']))
blood

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A', 'B', 'AB', 'O']

In [25]:
blood.cat.add_categories(['RH'])

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (5, object): ['A', 'B', 'AB', 'O', 'RH']

## Delete the categories
### `remove_categories`

In [26]:
blood.cat.remove_categories(['A'])

0    NaN
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (3, object): ['AB', 'B', 'O']

### `remove_unused_categories()`
To remove the categories that do not have any corresponding values in a categorical variable, we can use the `remove_unused_categories()` method

In [27]:
blood = pd.Series(data = pd.Categorical(["AB", "AB", "O", "B"], categories = ['A', 'B', 'AB', 'O']))
blood

0    AB
1    AB
2     O
3     B
dtype: category
Categories (4, object): ['A', 'B', 'AB', 'O']

In [28]:
blood.cat.remove_unused_categories()

0    AB
1    AB
2     O
3     B
dtype: category
Categories (3, object): ['B', 'AB', 'O']

## Order

New categorical data is not automatically sorted. We must explicitly pass `ordered=True` to indicate ordered categories.



In [29]:
# view the order of categories in categorical data
blood = pd.Series(data = pd.Categorical(["A", "AB", "RH", "AB", "O", "B"], categories = ['A', 'B', 'AB', 'O']))
blood.cat.categories

Index(['A', 'B', 'AB', 'O'], dtype='object')

In [30]:
blood.cat.ordered

False

### `as_ordered`
When converting a sequence to an ordered variable, we can use the as_ordered method.


In [31]:
 blood.cat.as_ordered()

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A' < 'B' < 'AB' < 'O']

In [32]:
blood.cat.as_ordered().cat.ordered

True

In [33]:
blood.cat.as_ordered().cat.as_unordered()

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A', 'B', 'AB', 'O']

### `set_categories`
The `set_categories` method is used to set the categories of a categorical variable. It allows you to define the categories explicitly and specify their order if needed. This method modifies the categories of the categorical variable in-place.

In [34]:
blood.cat.set_categories(['RH', 'A', 'O'], ordered = True) # remove category 'B'

0      A
1    NaN
2    NaN
3    NaN
4      O
5    NaN
dtype: category
Categories (3, object): ['RH' < 'A' < 'O']

### reorder_categories

In [35]:
blood

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['A', 'B', 'AB', 'O']

In [36]:
blood.cat.reorder_categories(['AB', 'A', 'B', 'O'], ordered = True)

0      A
1     AB
2    NaN
3     AB
4      O
5      B
dtype: category
Categories (4, object): ['AB' < 'A' < 'B' < 'O']

## Sorting


In [37]:
np.random.seed(1234)
s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.head(15)

0         bad
1       awful
2       awful
3     perfect
4        good
5        good
6        good
7        fair
8         bad
9       awful
10      awful
11       fair
12       fair
13    perfect
14    perfect
dtype: category
Categories (5, object): ['awful', 'bad', 'fair', 'good', 'perfect']

In [38]:
s1 = s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True)
s1.head(15)

0         bad
1       awful
2       awful
3     perfect
4        good
5        good
6        good
7        fair
8         bad
9       awful
10      awful
11       fair
12       fair
13    perfect
14    perfect
dtype: category
Categories (5, object): ['awful' < 'bad' < 'fair' < 'good' < 'perfect']

In [39]:
s.sort_values(ascending = False).head()

16    perfect
13    perfect
47    perfect
3     perfect
29    perfect
dtype: category
Categories (5, object): ['awful', 'bad', 'fair', 'good', 'perfect']

In [40]:
df_sort = pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
df_sort.head()

Unnamed: 0_level_0,value
cat,Unnamed: 1_level_1
bad,1.014849
awful,-0.557025
awful,-0.424606
perfect,0.137496
good,-0.070513


In [41]:
df_sort.sort_index().head()

Unnamed: 0_level_0,value
cat,Unnamed: 1_level_1
awful,0.231711
awful,-0.642475
awful,1.804172
awful,0.571988
awful,1.413138


## Comparisons
In the following three cases, categorical data can be compared with other objects:

* Equality (== and !=) with objects that have the same length as the categorical data (e.g., lists, sequences, arrays).
* All comparisons (==, !=, >, >=, <, <=) between categorical data and another categorical series when ordered == True and the categories are the same.
* All comparisons between categorical data and a scalar value.
All other comparisons, especially "non-equality" comparisons between categorical data with different categories or with any list-like object, such as Series, `np.array`, or `lists`, will raise a `TypeError`. This is because the interpretation of the custom category order can be different depending on whether sorting is considered or not.

Attempting to compare categorical data with objects that have different categories or sorting, including Series, `np.array`, lists, or categorical data, will result in a `TypeError`.

### Compared with scalar values or objects 
compared with scalar values or objects that have the same length as the categorical data (e.g., lists, sequences, arrays).

In [42]:
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'

0     True
1    False
2    False
3     True
dtype: bool

In [43]:
s == list('abcd')

0     True
1    False
2     True
3    False
dtype: bool

### Comparing two categorical variables

several conditions need to be met:

* Both categorical variables must have the same categories.
* The categories must be in the same order (if the variables are ordered).
* If the variables are ordered, the ordered parameter must be set to True.

The comparison operations that can be used between two categorical variables include equality (==) and inequality (!=), as well as greater than (>), greater than or equal to (>=), less than (<), and less than or equal to (<=).

In [44]:
categories1 = pd.Categorical(['A', 'B', 'C'], categories = ['A', 'B', 'C'], ordered = True)
categories2 = pd.Categorical(['B', 'B', 'C'], categories = ['A', 'B', 'C'], ordered = True)


print(categories1 == categories2)  
print(categories1 != categories2)  
print(categories1 > categories2)  
print(categories1 >= categories2) 
print(categories1 < categories2)   
print(categories1 <= categories2) 

[False  True  True]
[ True False False]
[False False False]
[False  True  True]
[ True False False]
[ True  True  True]


# Why categorical variable?


The use of categorical data offers several advantages over other data types, especially when dealing with large datasets. Here are some reasons why categorical data is beneficial:

* **Reduced memory usage**: Categorical data is represented as integers internally, where each unique category is mapped to a unique integer. This mapping significantly reduces the memory footprint compared to using object data types, where each category is stored as a string. The memory usage of categorical data is proportional to the number of unique categories, rather than the length of the data.

* **Faster performance**: The reduced memory usage of categorical data leads to improved performance in terms of computation and memory operations. With smaller memory requirements, it allows for faster data access, manipulation, and analysis. This is particularly advantageous when working with large datasets or performing repetitive operations.

* **Improved functionality**: Categorical data has built-in functionality specifically designed for categorical variables. This includes methods for sorting, ordering, and comparing categories, as well as handling missing values in a consistent manner. Categorical data also supports efficient groupby operations and facilitates data analysis and exploration.

* **Encoding and storage efficiency**: Categorical data can be efficiently encoded and stored in various formats, such as in database systems or when exporting data to different file formats. Categorical data can be easily serialized and deserialized without losing the categorical information.

Overall, the use of categorical data helps optimize memory usage, enhances computational efficiency, and provides specialized functionality for handling categorical variables. It is particularly useful when working with large datasets or when memory constraints are a concern.
