## 11. Why and how to use 'category' data type?
'category' data type can help us reduce memory consumed by the series and improve overall computation speed for computation using the series. Moreover, we get access to number category handling methods. We will also learn about ordered categories.

### 11.1. Using 'category' data type to save memory and perform operations faster

We will use the 'drinks' dataset which contains alcohol consumption by country.

In [1]:
import pandas as pd

In [2]:
drinks = pd.read_csv("http://bit.ly/drinksbycountry")
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


We will use DataFrame’s ‘info( )’ method which provides us the number of not null values in each series, the data type of each series, and memory used by the DataFrame. Notice, in my case, it says ‘memory usage: 9.2+ KB’ which means the DataFrame is using at least 9.2 KB and is not the actual value of memory used. Although we have worked with the ‘object’ data type as ‘string’ till now, it may be a ‘string’, or ‘list’, or ‘dictionary’ or other non-numeric data types. Pandas stores references to the actual data in the ‘object’ type column and when asked to display the data, uses the reference to look up for data and displays it. When using ‘info( )’ method the memory usages calculated is calculated accounting for the memory used by references and not actual data, and so will be less than true memory usages of DataFrame.

In [3]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


We can calculate the accurate memory used by using “ memory_usage = ’deep’ ” as a parameter to the ‘info( )’ method. I have got ‘memory usage: 30.5 KB’ which is greater than three times the original value.

In [4]:
drinks.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB


We can calculate memory used with the help of ‘memory_usage( )’ method as well. When used with default parameters it returns a series in which memory used by references in object columns is displayed in bytes, instead of memory used by data. We can pass the ‘deep=True’ parameter to ask pandas to show real memory usages. Since it returns a series, we can calculate total bytes using the ‘sum( )’ method. Notice it is equal to what we obtained using the ‘info( )’ method.

In [5]:
drinks.memory_usage()

Index                            128
country                         1544
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       1544
dtype: int64

In [6]:
drinks.memory_usage(deep=True)

Index                             128
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64

In [7]:
drinks.memory_usage(deep=True).sum()

31224

We can check the unique values present in the ‘continent’ series using ‘unique( )’ as a series method. It shows that the number of unique values i.e. number of continents is very small compared to the number of rows. It is a good idea to change the data type of such columns to ‘category’ because it will reduce the memory used by DataFrame and also improve computation speed for computations involving the particular column. So we will change the data type of ‘continent’ column to ‘category’ data type.

In [8]:
sorted(drinks.continent.unique())

['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']

In [9]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

In [10]:
drinks["continent"] = drinks.continent.astype("category")

In [11]:
drinks.dtypes

country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

Notice that it says ‘dtype: category’ and lists all the categories. At the backend, values in category data type are stored as integers and a lookup table is created to reference each integer to its correct value. We can use category (.cat, like .str for string) attribute ‘codes’ to show the integers each value in category column corresponds (e.g. Asia:1, Europe:2, Africa:0).



In [12]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [13]:
drinks.continent.cat.codes.head()

0    1
1    2
2    0
3    2
4    0
dtype: int8

We will check the memory used by DataFrame once again to see if the memory used was reduced. In my case, it's almost half now.

In [14]:
drinks.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   country                       193 non-null    object  
 1   beer_servings                 193 non-null    int64   
 2   spirit_servings               193 non-null    int64   
 3   wine_servings                 193 non-null    int64   
 4   total_litres_of_pure_alcohol  193 non-null    float64 
 5   continent                     193 non-null    category
dtypes: category(1), float64(1), int64(3), object(1)
memory usage: 19.2 KB


In [15]:
drinks.memory_usage(deep=True)

Index                             128
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [16]:
drinks.memory_usage(deep=True).sum()

19636

We should be aware that incorrect use of the category data type can result in increased memory usages instead of reducing it. We will change the data type of ‘country’ column to ‘category’ to test this idea. In my case, memory used by country column increased from 12588 bytes to 18092 bytes. It was expected as all the values in the country column were unique. We created a lookup table with all the values in the country column plus reference to each value that took more memory to ones initially created by pandas.

In [17]:
drinks["country"] = drinks.country.astype("category")

In [18]:
drinks.memory_usage(deep=True)

Index                             128
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [19]:
drinks.country.cat.categories

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)

### 11.2. Using 'category' data type to highlight order of ordered categories

To learn about using the ‘category’ data type with ordered categories, we will create a small DataFrame called df. Notice that the categories ‘poor’, ‘good’, ‘very good’, ‘excellent’ are ordered. If we use the ‘sort_values( )’ method to sort the ‘quality’ column it will sort it according to alphabetical order and not according to their order perceived in terms of quality of item (say).

 

In [20]:
df = pd.DataFrame({"ID":[100,101,102,103,104,105], "quality":["good", "poor", "very good", "good", "poor", "excellent"]})
df

Unnamed: 0,ID,quality
0,100,good
1,101,poor
2,102,very good
3,103,good
4,104,poor
5,105,excellent


In [21]:
df.sort_values("quality")

Unnamed: 0,ID,quality
5,105,excellent
0,100,good
3,103,good
1,101,poor
4,104,poor
2,102,very good


We can inform pandas about the ordering to extend the possible operations with the column. First, we will import ‘CategoricalDtype’. We will instantiate it with an ordered list of categories and ‘ordered=True’ parameter and equate it to a variable, here ‘quality_cat’. We will then change the data type of ‘quality’ column, not to ‘category’, but instead to ‘quality_cat’. Now column ‘quality’ will become an orders category column. Notice that when we call the series ‘quality’, instead of a list of categories, we have an ordered list of categories.

In [22]:
from pandas.api.types import CategoricalDtype
quality_cat = CategoricalDtype(["poor", "good", "very good", "excellent"], ordered=True)
df["quality"] = df.quality.astype(quality_cat)
df.quality

0         good
1         poor
2    very good
3         good
4         poor
5    excellent
Name: quality, dtype: category
Categories (4, object): [poor < good < very good < excellent]

In [23]:
df.sort_values("quality")

Unnamed: 0,ID,quality
1,101,poor
4,104,poor
0,100,good
3,103,good
2,102,very good
5,105,excellent


One use of the ordered category is to use it for comparison. Since we have specified the order, pandas will now compare then according to the order we have provided and not alphabetically.

In [24]:
df.loc[df.quality > "poor", :]

Unnamed: 0,ID,quality
0,100,good
2,102,very good
3,103,good
5,105,excellent
