<div style="text-align:center; background-color:#4CAF50; color:white; padding:15px; border-radius:10px;">
<h1>🐼 Working with Categorical Data in Pandas</h1>
</div>

### ✨ Benefits of Categorical Data:
- **Memory Efficiency**: Reduces memory usage by storing categories as integers.
- **Faster Performance**: Sorting, grouping, and filtering are more efficient.
- **Logical Comparisons**: Enables ordered comparisons (e.g., "small" < "medium" < "large").

In [10]:
import pandas as pd

<div style="background-color:#2196F3; color:white; padding:10px; border-radius:10px;">
<h2>🌟 1. Creating Categorical Data</h2>
</div>

You can create categorical data either by directly defining categories or by converting existing columns to a `category` data type.

In [11]:
# Creating a categorical Series
data = ['Male', 'Female', 'Female', 'Male', 'Other']
categories = pd.Categorical(data)
print(categories)

# Creating a DataFrame with a categorical column
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Gender': pd.Categorical(['Female', 'Male', 'Male', 'Male', 'Female'])
})
print(df)

['Male', 'Female', 'Female', 'Male', 'Other']
Categories (3, object): ['Female', 'Male', 'Other']
      Name  Gender
0    Alice  Female
1      Bob    Male
2  Charlie    Male
3    David    Male
4      Eva  Female


In [12]:
# we can convert it with like .astype('category')

<div style="background-color:#FF9800; color:white; padding:10px; border-radius:10px;">
<h2>🔍 2. Checking for Categorical Data</h2>
</div>

Check if a column is of type `category` using the `.dtype` attribute.

In [13]:
# Check the data type
print(df['Gender'].dtype)

category


<div style="background-color:#673AB7; color:white; padding:10px; border-radius:10px;">
<h2>🛠️ 3. Adding Categories and Managing Levels</h2>
</div>

You can customize categories by adding, removing, or reordering them.

In [14]:
# Setting categories and order
categories = pd.Categorical(data, categories=['Male', 'Female', 'Other'], ordered=True)
print(categories)

# Adding new categories
categories = categories.add_categories(['Non-binary'])
print(categories)

# Removing a category
categories = categories.remove_categories(['Other'])
print(categories)

['Male', 'Female', 'Female', 'Male', 'Other']
Categories (3, object): ['Male' < 'Female' < 'Other']
['Male', 'Female', 'Female', 'Male', 'Other']
Categories (4, object): ['Male' < 'Female' < 'Other' < 'Non-binary']
['Male', 'Female', 'Female', 'Male', NaN]
Categories (3, object): ['Male' < 'Female' < 'Non-binary']


<div style="background-color:#E91E63; color:white; padding:10px; border-radius:10px;">
<h2>📋 4. Sorting and Comparing Categorical Data</h2>
</div>

Ordered categories allow for sorting and logical comparisons.

In [15]:
# Ordered categorical data
sizes = pd.Categorical(['Small', 'Large', 'Medium'], categories=['Small', 'Medium', 'Large'], ordered=True)

# Sorting
print(sizes.sort_values())

# Comparing
print(sizes[0] < sizes[1])

['Small', 'Medium', 'Large']
Categories (3, object): ['Small' < 'Medium' < 'Large']
False


<div style="background-color:#3F51B5; color:white; padding:10px; border-radius:10px;">
<h2>🧮 5. Encoding and Decoding Categorical Data</h2>
</div>

Each category is represented by an integer code internally. You can access both the codes and the category names.

In [16]:
# Accessing codes
print(categories.codes)

# Accessing categories
print(categories.categories)

[ 0  1  1  0 -1]
Index(['Male', 'Female', 'Non-binary'], dtype='object')


<div style="background-color:#009688; color:white; padding:10px; border-radius:10px;">
<h2>🔄 6. Converting Columns to Categorical</h2>
</div>

Convert an existing column to a categorical data type for better memory and computational efficiency.

In [17]:
# Convert an existing column to categorical
df['Gender'] = df['Gender'].astype('category')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Name    5 non-null      object  
 1   Gender  5 non-null      category
dtypes: category(1), object(1)
memory usage: 301.0+ bytes
None


<div style="background-color:#FF5722; color:white; padding:10px; border-radius:10px;">
<h2>📊 7. Grouping and Aggregating Categorical Data</h2>
</div>

Categorical data is highly efficient for grouping and aggregating operations.

In [18]:
# Grouping by a categorical column
grouped = df.groupby('Gender').size()
print(grouped)

Gender
Female    2
Male      3
dtype: int64


  grouped = df.groupby('Gender').size()


# Working with real life example

In [19]:
df = pd.read_csv(r'../Datasets/Property_Crimes.csv')
df.head()

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
0,Andaman & Nicobar Islands,2001,Burglary - Property,3. Burglary,27,64,755858,1321961
1,Andhra Pradesh,2001,Burglary - Property,3. Burglary,3321,7134,51483437,147019348
2,Arunachal Pradesh,2001,Burglary - Property,3. Burglary,66,248,825115,4931904
3,Assam,2001,Burglary - Property,3. Burglary,539,2423,3722850,21466955
4,Bihar,2001,Burglary - Property,3. Burglary,367,3231,2327135,17023937


In [20]:
#Converting the Group_Name column to Category column

In [21]:
df.insert(3, "Group_Name_cat", df.index)
df.head()

Unnamed: 0,Area_Name,Year,Group_Name,Group_Name_cat,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
0,Andaman & Nicobar Islands,2001,Burglary - Property,0,3. Burglary,27,64,755858,1321961
1,Andhra Pradesh,2001,Burglary - Property,1,3. Burglary,3321,7134,51483437,147019348
2,Arunachal Pradesh,2001,Burglary - Property,2,3. Burglary,66,248,825115,4931904
3,Assam,2001,Burglary - Property,3,3. Burglary,539,2423,3722850,21466955
4,Bihar,2001,Burglary - Property,4,3. Burglary,367,3231,2327135,17023937


In [29]:
df['Group_Name_cat'] = df['Group_Name'].astype('category')
df.head()


Unnamed: 0,Area_Name,Year,Group_Name,Group_Name_cat,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
0,Andaman & Nicobar Islands,2001,Burglary - Property,Burglary - Property,3. Burglary,27,64,755858,1321961
1,Andhra Pradesh,2001,Burglary - Property,Burglary - Property,3. Burglary,3321,7134,51483437,147019348
2,Arunachal Pradesh,2001,Burglary - Property,Burglary - Property,3. Burglary,66,248,825115,4931904
3,Assam,2001,Burglary - Property,Burglary - Property,3. Burglary,539,2423,3722850,21466955
4,Bihar,2001,Burglary - Property,Burglary - Property,3. Burglary,367,3231,2327135,17023937


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2449 entries, 0 to 2448
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Area_Name                    2449 non-null   object  
 1   Year                         2449 non-null   int64   
 2   Group_Name                   2449 non-null   object  
 3   Group_Name_cat               2449 non-null   category
 4   Sub_Group_Name               2449 non-null   object  
 5   Cases_Property_Recovered     2449 non-null   int64   
 6   Cases_Property_Stolen        2449 non-null   int64   
 7   Value_of_Property_Recovered  2449 non-null   int64   
 8   Value_of_Property_Stolen     2449 non-null   int64   
dtypes: category(1), int64(5), object(3)
memory usage: 155.9+ KB


In [33]:
#the dtype is Categorical

In [32]:
df['Group_Name_cat'].dtype

CategoricalDtype(categories=['Burglary - Property', 'Criminal Breach of Trust - Property',
                  'Dacoity -Property', 'Other heads of Property',
                  'Robbery - Property', 'Theft - Property', 'Total Property'],
, ordered=False, categories_dtype=object)

We can create a categorical data with pd.Categorical

In [34]:
cat = pd.Categorical(df['Group_Name'],categories=df['Group_Name'].unique())
cat

['Burglary - Property', 'Burglary - Property', 'Burglary - Property', 'Burglary - Property', 'Burglary - Property', ..., 'Total Property', 'Total Property', 'Total Property', 'Total Property', 'Total Property']
Length: 2449
Categories (7, object): ['Burglary - Property', 'Criminal Breach of Trust - Property', 'Dacoity -Property', 'Other heads of Property', 'Robbery - Property', 'Theft - Property', 'Total Property']

In [36]:
#The categories always saved in coded format like hear 'Bulgary - Property' have 9th index

In [35]:
cat.codes

array([0, 0, 0, ..., 6, 6, 6], dtype=int8)

In [37]:
cat.categories

Index(['Burglary - Property', 'Criminal Breach of Trust - Property',
       'Dacoity -Property', 'Other heads of Property', 'Robbery - Property',
       'Theft - Property', 'Total Property'],
      dtype='object')

cat.categories and cat.unique() don't give same value always 