# What is Categorical Data?

Categorical data in Machine Learning refers to data that consists of categories or labels, rather than numerical values. These categories may be nominal, meaning that there is no inherent order or ranking between them (e.g., color, gender), or ordinal, meaning that there is a natural ordering between the categories (e.g., education level, income bracket).

Categorical data is often represented using discrete values, such as integers or strings, and is frequently encoded as one-hot vectors before being used as input to machine learning models. One-hot encoding involves creating a binary vector for each category, where the vector has a 1 in the position corresponding to the category and 0s in all other positions.

# Techniques for Handling Categorical Data
Handling categorical data is an important part of machine learning preprocessing, as many algorithms require numerical input. Depending on the algorithm and the nature of the categorical data, different encoding techniques may be used, such as label encoding, ordinal encoding, or binary encoding etc.dingding


1. One-Hot Encoding

2. Label Encoding

3. Frequency Encoding

4. Target Encoding

5. Binary Encoding

### 1. One-Hot Encoding

One-hot encoding is a popular technique for handling categorical data in machine learning. It involves creating a binary vector for each category, where each element of the vector represents the presence or absence of the category. For example, if we have a categorical variable for color with values red, blue, and green, one-hot encoding would create three binary vectors: [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.

In [9]:
import pandas as pd

# Creating a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
print("-----------df ----- \n", df)

# Performing one-hot encoding
# 0-index red --- color red is True remaining all are False
# 1-index green --- color green is True remaining all are False
# 2-index blue --- color blue is True remaining all are False
# 3-index red --- color red is True remaining all are False
# 4-index green --- color green is True remaining all are False
one_hot_encoded = pd.get_dummies(df['color'], prefix='color')

print("--------------one_hot_encoded ------\n", one_hot_encoded)
# Combining the encoded data with the original data
df = pd.concat([df, one_hot_encoded], axis=1)
print("-------------df-after-concat ----- \n", df)

# Drop the original categorical variable
df = df.drop('color', axis=1)

# Print the encoded data
print("df-after-drpoing-color-column \n", df)


-----------df ----- 
    color
0    red
1  green
2   blue
3    red
4  green
--------------one_hot_encoded ------
    color_blue  color_green  color_red
0       False        False       True
1       False         True      False
2        True        False      False
3       False        False       True
4       False         True      False
-------------df-after-concat ----- 
    color  color_blue  color_green  color_red
0    red       False        False       True
1  green       False         True      False
2   blue        True        False      False
3    red       False        False       True
4  green       False         True      False
df-after-drpoing-color-column 
    color_blue  color_green  color_red
0       False        False       True
1       False         True      False
2        True        False      False
3       False        False       True
4       False         True      False


### 2. Label Encoding

Label Encoding is another technique for handling categorical data in machine learning. It involves assigning a unique numerical value to each category in a categorical variable, with the order of the values based on the order of the categories.

For example, suppose we have a categorical variable "Size" with three categories: "small," "medium," and "large." Using label encoding, we would assign the values 0, 1, and 2 to these categories, respectively.

In [16]:
from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable
data = ['small', 'medium', 'large', 'small', 'large']
print("-------data--------\n", data)

# create a label encoder object
label_encoder = LabelEncoder()
print("-------label_encoder--------\n", label_encoder)

# fit and transform the data using the label encoder
# large is assigned 0
# medium is assigned 1
# small is assigned 2
encoded_data = label_encoder.fit_transform(data)
print("-------encoded_data--------\n", encoded_data)


-------data--------
 ['small', 'medium', 'large', 'small', 'large']
-------label_encoder--------
 LabelEncoder()
-------encoded_data--------
 [2 1 0 2 0]


### 3. Frequency Encoding

Frequency Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with its frequency (or count) in the dataset. The idea behind frequency encoding is that categories that appear more frequently may be more important or informative for the machine learning algorithm.

In [19]:
import pandas as pd

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
print("-----------df ----- \n", df)

# calculate the frequency of each category in the categorical variable
# red occurced for two times---- 2/5 --- 0.4
# green occurced for two times---- 2/5 --- 0.4
# blue occurced for 1 time---- 1/5 --- 0.2
freq = df['color'].value_counts(normalize=True)
print("-----------freq ----- \n", freq)

# replace each category with its frequency
df['color_freq'] = df['color'].map(freq)

# drop the original categorical variable
df = df.drop('color', axis=1)
print("-----------final-df ----- \n", df)


-----------df ----- 
    color
0    red
1  green
2   blue
3    red
4  green
-----------freq ----- 
 color
red      0.4
green    0.4
blue     0.2
Name: proportion, dtype: float64
-----------final-df ----- 
    color_freq
0         0.4
1         0.4
2         0.2
3         0.4
4         0.4


### 4. Target Encoding

Target Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with the mean (or other aggregation) of the target variable (i.e., the variable you want to predict) for that category. The idea behind target encoding is that it can capture the relationship between the categorical variable and the target variable, and therefore improve the predictive performance of the machine learning model.

In [21]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable and a target variable
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
   'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# create a label encoder object and fit it to the data
label_encoder = LabelEncoder()
# transform the categorical variable using the label encoder
df['color_encoded'] = label_encoder.fit_transform(df['color'])

# create a mean encoder object and fit it to the transformed data
mean_encoder = df.groupby('color_encoded')['target'].mean().to_dict()

# map the mean encoded values to the categorical variable
df['color_encoded'] = df['color_encoded'].map(mean_encoder)

# print the encoded data
print(df)

   color  target  color_encoded
0    red       1            0.5
1  green       0            0.5
2   blue       1            1.0
3    red       0            0.5
4  green       1            0.5


# 5. Binary Encoding

Binary encoding is another technique used for encoding categorical variables in machine learning. In binary encoding, each category is assigned a binary code, where each digit represents whether the category is present (1) or not (0). The binary codes are typically based on the position of the category in a sorted list of all categories..

In [5]:
import pandas as pd
import category_encoders as ce

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
print("-----------df ----- \n", df)

# create a binary encoder object and fit it to the data
binary_encoder = ce.BinaryEncoder(cols=['color'])
print("-----------binary_encoder ----- \n", binary_encoder)

# transform the categorical variable using the binary encoder
encoded_data = binary_encoder.fit_transform(df['color'])
print("-----------encoded_data ----- \n", encoded_data)

# merge the encoded variable with the original dataframe
df = pd.concat([df, encoded_data], axis=1)
print("-----------final-df ----- \n", df)


-----------df ----- 
    color
0    red
1  green
2   blue
3    red
4  green
-----------binary_encoder ----- 
 BinaryEncoder(cols=['color'])
-----------encoded_data ----- 
    color_0  color_1
0        0        1
1        1        0
2        1        1
3        0        1
4        1        0
-----------final-df ----- 
    color  color_0  color_1
0    red        0        1
1  green        1        0
2   blue        1        1
3    red        0        1
4  green        1        0
