# 6. Preprocessing Part 2: Handling Categorical Data

Up to this point, we have focused on preprocessing numeric data. However, real-world datasets often contain categorical features, such as labels like "low", "medium", and "high" risk. To use these categorical variables in machine learning models, we need to convert them into a numeric format. This notebook demonstrates how to preprocess categorical data using encoding techniques.

In [2]:
import numpy as np

arr = np.array(["low", "low", "high", "medium"]).reshape(-1, 1)
arr


array([['low'],
       ['low'],
       ['high'],
       ['medium']], dtype='<U6')

A common technique for converting categorical data into a numeric format is one-hot encoding. The `OneHotEncoder` from scikit-learn transforms categorical values into a binary matrix, where each category is represented by a separate column. This allows machine learning algorithms to interpret categorical variables numerically.

In [3]:
from sklearn.preprocessing import OneHotEncoder


By default, `OneHotEncoder` returns a sparse matrix, which is an efficient way to store large binary matrices with many zeros. If you want to view the full encoded array, you can set the `sparse` parameter to `False`.

In [5]:
enc = OneHotEncoder()
enc.fit_transform(arr)


<4x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

To make the encoded output easier to inspect, set `sparse=False` when creating the encoder. This will return a regular NumPy array instead of a sparse matrix, allowing you to see the actual encoded values.

In [6]:
enc = OneHotEncoder(sparse=False)
enc.fit_transform(arr)


array([[0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In the output, each row corresponds to a sample, and each column represents a category. For example, the first two rows are encoded as 'low', sharing the same column, while 'high' and 'medium' are each represented in their respective columns. This numeric representation is especially useful when the categorical variable is the target label (`y`) for a machine learning model, as scikit-learn requires numeric input.

However, there are some important behaviors to consider. If you attempt to transform a category that was not present during fitting, the encoder will raise an error by default. This is relevant when encoding features for prediction, as new categories may appear in future data.

In [7]:
# enc.transform([["zero"]])


ValueError: Found unknown categories ['zero'] in column 0 during transform

If you try to transform a category that the encoder has not seen during fitting, a `ValueError` will be raised, indicating that an unknown category was found. This default behavior prevents accidental introduction of unseen categories during model training or prediction.

To handle unknown categories gracefully, you can set the `handle_unknown` parameter to `'ignore'`. This will encode any unknown category as a row of zeros, indicating that it does not match any of the known categories.

In [8]:
enc = OneHotEncoder(sparse=False, handle_unknown="ignore")
enc.fit_transform(arr)

enc.transform([["zero"]])


array([[0., 0., 0.]])

When `handle_unknown="ignore"` is set, transforming an unknown category returns a row of zeros, indicating that the value does not correspond to any known category. This setting is useful when encoding feature columns (`X`), where new categories may appear in future data. However, for target labels (`y`), it is generally best to avoid this setting to maintain strict control over the classes.

To further explore preprocessing techniques, you can use the website [drawdata.xyz](https://drawdata.xyz) to create and download custom datasets. You can then load these datasets into your notebook using `pandas.read_clipboard(sep=",")` after copying the CSV data, making it easy to experiment with different preprocessing steps in scikit-learn.