# Working with Categorical Data

Now we look at categorical data, which can include numbers that behave like categories. To learn more, see the User Guide on scikit-learn docs: [7.3.4. Encoding categorical features](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).

- **Nominal**: Gender, Marital Status, Color, Brand, Favorite Sport
- **Ordinal**: Education Level, Customer Rating, Income Level

**Categorical** data is non-numerical; it has to be encoded into numerical feature vectors such that ML models can learn from them.

In [2]:
import pandas as pd

df = pd.DataFrame({
    'risk': ['low', 'medium', 'ZZZZZZ', 'low', 'low', 'high'],
    'class': ['1st', '3rd', '2nd', '1st', '3rd', '000000'],
})
df

Unnamed: 0,risk,class
0,low,1st
1,medium,3rd
2,ZZZZZZ,2nd
3,low,1st
4,low,3rd
5,high,000000


## OrdinalEncoder

Ordinal encoding can give the model information about the ranking of the values of this feature such as shirt size: `xs < s < m < L`.

In [3]:
from sklearn.preprocessing import OrdinalEncoder

# Specify the order of categories
categories = [
    ['low', 'medium', 'high'], # <-- categories of first feature
    ['1st', '2nd', '3rd'],     # <-- categories of second feature
] # <-- this is just a list of lists of strings (i.e., list[list[str]])

encoder = OrdinalEncoder(
    categories=categories, # <-- specify the categories
    handle_unknown='use_encoded_value',
    unknown_value=-1
)
# We want a pandas DataFrame as output rather than a NumPy array (default)
encoder = encoder.set_output(transform='pandas')

# Assigns integers to each category value found in the training data
encoder.fit(df)

encoded_data = encoder.transform(df)
encoded_data

Unnamed: 0,risk,class
0,0.0,0.0
1,1.0,2.0
2,-1.0,1.0
3,0.0,0.0
4,0.0,2.0
5,2.0,-1.0


## OneHotEncoder

**one-hot encoding**, in [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), is another way to encode them is to denote:

- the **existence** of a feature as 1 and
- the **absence** of the feature as 0

Example: for the feature color of 4 possible values: `Green`, `Red`, `Black`, and `Orange`

In [6]:
df = pd.DataFrame({
    'color': ['Red', 'Red', 'Red', 'Green', 'Blue', 'Blue', 'Orange']
})

In [8]:
from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder instance
encoder = OneHotEncoder(
    sparse_output=False,    # <-- output is a dense array
)
encoder.set_output(transform='pandas')

# Fit and transform the data
encoded_data = encoder.fit_transform(df)
encoded_data

Unnamed: 0,color_Blue,color_Green,color_Orange,color_Red
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0


## Extra: Geo-Encoding

For location-based data, **Geocoding**, which outputs latitude and longitude information is often a better choice than any other method.