# Encode Categorical Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This module covers techniques for encoding categorical data into numerical formats for machine learning models. We'll explore different encoding methods including ordinal encoding, one-hot encoding, and dummy variable encoding using a breast cancer dataset as a practical example.

## Learning Objectives

- Learn why encoding is required for preprocessing categorical data in machine learning algorithms
- Understand how to use ordinal encoding for categorical variables with natural rank ordering
- Understand one-hot encoding techniques for categorical variables without natural rank ordering
- Apply encoding techniques to real medical data for breast cancer prediction

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy libraries
- Knowledge of basic statistical concepts

## Get Started

To start, we install required packages, import the necessary libraries, and define a helper function to download data using the `requests` library.

### Install required packages

In [None]:
%pip install numpy pandas requests scikit-learn

### Import necessary libraries

In [None]:
from pathlib import Path

import requests
from numpy import asarray
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

### Define utility functions

Define a helper function for downloading example datasets.  

*Note!* It is not essential that you understand the following code.  It is just for getting the example data.

In [None]:
def download(url, to_file):
    """Download content from the given URL and save it to a file.

    Args:
        url (str): The URL to download the content from.
        to_file (str): The name of the file to save the downloaded content to.

    """
    response = requests.get(url, timeout=10)
    Path(to_file).write_bytes(response.content)
    print(f"downloaded file '{to_file}'")

## Breast Cancer Categorical Dataset

Breast cancer dataset classifies breast cancer
patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))

### Download Breast Cancer data files

In [None]:
download(
    url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv",
    to_file="breast-cancer.csv",
)

download(
    url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names",
    to_file="breast-cancer.names",
)

### Load and summarize the dataset

In [None]:
# load the dataset
dataset = read_csv("breast-cancer.csv", header=None)
print(dataset.head())

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# summarize
print("Input", X.shape)
print("Output", y.shape)

We
can see that we have 286 examples and nine input variables.



## Nominal And Ordinal Variables

* **Nominal Variable**. Variable comprises a finite set of discrete values with no rank-order
relationship between values.
* **Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked
ordering between values.

Some algorithms can work with categorical data directly. For example, a decision tree can
be learned directly from categorical data with no data transform required (this depends on
the specific implementation). Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be numeric. In general, this is
mostly a constraint of the effcient implementation of machine learning algorithms rather than
hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. This means that categorical data must be converted
to a numerical form. If the categorical variable is an output variable, you may also want to
convert predictions by the model back into a categorical form in order to present them or use
them in some application.

## Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical
values. They are:
* Ordinal Encoding
* One-Hot Encoding
* Dummy Variable Encoding

### Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value. An integer ordinal encoding is a natural encoding for ordinal variables. For categorical
variables, it imposes an ordinal relationship where no such relationship may exist. This can
cause problems and a one-hot encoding may be used instead.

In [None]:
# example of an ordinal encoding

# Encode categorical features as an integer array

# define data
data = asarray([["red"], ["green"], ["blue"]])
print("Original data: \n", data)

# define ordinal encoding
encoder = OrdinalEncoder()

# # Fit OrdinalEncoder to data, then transform data.
result = encoder.fit_transform(data)
print("Encoded data: \n", result)

We
can see that the numbers are assigned to the labels as we expected.

This **OrdinalEncoder** class is intended for input variables that are organized into rows and
columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification
problem, then the **LabelEncoder** class can be used. It does the same
thing as the **OrdinalEncoder**, although it expects a one-dimensional input for the single target
variable.

### One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be
enough or even misleading to the model. Forcing an ordinal relationship via an ordinal encoding
and allowing the model to assume a natural ordering between categories may result in poor
performance or unexpected results (predictions halfway between categories). In this case, a one
hot encoding can be applied to the ordinal representation. This is where the integer encoded
variable is removed and one new binary variable is added for each unique integer value in the
variable.

In [None]:
# example of an one-hot encoding

# Encode categorical features as a one-hot numeric array.

# define data
data = asarray([["red"], ["green"], ["blue"]])
print(data)

# define one-hot encoding
# Will return sparse matrix if set True else will return an array.
encoder = OneHotEncoder(sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
onehot = encoder.fit_transform(data)
print(onehot)

We can see the one-hot encoding
matching our expectation of 3 binary variables in the order blue, green and red.

### Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category. The problem is that this
representation includes redundancy. For example, if we know that `[1, 0, 0]` represents blue and
`[0, 1, 0]` represents green we don't need another binary variable to represent red, instead we
could use 0 values alone, e.g. `[0, 0]`. This is called a dummy variable encoding, and always
represents `C` categories with `C - 1` binary variables.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one-hot
encoding. The drop argument can be set to indicate which category will become the one that is
assigned all zero values, called the baseline. We can set this to `firrst' so that the first category is
used. When the labels are sorted alphabetically, the blue label will be the first and will become
the baseline.

In [None]:
# example of a dummy variable encoding

# define data
data = asarray([["red"], ["green"], ["blue"]])
print(data)

# define one-hot encoding
# drop the first category in each feature. If only one category is present,
# the feature will be dropped entirely.
# Will return sparse matrix if set True else will return an array.
encoder = OneHotEncoder(drop="first", sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
onehot = encoder.fit_transform(data)
print(onehot)

### `OrdinalEncoder` Transform

An ordinal encoding involves mapping each unique label to an integer value. This type of
encoding is really only appropriate if there is a known relationship between the categories. This
relationship does exist for some of the variables in our dataset, and ideally, this should be
harnessed when preparing the data. In this case, we will ignore any possible existing ordinal
relationship and assume all variables are categorical. It can still be helpful to use an ordinal
encoding, at least as a point of reference with other encoding schemes.
We can use the `OrdinalEncoder` from scikit-learn to encode each variable to integers.

#### Ordinal Encode The Breast Cancer Dataset


In [None]:
# load the dataset
dataset = read_csv("breast-cancer.csv", header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)

# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# summarize the transformed data
print("Input", X.shape)
print(X[:5, :])
print("Output", y.shape)
print(y[:5])

We would expect the number of rows, and in this case, the number of columns, to be unchanged,
except all string values are now integer values. As expected, in this case, we can see that the
number of variables is unchanged, but all values are now ordinal encoded integers.

Next, let's evaluate machine learning on this dataset with this encoding. The best practice
when encoding variables is to fit the encoding on the training dataset, then apply it to the train
and test datasets. We will first split the dataset, then prepare the encoding on the training set,
and apply it to the test set.

#### Logistic Regression With Ordinal Encoding

Next, we evaluate logistic regression on the breast cancer dataset with an ordinal encoding.

In [None]:
# load the dataset
dataset = read_csv("breast-cancer.csv", header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

# define the model
model = LogisticRegression()

# fit on the training set
model.fit(X_train, y_train)

# predict on test set
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a
reasonable score.

### `OneHotEncoder` Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between
categories. The scikit-learn library provides the OneHotEncoder class to automatically one-hot
encode one or more variables. By default the `OneHotEncoder` will output data with a sparse
representation, which is efficient given that most values are 0 in the encoded representation.
We will disable this feature by setting the sparse argument to False so that we can review the
effect of the encoding. Once defined, we can call the fit transform() function and pass it to
our dataset to create a quantile transformed version of our dataset.

#### One-hot Encode The Breast Cancer Dataset

In [None]:
# load the dataset
dataset = read_csv("breast-cancer.csv", header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# one-hot encode input variables
onehot_encoder = OneHotEncoder(sparse_output=False)
X = onehot_encoder.fit_transform(X)

# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# summarize the transformed data
print("Input", X.shape)
print(X[:5, :])

We would expect the number of rows to remain the same, but the number of columns to
dramatically increase. As expected, in this case, we can see that the number of variables has
leaped up from 9 to 43 and all values are now binary values 0 or 1.

Next, let's evaluate machine learning on this dataset with this encoding as we did in the
previous section. The encoding is fit on the training set then applied to both train and test sets
as before.

#### Logistic Regression With One-Hot Encoding 

Next, we evaluate logistic regression on the breast cancer dataset with a one-hot encoding.

In [None]:
# load the dataset
dataset = read_csv("breast-cancer.csv", header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)

# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)

# define the model
model = LogisticRegression()

# fit on the training set
model.fit(X_train, y_train)

# predict on test set
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classifcation accuracy of about 70.53 percent, which is
worse than the ordinal encoding in the previous section.

## Conclusion

In this module, we explored different techniques for encoding categorical data into numerical formats suitable for machine learning models.  Keep in mind that choice of encoding method can significantly impact model performance, and that some categorical variables may have natural relationships that should be considered when choosing encoding methods.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.