# Encode Categorical Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This module covers techniques for encoding categorical data into numerical formats for machine learning models. We'll explore different encoding methods including ordinal encoding, one-hot encoding, and dummy variable encoding using a breast cancer dataset as a practical example.

## Learning Objectives

- Learn why encoding is required for preprocessing categorical data in machine learning algorithms
- Understand how to use ordinal encoding for categorical variables with natural rank ordering
- Understand one-hot encoding techniques for categorical variables without natural rank ordering
- Apply encoding techniques to real medical data for breast cancer prediction

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy libraries
- Knowledge of basic statistical concepts

## Get Started

To start, we install required packages, import the necessary libraries.

### Install required packages

In [1]:
# Install the numpy, pandas, and scikit-learn Python libraries using pip.
# numpy:  A fundamental package for numerical computation in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
# pandas: A powerful data manipulation and analysis library. It offers data structures like DataFrames for efficiently handling and analyzing structured data (tabular data, time series, etc.).
# scikit-learn: A widely used machine learning library in Python. It provides simple and efficient tools for data mining and data analysis, including various classification, regression, clustering algorithms, model selection, and preprocessing tools.
%pip install numpy pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.


### Import necessary libraries

In [2]:
# Import the 'asarray' function from the NumPy library.
# This function is used to convert input to a NumPy array.
from numpy import asarray

# Import the 'read_csv' function from the pandas library.
# This function is used to read data from a CSV file into a pandas DataFrame.
from pandas import read_csv

# Import the 'LogisticRegression' class from the scikit-learn library (sklearn).
# This class is used to create a logistic regression model for classification tasks.
from sklearn.linear_model import LogisticRegression

# Import the 'accuracy_score' function from the scikit-learn metrics module.
# This function is used to calculate the accuracy of a classification model.
from sklearn.metrics import accuracy_score

# Import the 'train_test_split' function from the scikit-learn model_selection module.
# This function is used to split a dataset into training and testing sets.
from sklearn.model_selection import train_test_split

# Import 'LabelEncoder', 'OneHotEncoder', and 'OrdinalEncoder' classes from scikit-learn preprocessing module.
# These classes are used for encoding categorical variables into numerical representations.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

## Breast Cancer Categorical Dataset

Breast cancer dataset classifies breast cancer
patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))

### Load and summarize the dataset

In [3]:
# Define a variable 'breast_cancer_csv' that stores the file path to the breast cancer dataset CSV file.
# The file path is relative to the current script's location and points to a file named "breast-cancer.csv" in the "../../Data/" directory.
breast_cancer_csv = "../../Data/breast-cancer.csv"

# Load the dataset from the CSV file into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Print the first few rows of the DataFrame to inspect the loaded data.
# 'dataset.head()' displays the first 5 rows by default, allowing for a quick data preview.
print(dataset.head())

# Retrieve the underlying NumPy array from the pandas DataFrame.
# 'dataset.values' converts the DataFrame into a NumPy array for numerical operations and indexing.
data = dataset.values

# Separate the data into input features (X) and output labels (y).
# 'data[:, :-1]' selects all rows (:) and all columns except the last one (:-1) as input features (X).
# '.astype(str)' converts the input features to string type, assuming the features are meant to be strings.
X = data[:, :-1].astype(str)

# 'data[:, -1]' selects all rows (:) and only the last column (-1) as output labels (y).
# '.astype(str)' converts the output labels to string type, assuming the labels are meant to be strings.
y = data[:, -1].astype(str)

# Summarize the shape of the input features (X) and output labels (y).
# 'X.shape' returns a tuple representing the dimensions of the input feature array (rows, columns).
# 'y.shape' returns a tuple representing the dimensions of the output label array (rows,).
print("Input", X.shape)
print("Output", y.shape)

         0          1        2      3      4    5        6           7      8  \
0  '40-49'  'premeno'  '15-19'  '0-2'  'yes'  '3'  'right'   'left_up'   'no'   
1  '50-59'     'ge40'  '15-19'  '0-2'   'no'  '1'  'right'   'central'   'no'   
2  '50-59'     'ge40'  '35-39'  '0-2'   'no'  '2'   'left'  'left_low'   'no'   
3  '40-49'  'premeno'  '35-39'  '0-2'  'yes'  '3'  'right'  'left_low'  'yes'   
4  '40-49'  'premeno'  '30-34'  '3-5'  'yes'  '2'   'left'  'right_up'   'no'   

                        9  
0     'recurrence-events'  
1  'no-recurrence-events'  
2     'recurrence-events'  
3  'no-recurrence-events'  
4     'recurrence-events'  
Input (286, 9)
Output (286,)


We
can see that we have 286 examples and nine input variables.



## Nominal And Ordinal Variables

* **Nominal Variable**. Variable comprises a finite set of discrete values with no rank-order
relationship between values.
* **Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked
ordering between values.

Some algorithms can work with categorical data directly. For example, a decision tree can
be learned directly from categorical data with no data transform required (this depends on
the specific implementation). Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be numeric. In general, this is
mostly a constraint of the effcient implementation of machine learning algorithms rather than
hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. This means that categorical data must be converted
to a numerical form. If the categorical variable is an output variable, you may also want to
convert predictions by the model back into a categorical form in order to present them or use
them in some application.

## Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical
values. They are:
* Ordinal Encoding
* One-Hot Encoding
* Dummy Variable Encoding

### Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value. An integer ordinal encoding is a natural encoding for ordinal variables. For categorical
variables, it imposes an ordinal relationship where no such relationship may exist. This can
cause problems and a one-hot encoding may be used instead.

In [None]:
# example of an ordinal encoding

# Encode categorical features as an integer array

# Define the categorical data as a NumPy array.
# In this example, the data consists of color names: 'red', 'green', 'blue'.
data = asarray([["red"], ["green"], ["blue"]])
# Print the original categorical data to show the input before encoding.
print("Original data: \n", data)

# Define the ordinal encoder.
# OrdinalEncoder is used to convert categorical data into numerical ordinal data.
# By default, it assigns numerical values based on the order it encounters categories.
encoder = OrdinalEncoder()

# Fit OrdinalEncoder to the data and then transform the data.
# .fit(data) learns the unique categories from the input data. In this case, it learns 'red', 'green', 'blue'.
# .transform(data) then replaces each category in the data with its corresponding ordinal integer.
result = encoder.fit_transform(data)
# Print the encoded data to show the numerical representation after ordinal encoding.
# The output will be a NumPy array where each color is replaced by an integer.
print("Encoded data: \n", result)

We
can see that the numbers are assigned to the labels as we expected.

This **OrdinalEncoder** class is intended for input variables that are organized into rows and
columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification
problem, then the **LabelEncoder** class can be used. It does the same
thing as the **OrdinalEncoder**, although it expects a one-dimensional input for the single target
variable.

### One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be
enough or even misleading to the model. Forcing an ordinal relationship via an ordinal encoding
and allowing the model to assume a natural ordering between categories may result in poor
performance or unexpected results (predictions halfway between categories). In this case, a one
hot encoding can be applied to the ordinal representation. This is where the integer encoded
variable is removed and one new binary variable is added for each unique integer value in the
variable.

In [None]:
# example of an one-hot encoding

# Encode categorical features as a one-hot numeric array.

# define data
# Define a NumPy array named 'data' containing categorical data.
# In this case, it's a list of color names, each as a single-element list (representing a feature for each sample).
data = asarray([["red"], ["green"], ["blue"]])
# Print the original categorical data array.
print(data)

# define one-hot encoding
# Initialize a OneHotEncoder object named 'encoder'.
# sparse_output=False argument ensures that the output of the encoder will be a NumPy array, not a sparse matrix.
# Sparse matrices are memory-efficient for high-dimensional data with many zeros, but for this example, a dense array is easier to understand.
encoder = OneHotEncoder(sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
# Fit the OneHotEncoder to the 'data' array. This learns the unique categories in each feature.
# Then, transform the 'data' array into a one-hot encoded numerical array.
onehot = encoder.fit_transform(data)
# Print the resulting one-hot encoded array.
# Each row now represents a sample, and each column represents a unique category.
# A '1' indicates the presence of that category for the sample, and '0' indicates absence.
print(onehot)

We can see the one-hot encoding
matching our expectation of 3 binary variables in the order blue, green and red.

### Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category. The problem is that this
representation includes redundancy. For example, if we know that `[1, 0, 0]` represents blue and
`[0, 1, 0]` represents green we don't need another binary variable to represent red, instead we
could use 0 values alone, e.g. `[0, 0]`. This is called a dummy variable encoding, and always
represents `C` categories with `C - 1` binary variables.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one-hot
encoding. The drop argument can be set to indicate which category will become the one that is
assigned all zero values, called the baseline. We can set this to `firrst' so that the first category is
used. When the labels are sorted alphabetically, the blue label will be the first and will become
the baseline.

In [None]:
# example of a dummy variable encoding

# define data
# Create a NumPy array named 'data' containing categorical strings: "red", "green", and "blue".
# Each category is placed in its own row, creating a column vector-like structure.
data = asarray([["red"], ["green"], ["blue"]])
# Print the original categorical data array to the console.
print(data)

# define one-hot encoding
# Initialize the OneHotEncoder.
# drop="first":  Specifies to drop the first category in each feature to avoid multicollinearity in some models.
#                If only one category is present for a feature, the feature will be dropped entirely.
# sparse_output=False:  Sets the encoder to return a NumPy array instead of a sparse matrix.
#                       Sparse matrices are efficient for datasets with many zeros, but arrays are often easier to work with directly.
encoder = OneHotEncoder(drop="first", sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
# Fit the OneHotEncoder to the 'data' array.
# 'fit' learns the unique categories present in the data.
# 'transform' then applies the one-hot encoding to the data based on the learned categories.
# 'fit_transform' combines both steps: it fits the encoder to the data and then transforms the data in a single step.
onehot = encoder.fit_transform(data)
# Print the resulting one-hot encoded array to the console.
# Each category from the original data is now represented by a binary vector.
print(onehot)

### `OrdinalEncoder` Transform

An ordinal encoding involves mapping each unique label to an integer value. This type of
encoding is really only appropriate if there is a known relationship between the categories. This
relationship does exist for some of the variables in our dataset, and ideally, this should be
harnessed when preparing the data. In this case, we will ignore any possible existing ordinal
relationship and assume all variables are categorical. It can still be helpful to use an ordinal
encoding, at least as a point of reference with other encoding schemes.
We can use the `OrdinalEncoder` from scikit-learn to encode each variable to integers.

#### Ordinal Encode The Breast Cancer Dataset


In [None]:
# load the dataset
# Load the dataset from a CSV file named 'breast_cancer_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# retrieve the array of data
# Retrieve the underlying NumPy array from the pandas DataFrame 'dataset'.
# This converts the DataFrame into a numerical array for further processing.
data = dataset.values

# separate into input and output columns
# Separate the dataset into input features (X) and the target variable (y).
# X is assigned all columns except the last one (':-1').
# y is assigned only the last column ('-1').
# '.astype(str)' ensures that the data is treated as strings initially, which is important for ordinal encoding if the data is mixed type.
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# ordinal encode input variables
# Initialize an OrdinalEncoder object from scikit-learn.
# Ordinal encoding converts categorical variables into numerical values, preserving the order if there is one.
ordinal_encoder = OrdinalEncoder()
# Fit the OrdinalEncoder to the input features X and then transform X.
# 'fit_transform' learns the unique categories in each feature and replaces them with ordinal integers.
X = ordinal_encoder.fit_transform(X)

# ordinal encode target variable
# Initialize a LabelEncoder object from scikit-learn.
# Label encoding converts categorical labels into numerical values.
label_encoder = LabelEncoder()
# Fit the LabelEncoder to the target variable y and then transform y.
# 'fit_transform' learns the unique labels in y and replaces them with integers starting from 0.
y = label_encoder.fit_transform(y)

# summarize the transformed data
# Print a summary of the transformed input features X.
# "Input", X.shape prints the string "Input" followed by the shape (number of rows and columns) of X.
print("Input", X.shape)
# Print the first 5 rows and all columns of the transformed input features X.
# X[:5, :] selects the first 5 rows and all columns.
print(X[:5, :])
# Print a summary of the transformed target variable y.
# "Output", y.shape prints the string "Output" followed by the shape (number of elements) of y.
print("Output", y.shape)
# Print the first 5 elements of the transformed target variable y.
# y[:5] selects the first 5 elements of y.
print(y[:5])

We would expect the number of rows, and in this case, the number of columns, to be unchanged,
except all string values are now integer values. As expected, in this case, we can see that the
number of variables is unchanged, but all values are now ordinal encoded integers.

Next, let's evaluate machine learning on this dataset with this encoding. The best practice
when encoding variables is to fit the encoding on the training dataset, then apply it to the train
and test datasets. We will first split the dataset, then prepare the encoding on the training set,
and apply it to the test set.

#### Logistic Regression With Ordinal Encoding

Next, we evaluate logistic regression on the breast cancer dataset with an ordinal encoding.

In [None]:
# load the dataset from a CSV file named 'breast_cancer_csv'.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# retrieve the array of data from the pandas DataFrame.
data = dataset.values

# separate the data into input (X) and output (y) columns.
# X is assigned all columns except the last one (features).
# [:, :-1] selects all rows and all columns except the last one.
# .astype(str) casts the input features to string type.
X = data[:, :-1].astype(str)
# y is assigned the last column (target variable).
# [:, -1] selects all rows and only the last column.
# .astype(str) casts the target variable to string type.
y = data[:, -1].astype(str)

# split the dataset into training and testing sets.
# X_train, y_train will be used for training the model.
# X_test, y_test will be used for evaluating the model's performance.
# test_size=0.33 means 33% of the data will be used for testing, and 67% for training.
# random_state=1 ensures that the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# initialize OrdinalEncoder to convert categorical input features to numerical.
ordinal_encoder = OrdinalEncoder()
# fit the OrdinalEncoder on the training input data to learn the categories.
ordinal_encoder.fit(X_train)
# transform the training input data using the fitted OrdinalEncoder.
X_train = ordinal_encoder.transform(X_train)
# transform the testing input data using the fitted OrdinalEncoder.
X_test = ordinal_encoder.transform(X_test)

# initialize LabelEncoder to convert categorical target variable to numerical labels.
label_encoder = LabelEncoder()
# fit the LabelEncoder on the training target variable to learn the unique classes.
label_encoder.fit(y_train)
# transform the training target variable into numerical labels using the fitted LabelEncoder.
y_train = label_encoder.transform(y_train)
# transform the testing target variable into numerical labels using the fitted LabelEncoder.
y_test = label_encoder.transform(y_test)

# define the logistic regression model.
model = LogisticRegression()

# fit the logistic regression model to the training data.
# X_train is the training input features, and y_train is the training target variable.
model.fit(X_train, y_train)

# make predictions on the test set using the trained model.
yhat = model.predict(X_test)

# evaluate the model's predictions by calculating the accuracy score.
# accuracy_score compares the true labels (y_test) with the predicted labels (yhat).
accuracy = accuracy_score(y_test, yhat)
# print the accuracy of the model in percentage format, rounded to two decimal places.
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a
reasonable score.

### `OneHotEncoder` Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between
categories. The scikit-learn library provides the OneHotEncoder class to automatically one-hot
encode one or more variables. By default the `OneHotEncoder` will output data with a sparse
representation, which is efficient given that most values are 0 in the encoded representation.
We will disable this feature by setting the sparse argument to False so that we can review the
effect of the encoding. Once defined, we can call the fit transform() function and pass it to
our dataset to create a quantile transformed version of our dataset.

#### One-hot Encode The Breast Cancer Dataset

In [None]:
# Load the dataset from a CSV file named 'breast_cancer_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Retrieve the array of data from the pandas DataFrame.
# '.values' attribute returns a NumPy array representation of the DataFrame.
data = dataset.values

# Separate the data into input (X) and output (y) columns.
# 'data[:, :-1]' selects all rows and all columns except the last one for input features (X).
# '.astype(str)' converts the input features to string type, which is often necessary before one-hot encoding.
X = data[:, :-1].astype(str)
# 'data[:, -1]' selects all rows and only the last column for the target variable (y).
# '.astype(str)' converts the target variable to string type, which is suitable for label encoding.
y = data[:, -1].astype(str)

# One-hot encode the input variables (X).
# Initialize OneHotEncoder with 'sparse_output=False' to return a dense NumPy array instead of a sparse matrix.
onehot_encoder = OneHotEncoder(sparse_output=False)
# Fit the OneHotEncoder to the input data X and then transform X into one-hot encoded features.
# 'fit_transform' learns the unique categories in each feature and then transforms the data.
X = onehot_encoder.fit_transform(X)

# Ordinal encode the target variable (y). Although named 'LabelEncoder', it performs ordinal encoding for multiple classes.
# Initialize LabelEncoder.
label_encoder = LabelEncoder()
# Fit the LabelEncoder to the target variable y and then transform y into label-encoded integers.
# 'fit_transform' learns the unique classes in y and then transforms them into numerical labels.
y = label_encoder.fit_transform(y)

# Summarize the transformed data.
# Print the shape of the input data X after one-hot encoding.
# 'X.shape' returns a tuple representing the dimensions of X (number of rows, number of columns).
print("Input", X.shape)
# Print the first 5 rows and all columns of the transformed input data X.
# 'X[:5, :]' selects the first 5 rows and all columns.
print(X[:5, :])

We would expect the number of rows to remain the same, but the number of columns to
dramatically increase. As expected, in this case, we can see that the number of variables has
leaped up from 9 to 43 and all values are now binary values 0 or 1.

Next, let's evaluate machine learning on this dataset with this encoding as we did in the
previous section. The encoding is fit on the training set then applied to both train and test sets
as before.

#### Logistic Regression With One-Hot Encoding 

Next, we evaluate logistic regression on the breast cancer dataset with a one-hot encoding.

In [None]:
# load the dataset
# Load the dataset from a CSV file named 'breast_cancer_csv' into a pandas DataFrame.
# 'read_csv' function is assumed to be available (likely from pandas library).
dataset = read_csv(breast_cancer_csv, header=None)

# retrieve the array of data
# Extract the values from the pandas DataFrame and convert it into a NumPy array.
# This is often done to work with scikit-learn functions which often expect NumPy arrays.
data = dataset.values

# separate into input and output columns
# Separate the dataset into input features (X) and the target variable (y).
# X is assigned all columns except the last one ([:-1]), and y is assigned the last column ([-1]).
# .astype(str) converts the data type to string, likely for handling categorical data before encoding.
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# split the dataset into train and test sets
# Split the dataset into training and testing sets using the train_test_split function.
# X_train, y_train will be used for training the model.
# X_test, y_test will be used for evaluating the model's performance on unseen data.
# test_size=0.33 specifies that 33% of the data will be used for testing, and the rest for training.
# random_state=1 ensures that the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# one-hot encode input variables
# Initialize a OneHotEncoder object to perform one-hot encoding on categorical features.
onehot_encoder = OneHotEncoder()
# Fit the OneHotEncoder on the training input data (X_train).
# This learns the categories to be encoded from the training set.
onehot_encoder.fit(X_train)
# Transform the training input data (X_train) using the fitted OneHotEncoder.
# This converts categorical features into numerical one-hot encoded features.
X_train = onehot_encoder.transform(X_train)
# Transform the testing input data (X_test) using the fitted OneHotEncoder.
# It's important to use the encoder fitted on the training data to ensure consistency.
X_test = onehot_encoder.transform(X_test)

# ordinal encode target variable
# Initialize a LabelEncoder object to perform ordinal encoding on the target variable.
# LabelEncoder is used here to convert string labels into numerical labels.
label_encoder = LabelEncoder()
# Fit the LabelEncoder on the training target variable (y_train).
# This learns the unique classes in the training target.
label_encoder.fit(y_train)
# Transform the training target variable (y_train) using the fitted LabelEncoder.
# This converts string labels in y_train to numerical labels.
y_train = label_encoder.transform(y_train)
# Transform the testing target variable (y_test) using the fitted LabelEncoder.
# Use the same fitted encoder from training data for consistent encoding.
y_test = label_encoder.transform(y_test)

# define the model
# Define a Logistic Regression model.
# Logistic Regression is a linear model used for binary and multiclass classification.
model = LogisticRegression()

# fit on the training set
# Train the Logistic Regression model using the training data (X_train, y_train).
# The model learns the relationship between the features and the target variable from the training data.
model.fit(X_train, y_train)

# predict on test set
# Use the trained Logistic Regression model to make predictions on the test input data (X_test).
# yhat will contain the predicted class labels for the test set.
yhat = model.predict(X_test)

# evaluate predictions
# Evaluate the performance of the model by calculating the accuracy score.
# accuracy_score function compares the true labels (y_test) with the predicted labels (yhat).
accuracy = accuracy_score(y_test, yhat)
# Print the accuracy of the model in percentage format, rounded to two decimal places.
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classifcation accuracy of about 70.53 percent, which is
worse than the ordinal encoding in the previous section.

## Conclusion

In this module, we explored different techniques for encoding categorical data into numerical formats suitable for machine learning models.  Keep in mind that choice of encoding method can significantly impact model performance, and that some categorical variables may have natural relationships that should be considered when choosing encoding methods.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.