# Encode Categorical Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This module covers techniques for encoding categorical data into numerical formats for machine learning models. We'll explore different encoding methods including ordinal encoding, one-hot encoding, and dummy variable encoding using a breast cancer dataset as a practical example.

## Learning Objectives

- Learn why encoding is required for preprocessing categorical data in machine learning algorithms
- Understand how to use ordinal encoding for categorical variables with natural rank ordering
- Understand one-hot encoding techniques for categorical variables without natural rank ordering
- Apply encoding techniques to real medical data for breast cancer prediction

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy libraries
- Knowledge of basic statistical concepts

## Get Started

To start, we install required packages, import the necessary libraries.

### Install required packages

In [None]:
# Install the numpy, pandas, and scikit-learn Python libraries using pip.
# numpy:  A fundamental package for numerical computation in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
# pandas: A powerful data manipulation and analysis library. It offers data structures like DataFrames for efficiently handling and analyzing structured data (tabular data, time series, etc.).
# scikit-learn: A widely used machine learning library in Python. It provides simple and efficient tools for data mining and data analysis, including various classification, regression, clustering algorithms, model selection, and preprocessing tools.
%pip install numpy pandas scikit-learn

### Import necessary libraries

In [None]:
# Import the 'asarray' function from the NumPy library.
# This function is used to convert input to a NumPy array.
from numpy import asarray

# Import the 'read_csv' function from the pandas library.
# This function is used to read data from a CSV file into a pandas DataFrame.
from pandas import read_csv

# Import the 'LogisticRegression' class from the scikit-learn library (sklearn).
# This class is used to create a logistic regression model for classification tasks.
from sklearn.linear_model import LogisticRegression

# Import the 'accuracy_score' function from the scikit-learn metrics module.
# This function is used to calculate the accuracy of a classification model.
from sklearn.metrics import accuracy_score

# Import the 'train_test_split' function from the scikit-learn model_selection module.
# This function is used to split a dataset into training and testing sets.
from sklearn.model_selection import train_test_split

# Import 'LabelEncoder', 'OneHotEncoder', and 'OrdinalEncoder' classes from scikit-learn preprocessing module.
# These classes are used for encoding categorical variables into numerical representations.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

## Breast Cancer Categorical Dataset

Breast cancer dataset classifies breast cancer
patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))

### Load and summarize the dataset

In [None]:
# Define a variable 'breast_cancer_csv' that stores the file path to the breast cancer dataset CSV file.
# The file path is relative to the current script's location and points to a file named "breast-cancer.csv" in the "../../Data/" directory.
breast_cancer_csv = "../../Data/breast-cancer.csv"

# Load the dataset from the CSV file into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Print the first few rows of the DataFrame to inspect the loaded data.
# 'dataset.head()' displays the first 5 rows by default, allowing for a quick data preview.
print(dataset.head())

# Retrieve the underlying NumPy array from the pandas DataFrame.
# 'dataset.values' converts the DataFrame into a NumPy array for numerical operations and indexing.
data = dataset.values

# Separate the data into input features (X) and output labels (y).
# 'data[:, :-1]' selects all rows (:) and all columns except the last one (:-1) as input features (X).
# '.astype(str)' converts the input features to string type, assuming the features are meant to be strings.
X = data[:, :-1].astype(str)

# 'data[:, -1]' selects all rows (:) and only the last column (-1) as output labels (y).
# '.astype(str)' converts the output labels to string type, assuming the labels are meant to be strings.
y = data[:, -1].astype(str)

# Summarize the shape of the input features (X) and output labels (y).
# 'X.shape' returns a tuple representing the dimensions of the input feature array (rows, columns).
# 'y.shape' returns a tuple representing the dimensions of the output label array (rows,).
print("Input", X.shape)
print("Output", y.shape)

We
can see that we have 286 examples and nine input variables.



## Nominal and Ordinal Variables

### Nominal Variable
- A **nominal variable** consists of a finite set of discrete values with **no rank-order relationship** between them. Examples include categories like colors (e.g., red, blue, green) or types of fruit (e.g., apple, banana, orange).

### Ordinal Variable
- An **ordinal variable** consists of a finite set of discrete values with a **ranked ordering** between them. Examples include education levels (e.g., high school, bachelor’s, master’s) or survey ratings (e.g., poor, fair, good, excellent).


### Handling Categorical Data in Machine Learning

Some algorithms, such as **decision trees**, can work directly with categorical data without requiring any transformation. However, this depends on the specific implementation of the algorithm. Many machine learning algorithms, on the other hand, require all input and output variables to be **numeric**. This is often a constraint of the efficient implementation of these algorithms rather than a fundamental limitation of the algorithms themselves.


### Converting Categorical Data to Numerical Form

When working with algorithms that require numerical data, categorical variables must be converted into a numerical format. Common techniques include:
- **Label Encoding**: Assigning a unique integer to each category.
- **One-Hot Encoding**: Creating binary columns for each category.

If the categorical variable is an **output variable**, you may also need to convert the model’s numerical predictions back into their original categorical form for interpretation or application purposes.
them in some application.

## Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical
values. They are:
* Ordinal Encoding
* One-Hot Encoding
* Dummy Variable Encoding

### Ordinal Encoding

In **ordinal encoding**, each unique category value is assigned an integer value. This encoding is a natural fit for **ordinal variables**, where the categories have an inherent order or ranking (e.g., education levels: high school = 1, bachelor’s = 2, master’s = 3).

However, for **nominal variables** (categories with no inherent order), using ordinal encoding can introduce an artificial ordinal relationship where none exists. This can mislead machine learning models by implying a false hierarchy or ranking among categories.

To avoid this issue, **one-hot encoding** is often used for nominal variables. One-hot encoding creates binary columns for each category, ensuring that no unintended ordinal relationship is imposed.

In [None]:
# Example of an ordinal encoding

# Encode categorical features as an integer array

# Define the categorical data as a NumPy array.
# In this example, the data consists of color names: 'red', 'green', 'blue'.
data = asarray([["red"], ["green"], ["blue"]])

# Print the original categorical data to show the input before encoding.
print("Original data: \n", data)

# Define the ordinal encoder.
# OrdinalEncoder is used to convert categorical data into numerical ordinal data.
# By default, it assigns numerical values based on the order it encounters categories.
encoder = OrdinalEncoder()

# Fit OrdinalEncoder to the data and then transform the data.
# .fit(data) learns the unique categories from the input data. In this case, it learns 'red', 'green', 'blue'.
# .transform(data) then replaces each category in the data with its corresponding ordinal integer.
result = encoder.fit_transform(data)

# Print the encoded data to show the numerical representation after ordinal encoding.
# The output will be a NumPy array where each color is replaced by an integer.
print("Encoded data: \n", result)

We can observe that the numbers are assigned to the labels as expected, preserving the intended order for ordinal variables.

The **OrdinalEncoder** class is designed for encoding input variables organized in a **matrix format** (rows and columns). It is particularly useful for transforming categorical features into numerical values while maintaining their ordinal relationships.

For encoding a **categorical target variable** in classification problems, the **LabelEncoder** class is used. It performs a similar function to the `OrdinalEncoder`, but it is specifically tailored for **one-dimensional input**, such as a single target variable.

### One-Hot Encoding

For **categorical variables** where no ordinal relationship exists, using an integer encoding can be insufficient or even misleading. Imposing an artificial ordinal relationship through ordinal encoding may cause the model to incorrectly assume a natural ordering between categories. This can lead to **poor performance** or **unexpected results**, such as predictions that fall between categories.

To address this issue, **one-hot encoding** can be applied. In this approach:
1. The integer-encoded variable is removed.
2. A new **binary variable** is added for each unique integer value (category).
3. Each binary variable indicates the presence (1) or absence (0) of a specific category.

This ensures that no false ordinal relationship is introduced, allowing the model to interpret each category independently.

In [None]:
# Example of an one-hot encoding

# Encode categorical features as a one-hot numeric array.

# Define data
# Define a NumPy array named 'data' containing categorical data.
# In this case, it's a list of color names, each as a single-element list (representing a feature for each sample).
data = asarray([["red"], ["green"], ["blue"]])
# Print the original categorical data array.
print(data)

# Define one-hot encoding
# Initialize a OneHotEncoder object named 'encoder'.
# sparse_output=False argument ensures that the output of the encoder will be a NumPy array, not a sparse matrix.
# Sparse matrices are memory-efficient for high-dimensional data with many zeros, but for this example, a dense array is easier to understand.
encoder = OneHotEncoder(sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
# Fit the OneHotEncoder to the 'data' array. This learns the unique categories in each feature.
# Then, transform the 'data' array into a one-hot encoded numerical array.
onehot = encoder.fit_transform(data)

# Print the resulting one-hot encoded array.
# Each row now represents a sample, and each column represents a unique category.
# A '1' indicates the presence of that category for the sample, and '0' indicates absence.
print(onehot)

We can observe that the one-hot encoding aligns with our expectations, creating **3 binary variables** corresponding to the categories: **blue**, **green**, and **red**.

### Dummy Variable Encoding

One-hot encoding creates one binary variable for each category, but this representation includes **redundancy**. For example:
- If `[1, 0, 0]` represents **blue** and `[0, 1, 0]` represents **green**, we don’t need a third binary variable to represent **red**. Instead, we can use `[0, 0]` to implicitly represent red.

This simplified approach is called **dummy variable encoding**. It represents `C` categories using only `C - 1` binary variables, eliminating redundancy while retaining all necessary information.

We can use the **OneHotEncoder** class to implement both **one-hot encoding** and **dummy variable encoding**. The `drop` argument allows us to specify which category will serve as the **baseline** (assigned all zero values). For example, setting `drop='first'` ensures that the first category becomes the baseline.

When the labels are sorted alphabetically, the **blue** label will appear first and will be used as the baseline. This approach reduces redundancy while maintaining the necessary information for encoding categorical variables.

In [None]:
# Example of a dummy variable encoding

# Define data
# Create a NumPy array named 'data' containing categorical strings: "red", "green", and "blue".
# Each category is placed in its own row, creating a column vector-like structure.
data = asarray([["red"], ["green"], ["blue"]])

# Print the original categorical data array to the console.
print(data)

# Define one-hot encoding
# Initialize the OneHotEncoder.
# drop="first":  Specifies to drop the first category in each feature to avoid multicollinearity in some models.
#                If only one category is present for a feature, the feature will be dropped entirely.
# sparse_output=False:  Sets the encoder to return a NumPy array instead of a sparse matrix.
#                       Sparse matrices are efficient for datasets with many zeros, but arrays are often easier to work with directly.
encoder = OneHotEncoder(drop="first", sparse_output=False)

# Fit OneHotEncoder to data, then transform data.
# Fit the OneHotEncoder to the 'data' array.
# 'fit' learns the unique categories present in the data.
# 'transform' then applies the one-hot encoding to the data based on the learned categories.
# 'fit_transform' combines both steps: it fits the encoder to the data and then transforms the data in a single step.
onehot = encoder.fit_transform(data)

# Print the resulting one-hot encoded array to the console.
# Each category from the original data is now represented by a binary vector.
print(onehot)

### `OrdinalEncoder` Transform

**Ordinal encoding** involves mapping each unique label to an integer value. This encoding is most appropriate when there is a **known ordinal relationship** between the categories (e.g., low, medium, high). While some variables in our dataset may have such a relationship, we will assume all variables are **categorical** for this example, ignoring any inherent ordinal structure.

Even in this case, using ordinal encoding can still be useful as a **baseline** or point of comparison with other encoding schemes. We can implement this using the `OrdinalEncoder` class from scikit-learn, which encodes each categorical variable into integers.

#### Ordinal Encode The Breast Cancer Dataset


In [None]:
# Load the dataset from a CSV file defined by a variable named 'breast_cancer_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Retrieve the underlying NumPy array from the pandas DataFrame 'dataset'.
# This converts the DataFrame into a numerical array for further processing.
data = dataset.values

# Separate the dataset into input features (X) and the target variable (y).
# X is assigned all columns except the last one (':-1').
# y is assigned only the last column ('-1').
# '.astype(str)' ensures that the data is treated as strings initially, which is important for ordinal encoding if the data is mixed type.
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# Initialize an OrdinalEncoder object from scikit-learn.
# Ordinal encoding converts categorical variables into numerical values, preserving the order if there is one.
ordinal_encoder = OrdinalEncoder()

# Fit the OrdinalEncoder to the input features X and then transform X.
# 'fit_transform' learns the unique categories in each feature and replaces them with ordinal integers.
X = ordinal_encoder.fit_transform(X)

# Initialize a LabelEncoder object from scikit-learn.
# Label encoding converts categorical labels into numerical values.
label_encoder = LabelEncoder()

# Fit the LabelEncoder to the target variable y and then transform y.
# 'fit_transform' learns the unique labels in y and replaces them with integers starting from 0.
y = label_encoder.fit_transform(y)

# Print a summary of the transformed input features X.
# "Input", X.shape prints the string "Input" followed by the shape (number of rows and columns) of X.
print("Input", X.shape)

# Print the first 5 rows and all columns of the transformed input features X.
# X[:5, :] selects the first 5 rows and all columns.
print(X[:5, :])

# Print a summary of the transformed target variable y.
# "Output", y.shape prints the string "Output" followed by the shape (number of elements) of y.
print("Output", y.shape)

# Print the first 5 elements of the transformed target variable y.
# y[:5] selects the first 5 elements of y.
print(y[:5])

After applying the encoding, we would expect the **number of rows** and **number of columns** to remain unchanged. The only difference is that all **string values** are now replaced with **integer values**. As anticipated, we can see that the number of variables stays the same, but all categorical values have been transformed into ordinal-encoded integers.

Next, let's evaluate the performance of a machine learning model on this dataset using the applied encoding. A **best practice** when encoding variables is to:
1. **Fit the encoding** on the training dataset.
2. **Apply the encoding** to both the training and test datasets.

To follow this approach, we will:
1. **Split the dataset** into training and test sets.
2. **Prepare the encoding** using only the training set.
3. **Transform both the training and test sets** using the fitted encoding.

#### Logistic Regression With Ordinal Encoding

Next, we evaluate logistic regression on the breast cancer dataset with an ordinal encoding.

In [None]:
# Load the dataset from a CSV file defined by a variable named 'breast_cancer_csv'.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Retrieve the array of data from the pandas DataFrame.
data = dataset.values

# Separate the data into input (X) and output (y) columns.
# X is assigned all columns except the last one (features).
# [:, :-1] selects all rows and all columns except the last one.
# .astype(str) casts the input features to string type.
X = data[:, :-1].astype(str)

# y is assigned the last column (target variable).
# [:, -1] selects all rows and only the last column.
# .astype(str) casts the target variable to string type.
y = data[:, -1].astype(str)

# Split the dataset into training and testing sets.
# X_train, y_train will be used for training the model.
# X_test, y_test will be used for evaluating the model's performance.
# test_size=0.33 means 33% of the data will be used for testing, and 67% for training.
# random_state=1 ensures that the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Initialize OrdinalEncoder to convert categorical input features to numerical.
ordinal_encoder = OrdinalEncoder()

# Fit the OrdinalEncoder on the training input data to learn the categories.
ordinal_encoder.fit(X_train)

# Transform the training input data using the fitted OrdinalEncoder.
X_train = ordinal_encoder.transform(X_train)

# Transform the testing input data using the fitted OrdinalEncoder.
X_test = ordinal_encoder.transform(X_test)

# Initialize LabelEncoder to convert categorical target variable to numerical labels.
label_encoder = LabelEncoder()

# Fit the LabelEncoder on the training target variable to learn the unique classes.
label_encoder.fit(y_train)

# Transform the training target variable into numerical labels using the fitted LabelEncoder.
y_train = label_encoder.transform(y_train)

# Transform the testing target variable into numerical labels using the fitted LabelEncoder.
y_test = label_encoder.transform(y_test)

# Define the logistic regression model.
model = LogisticRegression()

# Fit the logistic regression model to the training data.
# X_train is the training input features, and y_train is the training target variable.
model.fit(X_train, y_train)

# Make predictions on the test set using the trained model.
yhat = model.predict(X_test)

# Evaluate the model's predictions by calculating the accuracy score.
# accuracy_score compares the true labels (y_test) with the predicted labels (yhat).
accuracy = accuracy_score(y_test, yhat)

# Print the accuracy of the model in percentage format, rounded to two decimal places.
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a
reasonable score.

### `OneHotEncoder` Transform

A **one-hot encoding** is suitable for categorical data where there is **no inherent relationship** between categories. The scikit-learn library provides the `OneHotEncoder` class to automatically perform one-hot encoding on one or more variables. 

By default, the `OneHotEncoder` outputs data in a **sparse representation**, which is memory-efficient since most values in the encoded representation are 0. For clarity, we will disable this feature by setting the `sparse` argument to `False`, allowing us to inspect the encoded data more easily.

Once configured, we can apply the encoding by calling the `fit_transform()` function and passing our dataset to it. This creates a one-hot encoded version of the dataset, transforming categorical variables into binary columns.

#### One-hot Encode The Breast Cancer Dataset

In [None]:
# Load the dataset from a CSV file defined by a variable named 'breast_cancer_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = read_csv(breast_cancer_csv, header=None)

# Retrieve the array of data from the pandas DataFrame.
# '.values' attribute returns a NumPy array representation of the DataFrame.
data = dataset.values

# Separate the data into input (X) and output (y) columns.
# 'data[:, :-1]' selects all rows and all columns except the last one for input features (X).
# '.astype(str)' converts the input features to string type, which is often necessary before one-hot encoding.
X = data[:, :-1].astype(str)

# 'data[:, -1]' selects all rows and only the last column for the target variable (y).
# '.astype(str)' converts the target variable to string type, which is suitable for label encoding.
y = data[:, -1].astype(str)

# One-hot encode the input variables (X).
# Initialize OneHotEncoder with 'sparse_output=False' to return a dense NumPy array instead of a sparse matrix.
onehot_encoder = OneHotEncoder(sparse_output=False)

# Fit the OneHotEncoder to the input data X and then transform X into one-hot encoded features.
# 'fit_transform' learns the unique categories in each feature and then transforms the data.
X = onehot_encoder.fit_transform(X)

# Ordinal encode the target variable (y). Although named 'LabelEncoder', it performs ordinal encoding for multiple classes.
# Initialize LabelEncoder.
label_encoder = LabelEncoder()

# Fit the LabelEncoder to the target variable y and then transform y into label-encoded integers.
# 'fit_transform' learns the unique classes in y and then transforms them into numerical labels.
y = label_encoder.fit_transform(y)

# Summarize the transformed data.
# Print the shape of the input data X after one-hot encoding.
# 'X.shape' returns a tuple representing the dimensions of X (number of rows, number of columns).
print("Input", X.shape)

# Print the first 5 rows and all columns of the transformed input data X.
# 'X[:5, :]' selects the first 5 rows and all columns.
print(X[:5, :])

After applying one-hot encoding, we would expect the **number of rows** to remain unchanged, but the **number of columns** to increase significantly. As anticipated, we can see that the number of variables has jumped from **9 to 43**, and all values are now binary (`0` or `1`).

Next, let's evaluate the performance of a machine learning model on this encoded dataset, following the same process as in the previous section. The encoding is **fit on the training set** and then **applied to both the training and test sets**, ensuring consistency and avoiding data leakage.

#### Logistic Regression With One-Hot Encoding 

Next, we evaluate logistic regression on the breast cancer dataset with a one-hot encoding.

In [None]:
# Load the dataset from a CSV file defined by a variable named 'breast_cancer_csv' into a pandas DataFrame.
# 'read_csv' function is assumed to be available (likely from pandas library).
dataset = read_csv(breast_cancer_csv, header=None)

# Extract the values from the pandas DataFrame and convert it into a NumPy array.
# This is often done to work with scikit-learn functions which often expect NumPy arrays.
data = dataset.values

# Separate the dataset into input features (X) and the target variable (y).
# X is assigned all columns except the last one ([:-1]), and y is assigned the last column ([-1]).
# .astype(str) converts the data type to string, likely for handling categorical data before encoding.
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# Split the dataset into training and testing sets using the train_test_split function.
# X_train, y_train will be used for training the model.
# X_test, y_test will be used for evaluating the model's performance on unseen data.
# test_size=0.33 specifies that 33% of the data will be used for testing, and the rest for training.
# random_state=1 ensures that the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Initialize a OneHotEncoder object to perform one-hot encoding on categorical features.
onehot_encoder = OneHotEncoder()

# Fit the OneHotEncoder on the training input data (X_train).
# This learns the categories to be encoded from the training set.
onehot_encoder.fit(X_train)

# Transform the training input data (X_train) using the fitted OneHotEncoder.
# This converts categorical features into numerical one-hot encoded features.
X_train = onehot_encoder.transform(X_train)

# Transform the testing input data (X_test) using the fitted OneHotEncoder.
# It's important to use the encoder fitted on the training data to ensure consistency.
X_test = onehot_encoder.transform(X_test)

# Initialize a LabelEncoder object to perform ordinal encoding on the target variable.
# LabelEncoder is used here to convert string labels into numerical labels.
label_encoder = LabelEncoder()

# Fit the LabelEncoder on the training target variable (y_train).
# This learns the unique classes in the training target.
label_encoder.fit(y_train)

# Transform the training target variable (y_train) using the fitted LabelEncoder.
# This converts string labels in y_train to numerical labels.
y_train = label_encoder.transform(y_train)

# Transform the testing target variable (y_test) using the fitted LabelEncoder.
# Use the same fitted encoder from training data for consistent encoding.
y_test = label_encoder.transform(y_test)

# Define a Logistic Regression model.
# Logistic Regression is a linear model used for binary and multiclass classification.
model = LogisticRegression()

# Train the Logistic Regression model using the training data (X_train, y_train).
# The model learns the relationship between the features and the target variable from the training data.
model.fit(X_train, y_train)

# Use the trained Logistic Regression model to make predictions on the test input data (X_test).
# yhat will contain the predicted class labels for the test set.
yhat = model.predict(X_test)

# Evaluate the performance of the model by calculating the accuracy score.
# accuracy_score function compares the true labels (y_test) with the predicted labels (yhat).
accuracy = accuracy_score(y_test, yhat)

# Print the accuracy of the model in percentage format, rounded to two decimal places.
print("Accuracy: %.2f" % (accuracy * 100))

In this case, the model achieved a classifcation accuracy of about 70.53 percent, which is
worse than the ordinal encoding in the previous section.

## Conclusion

In this module, we explored various techniques for **encoding categorical data** into numerical formats that are suitable for machine learning models. It’s important to remember that the choice of encoding method can **significantly impact model performance**. Additionally, some categorical variables may have **natural relationships** (e.g., ordinality) that should be taken into account when selecting the appropriate encoding method.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.