#**Encode Categorical Data**



Machine learning models require all input and output variables to be numeric. This means
that if your data contains categorical data, you must encode it to numbers before you can fit
and evaluate a model. The two most popular techniques are an Ordinal encoding and a One
Hot encoding.

In this tutorial, you will learn:

* Encoding is a required pre-processing step when working with categorical data for machine
learning algorithms.
* How to use ordinal encoding for categorical variables that have a natural rank ordering.
* How to use one hot encoding for categorical variables that do not have a natural rank
ordering.

Credit: Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

##Nominal and Ordinal Variables

* **Nominal Variable**. Variable comprises a finite set of discrete values with no rank-order
relationship between values.
* **Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked
ordering between values.

Some algorithms can work with categorical data directly. For example, a decision tree can
be learned directly from categorical data with no data transform required (this depends on
the specific implementation). Many machine learning algorithms cannot operate on label data
directly. They require all input variables and output variables to be numeric. In general, this is
mostly a constraint of the effcient implementation of machine learning algorithms rather than
hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. This means that categorical data must be converted
to a numerical form. If the categorical variable is an output variable, you may also want to
convert predictions by the model back into a categorical form in order to present them or use
them in some application.

##Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical
values. They are:
* Ordinal Encoding
* One Hot Encoding
* Dummy Variable Encoding

###Ordinal Encoding
In ordinal encoding, each unique category value is assigned an integer value. An integer ordinal encoding is a natural encoding for ordinal variables. For categorical
variables, it imposes an ordinal relationship where no such relationship may exist. This can
cause problems and a one hot encoding may be used instead.

In [2]:
# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


We
can see that the numbers are assigned to the labels as we expected.

This **OrdinalEncoder** class is intended for input variables that are organized into rows and
columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification
problem, then the **LabelEncoder** class can be used. It does the same
thing as the **OrdinalEncoder**, although it expects a one-dimensional input for the single target
variable.

###One Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not be
enough or even misleading to the model. Forcing an ordinal relationship via an ordinal encoding
and allowing the model to assume a natural ordering between categories may result in poor
performance or unexpected results (predictions halfway between categories). In this case, a one
hot encoding can be applied to the ordinal representation. This is where the integer encoded
variable is removed and one new binary variable is added for each unique integer value in the
variable.

In [None]:
# example of a one hot encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


We can see the one hot encoding
matching our expectation of 3 binary variables in the order blue, green and red.

###Dummy Variable Encoding
The one hot encoding creates one binary variable for each category. The problem is that this
representation includes redundancy. For example, if we know that [1, 0, 0] represents blue and
[0, 1, 0] represents green we don't need another binary variable to represent red, instead we
could use 0 values alone, e.g. [0, 0]. This is called a dummy variable encoding, and always
represents C categories with C - 1 binary variables.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot
encoding. The drop argument can be set to indicate which category will become the one that is
assigned all zero values, called the baseline. We can set this to `firrst' so that the first category is
used. When the labels are sorted alphabetically, the blue label will be the first and will become
the baseline.

In [None]:
# example of a dummy variable encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]


##Breast Cancer Categorical Dataset
Breast cancer dataset classifies breast cancer
patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine
input variables. It is a binary classification problem. A naive model can achieve an accuracy
of 70 percent on this dataset. A good score is about 76 percent. 

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))

###Download Breast Cancer data files

In [3]:
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv" -O breast-cancer.csv
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names" -O breast-cancer.names
!head breast-cancer.csv

--2022-05-05 14:45:23--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24373 (24K) [text/plain]
Saving to: ‘breast-cancer.csv’


2022-05-05 14:45:23 (5.48 MB/s) - ‘breast-cancer.csv’ saved [24373/24373]

--2022-05-05 14:45:23--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3162 (3.1K) [text/plain]
Saving to: ‘breast-cancer.names’


2022-05-05 14:45:23 (43.5 MB/s) - ‘brea

In [4]:
# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# summarize
print('Input', X.shape)
print('Output', y.shape)

Input (286, 9)
Output (286,)


We
can see that we have 286 examples and nine input variables.



###OrdinalEncoder Transform
An ordinal encoding involves mapping each unique label to an integer value. This type of
encoding is really only appropriate if there is a known relationship between the categories. This
relationship does exist for some of the variables in our dataset, and ideally, this should be
harnessed when preparing the data. In this case, we will ignore any possible existing ordinal
relationship and assume all variables are categorical. It can still be helpful to use an ordinal
encoding, at least as a point of reference with other encoding schemes.
We can use the OrdinalEncoder from scikit-learn to encode each variable to integers.

In [5]:
# ordinal encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]


We would expect the number of rows, and in this case, the number of columns, to be unchanged,
except all string values are now integer values. As expected, in this case, we can see that the
number of variables is unchanged, but all values are now ordinal encoded integers.

Next, let's evaluate machine learning on this dataset with this encoding. The best practice
when encoding variables is to fit the encoding on the training dataset, then apply it to the train
and test datasets. We will first split the dataset, then prepare the encoding on the training set,
and apply it to the test set.

In [6]:
# evaluate logistic regression on the breast cancer dataset with an ordinal encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 75.79


In this case, the model achieved a classi cation accuracy of about 75.79 percent, which is a
reasonable score.

###OneHotEncoder Transform
A one hot encoding is appropriate for categorical data where no relationship exists between
categories. The scikit-learn library provides the OneHotEncoder class to automatically one hot
encode one or more variables. By default the OneHotEncoder will output data with a sparse
representation, which is efficient given that most values are 0 in the encoded representation.
We will disable this feature by setting the sparse argument to False so that we can review the
effect of the encoding. Once defined, we can call the fit transform() function and pass it to
our dataset to create a quantile transformed version of our dataset.

In [7]:
# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

Input (286, 43)
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]


We would expect the number of rows to remain the same, but the number of columns to
dramatically increase. As expected, in this case, we can see that the number of variables has
leaped up from 9 to 43 and all values are now binary values 0 or 1.

Next, let's evaluate machine learning on this dataset with this encoding as we did in the
previous section. The encoding is fit on the training set then applied to both train and test sets
as before.

In [8]:
# evaluate logistic regression on the breast cancer dataset with a one-hot encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 70.53


In this case, the model achieved a classifcation accuracy of about 70.53 percent, which is
worse than the ordinal encoding in the previous section.