**Basics**

Sets of categories with no intrinsic ordering is
called **nominal**. Examples of nominal categories include:
• Blue, Red, Green

• Man, Woman

• Banana, Strawberry, Apple

In contrast, when a set of categories has some natural ordering we refer to it as ordinal.
For example:
• Low, Medium, High

• Young, Old

• Agree, Neutral, Disagree

**Encoding Nominal Categorical Features**

In [1]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer

In [2]:
feature = np.array([["Texas"],
["California"],
["Texas"],
["Delaware"],
["Texas"]])
# Create one-hot encoder
one_hot = LabelBinarizer()

In [3]:
one_hot.fit_transform(feature)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

In [5]:
# View feature classes
one_hot.classes_

array(['California', 'Delaware', 'Texas'], dtype='<U10')

If we want to reverse the one-hot encoding, we can use inverse_transform:

In [6]:
one_hot.inverse_transform(one_hot.transform(feature))

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')

We can even use pandas to one-hot encode the feature:

In [7]:
# Import library
import pandas as pd
# Create dummy variables from feature
pd.get_dummies(feature[:,0])

Unnamed: 0,California,Delaware,Texas
0,0,0,1
1,1,0,0
2,0,0,1
3,0,1,0
4,0,0,1


One helpful ability of scikit-learn is to handle a situation where each observation lists
multiple classes:

In [8]:
# Create multiclass feature
multiclass_feature = [("Texas", "Florida"),
("California", "Alabama"),
("Texas", "Florida"),
("Delware", "Florida"),
("Texas", "Alabama")]

In [9]:
# Create multiclass one-hot encoder
one_hot_multiclass = MultiLabelBinarizer()

In [11]:
one_hot_multiclass.fit_transform(multiclass_feature)

array([[0, 0, 0, 1, 1],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1]])

In [12]:
one_hot_multiclass.classes_

array(['Alabama', 'California', 'Delware', 'Florida', 'Texas'],
      dtype=object)

Finally, it is worthwhile to note that it is often recommended that after one-hot
encoding a feature, we drop one of the one-hot encoded features in the resulting
matrix to avoid linear dependence.

**Encoding Ordinal Categorical Features**

In [14]:
import pandas as pd
# Create features
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High"]})

In [15]:
# Create mapper
scale_mapper = {"Low":1,
"Medium":2,
"High":3}

In [16]:
# Replace feature values with scale
dataframe["Score"].replace(scale_mapper)

0    1
1    1
2    2
3    2
4    3
Name: Score, dtype: int64

Often we have a feature with classes that have some kind of natural ordering. A
famous example is the Likert scale:
• Strongly Agree

• Agree

• Neutral

• Disagree

• Strongly Disagree


In this example, the distance between Low and Medium is the same as the distance
between Medium and Barely More Than Medium, which is almost certainly not accurate.
The best approach is to be conscious about the numerical values mapped to
classes:

In [17]:
scale_mapper = {"Low":1,
"Medium":2,
"Barely More Than Medium": 2.1,
"High":3}

In [19]:
dataframe = pd.DataFrame({"Score": ["Low",
"Low",
"Medium",
"Medium",
"High",
"Barely More Than Medium"]})

In [20]:
dataframe["Score"].replace(scale_mapper)

0    1.0
1    1.0
2    2.0
3    2.0
4    3.0
5    2.1
Name: Score, dtype: float64

**Encoding Dictionaries of Features**

In [21]:
# Import library
from sklearn.feature_extraction import DictVectorizer

In [22]:
data_dict = [{"Red": 2, "Blue": 4},
{"Red": 4, "Blue": 3},
{"Red": 1, "Yellow": 2},
{"Red": 2, "Yellow": 2}]

In [24]:
# Create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)

In [25]:
# Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

In [26]:
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

By default DictVectorizer outputs a sparse matrix that only stores elements with a
value other than 0. This can be very helpful when we have massive matrices (often
encountered in natural language processing) and want to minimize the memory
requirements. We can force DictVectorizer to output a dense matrix using
sparse=False.

In [27]:
feature_names = dictvectorizer.get_feature_names()

In [28]:
feature_names

['Blue', 'Red', 'Yellow']

**Imputing Missing Class Values**

The ideal solution is to train a machine learning classifier algorithm to predict the
missing values, commonly a k-nearest neighbors (KNN) classifier:

In [29]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

In [30]:
# Create feature matrix with categorical feature
X = np.array([[0, 2.10, 1.45],
[1, 1.18, 1.33],
[0, 1.22, 1.27],
[1, -0.21, -1.19]])

In [31]:
X_with_nan = np.array([[np.nan, 0.87, 1.31],
[np.nan, -0.67, -0.22]])

In [32]:
clf = KNeighborsClassifier(3, weights='distance')
trained_model = clf.fit(X[:,1:], X[:,0])

In [33]:
imputed_values = trained_model.predict(X_with_nan[:,1:])

In [34]:
X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))

In [35]:
np.vstack((X_with_imputed, X))

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

When we have missing values in a categorical feature, our best solution is to open our
toolbox of machine learning algorithms to predict the values of the missing observations.
We can accomplish this by treating the feature with the missing values as the
target vector and the other features as the feature matrix. A commonly used algorithm
is KNN (discussed in depth later in this book), which assigns to the missing
value the median class of the k nearest observations.

**Handling Imbalanced Classes**

In [39]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

In [40]:
# Load iris data
iris = load_iris()

In [41]:
# Create feature matrix
features = iris.data

In [43]:
# Create target vector
target = iris.target

In [45]:
# Remove first 40 observations
features = features[40:,:]
target = target[40:]

In [46]:
# Create binary target vector indicating if class 0
target = np.where((target == 0), 0, 1)

In [50]:
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Many algorithms in scikit-learn offer a parameter to weight classes during training to
counteract the effect of their imbalance.

In [51]:
# Create weights
weights = {0: .9, 1: 0.1}

In [52]:
# Create random forest classifier with weights
RandomForestClassifier(class_weight=weights)

RandomForestClassifier(class_weight={0: 0.9, 1: 0.1})

Or you can pass balanced, which automatically creates weights inversely proportional
to class frequencies:

In [53]:
# Train a random forest with balanced class weights
RandomForestClassifier(class_weight="balanced")

RandomForestClassifier(class_weight='balanced')

Alternatively, we can downsample the majority class or upsample the minority class.
In downsampling, we randomly sample without replacement from the majority class
(i.e., the class with more observations) to create a new subset of observations equal in
size to the minority class.

For example, if the minority class has 10 observations, we
will randomly select 10 observations from the majority class and use those 20 observations
as our data.

In the real world, imbalanced classes are everywhere—most visitors don’t click the
buy button and many types of cancer are thankfully rare. For this reason, handling
imbalanced classes is a common activity in machine learning.
Our best strategy is simply to collect more observations—especially observations
from the minority class. However, this is often just not possible, so we have to resort
to other options.
A second strategy is to use a model evaluation metric better suited to imbalanced
classes. Accuracy is often used as a metric for evaluating the performance of a model,
but when imbalanced classes are present accuracy can be ill suited. For example, if
only 0.5% of observations have some rare cancer, then even a naive model that
predicts nobody has cancer will be 99.5% accurate. Clearly this is not ideal. Some better
metrics we discuss in later chapters are confusion matrices, precision, recall, F1
scores, and ROC curves.
A third strategy is to use the class weighing parameters included in implementations
of some models. This allows us to have the algorithm adjust for imbalanced classes.
Fortunately, many scikit-learn classifiers have a class_weight parameter, making it a
good option.

The fourth and fifth strategies are related: downsampling and upsampling. In downsampling
we create a random subset of the majority class of equal size to the minority
class. In upsampling we repeatedly sample with replacement from the minority class
to make it of equal size as the majority class. The decision between using downsampling
and upsampling is context-specific, and in general we should try both to see
which produces better results.