# Categorical Data

All of the features examined thus far have been numeric. There are other features in the dataset that have string values. We ignored these at the time, because all data passed to a scikit-learn estimator must be numeric. Let's select some string and numeric columns as our input data and attempt to fit a machine learning model with it.

In [None]:
import pandas as pd
import numpy as np
housing = pd.read_csv('../data/housing_sample.csv')
X = housing[['Neighborhood', 'Exterior1st', 'GrLivArea', 'GarageArea', 'HeatingQC']]
y = housing['SalePrice']
X.head()

### Attempt to fit the model

scikit-learn machine learning estimators only work with numeric data, so the following raises an error.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

## Encoding

If we wish to use string columns in our dataset, we will need to **encode** them as numeric in some fashion. As with missing value imputation, there are different encoding strategies available. 

Columns containing categorical (discrete) values can be divided into two types - **ordinal** and **nominal**. Ordinal categorical columns are those where the column values have some inherent **order**. For example, restaurant ratings ('Very good', 'good', 'average', etc...). Nominal categorical columns are those where there is no inherent ordering of the column values. Neighborhood and house exterior from the housing dataset are examples of those.

### Numeric columns may be categorical

Categorical data is not limited to just strings. Numeric columns can represent categories such as zip code, room number, or stage of disease such as cancer.

### One-hot encoding

One-hot encoding is a strategy that may be used primarily for nominal categorical columns (but can be used for ordinal columns as well). It works by first finding the number of unique values in a column. It then creates a new array with the number of columns equal to the number of unique values. Each column represents one of the unique values. The number of rows stays the same. Each row of the new array is composed entirely of 0's except for the column corresponding to the original value, which will be encoded as 1.

### Easy to see with pandas

One-hot encoding is more easily explained with a simple example using pandas. The `get_dummies` function performs one-hot encoding and use it on the `Exterior1st` column. Let's begin by outputting the first few values to verify the encoding.

In [None]:
housing['Exterior1st'].head()

We can now complete the one-hot encoding with the `get_dummies` function and highlight where the 1 is located in each row.

In [None]:
pd.get_dummies(housing['Exterior1st']).head().style.highlight_max(axis=1)

### One-hot encoding in scikit-learn

One-hot encoding is accomplished with the `OneHotEncoder` transformer of the `preprocessing` module. By default, it returns a sparse array which is a special object from the scipy library that saves memory for datasets that have only a few unique values. We'll set the `sparse` parameter to `False` so that we get back a normal numpy array allowing us to see the actual values. Let's complete the three-step process and assign the encoded array to a separate variable name.

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X_encode = ohe.fit_transform(X)
X_encode[:5]

That's quite difficult to interpret. Let's get the shape of the returned array to help understand what's going on.

In [None]:
X_encode.shape

### That's a lot of features - what happened?

We wanted to encode just the string columns. By default, `OneHotEncoder` encodes each column of data regardless of its type. It treats every unique value as a category to be encoded. Let's verify that there are a total of 1347 unique values. We can get the number of unique values in each column with the `nunique` method.

In [None]:
X.nunique()

Summing all the values in this Series verifies that we do indeed have a total of 1347 combined unique values.

In [None]:
X.nunique().sum()

### Only use the string columns

Instead of transforming all four columns, we can transform just the string columns.

In [None]:
X_encode2 = ohe.fit_transform(X[['Neighborhood', 'Exterior1st', 'HeatingQC']])
X_encode2

From above, there are 45 total unique values between the three columns, so we expect the number of columns in the returned array to be the same.

In [None]:
X_encode2.shape

### Get the feature names

scikit-learn returns a numpy array which is devoid of column names. It's not easily possible to decipher what categories each new column reference. The `get_feature_names` method returns the feature names allowing us to know the exact encoding.

In [None]:
ohe.get_feature_names()

Notice how feature name begins with 'x0', 'x1', or 'x2'. This references the original column. The first column, `x0_Blmgtn` equals 1 whenever the `Neighborhood` value is 'Blmgtn'. The column, `x1_WdShing` is 1 whenever the `Exeterior1st` column is 'WdShing', and the column `x2_Fa` is 1 whenever `HeatingQC` is 'Fa'. The unique values for each feature may be accessed with the `categories_` attribute. Below, a list of three arrays is returned, one for each column.

In [None]:
ohe.categories_

### Values that only appear in the test set

If a value appears in the test set that was not present during training, you will get an error when attempting to encode it. Let's see this with a simple example using the following array containing vehicle makes.

In [None]:
X1 = np.array([['Toyota'], ['Kia'], ['Ford'], ['Ford'], ['Kia'], ['Kia']])
X1

We instantiate a new `OneHotEncoder` and transform this single column of data.

In [None]:
ohe2 = OneHotEncoder(sparse=False)
ohe2.fit_transform(X1)

There are only three unique values in this column.

In [None]:
ohe2.categories_

If new data arrives that needs to be transformed using the same mapping, it is only possible if it contains categories found in the training set. The following array doesn't introduce any new values, so the transformation happens without error.

In [None]:
X1_new = np.array([['Kia'], ['Kia'], ['Toyota']])

Call the `transform` method to use the same encoding that was learned during training.

In [None]:
ohe2.transform(X1_new)

If new data arrives that contains a category not present during training, an error will be raised by default. Here, the value 'Honda' is new and responsible for the error.

In [None]:
X1_new2 = np.array([['Kia'], ['Kia'], ['Honda']])
ohe2.transform(X1_new2)

### Handling values unseen during training

The `OneHotEncoder` provides two ways to handle values that are unseen during training, but that appear later. The first involves the use of the `categories` parameter. If the distinct universe of values for the feature is known, you can create create a list of these categories and pass it to `categories`. 

You need to use a list of lists, where each column is given its own list of categories. Here, we only have a single column we are transforming, so there is only one inner list. Each list of categories must be sorted in alphabetical order. Here, we instantiate a new `OneHotEncoder` passing it a list of five categories.

In [None]:
categories = [['Ford', 'Honda','Kia', 'Tesla', 'Toyota']]
ohe3 = OneHotEncoder(categories=categories, sparse=False)
ohe3.fit_transform(X1)

Transforming this yields a similar mapping as before, but with two columns of all zeros for 'Honda' an 'Tesla'.

In [None]:
ohe3.get_feature_names()

Now, the one-hot encoder can transform the array `X1_new2` which contains 'Honda' as its last value. Let's output the array again before transforming it.

In [None]:
X1_new2

In [None]:
ohe3.transform(X1_new2)

The other way to handle values unseen during training is to set the `handle_unknown` parameter to 'ignore'. By default, this value is 'error'. Let's re-instantiate the model one more time and fit and transform the original data.

In [None]:
ohe4 = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe4.fit_transform(X1)

Now that `handle_unknown` has been set to 'ignore', no error will be raised when an unknown value is encountered during transformation of a future dataset. Instead, the entire row will be composed of 0's.

In [None]:
ohe4.transform(X1_new2)

## Inverting the encoding

The `inverse_transform` method is available to take an array of the one-hot encoded values and return the original data. Let's output the original array of data first.

In [None]:
X1

Now, we encode the data and assign the result to the variable name 'X1_transformed'.

In [None]:
X1_transformed = ohe2.transform(X1)
X1_transformed

Calling the `inverse_transform` method returns the original input data.

In [None]:
ohe2.inverse_transform(X1_transformed)

## Ordinal Encoding

One-hot encoding is a standard encoding procedure for nominal categorical variables, those that have no inherent ordering, such as `Neighborhood` and `Exterior1st`. But, the feature `HeatingQC` does have a clear ordering, so we have an option to encode it differently. Ordinal encoding returns just a single column encoding each value with an integer. The lowest category becomes 0 and the highest `n - 1` where `n` is the number of unique categories.

scikit-learn provides the `OrdinalEncoder` transformer to make this transformation. By default, it will use the alphabetic ordering as the natural inherent order. This isn't likely to be the case for most ordinal features. Instead, you need to supply the first parameter, `categories`, with the exact order as a list. 

Let's begin our three-step process by importing the `OrdinalEncoder` and instantiating it with the correct order of categories. Each feature requires its own list of ordered categories, even if it has the same categories in the same order. Therefore, scikit-learn requires that you give it a list of lists, where each inner list corresponds to each feature being transformed. In our example, we are only transforming a single column, therefore, we only have one inner list.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
order = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]
oe = OrdinalEncoder(order)

We are now setup to transform `HeatingQC`, our ordinal feature containing string data. We need to pass the `fit_transform` method a two-dimensional array. The first five values are output.

In [None]:
X_heating_transformed = oe.fit_transform(X[['HeatingQC']])
X_heating_transformed[:5]

Let's verify that the encoding happened correctly. The lowest category 'Po' corresponds to 0 and the highest category, 'Ex' corresponds to 4.

In [None]:
X['HeatingQC'].head()

As with the `OneHotEncoder`, the categories are stored in the `categories_` attribute.

In [None]:
oe.categories_

## Machine learning with categorical data

Once we have applied an encoding strategy to categorical data, we can use scikit-learn machine learning estimators to build models that learn from it. Let's build a model from our two nominal categorical features, `Neighborhood` and `Exterior1st`. We begin by selecting these features as their own DataFrame.

In [None]:
X_nom = housing[['Neighborhood', 'Exterior1st']]
X_nom.head()

Let's now learn the categories and transform the strings using one-hot encoding.

In [None]:
ohe = OneHotEncoder(sparse=False)
X_nom_t = ohe.fit_transform(X_nom)
X_nom_t.shape

We can take this array of transformed data an pass it to any of the regression estimators. Here, we choose to model the data using k-nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=5)
knr.fit(X_nom_t, y)

In [None]:
knr.predict(X_nom_t)

## Using features with different transformations

Simultaneously using continuous, nominal, and ordinal features in our model requires the use of the `ColumnTransformer`, subject of the next chapter.

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Find the cross-validated mean score using linear regression with `HeatingQC` encoded as ordinal. Repeat this using, but use one-hot encoding. Which encoding produces a better result?</span>