Moving beyond the [simplest model](../02_baseline/baseline_classifier.ipynb), the next step is a deep neural network. Just kidding, the next step is something slightly more complex but still very simple. OneR [@Holte1993] is such a model.

## The OneR algorithm

Conceptually, it works on the sampe principle as the baseline model, but on feature _categories_ instead of on the whole dataset. For each category in the feature, predict the most frequent label. To implement this, we iterate over each unique category and train a baseline classifier for just that category. When predicting on new data, we look up the baseline classifier for that category and return it's prediction.

### Categorical features

A categorical feature is an unordered set of values. Movie genres are a good example, horror, comedy, action, etc. There is no order to a movie genre, but each value represents a type of movie. We can treat our movie reviews the same way. We'll represent each unique review as a category.

## Rolling our own

There is no easy button for this model, so we'll go straight to the implementation.

In [1]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.dummy import DummyClassifier
from sklearn.utils.validation import validate_data


class OneR(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        """Find the most predictive rule."""
        # Sanity check on `X` and `y`.
        X, y = validate_data(self, X, y)
        predictors = {}
        # Get the unique categories from the first column.
        categories = np.unique(X[:, 0])
        for category in categories:
            # Create a conditional array where each index
            # is a boolean indicating if that index in the
            # first column of `X` is the category we're iterating
            # over.
            is_category = X[:, 0] == category
            # Grab all data points and labels in this category.
            _X = X[is_category]
            _y = y[is_category]
            # Train a baseline classifier on the category.
            predictors[category] = DummyClassifier().fit(_X, _y)
        self.predictors_ = predictors
        return self

    def predict(self, X):
        """Predict the labels for inputs `X`."""
        # Sanity check on `X`.
        # `reset` should be `True` in `fit` and `False` everywhere else.
        X = validate_data(self, X, reset=False)
        # Create an empty array that will hold the predictions.
        rv = np.zeros(X.shape[0])
        # Get the unique categories from the first column.
        categories = np.unique(X[:, 0])
        for category in categories:
            # Create a conditional array where each index
            # is a boolean indicating if that index in the
            # first column of `X` is the category we're iterating
            # over.
            is_category = X[:, 0] == category
            # Grab all data points in this category.
            _X = X[is_category]
            # Predict the label for all datapoints in `_X`.
            predictions = self.predictors_[category].predict(_X)
            # Assign the prediction for this category to
            # the corresponding indices in `rv`.
            rv[is_category] = predictions
        return rv

There's a lot going on here. We've added a `scikit-learn` validation function, `validate_data`. Since we're building a `sklearn` classifier we might as well take advantage of the validation functions the framework provides. While it's not necessary to use this function, I highly recommend it for a few reasons.

- It checks our inputs and outputs have the right shape, a matrix and array, respectively.

    We should always be operating on these shapes and raising an error otherwise. The input, `X`, passed to `predict` also needs to be the same shape as the input passed to `fit`. If it doesn't have the same number of columns this will raise and error.

- It converts the type to a `numpy` or `scipy` array.

    This gives consistency. Arguments to `fit` and `predict` could by `numpy` arrays, `scipy` arrays, `pandas` dataframes, `pandas` series, lists, or any number of container types. By converting them to `numpy`/`scipy` arrays we are guaranteed a predictible API for `X` and `y`.

Let's try it out!

In [2]:
from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()
features = ["review"]
label = "label"

X, y = train_df[features], train_df[label]
X

oner = OneR().fit(X, y)
oner.score(X, y)

ValueError: could not convert string to float: '"National Lampoon Goes to the Movies" (1981) is absolutely the worst movie ever made, surpassing even the witless "Plan 9 from Outer Space." The Lampoon film unreels in three separate and unconnected vignettes, each featuring different performers. The only common thread is the total lack of any redeeming qualities.<br /><br />Well, maybe there is one. Another reviewer on this site has said that the fleeting nude shots are nice, and he\'s right. Misses Ganzel and Dusenberry flash their assets prettily, in part one and part two, respectively. But their glamorous displays are, alas, wasted. The directors seem to have forgotten that even T&A needs a credible story to surround it, and there\'s none in sight.<br /><br />The third segment, starring Robby Benson and Richard Widmark, is the most disgusting of the three, and an unfortunate choice as the windup of this film. Benson plays an eager-beaver young policeman, brightly reporting for his first day of duty, ready to rid the streets of evil. He is paired with an old, cynical cop played by Widmark, and when these oil-and-water partners set out on their first patrol together, we sense a possible redemption of the film\'s earlier failures. Maybe, just maybe, the cynical old-timer will be reformed by his new partner\'s stalwart sense of duty and loyalty. Maybe all will end happily after all. But alas, this movie heads straight for the toilet, with no redemption, no happy ending, no coherent story of any kind.<br /><br />Before "National Lampoon Goes to the Movies," I thought I had already seen the worst schlock that Hollywood could possibly turn out. Unfortunately, I hadn\'t seen the half of it.<br /><br />'

*\*Surprised pikachu face\** An error!? `validate_data` didn't like our inputs. Turns out `validate_data` also trys to convert `X` and `y` to numeric values and fails hard. But why?

Well this is because machine learning works on numbers, not text. We've written our algorithm to work on numbers or text, but in general the `str` type isn't usable by most machine learning algorithms and `validate_data` enforces that up front. Fortunately it's not hard to convert text categories to numbers, we just assign a unique number to each category. `sklearn` even has a class to do just that.

## Transformers

What we've worked with so far in `scikit-learn` are estimators. These are machine learning models. `scikit-learn` has another common type of class called transformers, which transform data as the name implies. These classes are commonly used for preprocessing tasks. They do not have a `predict` method, but instead have a `transform` method. We "train" them with `fit` just like we would a model, but this is for consistent preprocessing and not prediction.

### Transforming text to numbers

`OrdinalEncoder` is the `sklearn` class for transforming categories to numbers.

In [3]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
encoder.fit(X)
encoder.transform(X)

array([[  300.],
       [22675.],
       [22956.],
       ...,
       [16257.],
       [ 4455.],
       [ 2065.]])

Turns out transforming data is so common, there's a short hand method for fitting and transforming in one go, `fit_transform`.

In [4]:
encoder = OrdinalEncoder()
encoder.fit_transform(X)

array([[  300.],
       [22675.],
       [22956.],
       ...,
       [16257.],
       [ 4455.],
       [ 2065.]])

Let's give our model another try.

In [5]:
X_ordinal = encoder.fit_transform(X)
oner = OneR().fit(X_ordinal, y)
oner.score(X_ordinal, y)

1.0

That's 100% accurate! Well it's 100% accurate on our _training data_, which we don't really care about. How does it perform on our test data?

### Do unto the test data as you have done unto the train data

Before we can score the model on the test data, we need to apply the same transforms. Again, `sklearn` has some niceties that make it easy to enforce the same process on all data.

#### Pipelines

[Pipelines](https://scikit-learn.org/stable/modules/compose.html#combining-estimators) offer convenient ways to string preprocessing and models together into one object that can be trained end to end and then make predictions on different data following the same process. The input to pipelines is a sequence of transformers with an optional predictor as the last element. Once the pipeline is set up it has the same methods for training and predicting as a regular `sklearn` model.

In [6]:
# | output: false
from sklearn.pipeline import Pipeline

# Layout the sequence of operations. The first element transforms
# the categories to numbers. The second element is our model. Each
# element is a tuple where the first element of the tuple is the
# name of the step and the second element is the transformer or model.
steps = [
    ("categorical_transform", OrdinalEncoder()),
    ("model", OneR()),
]
pipeline = Pipeline(steps)
# Now train the model with the original data as input.
# No need to fit the encoder ourselves!
pipeline.fit(X, y)

Now let's try predicting on one review. Our pipeline will handle the preprocessing and prediction for us.

In [7]:
X_test, y_test = test_df[features], test_df[label]
pipeline.predict(X_test.head(1))

ValueError: Found unknown categories ["1st watched 2/9/2008, 4 out of 10(Dir-J.S. Cardone): Sexual political thriller that doesn't really succeed in any of these areas very well except early on where there are some interesting soft-core scenes. The movie starts off portraying a couple exploring their sexual fantasies amidst their work environments or wherever and whatever suits their fancy. The couple takes an excursion to a retreat and bathhouse where they run into a woman that's willing to be a part of a three-some and fulfill some of their fantasies. At this point, we only know that this couple is well off but we don't know until they return that the fiancé is part of a well-to-do political family. The man hopes to be on the rise to the point of possibly getting a congressional seat after the marriage. They then receive a package in the mail from an anonymous source with explicit pictures of their encounter at the bath house and their qwest begins as to how and why they were filmed, who sent the package, what they want, and how to clear their names before any of this gets out. This qwest becomes an obsession that leads them deeper into seedier worlds and takes a lot of their time, to the point where their friends & family wonder what they're doing all day and why they look rundown all the time. This movie is interesting at times but drifts into ridiculousness as they personally seek out the problem instead of getting the police involved early on because of their pride. This mistake, of course, keeps the movie going. The performances are fine despite the no-name cast but the lunacy of the situation overrides and the movie starts to become ho-hum about ½ the way through. And of course, they throw in a twist at the end that defies and challenges everything that happened prior(as is the norm these days when they don't know what else to do to spice up the movie). This doesn't help this movie one bit, though."] in column 0 during transform

More errors, this is becoming a theme for this chapter. The error says the `transform` method found an unknown category. It turns out there are reviews in the test set that are different from the train set.

So when `OrdinalEncoder` tries to transform unseen reviews in the test set it doesn't know what category to assign them since it was fitted on the train set. The `sklearn` developers have thought of a way to handle this though by assigning all unknown categories to one unknown value.

In [8]:
encoder = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
encoder.fit(X)
encoder.transform(X_test.head(1))

array([[-1.]])

Voila, it's fixed. But this raises another problem. Our encoder can handle unknown categories, what about our model? It can't, so we need to provide a fallback prediction for unknown categories. The simple solution is to use the baseline classifier trained on all the train data to predict unknown categories.

In [9]:
class OneR(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        """Find the most predictive rule."""
        # Sanity check on `X` and `y`.
        X, y = validate_data(self, X, y)
        predictors = {}
        # Get the unique categories from the first column.
        categories = np.unique(X[:, 0])
        for category in categories:
            # Create a conditional array where each index
            # is a boolean indicating if that index in the
            # first column of `X` is the category we're iterating
            # over.
            is_category = X[:, 0] == category
            # Grab all data points and labels in this category.
            _X = X[is_category]
            _y = y[is_category]
            # Train a baseline classifier on the category.
            predictors[category] = DummyClassifier().fit(_X, _y)
        self.predictors_ = predictors
        # Create a fallback predictor for unknown categories.
        self.unknown_predictor_ = DummyClassifier().fit(X, y)
        return self

    def predict(self, X):
        """Predict the labels for inputs `X`."""
        # Sanity check on `X`.
        # `reset` should be `True` in `fit` and `False` everywhere else.
        X = validate_data(self, X, reset=False)
        # Create an empty array that will hold the predictions.
        rv = np.zeros(X.shape[0])
        # Get the unique categories from the first column.
        categories = np.unique(X[:, 0])
        for category in categories:
            # Create a conditional array where each index
            # is a boolean indicating if that index in the
            # first column of `X` is the category we're iterating
            # over.
            is_category = X[:, 0] == category
            # Grab all data points in this category.
            _X = X[is_category]
            # Predict the label for all datapoints in `_X`.
            try:
                predictions = self.predictors_[category].predict(_X)
            except KeyError:
                # Fallback to the predictor for unknown categories.
                predictions = self.unknown_predictor_.predict(_X)
            # Assign the prediction for this category to
            # the corresponding indices in `rv`.
            rv[is_category] = predictions
        return rv

Now we'll recreate the pipeline with our new model.

In [10]:
steps = [
    (
        "categorical_transform",
        OrdinalEncoder(
            handle_unknown="use_encoded_value", unknown_value=-1
        ),
    ),
    ("model", OneR()),
]
pipeline = Pipeline(steps)
pipeline.fit(X, y)
pipeline.predict(X_test.head(1))

array([1.])

Our model predicted the first review in the test set has a positive sentiment, which means it's predicting something, so it must be working!

## Evaluating our model

Our preprocessing is locked in and our model is trained. Now we can score our model on the test set.

In [11]:
pipeline.score(X_test, y_test)

0.5011190233977619

That's a lot worse than the 100% accuracy we got on the train data. How does this compare to the baseline model?

In [12]:
from nlpbook import model_results

model_results["baseline"]

0.5011190233977619

Damn, that's the same accuracy as the baseline model. Why is that?

Turns out all the reviews in the test set are different from the reviews in the train set, so this model devolves to the baseline model, giving us the same result. 

In [13]:
set(train_df["review"]) & set(test_df["review"])

set()

It's unfortunate we didn't see an improvement in performance, but because we are comparing to a baseline, we know where we stand. In the next chapter we'll mostly ignore the OneR algorithm and focus on improving performance by creating a richer representation of the input data.

## Rolling your own transformer

Much like our model, we can write transformers that slot into the `sklearn` API, all that we need to do is implement `fit` and `transform` methods. These follow all the same guidelines outlined in @sec-scikit-learnifying-our-model. Let's make our own `OrdinalEncoder` called `CategoricalEncoder`.[The `transform` method heavily uses [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html), which is one way `numpy` speeds up operations on arrays.]{.aside}

In [102]:
from sklearn.base import BaseEstimator, TransformerMixin


class CategoricalEncoder(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        """Generate numeric categories from `X`.

        Note: All `fit` methods must accept a `y` argument whether they
              use them or not. Transfomers typically ignore this argument
              whether it's passed in or not.
        """
        # Validate the data as before. `skip_check_array=True` tells `validate_data` not to convert `X` to a numeric array type. This is important since we have to deal with numeric or text types.
        X = validate_data(self, X, skip_check_array=True)
        try:
            # Since `validate_data` did not convert `X` to a numeric array, we need to convert it to a matrix if it's still a `DataFrame`.
            X = X.to_numpy()
        except AttributeError:
            # This is not a `DataFrame`. Assume it's a `numpy` or `scipy` array.
            pass

        categories = []
        # Iterate over each column in `X`.
        for column in X.T:
            # Get all unique values in the column.
            values = np.unique(column)
            # Store the unique values as the ith element in the array.
            categories.append(values)
        # Save the categories on the transformer.
        self.categories_ = categories
        return self

    def transform(self, X):
        """Return the categorical values."""
        # Validate the data as before. `skip_check_array=True` tells `validate_data` not to convert `X` to a numeric array type. This is important since we have to deal with numeric or text types.
        X = validate_data(self, X, skip_check_array=True, reset=False)
        try:
            # Since `validate_data` did not convert `X` to a numeric array, we need to convert it to a matrix if it's still a `DataFrame`.
            X = X.to_numpy()
        except AttributeError:
            # This is not a `DataFrame`. Assume it's a `numpy` or `scipy` array.
            pass

        # Create an array with the same shape as `X` to store the categorical values. An unknown category, `-1` is used as the default value.
        rv = np.full(X.shape, -1)
        # Iterate over each column in `X`.
        for i, x in enumerate(X.T):
            # Grab the categories for the ith column of `X`.
            categories = self.categories_[i]
            # Reshape the column to be a Nx1 matrix. Using a matrix instead of an array allows us to leverage `numpy` broadcasting.
            x = x.reshape(-1, 1)
            # Create boolean matrix where `True` values indicate the index of `categories` that equals `x` for each row.
            is_category = x == categories
            # Find the indices of `x` that contain known categories. This tells us which rows have known categories.
            known_category = is_category.any(axis=1)
            # Get the index of the `True` value in each row. This is the numeric value for the category.
            category_value = np.where(is_category)[1]
            # Assign the category index to the appropriate rows.
            rv[known_category, i] = category_value
        return rv

Our very first transformer! Let's run it.

In [98]:
cat_encoder = CategoricalEncoder()
cat_encoder.fit_transform(X)

array([[  300],
       [22675],
       [22956],
       ...,
       [16257],
       [ 4455],
       [ 2065]])

Beautiful, we've successfully recreated `OrdinalEncoder`. And we should get `-1` for the test data.

In [99]:
cat_encoder.transform(X_test)

array([[-1],
       [-1],
       [-1],
       ...,
       [-1],
       [-1],
       [-1]])

Let's use it in a pipeline!

In [101]:
steps = [
    ("categorical_transform", CategoricalEncoder()),
    ("model", OneR()),
]
pipeline = Pipeline(steps)
pipeline.fit(X, y)
pipeline.score(X_test, y_test)

0.5011190233977619

Heck yeah it worked.

## Multiclass classification

We know every review in the test data will have an unknown category, so the OneR model will fallback to the baseline classifier for every review meaning we'll get the same result we did in the last chapter.