Moving beyond the [simplest model](../02_baseline/baseline_classifier.ipynb), the next step is a deep neural network. Just kidding, the next step is something slightly more complex but still very simple. OneR [@Holte1993] is such a model, but before we can get to the nitty gritty details, we need to learn about features and feature engineering, an important concept that is integral to machine learning.

## Features

Features are just inputs to our model, so our data naturally contains features. Our data also contains observations that we want the model to predict; these are not features. There may be additional information in the data that aren't used by our model.

In [1]:
from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()
train_df.columns

Index(['id', 'movie_id', 'rating', 'review', 'label'], dtype='object')

There are five columns in our dataframe, which ones are features? Which ones are observations? Which are additional information that won't be used for training?

Let's break it down column by column.

### `id`

This is the review ID. It is not useful as a feature. Since each ID is unique there is no information a model can learn.

### `movie_id`

Here things get tricky. The movie ID indicates what movie the review is for. Depending on your goal, this could be a feature. Recommendation systems may include a bias for each movie since not all movies are created equal. For example, _Die Hard_ is a highly rated action movie and it makes sense to recommend that movie to fans of the action genre over other, lower rated action movies.

We are interested in classifying the sentiment based on the content of the review. In this case, the movie ID can unfairly influence the model to ignore the review and classify the sentiment based on the movie (eg. all _Die Hard_ reviews are positive). In theory that sounds fine, but how will your model handle a movie it's not trained on? If the model learns to ignore the review and base it's output on the movie ID, then it will be terrible at predicting the sentiment of a review for movies not seen in the training data. We will exclude `movie_id` as a feature for the sentiment classification task.

### `rating`

For multiclass classification, this is the value the model is predicting, making it an observation, not a feature.

But what about when we are trying to predict `label`? Could this be a feature? Unfortunately no, it turns out the `label` column is computed from the rating. Any model worth its salt will figure that out and make predictions based on the rating instead of the review.

### `review`

This is our input, therefore it's a feature.

### `label`

In the binary classification problem, this is what we expect the model to predict, so it is an observation, not a feature.

## Feature Types

### Categorical

Categorical features can be grouped by a value. Movie genre is such an example, where the genre falls into categories like action, horror, comedy, etc. Order doesn't matter, uniqueness does. We can think of the `review` column as such a feature. It's a `str` type so each unique review is a category value. Integers can also be used as category values.

### Numeric

As the name suggests, these features are numbers. They are continuous values where order matters, like age or income.

## OneR algorithm

Now we can get to business and build a new model! Last chapter, the model always predicted the most common label in the training set. This model predicts the most common label for each _category_.

In [2]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.dummy import DummyClassifier

class OneR(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        """Train the model with inputs `X` on labels `y`."""
        self.predictors_ = {}
        X, y = np.array(X), np.array(y)
        for x in np.unique(X):
            is_x = X == x
            self.predictors_[x] = DummyClassifier().fit(X[is_x], y[is_x])

        return self

    def predict(self, X):
        """Predict the labels for inputs `X`."""
        X = np.array(X)
        return np.array([self.predictors_[x].predict([x])[0] for x in X])

Ok, our model is ready to train. Let's grab our feature and label and get to it.

In [3]:
feature = "review"
label = "label"

X_train, y_train = train_df[feature], train_df[label]

oner = OneR().fit(X_train, y_train)
oner.score(X_train, y_train)

1.0

That's 100% accurate! Well it's 100% accurate on our _training data_, which we don't really care about. How does it perform on our test data?

In [4]:
X_test, y_test = test_df[feature], test_df[label]
oner.score(X_test, y_test)

KeyError: "1st watched 2/9/2008, 4 out of 10(Dir-J.S. Cardone): Sexual political thriller that doesn't really succeed in any of these areas very well except early on where there are some interesting soft-core scenes. The movie starts off portraying a couple exploring their sexual fantasies amidst their work environments or wherever and whatever suits their fancy. The couple takes an excursion to a retreat and bathhouse where they run into a woman that's willing to be a part of a three-some and fulfill some of their fantasies. At this point, we only know that this couple is well off but we don't know until they return that the fiancé is part of a well-to-do political family. The man hopes to be on the rise to the point of possibly getting a congressional seat after the marriage. They then receive a package in the mail from an anonymous source with explicit pictures of their encounter at the bath house and their qwest begins as to how and why they were filmed, who sent the package, what they want, and how to clear their names before any of this gets out. This qwest becomes an obsession that leads them deeper into seedier worlds and takes a lot of their time, to the point where their friends & family wonder what they're doing all day and why they look rundown all the time. This movie is interesting at times but drifts into ridiculousness as they personally seek out the problem instead of getting the police involved early on because of their pride. This mistake, of course, keeps the movie going. The performances are fine despite the no-name cast but the lunacy of the situation overrides and the movie starts to become ho-hum about ½ the way through. And of course, they throw in a twist at the end that defies and challenges everything that happened prior(as is the norm these days when they don't know what else to do to spice up the movie). This doesn't help this movie one bit, though."

Uh oh, there's an error! We've treated each unique review as a category and there are reviews in the test set that aren't in the train set. We need a way to handle missing categories. Let's use the baseline classifier as a fallback.

In [5]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.dummy import DummyClassifier

class OneR(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        """Train the model with inputs `X` on labels `y`."""
        self.predictors_ = {}
        self.fallback_ = DummyClassifier().fit(X, y)
        # Added fallback for missing categories.
        X, y = np.array(X), np.array(y)
        for x in np.unique(X):
            is_x = X == x
            self.predictors_[x] = DummyClassifier().fit(X[is_x], y[is_x])

        return self

    def predict(self, X):
        """Predict the labels for inputs `X`."""
        X = np.array(X)
        rv = []
        for x in X:
            try:
                rv.append(self.predictors_[x].predict([x])[0])
            except KeyError:
                rv.append(self.fallback_.predict([x])[0])
                # Use the fallback when the category isn't 
                # in `self.predictors_`.
        return np.array(rv)

In [6]:
oner = OneR().fit(X_train, y_train)
oner.score(X_test, y_test)

0.5011190233977619

That's a lot worse than the 100% accuracy we got on the train data. How does this compare to the baseline model?

In [7]:
from nlpbook import model_results

model_results["baseline"]

0.5011190233977619

Damn, that's the same accuracy as the baseline model. Why is that?

Turns out all the reviews in the test set are different from the reviews in the train set, so this model devolves to the baseline model, giving us the same result. It's unfortunate we didn't see an improvement in performance, but because we are comparing to a baseline, we know where we stand. In the next chapter we'll mostly ignore the OneR algorithm and focus on improving performance by creating a richer representation of the input data.