We'll start simple. So simple you may wonder if it's AI at all. When building models from scratch, I like to start with the easiest thing possible and iterate. Before we write any code, let's define what machine learning and machine learning models are.

## What is machine learning?

Machine learning is a field of artificial intelligence focused on algorithms that 1) learn from data and 2) generalize to unseen data. There are three main components to machine learning, data, models, and algorithms. Machine learning models are the things that do the learning and the algorithms direct the models learning. The lines blur at times between models and algorithms and some algorithms only work for some models. At their core, models are functions. They take an input, process them in some way, and return an output. Models have learnable parameters and machine learning algorithms focus on adjusting these parameters to make the model produce better outputs.

Conceptually, models are simple. Take the equation for a line, `f(x) = m*x + b`. This is a model where `m` and `b` are constant values indicating the slope and y-intercept of the line. `x` is an input value and the output is `f(x) = y`. If we know `m` and `b` we can compute `y` for any `x`. If we don't know `m` and `b`, this is where machine learning comes in. Assuming we have a bunch of `(x, y)` points, a machine learning algorithm can guess what good values for `m` and `b` are based on those points. Then we freeze those values and the model can predict `y` for any `x`.

Models can be giant equations and it's easy to get lost in the details, but remember _it's still just an equation!_ The learning algorithm will find the right constant terms for us. It's our job to set up the problem for the algorithm and model and then get out of the way so the machine can learn.

## The learning process

1. Prepare data
2. Make predictions
3. Compare predictions to targets
4. Update model based on comparison
5. Go back to step 2 and repeat until satisfied.

\# TODO: Make flowchart, maybe try https://github.com/cdfmlr/pyflowchart

In [1]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array

from nlpbook import get_train_test_data

class BaselineClassifier(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        assert len(X) == len(y)
        y = check_array(y, ensure_2d=False)
        self.classes_, counts = np.unique(y, return_counts=True)
        self.proba_ = counts / len(y)
        return self

    def predict(self, X):
        return np.full(len(X), self.classes_[np.argmax(self.proba_)])

from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()
e = BaselineClassifier().fit(train_df[['review']], train_df['label'])
e.score(test_df[['review']], test_df['label'])

0.5011190233977619