We'll start simple. So simple you may wonder if it's AI at all. When building models from scratch, I like to start with the easiest thing possible and iterate. Before we write any code, let's define what machine learning and machine learning models are.

## What is machine learning?

Machine learning is a field of artificial intelligence focused on algorithms that 1) learn from data and 2) generalize to unseen data. There are three main components to machine learning, data, models, and algorithms. Machine learning models are the things that do the learning and the algorithms direct the models learning. The lines blur at times between models and algorithms and some algorithms only work for some models. At their core, models are functions. They take an input, process them in some way, and return an output. Models have learnable parameters and machine learning algorithms focus on adjusting these parameters to make the model produce better outputs.

Conceptually, models are simple. Take the equation for a line, `f(x) = m*x + b`. This is a model where `m` and `b` are constant values indicating the slope and y-intercept of the line. `x` is an input value and the output is `f(x) = y`. If we know `m` and `b` we can compute `y` for any `x`. If we don't know `m` and `b`, this is where machine learning comes in. Assuming we have a bunch of `(x, y)` points, a machine learning algorithm can guess what good values for `m` and `b` are based on those points. Then we freeze those values and the model can predict `y` for any `x`.

Models can be giant equations and it's easy to get lost in the details, but remember _it's still just an equation!_ The learning algorithm will find the right constant terms for us. It's our job to set up the problem for the algorithm and model and then get out of the way so the machine can learn.

::: {.callout-note}
In machine learning, the constant terms of the equation are called parameters or weights. During training they are not constant as the learning algorithm is trying to find their optimal values, but once training is done these values become constant and you're left with a normal equation.
:::

### The learning process

At it's core the learning process iterates through prediction, comparison, and tweaking the model parameters. The training data is used in this process. We initialize the model, then use it as is to predict the outputs on the training data. These outputs are compared to the actual outputs, or targets, of the training data. If we are happy with the comparison training is done. Otherwise the model parameters are adjusted and the process begins again. This whole process is outlined below.

```{mermaid}
flowchart TB
  B[Predict output for train data] --> C[Compare predictions to targets]
  C --> D{Predictions good enough?}
  D -- No --> E[Update model weights]
  D -- Yes --> F[Freeze model for production]
  E --> B
```

## The simplest of models

Turning our attention back to our model, the simplest thing we can do is predict the same thing for every input. This may seem silly, but it gives us a measuring stick to compare to other models. If we use the latest and greatest techniques in deep learning, it should outperform this model. The only way we'll know it outperforms it is by building the model and testing it. If it doesn't outperform this model, that raises cause for concern and the results should be investigated. So we start simple and incrementally improve until we are satisfied.

### Train vs. test datasets

Now that we have a model in mind, we need to train it and test it. While training a model is the fun part, that's not really what we care about. We want to know how well our model works which is why we have a test set. The test set is used exclusively to benchmark the performance of the model and is not used in the training process. This comes back to making models that generalize to unseen data. Models generally perform better on data they're trained on because that data is what the model is optimized to predict for. If we use that same data to test the model, then we'll likely be overconfident in our models predictions. When we deploy the model and it is used on new data it hasn't seen we'll be in for a rude awakening. So we separate our dataset into training and testing datasets. In practice care should go into how these datasets are curated[I highly recommend this [blog post](https://www.fast.ai/posts/2017-11-13-validation-sets.html) on the dangers of blindly splitting your data. It is about validation sets, but the same concepts apply to test sets. It's ok if some of the content is over your head as we'll revisit this in a later chapter.]{.aside}, but we won't worry about that here since I've already cleaned our datasets.

Let's train our model and predict the labels of the test set.

In [23]:
import numpy as np
from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()
X_train, y_train = train_df[['review']], train_df['label']
predicted_output = y_train.value_counts().idxmax()
# `predicted_output` is the most common label in the train set.
pred = np.full(len(test_df), predicted_output)
pred

array([1, 1, 1, ..., 1, 1, 1])

Our predictions on the test set are all 1 because that's the most common label in the train set. Now how do we measure this model's performance?

## Metrics, metrics, metrics



In [1]:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array

from nlpbook import get_train_test_data

class BaselineClassifier(ClassifierMixin, BaseEstimator):
    def fit(self, X, y):
        assert len(X) == len(y)
        y = check_array(y, ensure_2d=False)
        self.classes_, counts = np.unique(y, return_counts=True)
        self.proba_ = counts / len(y)
        return self

    def predict(self, X):
        return np.full(len(X), self.classes_[np.argmax(self.proba_)])

from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()
e = BaselineClassifier().fit(train_df[['review']], train_df['label'])
e.score(test_df[['review']], test_df['label'])

0.5011190233977619