# Evaluating machine learning models

Machine learning involves automatically learning how to compute functions from examples.  There are several ways that this process can go wrong, including:

0. Overfitting the training examples,
1. Optimizing for the wrong objective,
2. Starting with the wrong features,
3. Data drift (treated in another notebook),
4. ...


In [2]:
import pandas as pd
data = pd.read_parquet("data/training.parquet")

# Overfitting

The first concern we'd like to address is _overfitting_, in which we have a model whose performance on training examples is materially different from its performance in production.  We'll see that in action with a simple example.  First, let's choose some of our data as training examples:

In [7]:
overfit_training = data.sample(1000)
overfit_training.head(10)

Unnamed: 0,index,label,text
7149,7149,legitimate,Never had any day passed so quickly! Smells ba...
14554,14554,legitimate,Their respectability was as dear to her of all...
10999,10999,legitimate,"Lady Bertram, who assured him, as soon as he k..."
5891,5891,legitimate,It doesn't deserve one or any part of what had...
27436,7436,spam,Try it for yourself in this case. I can't reco...
25857,5857,spam,Not to mention a lot healthier all the way for...
17794,17794,legitimate,It will last you for yeeeaars.With a good popc...
14613,14613,legitimate,"She wondered, and questioned him eagerly; but ..."
11113,11113,legitimate,Her report was highly favourable. Crawford's i...
16898,16898,legitimate,Mrs. Norris could tolerate its being for you h...


We'll "train" a very simple "model" from these examples:  we'll just memorize hashes of every example so we can look up whether a given example is legitimate or not.  We'll program defensively, too:  if we don't find an example in either set, we'll call it legitimate if it has an even number of characters.

In [45]:
class OverfittingSpamModel(object):
    def __init__(self):
        self.legit = set()
        self.spam = set()
    
    def fit(self, df):
        for tup in df.itertuples():
            if tup.label == "legitimate":
                self.legit.add(hash(tup.text))
            else:
                self.spam.add(hash(tup.text))
    
    def predict(self, text):
        h = hash(text)
        if h in self.legit:
            return "legitimate"
        elif h in self.spam:
            return "spam"
        else:
            return (len(text) % 2 == 0) and "legitimate" or "spam"

osm = OverfittingSpamModel()
osm.fit(overfit_training)

We can try this out with some of our training examples to see how well it works:

In [49]:
for row in overfit_training.sample(10).itertuples():
    print("text is '%s...': actual label is %s; predicted label is %s" % (row.text[0:20], row.label, osm.predict(row.text)))

text is 'I've given him a mix...': actual label is spam; predicted label is spam
text is 'Much too little caff...': actual label is legitimate; predicted label is legitimate
text is 'Even if they are Pro...': actual label is spam; predicted label is spam
text is 'I wish that big jug ...': actual label is spam; predicted label is spam
text is 'He dined with us the...': actual label is legitimate; predicted label is legitimate
text is 'I promised them so f...': actual label is legitimate; predicted label is legitimate
text is 'This product provide...': actual label is spam; predicted label is spam
text is 'As soon as we came t...': actual label is legitimate; predicted label is legitimate
text is 'How they could get t...': actual label is legitimate; predicted label is legitimate
text is 'Friday Evening Lady ...': actual label is legitimate; predicted label is legitimate


This model appears to work really well!  We can test it on the whole set of examples:

In [54]:
def model_accuracy(osm, df):
    correct = 0
    incorrect = 0
    for row in df.itertuples():
        if row.label == osm.predict(row.text):
            correct += 1
        else:
            incorrect += 1
    
    if correct + incorrect == 0:
        return 100
    
    return (float(correct) / float(correct + incorrect) * 100)

In [55]:
model_accuracy(osm, overfit_training)

100.0

Our model is enormously successful!  It has one hundred percent accuracy.  We probably expected this result, but it's always nice when things work out as you expected they would.  Let's see how well our model has generalized to data it _hasn't_ seen by testing it on the rest of our dataset (39,000 more examples).

In [57]:
model_accuracy(osm, data)

51.6725

Uh oh!  It appears that our model is not much better than a coin toss once it's running on data it hasn't already seen.  If we had put this model in production, our application surely wouldn't have performed well.

We want a way to identify this problem before we put a model into production, and you've essentially seen it already:  when we train a model, we don't use all of the data we have available to us.  Instead, we divide our examples into distinct _training_ and _test_ sets, usually with about 70% of the examples in the former and 30% in the latter.  
The training algorithm only considers the examples in the training set.  After training our model, we can evaluate its performance on both the training set (which it saw during training) and the test set (which it didn't).  If the performance is materially different on the different sets, we know that we've overfit the data when training our model.

Next up, we'll deal with the question of what metrics we should use to evaluate our performance.  We used accuracy above, but is it always the best option?

# Evaluation metrics and types of error