# Evaluating machine learning models

Machine learning involves automatically learning how to compute functions from examples.  There are several ways that this process can go wrong, including:

0. Overfitting the training examples,
1. Optimizing for the wrong objective,
2. Starting with the wrong features,
3. Data drift (treated in another notebook),
4. ...


In [59]:
import pandas as pd
data = pd.read_parquet("data/training.parquet")

# Overfitting

The first concern we'd like to address is _overfitting_, in which we have a model whose performance on training examples is materially different from its performance in production.  We'll see that in action with a simple example.  First, let's choose some of our data as training examples:

In [60]:
overfit_training = data.sample(1000)
overfit_training.head(10)

Unnamed: 0,index,label,text
33630,13630,spam,It ain't healthy but darn is it good. The soup...
14768,14768,legitimate,"No second attachment, the only thoroughly natu..."
34240,14240,spam,It's possible that I'll eventually figure out ...
30075,10075,spam,IT IS THE SKELETON PAPER CUPS! I absolutely sw...
24878,4878,spam,CANNOT BEAT IT! Some years ago I switched to t...
27309,7309,spam,When I got this item today and I received it o...
33151,13151,spam,And they don't leave stains and three the pric...
1232,1232,legitimate,I can not express how much I felt. She could s...
16593,16593,legitimate,They remained together at the Park soon after ...
38255,18255,spam,The vanilla cookie/cracker with creme de cocoa...


We'll "train" a very simple "model" from these examples:  we'll just memorize hashes of every example so we can look up whether a given example is legitimate or not.  We'll program defensively, too:  if we don't find an example in either set, we'll call it legitimate if it has an even number of characters.

In [61]:
class OverfittingSpamModel(object):
    def __init__(self):
        self.legit = set()
        self.spam = set()
    
    def fit(self, df):
        for tup in df.itertuples():
            if tup.label == "legitimate":
                self.legit.add(hash(tup.text))
            else:
                self.spam.add(hash(tup.text))
    
    def predict(self, text):
        h = hash(text)
        if h in self.legit:
            return "legitimate"
        elif h in self.spam:
            return "spam"
        else:
            return (len(text) % 2 == 0) and "legitimate" or "spam"

osm = OverfittingSpamModel()
osm.fit(overfit_training)

We can try this out with some of our training examples to see how well it works:

In [62]:
for row in overfit_training.sample(10).itertuples():
    print("text is '%s...': actual label is %s; predicted label is %s" % (row.text[0:20], row.label, osm.predict(row.text)))

text is 'I have two dogs and ...': actual label is legitimate; predicted label is legitimate
text is 'This is my new hambu...': actual label is spam; predicted label is spam
text is 'Well, I was so aston...': actual label is legitimate; predicted label is legitimate
text is 'You were disgusted w...': actual label is legitimate; predicted label is legitimate
text is 'On this product they...': actual label is legitimate; predicted label is legitimate
text is 'The two friends who ...': actual label is legitimate; predicted label is legitimate
text is 'And keeps coffee tab...': actual label is spam; predicted label is spam
text is 'They were too much i...': actual label is legitimate; predicted label is legitimate
text is 'But they have a rich...': actual label is legitimate; predicted label is legitimate
text is 'I used these k-cups....': actual label is spam; predicted label is spam


This model appears to work really well!  We can test it on the whole set of examples:

In [63]:
def model_accuracy(osm, df):
    correct = 0
    incorrect = 0
    for row in df.itertuples():
        if row.label == osm.predict(row.text):
            correct += 1
        else:
            incorrect += 1
    
    if correct + incorrect == 0:
        return 100
    
    return (float(correct) / float(correct + incorrect) * 100)

In [64]:
model_accuracy(osm, overfit_training)

100.0

Our model is enormously successful!  It has one hundred percent accuracy.  We probably expected this result, but it's always nice when things work out as you expected they would.  Let's see how well our model has generalized to data it _hasn't_ seen by testing it on the rest of our dataset (39,000 more examples).

In [65]:
model_accuracy(osm, data)

51.60249999999999

Uh oh!  It appears that our model is not much better than a coin toss once it's running on data it hasn't already seen.  If we had put this model in production, our application surely wouldn't have performed well.

We want a way to identify this problem before we put a model into production, and you've essentially seen it already:  when we train a model, we don't use all of the data we have available to us.  Instead, we divide our examples into distinct _training_ and _test_ sets, usually with about 70% of the examples in the former and 30% in the latter.  
The training algorithm only considers the examples in the training set.  After training our model, we can evaluate its performance on both the training set (which it saw during training) and the test set (which it didn't).  If the performance is materially different on the different sets, we know that we've overfit the data when training our model.

Next up, we'll deal with the question of what metrics we should use to evaluate our performance.  We used accuracy above, but is it always the best option?

# Evaluation metrics and types of error

Our training data set is _balanced_ between classes -- there are equal numbers of legitimate and spam documents.  But data in the real world are typically not balanced, and are often wildly unbalanced.  For example:

- The worldwide incidence of Rh-negative blood types is approximiately 6 percent;
- Between one and three percent of actual consumer payments transactions are fraudulent; and
- A rare disease may have an incidence rate on the order of one in ten thousand per year.

In cases like these, it would be possible to develop an accurate model that wouldn't produce meaningful results; for example:

- A blood type tester that always returned "Rh-positive" would be accurate roughly 94% of the time on a sufficiently diverse population;
- A fraud detector that always returned "not fradulent" would be accurate between 97-99% of the time -- until, that is, fraudsters determined that their charges would likely go through, increasing the rate of fraudulent charges; and
- A technique to screen for a very rare disease could be quite accurate by simply never identifying disease.

In many applications, we're not only interested in correctly identifying members of one class, we're interested in correctly identifying members of both classes.  We can capture this behavior by using better metrics than accuracy.

To learn about these metrics, let's start with an unbalanced data set, in which 90% of the messages are spam.

In [66]:
legit_sample = data[data.label == 'legitimate'].sample(2000)
spam_sample = data[data.label == 'spam'].sample(18000)
unbalanced = pd.DataFrame.append(legit_sample, spam_sample)

To avoid overfitting, we'll split the unbalanced data set into training and test sets, using functionality from scikit-learn.

In [67]:
from sklearn.model_selection import train_test_split
unbalanced_train, unbalanced_test = train_test_split(unbalanced, test_size=0.3)

We'll now create a simple model that should work pretty well for spam messages but not necessarily as well for legitimate ones.  

In [68]:
from collections import defaultdict
import re
    
class SensitiveSpamModel(object):
    
    def __init__(self):
        self.legit = set()
        self.spam = set()
    
    def fit(self, df):
        """ Train a model based on the most frequent unique 
            words in each class of documents """
        legit_words = defaultdict(lambda: 0)
        spam_words = defaultdict(lambda: 0)
        
        for tup in df.itertuples():
            target = spam_words
            if tup.label == "legitimate":
                target = legit_words
            for word in re.split(r"\W+", tup.text):
                if len(word) > 0:
                    target[word.lower()] += 1
        
        # remove words common to both classes
        for word in set(legit_words.keys()).intersection(set(spam_words.keys())):
            del legit_words[word]
            del spam_words[word]
        
        top_legit_words = sorted(legit_words.items(), key=lambda kv: kv[1], reverse=True)
        top_spam_words = sorted(spam_words.items(), key=lambda kv: kv[1], reverse=True)
        
        # store ten times as many words from the spam set
        self.legit = set([t[0] for t in top_legit_words[:100]])
        self.spam = set([t[0] for t in top_spam_words[:1000]])
            
    def predict(self, text):
        legit_score = 0
        spam_score = 0
        
        for word in re.split(r"\W+", text):
            w = word.lower()
            if word in self.legit:
                legit_score = legit_score + 1
            elif word in self.spam:
                spam_score = spam_score + 1
        
        # bias results towards spam in the event of ties
        return (legit_score > spam_score) and "legitimate" or "spam"

ssm = SensitiveSpamModel()
ssm.fit(unbalanced_train)

Let's check the accuracy on our training sample.

In [69]:
model_accuracy(ssm, unbalanced_train)

94.21428571428572

To make sure we've not overfit our training sample, let's check the accuracy on the test sample.

In [70]:
model_accuracy(ssm, unbalanced_test)

92.33333333333333

That's not quite as good as the results on our training sample, but it's still pretty decent (that is, it's better than just always returning "spam" would be given the balance of the classes).  

However, we get a different picture if we look at our model's performance on the balanced data.

In [71]:
model_accuracy(ssm, data)

63.5375

And the accuracy is even worse if we look at a sample where the label balance is reversed (i.e., only 10% of documents are spam):

In [72]:
legit_sample = data[data.label == 'legitimate'].sample(900)
spam_sample = data[data.label == 'spam'].sample(100)
legit_biased = pd.DataFrame.append(legit_sample, spam_sample)
model_accuracy(ssm, legit_biased)

32.6

We'd like to understand the performance of our model with some metric that captures not only the overall accuracy but the accuracy for positive cases and the accuracy for negative cases.  That is, if we assume that our goal is to identify spam documents, we care about:

- _true positives_, which are spam documents that our model predicts as spam;
- _true negatives_, which are legitimate documents that our model predicts as legitimate;
- _false positives_, which are legitimate documents that our model predicts as spam; and
- _false negatives_, which are spam documents that our model predicts as legitimate

The proportions between these quantities can provide interesting metrics.  For example, the ratio of true positives to actual positives (that is, true positives + false negatives) is called _recall_, which indicates the percentage of spam documents we've selected.  The ratio of true positives to all predicted positives (that is, true positives + false positives) is called _precision_, which indicates the percentage of predicted spam documents that are actually spam.  Ideally, a good classifier would have both high precision and high recall, but in some applications either precision or recall is more important.