# Model overfitting, validation, process models

At this point, we have had our first experience with analysing a data set with a machine learning method (e.g. kNN)
and evaluating the results.
In this module, we will discuss the problem of model overfitting and how to validate the model.
We will also introduce some the CRISP-DM process model, as well as other process models, which create much-needed structure to the data analysis process.

## Model overfitting

Whenever we apply a machine learning method to a data set, the goal is to construct a trained model that generalizes well to new, unseen data. For example, decision trees and trained neural networks are models that can be used to predict the class of a new data point. The kNN algorithm that we previously studied is conceptually slightly different: there the model is the training data itself, and the prediction is made by comparing the new data point to the training data.

Irrespective of the method, the predictions rely on the training data, or, expressed differently, on a model that grasps the essence of the training data.


The problem of model overfitting stems from the fact that a constructed model, as it grasps the characteristics of the training data, always adapts to the specific peculiarities of the training data. These peculiarities can be noise in the data, or they can be patterns that are not generalizable to new data.

Consider an extreme example where we collect a group of people (say 30 persons), and try to predict who of them will be left-handed. As explanatory variables, we might use a large number of easily measurable quantities, such as height, weight, age, etc. We might also include some more exotic variables, such as the number of freckles on the person's face. If we have enough variables, we can construct a model that predicts the left-handedness of the person with 100% accuracy. The model might tell that, for example, if you are 1.75 meters tall, weigh 70 kg, and have 10 freckles on your face, you are left-handed. This model is overfitted, as it is based on the peculiarities of the training data, and not on generalizable patterns.

Even though this example is extreme, the danger of model overfitting is real in all machine learning applications. The goal of the data analyst is to construct a model that generalizes well to new data, and does not overfit to the training data.

In our extreme example, we can easily test whether the model generalizes well to new data by collecting a new group of people and testing the model on them. Almost inevitably, the model will fail to predict the left-handedness of the new group of people, as the model is based on the peculiarities of the training data. The accuracy of the classifier in this new data would probably be comparable to random guessing.

The act of collecting new data to test the model is called validation. In the following sections, we will discuss different ways to validate a model.

## Validation

Validation means exposing the model to new data to test its generalization capabilities. The measures of goodness for a classifier, such as accuracy, precision, and recall should always be estimated from data that has not been used to train the model. That way, we can be sure that the model generalizes well to new data, or, in other words, is not subject to overfitting.

There are various ways to validate a model. In the following sections, we will discuss some of the most common ones.

### Split validation

Split validation means that a part of the data is used to train the model, and another part is used to test the model. The simplest way to do this is to split the data into two parts: a training set and a test set. The training set is used to train the model, and the test set is used to evaluate the model. In most cases, the split is done randomly.

It is a common practice to use roughty two-thirds of the data for training and one-third for testing. The exact ratio depends on the size of the data set and the problem at hand.

Let us use the Iris data set to illustrate split validation, in conjunction with the kNN classifier. The following code snippet demonstrates how to split the data into training and test sets, and how to train and evaluate the kNN classifier.


In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = pd.read_csv('datasets/iris/iris.csv').drop(columns = ['sepal_length','sepal_width'])

X = iris.drop(columns = 'species')
y = iris['species']

# Split the dataset into a training set and a testing set
# 70% of the data will be used for training, 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

# Create a kNN classifier
# n_neighbors parameter specifies the number of neighbors to use (k)
knn = KNeighborsClassifier(n_neighbors=3)

# Train the kNN classifier on the training data
knn.fit(X_train, y_train)

# Use the trained classifier to predict labels for the test set
y_pred = knn.predict(X_test)

# Calculate and print the accuracy of the classifier on the test set
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Accuracy of kNN classifier on the test set: {accuracy_test:.2f}")

Accuracy of kNN classifier on the test set: 0.98


For comparison, the following code block calculates the accuracy of the kNN classifier on the training set. In many cases, the accuracy on the training set is higher than the accuracy on the test set, as the model is trained on the training data, and therefore performs better on the training data than on new, unseen data. Do not rely on the training accuracy as a measure of the model's performance!

In [20]:
# This is just to show the difference between training and test accuracy
y_train_pred = knn.predict(X_train)
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f"Accuracy of kNN classifier on the training set: {accuracy_train:.2f} (EXERCISE CAUTION!)")

Accuracy of kNN classifier on the training set: 0.99 (EXERCISE CAUTION!)


> As you run the code, you may occasionally get different accuracy values, as the data is split randomly into training and test sets. As the Iris data set is relatively small and easy for the kNN classifier, the accuracy is expected to be high even for the test set, and randomness in the data split may cause the accuracy to vary.
>
> Generally, the model overfitting becomes worse as the number of features, and, consequently, model complexity increases. In the Iris data set, we have only four features, which makes the problem relatively simple. In more complex problems, the risk of overfitting is higher.
>
In the sklearn library, the `train_test_split` function is used to split the data into training and test sets. The `test_size` parameter specifies the proportion of the data that should be used for testing. The rows of the data frame are split randomly, so each run of the code may give slightly different results (unless the random seed is fixed). The optional `random_state` parameter can be used to fix the random seed, which ensures that the data is split in the same way each time the code is run. This can be useful for reproducibility. As the parameter value, just choose an integer, e.g. 12345. Never tune the value to get favorable results, as this would be a form of malpractice called cherry picking.