# Introduction

You may have come across situations where a model performs well on training data but not on the test data. Also, you would have faced confusion about which model to use for a given problem. For example, by now you have learned many classification models. Given a problem that requires classification, how would you decide which model to go with? Questions like these frequently arise irrespective of the choice of model, data or the problem itself.

The central issue in all of the machine learning is “how do we extrapolate learnings from a finite amount of available data to all possible inputs ‘of the same kind’?” Training data is always finite, yet the model is supposed to learn everything about the task at hand from it and perform well on unseen data.

How do you ensure, and be confident, that the model is as good as it seems on the training data and deploy it to make predictions on real, unseen data?

Often, it is mistaken that if a model performs well on the training data, it will produce good results on test data as well. But that is not always the case.

Let's understand how models extract generalizable information from a finite amount of data it is trained on to perform well on unseen data. 

When we give a data to a Machine Learning algorithm, it produces a ML model. A model can be a function, logical rule or a data structure that takes a set of input, processes it and gives out the output. It should not be neither too simple nor too complex to be able to make predictions. You need to strike the right balance between the two to come up with a good model that is capable of making right predictions even on unseen data.

**Occam's razor** is perhaps the most important thumb rule in machine learning, and incredibly 'simple' at the same time. When in dilemma, choose the simpler model. The question then is 'how do we define simplicity?'.

Definition of simplicity varies with respect to the type of model under consideration. Simplicity in case of a tree model would mean a reduced depth or size but when it comes to a linear model, it can be expressed in terms of the number of bits required to represent a model.

## Model and Algorithms

A basic property of a learning algorithm is that it can only produce models of a certain kind within its boundaries. This means that an algorithm designed to produce linear class of models, like linear/logistic regression, will never be able to produce a decision tree or a neural network. The class of model becomes critical because a wrong class will yield a sub-optimal model.

**Hypothesis and Hypothesis Class**

A hypothesis is the same as a model and hypothesis class is the class of models that you are going to consider for a given problem. Every algorithm has its own limitations. It works within the boundary of a certain class of models. Assume there is an algorithm which builds a random forest. Random forest is an example of a learning algorithm. Now every model that the random forest algorithm will produce is going to be a forest or a collection of decision trees only. Those are the only kind of algorithms that a random forest will ever produce. 

Now suppose that the learning algorithm is a linear regression, logistic regression or SVM. In this case the learning algorithms will only produce linear models as they themselves are linear models and will not consider any other models.

![pasted%20image%200.png](attachment:pasted%20image%200.png)

When you fit a linear regression on this type of data, you would get something shown in the image above which is not definitely the best model as it is unable to capture the underlying structure of the data. A linear regression model will produce some line which it thinks best fits the dataset among all the possible straight lines.It will not consider anything other than a straight line for this dataset.

So a learning algorithm when given data produces a model which could be linear regression, decision trees or any other model. This learning algorithm puts a boundary across the class of models that it is ever going to consider and among those models it will try to find the best model that fits the data given to it for training. That model will come out as an output from the learning algorithm. Once you have this model, you can go ahead and use it for making predictions.

### Complexity and Overfitting

Every class of models has its own strengths and weaknesses. Depending on the computational resources and the kind of data that you have, you need to shortlist the choice of models that you can consider for a given problem. Out of these models, how do you pick one and make the decision?  Here comes the role of model evaluation and one of the most important rules to keep in mind is that never evaluate your model on the training data.

**First Basic Rule:** Keep your model as simple as possible. Simpler the better. Always start with a simple model to set the baseline. Try fitting other complex models on your dataset only if the simple models do not produce a good model with the desired results.

A complex model makes far too many assumptions about the data it has not seen before. Such assumptions may not hold true for all kinds of unseen data that it may encounter later. This is how simple models stand out as compared to complex models because they keep from making any extra assumptions about the unseen data to the least minimum. Another advantage of simple models over complex models is that they require less samples to train the model. Complex models will require far more training to ensure that it is capturing the information well and performing efficiently.

**Advantages of a Simpler Model:**

- A simpler model is usually more generic than a complex model. This becomes important because generic models are bound to perform better on unseen datasets.
- A simpler model requires less training data points. This becomes extremely important because in many cases one has to work with limited data points.
- A simple model is more robust and does not change significantly if the training data points undergo small changes.
- A simple model may make more errors in the training phase but it is bound to outperform complex models when it sees new data. This happens because of **`overfitting`.**

**Overfitting**

Overfitting is a phenomenon where a model becomes too specific to the data it is trained on and fails to generalise to other unseen data points in the larger domain. A model that has become too specific to a training dataset has actually ‘learnt’ not just the hidden patterns in the data but also the noise and the inconsistencies in the data. In a typical case of overfitting, the model performs very well on the training data but fails miserably on the test data. 

### Bias-Variance Tradeoff

We considered the example of a model memorizing the entire training dataset. If you change the dataset a little, this model will need to change drastically. The model is, therefore, unstable and sensitive to changes in training data, and this is called high variance.

The `variance` of a model is the variance in its output on some test data with respect to the changes in the training data. In other words, variance here refers to the degree of changes in the model itself with respect to changes in training data.

Bias quantifies how accurate the model is likely to be on future (test) data. Extremely simple models are likely to fail in predicting complex real world phenomena. Simplicity has its own disadvantages.

Imagine solving digital image processing problems using simple linear regression when much more complex models like neural networks are typically successful in these problems. We say that the linear model has a high bias since it is way too simple to be able to learn the complexity involved in the task.

In an ideal case, we want to reduce both the bias and the variance, because the expected total error of a model is the sum of the errors in bias and the variance.

![Bias_variance.png](attachment:Bias_variance.png)

Although, in practice, we often cannot have a low bias and low variance model. As the model complexity goes up, the bias reduces while the variance increases, hence the trade-off.

### Regularization

Having established that we need to find the right balance between model bias and variance, or simplicity and complexity, we need tools which can reduce or increase the complexity.

Regularization is the process of deliberately simplifying models to achieve the right balance between keeping the model simple and yet not too naive. It is a part of the learning algorithm with some explicit steps to control the model complexity. Recall that there are a few objective ways of measuring simplicity - choice of simpler functions, lesser number of model parameters, using lower degree polynomials, etc.

Regularization discourages the model from becoming too complex even if the model explains the (training) observations better. It is used to find the optimal point between extreme complexity and simplicity.

### Hyperparameters and Cross Validation

**Hyperparameters** are parameters that are passed on to the learning algorithm to control the complexity and performance of the final model. These are choices that the algorithm designer makes to `tune` the behavior of the learning algorithm. The choice of hyperparameters, therefore, has a lot of bearing on the final model produced by the learning algorithm.

Now how do you decide the optimum value of the hyperparameters to regularize and evaluate your model so as to avoid these problems.

When you define a set of hyperparameter values to tune your model and check if it is optimum or not by evaluating its performance on the test set, the model is actually taking a sneak preview of the unseen data which is kept aside for the final model evaluation. The model is repeatedly exposed to the test set for examining each set of hyperparameter values before it is finalized for the actual test. This way of choosing the best set of hyperparameters for a model is not acceptable and this is where the notion of validation set comes in.

The original data is divided into three parts: train, validation and test sets. Each time a model is trained on the train set with a specific set of hyperparameters, this model is then validated and tuned on the validation set to figure out the optimum set of hyperparameter values and the model can revisit the validation set any number of times. After finalizing the model hyperparameters, a final evaluation of the model performance is done using the test set.

The key thing to remember is that a model should never be evaluated on data it has already seen before. With that in mind, you will have either one of two cases:

1. The training data is abundant.
2. The training data is limited.

The first case of abundant training data is straightforward because you can use as many observations as you like to both train and test the model. However, in the second case where the training data is limited, you will need to find some hack so that the model can be evaluated on unseen data and at the same time doesn’t eat up the data available for training. This hack is called cross-validation.

In the cross-validation technique, you split the data into train and test sets and train multiple models by sampling the train set. Finally, you can use the test set to test the final model once.

Specifically, you can apply the k-fold cross-validation technique, where you divide the training data into k-folds/groups of samples. If k = 5, you use k-1 folds to build the model and test it on the kth fold.

It is important to remember that k-fold cross-validation is only applied on the train data. The test data is used for the final evaluation. One extra step that we perform in order to execute cross-validation is that we divide the train data itself into train and test (or validation) data and keep changing it across "k" no. of folds so that the model is more generalised.