# Model Selection and Evaluation - Part 1

When building supervised machine learning models, we need to solve two
problems:

1. **Model selection** - Finding the model that does as well as possible
on our learning task.

2. **Model evaluation** - Predicting **generalization error**, or the expected performance of our
model on unseen data.

Both are critical.  Without 1. we can't have an effective model and
without 2. we can't *know* if we have an effective model.

Once we have picked a particular machine learning algorithm, model
selection comes down to the problem of **hyperparameter** tuning.
Hyperparameters are parameters of our learning models that need to be
selected before the model can be learned.  Examples include the
maximum number of leaves in a decision tree or the number of hidden
units in a neural network classifier.

**WARNING:**  In the exercises below we will try several *BAD* approaches to model selection and evaluation.  These examples are not meant to illustrate the correct way of doing things, they are meant to show the consequences of doing things incorrectly. 


## Exercise 1 - Naive Model Selection

For now, let's focus entirely on model selection and disregard model
evaluation.  The following cell will load a data set and use a
decision tree regressor to fit a decision tree to the data. Try adjusting the `max_leaf_nodes` hyperparameter in order to minimize the MSE on the training set.

In [None]:
%matplotlib qt
import numpy as np
import matplotlib.pyplot as plt
import datasource

from sklearn.tree import DecisionTreeRegressor

# Grab our training data
source = datasource.DataSource()
X, y = source.gen_data(100, seed=100)

# Build a decision tree regressor
tree = DecisionTreeRegressor(max_leaf_nodes=3)
tree.fit(X, y)

# Evaluate the MSE of our decision tree on the training set
y_predict = tree.predict(X)
mse = np.sum((y - y_predict)**2) / y.size
print("MSE: {:.4f}".format(mse))

# Plot the fit.
plt.plot(X, y, '*')
x_plt = np.linspace(0, 1, 400).reshape(400, 1)
plt.plot(x_plt, tree.predict(x_plt))
plt.show()


### Questions

* What value of the hyperparameter resulted in the lowest MSE?
* Do you think that this MSE reflects how well this model will do on unseen data?  Why or why not?

### Answers: 

* 
* 

In the exercise above, you were able to tune the hyperparameters so as to *perfectly* fit the training data.  Now let's see what happens when we use this model on some new data drawn from the same underlying distribution:

In [None]:
X_new, y_new = source.gen_data(1000, seed=200)
y_new_predict = tree.predict(X_new)
mse = np.sum((y_new - y_new_predict)**2) / y_new.size
print("MSE: {:.4f}".format(mse))

## Exercise 2 - Using a Test Set for Hyperparameter Tuning and Evaluation

In the exercise above, you were able to perfectly fit a training data set, but that didn't tell you anything about how well your model would perform on unseen data.  We might address this by splitting our limited data into a training set and a test set.  This is illustrated in the cell below.

In [None]:
# Split our data into a training and testing set...
split_point = int(X.shape[0] * .8) # Use 80% of the data to train the model

X_train = X[0:split_point, :]
y_train = y[0:split_point]

X_test = X[split_point::, :]
y_test = y[split_point::]

# Build a decision tree regressor using the TRAINING set
tree = DecisionTreeRegressor(max_leaf_nodes=3)
tree.fit(X_train, y_train)

# Evaluate the MSE of our decision tree on the TESTING set 
y_test_predict = tree.predict(X_test)
mse = np.sum((y_test - y_test_predict)**2) / y_test.size
print("MSE: {:.4f}".format(mse))

### Questions
* What hyperparameter settings gives us the lowest MSE on the testing data?  What is the MSE?  (I suggest writing a loop to systematically check all of the possible hyperparameter values. Bonus points for creating a plot with number of leaves vs. MSE.)
* Do you think *this* MSE will be reflective of how well our model will perform on unseen data? 

### Answers:

* 
* 

Notice that in this example we are using our test set for *both* model selection and model evaluation.  We used it for model selection by searching for a hyperparameter setting that minimizes error on the test set.  We use it for model evaluation by using our test set error as an estimate of the expected error rate on unobserved data.

Let's see how our model does on some new, unobserved data drawn from the same distribution.  This cell will give us a good estimate of our *actual* generalization error.  (Note that in real-world problems we can't run a test like this because we don't have unlimited access to extra data that we can use to check our work.)

In [None]:
tree = DecisionTreeRegressor(max_leaf_nodes=????) # Put your best hyperparameter here!
tree.fit(X_train, y_train)

# Let's see how we do on unobserved data... 
X_new, y_new = source.gen_data(1000, seed=200)
y_new_predict = tree.predict(X_new)
mse = np.sum((y_new - y_new_predict)**2) / y_new.size
print("MSE: {:.4f}".format(mse))

### Questions

* Relative to Exercise 1, where we just looked for the model that best fit our training data, would you say that our train/test split was beneficial in terms of model selection, i.e. did we end up with a better model? Justify your answer.
* Would you say that our train/test split was beneficial in terms of model evaluation, i.e. were we able make a better prediction of our generalization error?  Justify your answer.
* Do you see any problems here in terms of model selection or evaluation?  How accurate was our prediction of generalization error?

### Answers:

* 
* 
* 

### Click [here](model_selection_2.ipynb) to open the next page of exercises...