# Objective 01 - Implement a train-validate-test split
## Overview
In the previous module, we used a train-test split where we hold back a subset of the data to use for testing the model. When we train a model, we also need to evaluate the model. Recall that if we evaluate on the training data, we're not getting an accurate estimate of the true performance of the model. For this reason, we need to use test data that the model has not yet seen.

But, sometimes it's useful to be able to have an intermediate step where the model can be evaluated without using the set-aside test set. This is where a validation set is useful. Consider the situation where we take a subset of our data and set it aside as the test set - we won't touch this data until we're ready to evaluate a final model.

With the remaining data, we can divide it into training and validation sets. We then train the model on the training data and evaluate it on the validation data. Another advantage of using a validation set it that is can be used to tune the model or adjust the hyperparameters. Iterations of tuning and model fitting are used to find the final model, which is then evaluated using the test set.

## Train-validate-test
Some general definitions are:

- training dataset: the sample of data used to fit the model
- validation dataset: the sample of data used to evaluate the model and possibly to adjust the hyperparameters
- testing dataset the sample of data used for final model testing; not to be used for anything other than testing so that the result is unbiased

One last point to make is that sometimes you won't even have access to a test set! If you are participating in a Kaggle competition, for example, you will have training data and use the test data for your submission. The number of test submissions might be limited, or you might not want to make numerous test submissions just to evaluate your model.

In this next section, we'll create our own train-validation-test data sets. We'll follow the guideline of using 60% for training, 20% for validation, and 20% for testing.

## Follow Along
As we haven't yet worked with the Iris dataset in this module, we'll start there. In the following example, we load the data and then separate out the feature (petal_width) and the target (petal_length). Having plotted this data earlier, we know there is a linear relationship between the petal width and length: the wider the petal, the greater the length. So we'll use a linear model to predict our target.

In [1]:
import numpy as np 
import seaborn as sns 

iris = sns.load_dataset('iris')
display(iris.head() )

x= iris['petal_width']
X = x[:, np.newaxis]
y = iris['petal_width']

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


  X = x[:, np.newaxis]


First, we'll hold back a subset of the data just for the test data. We'll do this with the scikit-learn utility; we'll call it something different from "train" so that we don't confuse it with the actual training data later.  

In [2]:
# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the "remaining" and test datasets
X_remain, X_test, y_remain, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Then we'll create a training set and validation set from the remaining data. We could have done this in one step but we're breaking it down here so it's easier to see that we removed a test subset and aren't accidentally going to use it for evaluation until we're ready to test.



In [3]:
# Create the train and validation datasets

X_train, X_val, y_train, y_val = train_test_split(
    X_remain, y_remain, test_size=0.25, random_state=42)

# Print out sizes of train, validate, test datasets

print('Training data set samples:', len(X_train))
print('Validation data set samples:', len(X_val))
print('Test data set samples:', len(X_test))

Training data set samples: 90
Validation data set samples: 30
Test data set samples: 30


Now we can fit our model and evaluate it on our validation set.

In [4]:
# Import the predictor and instantiate the class
from sklearn.linear_model import LinearRegression

# Instantiate the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Use the VALIDATION set for prediction
y_predict = model.predict(X_val)

# Calculate the accuracy score
from sklearn.metrics import r2_score
r2_score(y_val, y_predict)

1.0

Well, that's a pretty good model score (R-squared), which we expect because we know the Iris dataset has a strong linear trend between the petal width and petal length. Now would be the time to change any of the model hyperparameters and evaluate on the validation set again. We'll continue with the default model parameters for now; hyperparameter tuning is something that will be introduced in Sprint 2 of this unit.

Now, let's use the test set we held back above.

In [16]:
# Use the TEST set for prediction
y_predict_test = model.predict(X_test)

# Calculate the accuracy score

r2_score(y_test, y_predict_test)

1.0

The R-squared score is a little lower than it was for the validate set. But, if we were to run the model and test again, with a different random seed, the scores would be different and the test score might be higher.

Challenge
Using the same data set as in the example, try changing the random_state parameter to see how the validate and test model scores change.

Additional Resources  
What is the Difference Between Test and Validation Datasets?['https://machinelearningmastery.com/difference-test-validation-datasets/]