# Validation Sets and Test Sets
In this Colab, we will experiment with validation sets and test sets.






## Learning objectives

We will learn the following:

  * Split a training set into a smaller training set and a validation set
  * Analyze deltas between training set and validation set results.
  * Test the trained model with a test set to determine whether your trained model is overfitting.
  * Detect and fix a common training problem.

## The dataset

We will use the **California Housing Dataset** to predict the `median_house_value`. 

* The training set is in `california_housing_train.csv`.
* The test set is in `california_housing_test.csv`.

We will create the validation set by dividing the training set into two parts:

* a training set  
* a validation set

In [None]:
#Use the right version of TensorFlow
%tensorflow_version 2.x

In [None]:
#Import relevant modules
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt

pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

In [None]:
## Load the datasets
train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")

## Scale the label values

The following code cell scales the `median_house_value`. 

In [None]:
#train_df.head(5)

In [None]:
# Scale the training set's label.
train_df["median_house_value"] /= 1000

# Scale the test set's label
test_df["median_house_value"] /= 1000

In [None]:
#train_df.head(5)

In [None]:
#Instantiate the model
model = None
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.08) , loss="mean_squared_error", metrics=[tf.keras.metrics.RootMeanSquaredError()])
model.summary()

In [None]:
#Train the model
history = model.fit(x=train_df['median_income'], y=train_df['median_house_value'], verbose=0, batch_size=100, validation_split=0.2, epochs=100)

In [None]:
#Plot the loss curve
plt.figure()
plt.xlabel("Epoch")
plt.ylabel("Root Mean Squared Error")


plt.plot(history.history['root_mean_squared_error'], label="Training Loss")
plt.plot(history.history['val_root_mean_squared_error'], label="Validation Loss")
plt.legend()




If the data in the training set is similar to the data in the validation set, then the two loss curves and the final loss values should be almost identical. However, the loss curves and final loss values are **not** almost identical. 

Even if you experiment with different values of `validation_split` it will not fix the problem.


Evidently, the data in the training set isn't similar enough to the data in the validation set. Because, the original training set is sorted by longitude. Apparently, longitude influences the relationship of total_rooms to median_house_value.

In [None]:
# Examine examples 0 through 4 and examples 95 through 99 of the training set
train_df.head(100)

##Fixing the problem

To fix the problem, we need to shuffle the examples in the training set before splitting the examples into a training set and validation set. To do so, add the following line anywhere before calling `train_model`

```
  shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index))
```                                    

Pass `shuffled_train_df` (instead of `train_df`) as the second argument to `train_model` 

In [None]:
#Instantiate the model
model = None
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.08) , loss="mean_squared_error", metrics=[tf.keras.metrics.RootMeanSquaredError()])
model.summary()

In [None]:
#Train the model
#Shuffle the examples, and use 'shuffled_train_df' instead of train_df
shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index)) 
history = model.fit(x=shuffled_train_df['median_income'], y=shuffled_train_df['median_house_value'], verbose=0, batch_size=100, validation_split=0.2, epochs=100)

In [None]:
#Plot the loss curve
plt.figure()
plt.xlabel("Epoch")
plt.ylabel("Root Mean Squared Error")

plt.plot(history.history['root_mean_squared_error'], label="Training Loss")
plt.plot(history.history['val_root_mean_squared_error'], label="Validation Loss")
plt.legend()

## Finaly, evaluate the model performance on the test dataset

In [None]:
x_test = test_df['median_income']
y_test = test_df['median_house_value']

results = model.evaluate(x_test, y_test, batch_size=100)
results

Compare the root mean squared error of the model when evaluated on each of the three datasets:

* training set, validation set, and test set

Ideally, the root mean squared error of all three sets should be similar. **Successfull!**