### Validation Sets 

###### The training sets does not provide an accurate signal of the quality of your model. In this section validation sets are used to improve the accuracy and provide model with additional layer of data for tuning the mode.

##### Dataset - we will use the california housing dataset to predict the median_house_value at the city block level.
##### The california housing dataset has two separate datasets for training and test set
* california_housing_train.csv
* california_housing_test.csv

##### Validation set is created by dividing the training set into two parts
* Smaller training set
* Validation set

In [None]:
#@title Import modules
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt

pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

In [None]:
## Load the California housing dataset

train_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")

In [None]:
## Scale the label values to a factor of 1000.
scale_factor = 1000.0

# Scale the training set's label.
train_df["median_house_value"] /= scale_factor 

# Scale the test set's label
test_df["median_house_value"] /= scale_factor

#### Load the functions to build and train the model.

##### build_model -> defines the models topography
##### train_model -> function to train the model based on the model topography defined in build_model.

In [None]:
#@title Define the functions that build and train a model
def build_model(my_learning_rate):
  ## Create and compile a simple linear regression model.
  ## Most simple tf.keras models are sequential.
  model = tf.keras.models.Sequential()

  ## Add one linear layer to the model to yield a simple linear regressor.
  model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))

  ## Compile the model topography into code that TensorFlow can efficiently
  ## execute. Configure training to minimize the model's mean squared error. 
  model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.RootMeanSquaredError()])

  return model  

In [None]:
def train_model(model, df, feature, label, my_epochs, 
                my_batch_size=None, my_validation_split=0.1):
  ##Feed a dataset into the model in order to train it."""

  history = model.fit(x=df[feature],
                      y=df[label],
                      batch_size=my_batch_size,
                      epochs=my_epochs,
                      validation_split=my_validation_split)

  ## Gather the model's trained weight and bias.
  trained_weight = model.get_weights()[0]
  trained_bias = model.get_weights()[1]

  ## The list of epochs is stored separately from the 
  ## rest of history.
  epochs = history.epoch
  
  ## Isolate the root mean squared error for each epoch.
  hist = pd.DataFrame(history.history)
  rmse = hist["root_mean_squared_error"]

  return epochs, rmse, history.history   


In [None]:
### plot the functions on the graph.

#@title Define the plotting function

def plot_the_loss_curve(epochs, mae_training, mae_validation):
  ##Plot a curve of loss vs. epoch."""

  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Root Mean Squared Error")

  plt.plot(epochs[1:], mae_training[1:], label="Training Loss")
  plt.plot(epochs[1:], mae_validation[1:], label="Validation Loss")
  plt.legend()
  
  ## We're not going to plot the first epoch, since the loss on the first epoch
  ## is often substantially greater than the loss for other epochs.
  merged_mae_lists = mae_training[1:] + mae_validation[1:]
  highest_loss = max(merged_mae_lists)
  lowest_loss = min(merged_mae_lists)
  delta = highest_loss - lowest_loss
  print(delta)

  top_of_y_axis = highest_loss + (delta * 0.05)
  bottom_of_y_axis = lowest_loss - (delta * 0.05)
   
  plt.ylim([bottom_of_y_axis, top_of_y_axis])
  plt.show()  

print("Defined the plot_the_loss_curve function.")

##### We will experiment with the validation set

###### We will set the validation_split variable at 0.2 which means 20% of the data will be used for validation set and 80% for training set.

###### We will pass the variables to the build_model function and then use it to train the model and then plot the curves.

In [None]:
# The following variables are the hyperparameters.
learning_rate = 0.08
epochs = 30
batch_size = 100

# Split the original training set into a reduced training set and a
# validation set. 
validation_split=0.2

# Identify the feature and the label.
my_feature="median_income"  # the median income on a specific city block.
my_label="median_house_value" # the median value of a house on a specific city block.
# That is, you're going to create a model that predicts house value based 
# solely on the neighborhood's median income.  

# Discard any pre-existing version of the model.
my_model = None

# Invoke the functions to build and train the model.
my_model = build_model(learning_rate)
epochs, rmse, history = train_model(my_model, train_df, my_feature, 
                                    my_label, epochs, batch_size, 
                                    validation_split)

plot_the_loss_curve(epochs, history["root_mean_squared_error"], 
                    history["val_root_mean_squared_error"])

<b> we can see from the graph that the trainingloss and the validation loss differs greatly, which means the data in the validation set is not similar to the data in the training set. This is a bit counter-intuitive as the purpose of validation set was to create an additional layer of data to improve model quality. But there is a reason for this error and also there is a fix.
Most issues are with the data itself and so is the same in this particular case. We need to look into the data and identify what could potentially be wrong or different about the training set compared to the validation set.
</b>

#### Fix for the validation set data issue.

<b> To fix this problem what one can do is to shuffle the data in the training_set before splitting into training_set and validation_set, this way we make sure the data points are spread across the entire dataset and there is some uniformity
We use the reindex method from the pandas library to reshuffle the input data.
Then pass the shuffled input to the train_model function.
</b>

In [None]:
#@title Double-click to view the complete implementation.

# The following variables are the hyperparameters.
learning_rate = 0.08
epochs = 70
batch_size = 100

# Split the original training set into a reduced training set and a
# validation set. 
validation_split=0.2

# Identify the feature and the label.
my_feature="median_income"  # the median income on a specific city block.
my_label="median_house_value" # the median value of a house on a specific city block.
# That is, you're going to create a model that predicts house value based 
# solely on the neighborhood's median income.  

# Discard any pre-existing version of the model.
my_model = None

# Shuffle the examples.
shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index)) 

# Invoke the functions to build and train the model. Train on the shuffled
# training set.
my_model = build_model(learning_rate)
epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature, 
                                    my_label, epochs, batch_size, 
                                    validation_split)

plot_the_loss_curve(epochs, history["root_mean_squared_error"], 
                    history["val_root_mean_squared_error"])

In [None]:
#### Use the datasets to evaluate the models performance.

x_test = test_df[my_feature]
y_test = test_df[my_label]

results = my_model.evaluate(x_test, y_test, batch_size=batch_size)