#  Neural Networks Training with Keras

In [None]:
    ##################   ##########  Open Source Deep Learning Frameworks   #############    ##################
    
    1. DL4J

         * JVM-based
         * Distrubted
         * Integrates with Hadoop and Spark
   
   
     2. Theano

         * Very popular in Academia
         * Fairly low level
         * Interfaced with via Python and Numpy


     3. Torch

         * Lua based
         * In house versions used by Facebook and Twitter
         * Contains pretrained models


      4. TensorFlow

         * Google written successor to Theano
         * Interfaced with via Python and Numpy
         * Highly parallel
         * Can be somewhat slow for certain problem sets



      5. Caffe

         * Not general purpose. Focuses on machine-vision problems
         * Implemented in C++ and is very fast
         * Not easily extensible
         * Has a Python interface

#  Implementing MLPs with Keras

In [None]:
       ##############   #################   Introduction to Keras   ###################   #############
 
 =>  Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate and execute all sorts 
      of neural networks. It quickly gained popularity owing to its ease-of-use, flexibility and beautiful design. 
        
 =>  Its documentation (or specification) is available at https://keras.io and the reference implementation is 
       simply called Keras as well, so to avoid any confusion we will call it keras-team and it is available 
         at https://github.com/keras-team/keras).
                
 =>  It was developed by François Chollet as part of an open source research project released in march 2015.
       [Project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System] 
    
 =>  To per‐form the heavy computations required by neural networks, keras-team relies on a computation backend. 
     At pesent Keras relies on 3 different Deep Learning Libraries.
                         
        =>  TensorFlow, Microsoft Cognitive Toolkit (CNTK) or Theano.

    
=>  Moreover, since late 2016, other implementations have been released. We can now run Keras on Apache MXNet, 
      Apple’s Core ML, Javascript or Typescript (to run Keras code in a web browser), or PlaidML (which can run on
     all sorts of GPU devices, notjust Nvidia). Moreover, TensorFlow itself now comes bundled with its own 
      Keras implementation called tf.keras. 
                         
 =>  It only supports TensorFlow as the backend, but it has the advantage of offering some very useful extra 
       features. for example, it supports TensorFlow’s Data API which makes it quite easy to load and preprocess 
      data efficiently. 

 =>  For TensorFlow version of Keras, We installs TensorFlow and import Keras directly from it.
                         
        >>>  from tensorflow import keras
        >>>  keras.__version__

<img src='images/keras.jpg' width='480px'>

#  Building an Image Classifier Using the Sequential API

In [None]:
     #############   ###############  Working with Fashion MNIST dataset   #################   ###############
     
 =>  We will tackle with Fashion MNIST Dataset, which has he exact same format as MNIST 
      (70,000 grayscale images of 28×28 pixels each, with 10 classes), but the images represent fashion items
     rather than hand written digits, so each class is more diverse and the problem turns out to be significantly 
        more challenging than MNIST.
   
 =>  For example, a simple linear model reaches about 92% accuracy on MNIST, but only about 83% on Fashion MNIST.

In [None]:
#  Importing the Required libraries

import tensorflow as tf
from tensorflow import keras

#  Getting the versions 
print ('TensorFlow version is -> ', tf.__version__)
print ('Keras version is -> ', keras.__version__)

### Loading the fashion MNIST Dataset with Keras

In [None]:
=>  There is an important difference between the loading of Fashion Mnist dataset using keras and sk-learn library.
     
 => Every Image is represented as 28x28 array rather than 1d array of size 784 and Pixel intensities are represented 
       as integers from (0 o 255) rather than floats(0.0 to 255.0)

In [None]:
#  Loading the Fashion MNIST Dataset using Keras API

fashion_mnist = keras.datasets.fashion_mnist
(X_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

In [None]:
#  Getting the information about size and dimension and Dtypes of dataset
print ('shape of dataset -> ', X_train.shape)

X_train.dtype

In [None]:
 =>  Note that the dataset is already split into a training set and a test set, but there is no validation set, 
       so let’s create one.
       
 =>  Moreover, since we are going to train the Neural network using Gradient Descent, we must scale the input              features. 
     For simplicity, we justscale the pixel intensities down to the 0-1 range by dividing them by 255.0
      (this also converts them to floats).

In [None]:
#  Creating a Validation set and scaling them down in the range (0-1)

X_valid, x_train = X_train[:5000] / 255.0, X_train[5000:] / 255.0
y_valid, y_train = y_train[:5000], y_train[5000:]

In [None]:
 =>  With MNIST, when the label is equal to 5, it means that the image represents the handwritten digit 5.
       However, for Fashion MNIST, we need the list of 'class names' to know what we are dealing with.

In [None]:
#  Creating Labels column with name Class_names
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat","Sandal", "Shirt", "Sneaker", "Bag",
               "Ankle boot"]

#  Creating the Model Using the Sequential API

In [None]:
# Let’s build the Neural network! 
#  Here is a classification MLP with two hidden layers

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape = [28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

## Let’s go through this code line by line

In [None]:

 •  The first line creates a Sequential model. This is the simplest kind of Keras model, for neural networks that 
      are just composed of a single stack of layers, connected sequentially are called the Sequential API.

 •  Next, we build the first layer and add it to the model. It is a Flatten layer whose role is simply to convert
       each input image into a 1D array. If it receives input data X, it computes X.reshape(-1,1). 
     This layer does not have any parameters, it is just there to do some simple preprocessing. 
    Since it is the first layer in the model, you should specify the input_shape, which does not include the
        batch size, only the shape of the instances.
    Alternatively, you cou
    ld add a keras.layers.InputLayer as the first layer, setting shape=[28,28].
    
 •  Next we add a Dense hidden layer with 300 neurons. It will use the ReLU Activation function. Each Dense layer 
      manages its own weight matrix, containing all the connection weights between the neurons and their inputs. 
    It also manages a vector of bias terms (one per neuron). When it receives some input data, 
     it computes Equation 10-2.
     
 •  Next we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function.
 
 •  Finally, we add a Dense output layer with 10 neurons (one per class), using the softmax activation function
     (because the classes are exclusive).
    
 =>  Specifying activation="relu" is equivalent to activation=keras.activations.relu. 
     
 =>  Other activation functions are available in keras.activations package, 
        visit https://keras.io/activations/ for the full list.

In [None]:
<img src = 'images/keras.io.png' width='640px'>

In [None]:
###  Getting Model's summary

In [None]:
 => The Model’s summary() method displays all the Model’s layers with their Layer names, its Output shape, and              number of parameters.
 
      *  Layer names are automatically generated or we can set it manually.
      *  In output shape, None means the batch size can be anything.
      *  There would  be Trainable or Non Trainable both types of parametes available in the model.

In [None]:
model.summary()

In [None]:
     #############   ###############   Getting names of layers of a Model   ################   ##############

 =>  Dense layers often have a lot of parameters. For example, the first hidden layer has 784 × 300 connection             weights, plus 300 bias terms, which adds up to 235,500 parameters. This gives the model quite a lot of              flexibility to fit the Training data, but it also means that the model runs the risk of overfitting, 
     especially when you do not have a lot of Training data. 
     
 =>  You can easily get a model’s list of layers, to fetch a layer by its index, or you can fetchit by name.

       >>>  model.layers      >>>  model.layers[1].name     >>>  model.get_layer('dense_3').name

 =>  All the parameters of a layer can be accessed using itsget_weights() and set_weights() method. 
       For a Dense layer, this includes both the connection weights and the Bias terms.
       
       >>>  weights, biases = hidden1.get_weights()
       
 =>  Accessing the Weights and Biases and their shapes or Dtypes.
 
       >>>  weights.shape    >>> weights.dtype    or     >>> bias.shape 

In [None]:
#  Getting the name of layers of a model

model.layers                            
model.layers[1].name                    
model.get_layer('dense_2').name         

In [None]:
#  You can also generate an image of your model.
keras.utils.plot_model(model)

In [None]:
 =>  Dense layer initialized the connection Weights randomly and the Biases were just initialized to zeros, 
     
 =>  If you ever want to use a different initialization method, you can set kernel_initializer .
       (kernel is another name for the matrix of connection weights) or Bias_initializer when creating the layer.
       
 =>  For full list of Kernel initializers, visit ->  https://keras.io/initializers/.

 =>  The shape of the weight matrix depends on the number of inputs. This is why it is recommended to specify             the input_shape when creating the first layer in a Sequential model. 
 
 =>  However, if you do not specify the input shape, Keras will simply wait until it knows the input shape
      before it actually builds the model. This will happen either when you feed it actual data 
       (e.g., during training), or when you call its build() method. 
       
 =>  Until the model is really built, the layers will not have any weights, and you will not be able to do certain         things (such as print the model summary or save the model), so if you know the input shape when creating the         model, it is best to specify it.

In [None]:
##  Compiling the Model

In [None]:
 =>  After a model is created, you must call compile() method to specify the loss function and the optimizer
      to use. Optionally, you can also specify a list of extra metrics to compute during training and evaluation.
        
      >>>  model.compile(loss = "sparse_categorical_crossentropy",optimizer = "sgd", metrics = ["accuracy"])
    
    
 =>  For getting the full list of Losses, Optmizers, and metrics, visit -
    
          https://keras.io/losses/,https://keras.io/optimizers/andhttps://keras.io/metrics/.
       

In [None]:
#  Compiling the Trained Model

model.compile(loss = "sparse_categorical_crossentropy",optimizer = "sgd", metrics = ["accuracy"])

In [None]:
<img src='images/model-compile.png'>

In [None]:
<img src='images/model-explaination.png'>

In [None]:
## Training and Evaluating the Model

 =>  Now the model is ready to be trained. For this we simply need to call fit() method.
 
 =>  We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to         train (or else default to just 1, which would definitely not be enough to converge to a good solution).
 
 =>  We also pass a validation set (optional). Keras will measure the loss and the extra metrics on this set at the      end of each epoch, which is very useful to see how well the model really performs.
  
 =>  If the performance on the training set is much better than on the validation set, your model is probably            overfitting the training set (or there is a bug, such as a data mis‐match between the training set and the           validation set).

In [None]:
#  Training the Model
history = model.fit(x_train, y_train, epochs=30, validation_data=(X_valid, y_valid))

 =>  And that’s it! The neural network is trained. 
 
 =>  At each epoch during training, Keras displays the number of instances processed so far (along with progress           bar), the mean training time per sample, the loss and accuracy (or any other extra metrics you asked for), 
      both on the training set and the validation set. 
    
 =>  You can see that the training loss went down, which is a good sign, and the validation accuracy reached 87.28%        after 50epochs, not too far from the training accuracy, so there does not seem to be much overfitting going          on.

 =>  If the training set was very skewed, with some classes being over represented and others under represented, 
      it would be useful to set the class_weight argument when calling the fit() method, giving a larger weight to          underrepresented classes, and a lower weight to over represented classes. 
      These weights would be used by Keras when computing the loss.
 
 =>  If you need per-instance weights instead, you can set the sample_weight argument (it supersedes class_weight).       This could be useful 
 
 =>  For example, if some instances were labeled by experts while others were labeled using a crowd sourcing               platform, you might want to give more weight to the former. You can also provide sample weights (but not 
     class weights) for the validation set by adding them as a third item in the validation_data tuple.
     
 =>  The fit() method returns a History object containing the training parameters (history.params), the list of            epochs it went through (history.epoch), and most importantly a dictionary (history.history) containing the        loss and extra metrics it measured at the end of each epoch on the training set and on the validation set 
      (if any). 
   
 =>  If you create a Pandas DataFrame using this dictionary and call plot() method, you get the learning curves         shown in Figure.

In [None]:
# Plotting the curves

import pandas as pd
import matplotlib.pyplot as plt

pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)

#  Setting the vertical range (0 - 1)
plt.gca().set_ylim(0,1)
plt.show()

In [None]:
 =>  You can see that both the Training and Validation accuracy steadily increase during Training, while the               Training and Validation loss decreases. Good! 
 
 =>  Moreover, the valida‐tion curves are quite close to the Training curves, which means that there is not too much       overfitting.
 
 =>   In this particular case, the model performed better on the Valida‐tion set than on the Training set at the          beginning of training. this sometimes happens by-chance (especially when the validation set is fairly small).      
    However, the Training set performance ends up beating the Validation performance, as is generally the case when      you train for long enough. You can tell that the model has not quite converged yet, as the validation loss is      still going down, so you should probably continue train‐ing. 
    
 =>  It’s as simple as calling the fit()method again, since Keras just continues train‐ing where it left off (you       should be able to reach close to 89% validation accuracy).
 
 =>  If you are not satisfied with the performance of your model, you should go back andtune the model’s                   hyperparameters, for example the number of layers, the number of neurons per layer, the types of activation          functions we use for each hidden layer, the number of Training epochs, the batch size (it can be set in            thef it()method using the batch_size argument, which defaults to 32). 


In [None]:
# Evaluation of model on Test dataset
model.evaluate(x_test, y_test)

In [None]:
## Using the Model to Make Predictions

In [None]:
 =>  predict() method is used to get the predictions with the trained model.
 
      >>> model.predict('data')
      
      >>> model.predict_classes('data')

In [None]:
#  Getting predictions for some data belongs to Test set
# Preparing data for prediction
x_new = x_test[:3]

# Pridicting with new data
y_predict = model.predict(x_new)
y_predict


In [None]:
#  Geting class names predictions

import numpy as np

y_predict = model.predict_classes(x_new)

np.array(class_names)[y_predict]

In [None]:
# Saving  and Restoring a Model

        ##############   ###############   Saving the Trained model   ###############    ###############

 =>  Keras will save both the model’s architecture (including every layer’s hyperparame‐ters) and the value of all         the model parameters for every layer (e.g., connection weights and biases), using the HDF5 format. 
     It also saves the optimizer (including hyperparameters and any state it may have).
    
 =>  You will typically have a script that trains a model and saves it, and one or more scripts (or web services)         that load the model and use it to make predictions. Loadingthe model is just as easy.

     >>>  model.save("my_keras_model.h5")

     >>>  model = keras.models.load_model("my_keras_model.h5")

 =>  This will work when using the Sequential API or the FunctionalAPI, but unfortunately not when using Model              subclassing. How‐ever, you can use save_weights() and load_weights()to at least save and restore the model         parameters (but you will need to save and restore everything else yourself).

In [None]:
#  Saving the Trained Model

model.save("my_keras_model.h5")


In [None]:
       #############   #############   Using Callbacks to saving the Models   ###############   ##############

 =>  The fit() method accepts a callbacks argument that lets you specify a list of objects that Keras will call          during training at the start and end of Training, at the start and end of each epoch and even before and after      processing each batch. 
 
     For example, the Model Checkpoint callback saves checkpoints of your model at regular intervals during                training, by default at the end of each epoch.

   
   >>>  checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5")
   >>>  history = model.fit(X_train, y_train, epochs=10, callbacks=[checkpoint_cb])


=>  Moreover, if you use a validation set during training, you can set 'save_best_only=True'
    when creating the Model Checkpoint. In this case, it will only save your model when its performance on the           validation set is the best so far. This way, you do not need to worry about training for too long and                overfitting the training set. 
    Simply restore the last model saved after training, and this will be the best model on the validation set. 
    
    
    >>>  checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",save_best_only=True)
    >>>  history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid),
                              callbacks=[checkpoint_cb])
                              
    >>>  model = keras.models.load_model("my_keras_model.h5")        # rollback to best model

 =>  Another way to implement early stopping is to simply use the EarlyStopping call‐back. It will interrupt               training when it measures no progress on the validation set for a number of epochs (defined by                       thepatienceargument), and it will optionally rollback to the best model.
     You can combine both callbacks to both save checkpoints of your model (in case your computer crashes), and            actually interrupt training early when there is no more progress (to avoid wasting time and resources).
     
     >>>  early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
     >>>  history = model.fit(X_train, y_train, epochs=100,validation_data=(X_valid, y_valid),callbacks=                                             [checkpoint_cb, early_stopping_cb])

 =>  The number of epochs can be set to a large value since training will stop automati‐cally when there is no more       progress. Moreover, there is no need to restore the bestmodel saved in this case since EarlyStopping callback           will keep track of the bestweights and restore them for us at the end of training.


In [None]:
 =>  There are many other callbacks available in keras.callbacks package.
        
         visit ->  https://keras.io/callbacks/.

 =>  If you need extra control, you can easily write your own custom callbacks.
    For exam‐ple, the following custom callback will display the ratio between the validation loss and the training      loss during training (e.g., to detect overfitting).

       >>>  class PrintValTrainRatioCallback(keras.callbacks.Callback):
       >>>      def on_epoch_end(self, epoch, logs):print("\nval/train: 
       >>>          {:.2f}".format(logs["val_loss"] / logs["loss"]))


In [None]:
# Building a Regression MLP Using the Sequential API

In [None]:
 =>  Let’s switch to the California housing problem and tackle it using a regression Neural network.
 
 =>  For simplicity, we will use Scikit-Learn’sfetch_california_housing()function to load the data.
 
 =>  This dataset contains only numerical features (there is no ocean_proximity feature), and there is no
       missing value. After loading the data, we split it into a training set, a vali‐dation set and a test set, 
         and we scale all the features.
            
 =>  Building, Training, Evaluating and using a Regression MLP using the Sequential API to make predictions is 
        quite similar to what we did for Classification. 
     The main differ‐ences are the fact that the output layer has a single neuron (since we only want to predict 
       a single value) and uses no activation function, and the loss function is themean squared error.
        
 =>  Since the dataset is quite noisy, we just use a single hidden layer with fewer neurons than before, 
       to avoid overfitting:

In [None]:
#  Importing the Required packages

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Loading the Housing dataset
housing = fetch_california_housing()

# Splitting the Dataset
X_train_full, X_test, y_train_full, y_test = train_test_split(
 housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(
 X_train_full, y_train_full)

# Scaling the Dataset
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# Creating the Model
model = keras.models.Sequential([keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
                                 keras.layers.Dense(1)])

#  Compiling the Model
model.compile(loss="mean_squared_error", optimizer="sgd")

# Training the Model
history = model.fit(X_train, y_train, epochs=2, validation_data=(X_valid, y_valid))

# Evaluating the Model
mse_test = model.evaluate(X_test, y_test)

# Getting predictions with the Model
X_new = X_test[:3]                        # pretend these are new instances
y_pred = model.predict(X_new)

In [None]:
# Building Complex Models Using the Functional API

In [None]:
     #############    ###############   Introduction to Functional API  ###############   ############

 =>  One example of a Non Sequential Neural Network is a Wide & Deep Neural Network.
 
 =>  This Neural Network architecture was introduced in a 2016 paper by Heng-Tze Chenget al. It connects all or
      part of the inputs directly to the output layer. This architecture makes it possible for the Neural Network
       to lea
        n both deep patterns (using the deep path) and simple rules (through the short path).
       
 =>  In contrast, a regular MLP forces all the data to flow through the full stack of layers, thus simple patterns 
        in the data may end up being distorted by this sequence of transformations.

In [None]:
<img src='images/wide-deep-nn.png' width='420px'>

In [None]:
###  Let’s build such a neural network to tackle the California housing problem

In [None]:
input = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation="relu")(input)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.Concatenate()[input, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.models.Model(inputs=[input], outputs=[output])

In [None]:
 **  Let’s go through each line of this code.
    
 =>  First, we need to create an Input object. This is needed because we may have multiple inputs, as we will
      see later.
 
 =>  Next, we create a Dense layer with 30 neurons and using the ReLU activation function. 
       As soon as it is created, notice that we call it like a function, passing it the input. 
     This is why this is called the Functional API. Note that we are just telling Keras how it should connect 
        the layers together, no actual data is being processed yet.
        
 =>  We then create a second hidden layer, and again we use it as a function. Note however that we pass it the 
       output of the first hidden layer.

 =>  Next, we create a Concatenate() layer, and once again we immediately use it like a function, to concatenate 
       the input and the output of the second hidden layer (you may prefer the keras.layers.concatenate() function,
      which creates a Concatenate layer and immediately calls it with the given inputs).

 =>  Then we create the output layer, with a single neuron and no activation function, and we call it like
       a function, passing it the result of the concatenation.

 =>  Lastly, we create a Keras Model, specifying which inputs and outputs to use.


In [None]:
 =>  Once you have built the Keras model, everything is exactly like earlier, so no need to repeat it here.
        You must compile the model, train it, evaluate it and use it to make predictions.

 =>  But what if you want to send a subset of the features through the wide path, and a different subset 
      (possibly overlapping) through the deep path? 
     Those we want to send 5 features through the deep path (features 0 to 4), and 6 features through the
       wide path (features 2 to 7).

In [None]:
input_A = keras.layers.Input(shape=[5])
input_B = keras.layers.Input(shape=[6])
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.models.Model(inputs=[input_A, input_B], outputs=[output])

In [None]:
<img src='images/multiple-inputs.png' width='420px'>

In [None]:
 =>  The code is self-explanatory. 
     Note that we specified inputs = [input_A, input_B] when creating the model. 
            
 =>  Now we can compile the model as usual, but when we call the fit() method, instead of passing a single input 
     matrix X_train, we must pass a pair of matrices (X_train_A, X_train_B): one per input. 
     The same is True for X_valid, and also for X_test and X_new when you call evaluate() or predict().

In [None]:
model.compile(loss="mse", optimizer="sgd")

X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]

history = model.fit((X_train_A, X_train_B), y_train, epochs=20, validation_data=((X_valid_A, X_valid_B), y_valid))

mse_test = model.evaluate((X_test_A, X_test_B), y_test)

y_pred = model.predict((X_new_A, X_new_B))

In [None]:
 =>  There are also many use cases in which you may want to have multiple outputs.
    
   •  The task may demand it, for example you may want to locate and classify the main object in a picture. 
        This is both a regression task (finding the coordinates of the object’s center, as well as its width and
          height) and a classification task.
        
   •  Similarly, you may have multiple independent tasks to perform based on the same data. 
       Sure, you could train one neural network per task, but in many cases you will get better results on all 
         tasks by training a single neural network with one output per task.
        This is because the neural network can learn features in the data that are useful across tasks.

   •  Another use case is as a regularization technique (i.e., a training constraint whose objective is to reduce 
        overfitting and thus improve the model’s ability to generalize). 

       For example, you may want to add some auxiliary outputs in a neural network architecture 
        (see Figure 10-15) to ensure that the underlying part of the network learns something useful on its own,
          without relying on the rest of the network.

In [None]:
<img src='images/multiple-outputs.png' width='420px'>

In [None]:
<img src='images/gb.png'>
<img src='images/gb2.png'>


In [None]:
# Building Dynamic Models Using the Subclassing API

 =>  Both the Sequential API and the Functional API are declarative. you start by declar‐ing which layers you want to use and how they should be connected, and only thencan you start feeding the model some data for training or inference. This has manyadvantages: the model can easily be saved, cloned, shared, its structure can be dis‐played and analyzed, the framework can infer shapes and check types, so errors canbe caught early (i.e., before any data ever goes through the model). It’s also fairly easyto debug, since the whole model is just a static graph of layers. But the flip side is justthat: it’s static. Some models involve loops, varying shapes, conditional branching,and other dynamic behaviors. For such cases, or simply if you prefer a more impera‐tive programming style, the Subclassing API is for you.

Simply subclass theModelclass, create the layers you need in the constructor, and usethem to perform the computations you want in thecall()method. For example, cre‐ating an instance of the followingWideAndDeepModelclass gives us an equivalentmodel to the one we just built with the Functional API. You can then compile it, eval‐uate it and use it to make predictions, exactly like we just did.

In [None]:
<img src='images/gb3.png'>

This example looks very much like the Functional API, except we do not need to cre‐ate the inputs, we just use theinputargument to thecall()method, and we separatethe creation of the layers15in the constructor from their usage in thecall()method.However, the big difference is that you can do pretty much anything you want in thecall()method:forloops,ifstatements, low-level TensorFlow operations, yourimagination is the limit (see Chapter 12)! This makes it a great API for researchersexperimenting with new ideas.

However, this extra flexibility comes at a cost: your model’s architecture is hiddenwithin thecall()method, so Keras cannot easily inspect it, it cannot save or clone it,and when you call thesummary()method, you only get a list of layers, without anyinformation on how they are connected to each other. Moreover, Keras cannot checktypes and shapes ahead of time, and it is easier to make mistakes. So unless you reallyneed that extra flexibility, you should probably stick to the Sequential API or theFunctional API.

Keras models have anoutputattribute, so we cannot use that name for the main output layer, which is whywe renamed it tomain_output.

Keras models can be used just like regular layers, so you can easilycompose them to build complex architectures.

In [None]:
# Visualization Using TensorBoard

In [None]:
<img src='images/tb.png'>
<img src='images/tb1.png'>

And that’s all there is to it! It could hardly be easier to use. If you run this code, theTensorBoardcallback will take care of creating the log directory for you (along withits parent directories if needed), and during training it will create event files and writesummaries to them. After running the program a second time (perhaps changingsome hyperparameter value), you will end up with a directory structure similar tothis one:

<images/tb-logs>

Next you need to start the TensorBoard server. If you installed TensorFlow within avirtualenv, you should activate it. Next, run the following command at the root of theproject (or from anywhere else as long as you point to the appropriate log directory).If your shell cannot find thetensorboardscript, then you must update yourPATHenvironment variable so that it contains the directory in which the script wasinstalled (alternatively, you can just replacetensorboardwithpython3 -m tensorboard.main).
    
    $ tensorboard --logdir=./my_logs --port=6006
    
    Finally, open up a web browser tohttp://localhost:6006.You should see TensorBoard’sweb interface. Click on the SCALARS tab to view the learning curves (seeFigure 10-16). Notice that the training loss went down nicely during both runs, butthe second run went down much faster. Indeed, we used a larger learning rate by set‐tingoptimizer=keras.optimizers.SGD(lr=0.05)instead ofoptimizer="sgd",which defaults to a learning rate of 0.001.

In [None]:
<img src='images/tb-logs.png'>
<img src='images/tensorboard.png'>

Unfortunately, at the time of writing, no other data is exported by theTensorBoardcallback, but this issue will probably be fixed by the time you read these lines. In Ten‐sorFlow 1, this callback exported the computation graph and many useful statistics:typehelp(keras.callbacks.TensorBoard)to see all the options.

Let’s summarize what you learned so far in this chapter: we saw where neural netscame from, what an MLP is and how you can use it for classification and regression,how to build MLPs using tf.keras’s Sequential API, or more complex architecturesusing the Functional API orModelSubclassing, you learned how to save and restore amodel, use callbacks for checkpointing, early stopping, and more, and finally how touse TensorBoard for visualization. You can already go ahead and use neural networksto tackle many problems! However, you may wonder how to choose the number ofhidden layers, the number of neurons in the network, and all the other hyperparame‐ters. Let’s look at this now.

In [None]:
# Fine-Tuning Neural Network Hyperparameters

TheKerasRegressorobject is a thin wrapper around the Keras model built usingbuild_model(). Since we did not specify any hyperparameter when creating it, it willjust use the default hyperparameters we defined inbuild_model(). Now we can usethis object like a regular Scikit-Learn regressor: we can train it using itsfit()method, then evaluate it using itsscore()method, and use it to make predictionsusing itspredict()method. Note that any extra parameter you pass to thefit()method will simply get passed to the underlying Keras model. Also note that thescore will be the opposite of the MSE because Scikit-Learn wants scores, not losses(i.e., higher should be better).

keras_reg.fit(X_train, y_train, epochs=100,validation_data=(X_valid, y_valid),callbacks=[keras.callbacks.EarlyStopping(patience=10)])mse_test = keras_reg.score(X_test, y_test)y_pred = keras_reg.predict(X_new)


However, we do not actually want to train and evaluate a single model like this, wewant to train hundreds of variants and see which one performs best on the validationset. Since there are many hyperparameters, it is preferable to use a randomized searchrather than grid search (as we discussed in Chapter 2). Let’s try to explore the numberof hidden layers, the number of neurons and the learning rate

from scipy.stats import reciprocalfrom sklearn.model_selection import RandomizedSearchCVparam_distribs = {"n_hidden": [0, 1, 2, 3],"n_neurons": np.arange(1, 100),"learning_rate": reciprocal(3e-4, 3e-2),}rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3)rnd_search_cv.fit(X_train, y_train, epochs=100,validation_data=(X_valid, y_valid),callbacks=[keras.callbacks.EarlyStopping(patience=10)])

As you can see, this is identical to what we did in Chapter 2, with the exception thatwe pass extra parameters to thefit()method: they simply get relayed to the under‐lying Keras models. Note thatRandomizedSearchCVuses K-fold cross-validation, so itdoes not useX_validandy_valid. These are just used for early stopping.The exploration may last many hours depending on the hardware, the size of thedataset, the complexity of the model and the value ofn_iterandcv. When it is over,you can access the best parameters found, the best score, and the trained Keras modellike this:

>>> rnd_search_cv.best_params_
>>> rnd_search_cv.best_score_
>>> model = rnd_search_cv.best_estimator_.model

You can now save this model, evaluate it on the test set, and if you are satisfied withits performance, deploy it to production. Using randomized search is not too hard,and it works well for many fairly simple problems. However, when training is slow(e.g., for more complex problems with larger datasets), this approach will onlyexplore a tiny portion of the hyperparameter space. You can partially alleviate thisproblem by assisting the search process manually: first run a quick random searchusing wide ranges of hyperparameter values, then run another search using smallerranges of values centered on the best ones found during the first run, and so on. Thiswill hopefully zoom in to a good set of hyperparameters. However, this is very timeconsuming, and probably not the best use of your time.Fortunately, there are many techniques to explore a search space much more effi‐ciently than randomly. Their core idea is simple: when a region of the space turns out to be good, it should be explored more. This takes care of the “zooming” process foryou and leads to much better solutions in much less time. Here are a few Pythonlibraries you can use to optimize hyperparameters:

In [None]:
<img src='images/python-libraries-for-optimisation.png'>

Moreover, many companies offer services for hyperparameter optimization. Forexample Google Cloud ML Engine has a hyperparameter tuning service. Other com‐panies provide APIs for hyperparameter optimization, such as Arimo, SigOpt, Oscarand many more.

Hyperparameter tuning is still an active area of research. Evolutionary algorithms aremaking a comeback lately. For example, check out DeepMind’s excellent 2017 paper16,where they jointly optimize a population of models and their hyperparameters. Goo‐gle also used an evolutionary approach, not just to search for hyperparameters, butalso to look for the best neural network architecture for the problem. They call thisAutoML,and it is already available as a cloud service. Perhaps the days of buildingneural networks manually will soon be over? Check out Google’s post on this topic. Infact, evolutionary algorithms have also been used successfully to train individual neu‐ral networks, replacing the ubiquitous Gradient Descent! See this 2017 post by Uberwhere they introduce theirDeep Neuroevolutiontechnique.Despite all this exciting progress, and all these tools and services, it still helps to havean idea of what values are reasonable for each hyperparameter, so you can build aquick prototype, and restrict the search space. Here are a few guidelines for choosingthe number of hidden layers and neurons in an MLP, and selecting good values forsome of the main hyperparameters.

In [None]:
###  Number of Hidden Layers

For many problems, you can just begin with a single hidden layer and you will getreasonable results. It has actually been shown that an MLP with just one hidden layercan model even the most complex functions provided it has enough neurons. For along time, these facts convinced researchers that there was no need to investigate anydeeper neural networks. But they overlooked the fact that deep networks have a muchhigherparameterefficiencythan shallow ones: they can model complex functionsusing exponentially fewer neurons than shallow nets, allowing them to reach muchbetter performance with the same amount of training data.To understand why, suppose you are asked to draw a forest using some drawing soft‐ware, but you are forbidden to use copy/paste. You would have to draw each treeindividually, branch per branch, leaf per leaf. If you could instead draw one leaf,copy/paste it to draw a branch, then copy/paste that branch to create a tree, andfinally copy/paste this tree to make a forest, you would be finished in no time. Real-world data is often structured in such a hierarchical way and Deep Neural Networksautomatically take advantage of this fact: lower hidden layers model low-level struc‐tures (e.g., line segments of various shapes and orientations), intermediate hiddenlayers combine these low-level structures to model intermediate-level structures (e.g.,squares, circles), and the highest hidden layers and the output layer combine theseintermediate structures to model high-level structures (e.g., faces).Not only does this hierarchical architecture help DNNs converge faster to a good sol‐ution, it also improves their ability to generalize to new datasets. For example, if youhave already trained a model to recognize faces in pictures, and you now want totrain a new neural network to recognize hairstyles, then you can kickstart training byreusing the lower layers of the first network. Instead of randomly initializing theweights and biases of the first few layers of the new neural network, you can initialize them to the value of the weights and biases of the lower layers of the first network.This way the network will not have to learn from scratch all the low-level structuresthat occur in most pictures; it will only have to learn the higher-level structures (e.g.,hairstyles). This is calledtransfer learning.In summary, for many problems you can start with just one or two hidden layers andit will work just fine (e.g., you can easily reach above 97% accuracy on the MNISTdataset using just one hidden layer with a few hundred neurons, and above 98% accu‐racy using two hidden layers with the same total amount of neurons, in roughly thesame amount of training time). For more complex problems, you can gradually rampup the number of hidden layers, until you start overfitting the training set. Very com‐plex tasks, such as large image classification or speech recognition, typically requirenetworks with dozens of layers (or even hundreds, but not fully connected ones, aswe will see in Chapter 14), and they need a huge amount of training data. However,you will rarely have to train such networks from scratch: it is much more common to reuse parts of a pretrained state-of-the-art network that performs a similar task.Training will be a lot faster and require much less data 

In [None]:
### Number of Neurons per Hidden Layer

Obviously the number of neurons in the input and output layers is determined by thetype of input and output your task requires. For example, the MNIST task requires 28x 28 = 784 input neurons and 10 output neurons.As for the hidden layers, it used to be a common practice to size them to form a pyra‐mid, with fewer and fewer neurons at each layer—the rationale being that many low-level features can coalesce into far fewer high-level features. For example, a typicalneural network for MNIST may have three hidden layers, the first with 300 neurons,the second with 200, and the third with 100. However, this practice has been largelyabandoned now, as it seems that simply using the same number of neurons in all hid‐den layers performs just as well in most cases, or even better, and there is just onehyperparameter to tune instead of one per layer—for example, all hidden layers couldsimply have 150 neurons. However, depending on the dataset, it can sometimes helpto make the first hidden layer bigger than the others.Just like for the number of layers, you can try increasing the number of neurons grad‐ually until the network starts overfitting. In general you will get more bang for thebuck by increasing the number of layers than the number of neurons per layer.Unfortunately, as you can see, finding the perfect amount of neurons is still somewhatof a dark art.A simpler approach is to pick a model with more layers and neurons than youactually need, then use early stopping to prevent it from overfitting (and other regu‐larization techniques, such asdropout,as we will see in Chapter 11). This has beendubbed the “stretch pants” approach:17instead of wasting time looking for pants thatperfectly match your size, just use large stretch pants that will shrink down to theright size.

In [None]:
### Learning Rate, Batch Size and Other Hyperparameters

The number of hidden layers and neurons are not the only hyperparameters you cantweak in an MLP. Here are some of the most important ones, and some tips on howto set them:

•The learning rate is arguably the most important hyperparameter. In general, theoptimal learning rate is about half of the maximum learning rate (i.e., the learn‐ing rate above which the training algorithm diverges, as we saw in Chapter 4). Soa simple approach for tuning the learning rate is to start with a large value thatmakes the training algorithm diverge, then divide this value by 3 and try again,and repeat until the training algorithm stops diverging. At that point, you gener‐ally won’t be too far from the optimal learning rate. That said, it is sometimesuseful to reduce the learning rate during training:

• Choosing a better optimizer than plain old Mini-batch Gradient Descent (andtuning its hyperparameters) is also quite important. We will discuss this in Chap‐ter 11.• The batch size can also have a significant impact on your model’s performanceand the training time. In general the optimal batch size will be lower than 32 (inApril 2018, Yann Lecun even tweeted "Friendsdon’t let friends use mini-batcheslarger than 32“).A small batch size ensures that each training iteration is veryfast, and although a large batch size will give a more precise estimate of the gradi‐ents, in practice this does not matter much since the optimization landscape isquite complex and the direction of the true gradients do not point precisely inthe direction of the optimum. However, having a batch size greater than 10 helpstake advantage of hardware and software optimizations, in particular for matrixmultiplications, so it will speed up training. Moreover, if you useBatch Normal‐ization(see Chapter 11), the batch size should not be too small (in general no lessthan 20).•We discussed the choice of the activation function earlier in this chapter: in gen‐eral, the ReLU activation function will be a good default for all hidden layers. Forthe output layer, it really depends on your task.• In most cases, the number of training iterations does not actually need to betweaked: just use early stopping instead.

For more best practices, make sure to read Yoshua Bengio’s great 2012 paper18, whichpresents many practical recommendations for deep networks.This concludes this introduction to artificial neural networks and their implementa‐tion with Keras. In the next few chapters, we will discuss techniques to train verydeep nets, we will see how to customize your models using TensorFlow’s lower-levelAPI and how to load and preprocess data efficiently using the Data API, and we willdive into other popular neural network architectures: convolutional neural networksfor image processing, recurrent neural networks for sequential data, autoencoders for representation learning, and generative adversarial networks to model and generate data.

In [None]:
<img src='images/qa.png'>
<img src='images/qa1.png'>

In [None]:
# Plotting the curves

import pandas as pd
import matplotlib.pyplt as plt

pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)

#  Setting the vertical range (0 - 1)
plt.gca().set_ylim(0,1)
plt.show()

 =>  You can see that both the Training and Validation accuracy steadily increase during Training, while the               Training and Validation loss decreases. Good! 
 
 =>  Moreover, the valida‐tion curves are quite close to the Training curves, which means that there is not too much       overfitting.
 
 =>   In this particular case, the model performed better on the Valida‐tion set than on the Training set at the          beginning of training. this sometimes happens by-chance (especially when the validation set is fairly small).      
    However, the Training set performance ends up beating the Validation performance, as is generally the case when      you train for long enough. You can tell that the model has not quite converged yet, as the validation loss is      still going down, so you should probably continue train‐ing. 
    
 =>  It’s as simple as calling the fit()method again, since Keras just continues train‐ing where it left off (you       should be able to reach close to 89% validation accuracy).
 
 =>  If you are not satisfied with the performance of your model, you should go back andtune the model’s                   hyperparameters, for example the number of layers, the number of neurons per layer, the types of activation          functions we use for each hidden layer, the number of Training epochs, the batch size (it can be set in            thef it()method using the batch_size argument, which defaults to 32). 


In [None]:
         ############   #############   Evaluating the Trained Model   ##############   ############

 =>  evaluate() method returns the loss value & metrics values for the mode;l in test mode and Computation is 
                 done in batches.

       >>> model.evaluate(x, y, batch_size=None, sample_weight=None, steps=None, callbacks=None, verbose=1,
                          workers=1, use_multiprocessing=False, return_dict=False, max_queue_size=10, verbose=1)

In [None]:
# Evaluation of model on Test dataset
model.evaluate(x_test, y_test)