# Training and Saving Models in TF

We don't want to retrain a neural network every time we spin up a new server. Instead, we want to load a pretrained model from a file (which could live in Amazon's S3, another cloud storage service, or as a blob in a database). The following code would be written in standard python files, versioned with `git` or some other version control system, and deployed to a powerful machine with a good GPU or cluster. 

In [1]:
## Simple neural network example.
## So far this should all look very familiar.
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

num_classes = 10 
image_size = 784

(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_data = training_images.reshape(training_images.shape[0], image_size) 
test_data = test_images.reshape(test_images.shape[0], image_size)

training_labels = to_categorical(training_labels, num_classes)
test_labels = to_categorical(test_labels, num_classes)

model = Sequential([
    Dense(units=512, activation='relu', input_shape=(image_size,)),
    Dense(units=256, activation='relu'),
    Dense(units=128, activation='relu'),
    
    Dense(units=64, activation='relu'),
    Dropout(rate=.3),
    
    Dense(units=32, activation='relu'),
    Dropout(rate=.3),
    
    Dense(units=num_classes, activation='softmax')
])

model.compile(optimizer="adam", loss='categorical_crossentropy', metrics=['accuracy'])

# Note: No validation data. In a go-to-production setting, you'd already be confident this model will generalize
# so there's no point in validating it. Instead, use all the available data to train!
model.fit(training_data, training_labels, batch_size=128, epochs=20, verbose=True) 

# You can save the file as an .h5, which is specific to the Keras frontend for TF
model.save('save_files/mnist-model.h5', save_format='h5')

# You can also save the file in a tensorflow format that is slightly more generic
model.save('save_files/mnist-model-generic', save_format='tf')


2023-02-03 16:20:15.347822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/20


2023-02-03 16:20:21.766058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20




INFO:tensorflow:Assets written to: save_files/mnist-model-generic/assets


INFO:tensorflow:Assets written to: save_files/mnist-model-generic/assets


### Loading Models

The result of your training on the GPU is a file. Part of your service deployment is now fetching the latest version of that file and putting it in the right place. Part of your server or application code now has to load the saved model into it's memory and run it. 

This **does require** a significant degree of integration, specifically your server code now has to be in Python and must depend on Keras. In some cases this is not a problem, in some cases it might require standing up a standalone API server in Python and having your (say) Ruby on Rails webserver make web requests to the Python server, which runs the model and returns the predictions. 

In [2]:
# Loading models from save files is pretty easy. 
from tensorflow.keras.models import load_model
import numpy as np

trained_loaded_model = load_model('save_files/mnist-model.h5')
tf_trained_loaded_model = load_model('save_files/mnist-model-generic')

# Loss, Accuracy
# They'll be the same, since it's the same model being restored from two different formats.
a = trained_loaded_model.evaluate(test_data, test_labels, verbose=False)
print(a)

b = tf_trained_loaded_model.evaluate(test_data, test_labels, verbose=False)
print(b)

[0.14336198568344116, 0.975600004196167]
[0.14336198568344116, 0.975600004196167]


## Create Checkpoints While Training

Your code or computer could crash for any number of reasons at any time. If you've been training for 10 hours and the server running that training goes down but you haven't persisted the results of your training to the hard drive, then you're going to be very sad. Instead of training with `.fit` and `epochs=999999` we want to ensure that you're periodically saving the model.

Keras provides a helpful callback class that can automatically persist the model during the training process based on the results. For example, this callback makes it easy to make a checkpoint of the model every time validation accuracy improves, instead of over a fixed number of epochs. This callback can also be configured to only save the weights, see  the [ModelCheckpoint Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint).

In [5]:
from tensorflow.keras.callbacks import ModelCheckpoint

# This string uses the same format as Python's f-strings
filename_format = 'save_files/model-checkpoint.{epoch:02d}-{val_loss:.2f}.h5'

model_checkpointer = ModelCheckpoint(
    filename_format,
    monitor='val_accuracy', 
    verbose=1, 
    save_best_only=True,     # If True, the checkpoint will be replaced every time the model improves on val_accuracy.
    save_weights_only=False, # If True the saved files will be the weights only, not the whole model.
    mode='auto', 
    period=1 # If larger, the checkpointer will only run every n epochs.
)

fresh_model = Sequential([
    Dense(units=512, activation='relu', input_shape=(image_size,)),
    Dense(units=256, activation='relu'),
    Dense(units=128, activation='relu'),
    
    Dense(units=64, activation='relu'),
    Dropout(rate=.3),
    
    Dense(units=32, activation='relu'),
    Dropout(rate=.3),
    
    Dense(units=num_classes, activation='softmax')
])

fresh_model.compile(optimizer="adam", loss='categorical_crossentropy', metrics=['accuracy'])
fresh_model.fit(
    training_data, 
    training_labels, 
    batch_size=128, 
    epochs=30, 
    verbose=False, 
    validation_split=.1,
    callbacks=[model_checkpointer] # Here's our checkpointer!
)






Epoch 1: val_accuracy improved from -inf to 0.92517, saving model to save_files/model-checkpoint.01-0.29.h5

Epoch 2: val_accuracy improved from 0.92517 to 0.95417, saving model to save_files/model-checkpoint.02-0.19.h5

Epoch 3: val_accuracy improved from 0.95417 to 0.95767, saving model to save_files/model-checkpoint.03-0.18.h5

Epoch 4: val_accuracy improved from 0.95767 to 0.96533, saving model to save_files/model-checkpoint.04-0.15.h5

Epoch 5: val_accuracy improved from 0.96533 to 0.97083, saving model to save_files/model-checkpoint.05-0.13.h5

Epoch 6: val_accuracy did not improve from 0.97083

Epoch 7: val_accuracy did not improve from 0.97083

Epoch 8: val_accuracy improved from 0.97083 to 0.97117, saving model to save_files/model-checkpoint.08-0.14.h5

Epoch 9: val_accuracy improved from 0.97117 to 0.97450, saving model to save_files/model-checkpoint.09-0.13.h5

Epoch 10: val_accuracy improved from 0.97450 to 0.97500, saving model to save_files/model-checkpoint.10-0.13.h5

E

<keras.callbacks.History at 0x7fab41e57190>