# Exercise 10: Neural networks with Keras II

## GRA 4160

Go through the below examples and experiment with the code. Try to understand what is happening in each example. Try to change the code and see what happens.

1. Example of using different weight initialisations

2. Example of using different optimisers

# 1. Example of using different weight initialisations

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load and preprocess the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape((60000, 28 * 28))
x_train = x_train.astype("float32") / 255

x_test = x_test.reshape((10000, 28 * 28))
x_test = x_test.astype("float32") / 255

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [None]:
from tensorflow.keras import layers, models

# Define a function to create the model with the specified weight initialization
def create_model(initializer):
    model = models.Sequential()
    model.add(layers.Dense(128, activation="relu", kernel_initializer=initializer, input_shape=(28 * 28,)))
    model.add(layers.Dense(64, activation="relu", kernel_initializer=initializer))
    model.add(layers.Dense(10, activation="softmax", kernel_initializer=initializer))

    model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
    return model

TensorFlow provides several weight initializers in the tf.keras.initializers module.

- Zeros: Initializes the weights with all zeros.
- Ones: Initializes the weights with all ones.
- Constant: Initializes the weights with a constant value.
- RandomNormal: Initializes the weights with a normal distribution.
- RandomUniform: Initializes the weights with a uniform distribution.
- TruncatedNormal: Initializes the weights with a truncated normal distribution, where any value beyond two standard deviations from the mean is discarded and resampled.
- VarianceScaling: Initializes the weights by scaling their variance according to the input and output dimensions.
- GlorotNormal (Xavier normal): Initializes the weights with a normal distribution with mean 0 and variance 2 / (input_units + output_units).
- GlorotUniform (Xavier uniform): Initializes the weights with a uniform distribution within the range [-limit, limit], where limit = sqrt(6 / (input_units + output_units)).
- LecunNormal: Initializes the weights with a normal distribution with mean 0 and variance 1 / input_units.
- LecunUniform: Initializes the weights with a uniform distribution within the range [-limit, limit], where limit = sqrt(3 / input_units).
- HeNormal: Initializes the weights with a normal distribution with mean 0 and variance 2 / input_units.
- HeUniform: Initializes the weights with a uniform distribution within the range [-limit, limit], where limit = sqrt(6 / input_units).

In [None]:
# Create models with different weight initializations
initializers = {
    "Random Normal": tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None),
    "Glorot Uniform": tf.keras.initializers.GlorotUniform(seed=None),
    "He Normal": tf.keras.initializers.HeNormal(seed=None),
    "LeCun Normal": tf.keras.initializers.LecunNormal(seed=None),
}

for name, initializer in initializers.items():
    model = create_model(initializer)
    print(f"Training model with {name} initialization")
    history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=0)
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
    print(f"Test accuracy with {name} initialization: {round(test_acc, 4)}\n")

# 2. Example of using different optimisers

- Stochastic Gradient Descent (SGD): A basic optimization algorithm that updates the model's weights iteratively using a fixed learning rate. It supports optional momentum and Nesterov momentum for improved convergence.
- Adaptive Moment Estimation (Adam): A popular optimization algorithm that adapts the learning rate for each weight individually by computing the first and second moments of the gradients. It usually provides fast convergence and requires less tuning of the learning rate.
- RMSprop: An adaptive learning rate optimization algorithm that divides the learning rate for each weight by a running average of the magnitudes of recent gradients for that weight. It is well-suited for problems with non-stationary objectives, such as online learning and noisy gradients.
- Adagrad: An adaptive learning rate optimization algorithm that scales the learning rate based on the sum of squares of the past gradients. It is particularly useful for sparse data and often achieves good results in natural language processing and recommendation systems.
- Adadelta: An extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate by maintaining a moving average of the gradients and updating parameters based on the ratio of the accumulated gradients.
- Adamax: A variant of Adam based on the infinity norm, which can be more stable and provide better results for some models.
- Nadam: A combination of Adam and Nesterov momentum, which incorporates Nesterov's accelerated gradient into the Adam optimizer.

In [None]:
from tensorflow.keras import layers, models
from tensorflow.keras.utils import to_categorical

def create_simple_nn():
    model = models.Sequential()
    model.add(layers.Flatten(input_shape=(28, 28)))
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))
    return model

In [None]:
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images / 255.0
test_images = test_images / 255.0

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [None]:
# Stochastic Gradient Descent (SGD):
model_sgd = create_simple_nn()
model_sgd.compile(optimizer='sgd',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

history_sgd = model_sgd.fit(train_images, train_labels, epochs=10, batch_size=32,
                            validation_data=(test_images, test_labels))

In [None]:
# Adam:
model_adam = create_simple_nn()
model_adam.compile(optimizer='adam',
                   loss='categorical_crossentropy',
                   metrics=['accuracy'])

history_adam = model_adam.fit(train_images, train_labels, epochs=10, batch_size=32,
                              validation_data=(test_images, test_labels))

In [None]:
# RMSprop:
model_rmsprop = create_simple_nn()
model_rmsprop.compile(optimizer='rmsprop',
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

history_rmsprop = model_rmsprop.fit(train_images, train_labels, epochs=10, batch_size=32,
                                    validation_data=(test_images, test_labels))

In [None]:
import matplotlib.pyplot as plt

def plot_history(optimizer_name, history):
    plt.figure(figsize=(12, 4))

    # Plot accuracy
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training')
    plt.plot(history.history['val_accuracy'], label='Validation')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.title(f'{optimizer_name} - Accuracy')

    # Plot loss
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training')
    plt.plot(history.history['val_loss'], label='Validation')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.title(f'{optimizer_name} - Loss')

    plt.show()

In [None]:
plot_history('SGD', history_sgd)

In [None]:
plot_history('Adam', history_adam)

In [None]:
plot_history('RMSprop', history_rmsprop)