In [None]:
#1. Deep Learning.

#a. Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.

"""Here's an example of how you can build a Deep Neural Network (DNN) with five hidden layers, each consisting of
   100 neurons. We'll use the He initialization method and the ELU activation function.
   
   import tensorflow as tf

# Define the DNN architecture
model = tf.keras.models.Sequential()

# Input layer
model.add(tf.keras.layers.Dense(units=100, activation='elu', kernel_initializer='he_normal', input_shape=(input_shape,)))

# Hidden layers
for _ in range(4):
    model.add(tf.keras.layers.Dense(units=100, activation='elu', kernel_initializer='he_normal'))

# Output layer
model.add(tf.keras.layers.Dense(units=output_units, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Print the summary of the model
model.summary()

  In this example, you need to replace input_shape with the shape of your input data and output_units with the number
  of units/neurons in your output layer.

  The Dense layers are used to create fully connected layers. The kernel_initializer='he_normal' argument initializes 
  the weights using the He initialization method, which is a popular initialization technique for deep neural networks. 
  The activation='elu' argument sets the activation function of each layer to the Exponential Linear Unit (ELU) activation
  function.

  The model is compiled with the Adam optimizer, categorical cross-entropy loss function (assuming multi-class 
  classification), and accuracy as the evaluation metric.

  Finally, calling model.summary() will provide a summary of the model's architecture, showing the number of 
  parameters in each layer.

  Feel free to modify the code based on your specific requirements and data."""

#b. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use 
transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, 
and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.

"""Here's an example of how you can train a Deep Neural Network (DNN) with Adam optimization and early stopping on 
   the MNIST dataset, specifically on digits 0 to 4. The model will have a softmax output layer with five neurons.
   
   import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Filter digits 0 to 4
train_filter = y_train <= 4
test_filter = y_test <= 4
X_train, y_train = X_train[train_filter], y_train[train_filter]
X_test, y_test = X_test[test_filter], y_test[test_filter]

# Normalize input data
X_train = X_train / 255.0
X_test = X_test / 255.0

# Convert labels to one-hot encoding
y_train = tf.keras.utils.to_categorical(y_train, num_classes=5)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=5)

# Define the DNN architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(5, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define checkpoints and early stopping
checkpoint = ModelCheckpoint('mnist_dnn_checkpoint.h5', save_best_only=True)
early_stopping = EarlyStopping(patience=5, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test),
                    callbacks=[checkpoint, early_stopping])

# Save the final model
model.save('mnist_dnn_final_model.h5')


   In this example, we first load the MNIST dataset using mnist.load_data() and filter the data to only include 
   digits 0 to 4. The input data is then normalized by dividing by 255.0.

   Next, the labels are converted to one-hot encoding using tf.keras.utils.to_categorical() to match the softmax 
   output layer's requirements.

   The DNN architecture consists of four hidden layers, each with 100 neurons, using the ELU activation function
   and He initialization. The output layer has five neurons with a softmax activation function.

   The model is compiled with the Adam optimizer, categorical cross-entropy loss function, and accuracy as the
   evaluation metric.

   The code includes the ModelCheckpoint callback to save the best model during training based on validation accuracy. 
   The EarlyStopping callback is used to stop training early if the validation loss does not improve for five consecutive
   epochs, and it restores the weights of the best model.

   During training, the checkpoints are saved with the name "mnist_dnn_checkpoint.h5", and the final model is saved as
   "mnist_dnn_final_model.h5".

   Feel free to modify the code based on your specific requirements and preferences."""

#c. Tune the hyperparameters using cross-validation and see what precision you can achieve.

"""To tune the hyperparameters of the Deep Neural Network (DNN) and evaluate its precision using cross-validation, 
   you can use Scikit-learn's GridSearchCV or RandomizedSearchCV classes. Here's an example of how you can perform 
   hyperparameter tuning and evaluate precision using cross-validation:
   
   import numpy as np
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Define a function to create the DNN model
def create_model(learning_rate=0.001, num_neurons=100, activation='elu', kernel_initializer='he_normal'):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(num_neurons, activation=activation, kernel_initializer=kernel_initializer),
        tf.keras.layers.Dense(num_neurons, activation=activation, kernel_initializer=kernel_initializer),
        tf.keras.layers.Dense(num_neurons, activation=activation, kernel_initializer=kernel_initializer),
        tf.keras.layers.Dense(num_neurons, activation=activation, kernel_initializer=kernel_initializer),
        tf.keras.layers.Dense(5, activation='softmax')
    ])
    
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

# Create the KerasClassifier for Scikit-learn compatibility
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define the hyperparameter grid for search
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'num_neurons': [50, 100, 150],
    'activation': ['elu', 'relu'],
    'kernel_initializer': ['he_normal', 'glorot_uniform']
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='precision_macro')
grid_result = grid_search.fit(X_train, y_train)

# Print the best precision and hyperparameters
print("Best Precision: {:.4f}".format(grid_result.best_score_))
print("Best Hyperparameters:", grid_result.best_params_)
   
   
   In this example, we define a function create_model() that builds the DNN model with the specified hyperparameters. 
   The hyperparameters to be tuned include the learning rate, number of neurons, activation function, and kernel initializer.

   We then create a KerasClassifier object from the function create_model() to make it compatible with Scikit-learn's 
   cross-validation utilities.

   The param_grid dictionary defines the grid of hyperparameters to search over. You can modify this grid to include
   more hyperparameters or adjust the ranges.

   The GridSearchCV class is used to perform the grid search with cross-validation. We specify the number of cross-validation
   folds (cv=3) and the scoring metric (scoring='precision_macro') to evaluate the precision. You can adjust these parameters 
   as needed.

   After performing the grid search, the best precision score and corresponding hyperparameters are printed.

   Feel free to modify the code to include additional hyperparameters, adjust the ranges, or change the scoring metric
   based on your specific requirements."""

#d. Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it
produce a better model?

"""To compare the learning curves and assess the impact of Batch Normalization on convergence speed and model performance,
   we can modify the Deep Neural Network (DNN) by adding Batch Normalization layers. Here's an example:
   
   import tensorflow as tf

# Define the DNN architecture with Batch Normalization
model_bn = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(5, activation='softmax')
])

# Compile the model
model_bn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with Batch Normalization
history_bn = model_bn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Define the DNN architecture without Batch Normalization
model_no_bn = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(5, activation='softmax')
])

# Compile the model
model_no_bn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model without Batch Normalization
history_no_bn = model_no_bn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
  
  
  In this example, we create two models: model_bn with Batch Normalization layers and model_no_bn without Batch 
  Normalization layers.

  For model_bn, Batch Normalization layers are added after each dense layer. Batch Normalization helps normalize the
  inputs to each layer, which can accelerate training and improve generalization.

  Both models are trained using the same training data, epochs, batch size, and validation data. The training process
  and learning curves are captured in the history_bn and history_no_bn variables, respectively.

  To compare the learning curves, you can plot the training and validation loss and accuracy for both models:
  
  import matplotlib.pyplot as plt

# Plot the learning curves
def plot_learning_curves(history, title):
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title(title + ' - Loss')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.title(title + ' - Accuracy')
    plt.legend()

    plt.tight_layout()
    plt.show()

# Plot learning curves for models with and without Batch Normalization
plot_learning_curves(history_bn, 'Model with Batch Normalization')
plot_learning_curves(history_no_bn, 'Model without Batch Normalization')


   This code will generate two subplots showing the learning curves for both models, one for loss and another for accuracy.

  By comparing the learning curves, you can assess if the model with Batch Normalization converges faster and produces
  a better model in terms of training and validation performance.

  Feel free to modify the code to adjust the model architecture or training parameters according to your need."""

#e. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?

"""To address the issue of overfitting in the model, we can add dropout regularization to every layer of the Deep Neural 
   Network (DNN). Dropout randomly sets a fraction of input units to 0 at each update during training, which helps prevent 
   overfitting. Here's an example of how you can add dropout to each layer:
   
   import tensorflow as tf

# Define the DNN architecture with Dropout regularization
model_dropout = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5, activation='softmax')
])

# Compile the model
model_dropout.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with Dropout regularization
history_dropout = model_dropout.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))


  In this example, we create a new model called model_dropout by adding Dropout layers after each dense layer. 
  The Dropout layers have a dropout rate of 0.2, meaning 20% of the input units will be randomly set to 0 during training.

  The rest of the code remains the same as in the previous examples, compiling the model with Adam optimizer and 
  training the model with dropout regularization.

  To compare the learning curves between the model with dropout and the previous models, you can use the plot_learning_
  curves() function mentioned in the previous response:
  
  import matplotlib.pyplot as plt

# Plot learning curves for models with and without Dropout
plot_learning_curves(history_dropout, 'Model with Dropout')


   By plotting the learning curves, you can assess if the addition of dropout regularization helps in reducing overfitting
   and improving the model's generalization performance.

  Feel free to modify the dropout rate or other model parameters to experiment and find the optimal settings for your 
  specific task."""

#2. Transfer learning.

#a. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the 
softmax output layer with a new one.

"""To create a new Deep Neural Network (DNN) that reuses the pretrained hidden layers from a previous model, freezes them, 
   and replaces the softmax output layer with a new one, you can follow these steps:
   
   import tensorflow as tf

# Load the pretrained model
pretrained_model = tf.keras.models.load_model('mnist_dnn_final_model.h5')

# Freeze the pretrained hidden layers
for layer in pretrained_model.layers:
    layer.trainable = False

# Create a new DNN by reusing the pretrained hidden layers
new_model = tf.keras.models.Sequential()
for layer in pretrained_model.layers[:-1]:  # Exclude the output layer
    new_model.add(layer)

# Add a new softmax output layer for the new task
new_model.add(tf.keras.layers.Dense(10, activation='softmax'))  # Assuming 10 classes for the new task

# Compile the new model
new_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
  
  
  In this example, we first load the pretrained model using tf.keras.models.load_model() with the filename of the 
  saved model file (mnist_dnn_final_model.h5 in this case).

  Next, we iterate over each layer of the pretrained model and set their trainable attribute to False. This freezes
  the weights of the pretrained hidden layers, preventing them from being updated during training.

  Then, we create a new model called new_model and add all the layers from the pretrained model except the last layer
  (output layer) using a loop. This effectively reuses the pretrained hidden layers.

  Finally, we add a new softmax output layer with the appropriate number of units (assuming 10 classes for the new task)
  and compile the new model with the desired optimizer, loss function, and metrics."""

#b. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small
number of examples, can you achieve high precision?

"""Training a new Deep Neural Network (DNN) using transfer learning on a small number of examples can be challenging, 
   but it's still possible to achieve high precision. However, it's important to note that the performance may vary 
   depending on the complexity of the task and the amount of available data. Here's an example of how you can train 
   the new DNN on digits 5 to 9 using only 100 images per digit:
   
   import time
import numpy as np

# Load the data for digits 5 to 9
X_train_transfer = ...
y_train_transfer = ...
X_test_transfer = ...
y_test_transfer = ...

# Select 100 images per digit for training
num_examples_per_digit = 100
X_train_transfer_subset = []
y_train_transfer_subset = []

for digit in range(5, 10):
    digit_indices = np.where(y_train_transfer == digit)[0]
    np.random.shuffle(digit_indices)
    selected_indices = digit_indices[:num_examples_per_digit]
    X_train_transfer_subset.append(X_train_transfer[selected_indices])
    y_train_transfer_subset.append(y_train_transfer[selected_indices])

X_train_transfer_subset = np.concatenate(X_train_transfer_subset)
y_train_transfer_subset = np.concatenate(y_train_transfer_subset)

# Train the new DNN on the subset of data
start_time = time.time()
new_model.fit(X_train_transfer_subset, y_train_transfer_subset, epochs=50, batch_size=32, validation_data=(X_test_transfer,
y_test_transfer))
end_time = time.time()
training_time = end_time - start_time

# Evaluate the precision of the model
precision = new_model.evaluate(X_test_transfer, y_test_transfer)[1]

print("Training time: {:.2f} seconds".format(training_time))
print("Precision: {:.4f}".format(precision))

  In this example, you need to replace X_train_transfer, y_train_transfer, X_test_transfer, and y_test_transfer with the 
  corresponding data for digits 5 to 9. Ensure that the data is appropriately preprocessed and split into training and 
  testing sets.

  The code randomly selects 100 images per digit from the training set and creates a subset for training. This ensures
  that you have a balanced subset of data with 100 examples per digit.

  The new_model is then trained on this subset using the fit function. Adjust the number of epochs and batch size as needed.

  After training, the model is evaluated on the testing set, and the precision is calculated and printed.

  Please note that achieving high precision with only 100 examples per digit may be challenging, as the model might not 
  have sufficient data to generalize well. Increasing the number of training examples or employing techniques like data 
  augmentation can help improve the performance."""

#c. Try caching the frozen layers, and train the model again: how much faster is it now?

"""Caching the frozen layers can significantly speed up the training process in transfer learning, as the frozen layers' 
   outputs can be precomputed and reused for multiple epochs. Here's an example of how you can cache the frozen layers 
   and train the model again:
   
   
   import time
import numpy as np

# Load the data for digits 5 to 9
X_train_transfer = ...
y_train_transfer = ...
X_test_transfer = ...
y_test_transfer = ...

# Select 100 images per digit for training
num_examples_per_digit = 100
X_train_transfer_subset = []
y_train_transfer_subset = []

for digit in range(5, 10):
    digit_indices = np.where(y_train_transfer == digit)[0]
    np.random.shuffle(digit_indices)
    selected_indices = digit_indices[:num_examples_per_digit]
    X_train_transfer_subset.append(X_train_transfer[selected_indices])
    y_train_transfer_subset.append(y_train_transfer[selected_indices])

X_train_transfer_subset = np.concatenate(X_train_transfer_subset)
y_train_transfer_subset = np.concatenate(y_train_transfer_subset)

# Cache the outputs of the frozen layers
X_train_frozen = pretrained_model.predict(X_train_transfer_subset)

# Train the new model using cached frozen layer outputs
start_time = time.time()
new_model.fit(X_train_frozen, y_train_transfer_subset, epochs=50, batch_size=32, validation_data=(X_test_transfer, y_test_
transfer))
end_time = time.time()
training_time = end_time - start_time

# Evaluate the precision of the model
precision = new_model.evaluate(X_test_transfer, y_test_transfer)[1]

print("Training time (with caching): {:.2f} seconds".format(training_time))
print("Precision: {:.4f}".format(precision))

  In this example, after selecting the subset of data for training, we precompute the outputs of the frozen layers for 
  that subset using the pretrained model's predict method. These outputs (X_train_frozen) are then used as input to train
  the new model.

  By caching the frozen layer outputs, we avoid recomputing them in each epoch, leading to faster training.

  The rest of the code remains the same, evaluating the precision of the model on the testing set and printing the training
  time and precision.

  Please note that the first run of training with caching might be slower than subsequent runs due to additional computations
  involved in caching. The subsequent runs, however, should demonstrate significant speedup compared to training without
  caching.

  Ensure that you have the necessary data and adapt the code accordingly to run the training and evaluation successfully."""

#d. Try again reusing just four hidden layers instead of five. Can you achieve a higher precision?

"""When reusing a smaller number of hidden layers instead of all the layers in transfer learning, the performance can
   vary depending on the specific task and dataset. It's possible to achieve a higher precision in some cases, as 
   reducing the complexity of the model may help prevent overfitting or improve generalization. Here's an example of 
   reusing only four hidden layers:
   
   import tensorflow as tf

# Load the pretrained model
pretrained_model = tf.keras.models.load_model('mnist_dnn_final_model.h5')

# Freeze the pretrained hidden layers
for layer in pretrained_model.layers[:-4]:  # Freeze all but the last four layers
    layer.trainable = False

# Create a new DNN by reusing four pretrained hidden layers
new_model = tf.keras.models.Sequential(pretrained_model.layers[:-4])

# Add a new softmax output layer for the new task
new_model.add(tf.keras.layers.Dense(10, activation='softmax'))  # Assuming 10 classes for the new task

# Compile the new model
new_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
  
  
  In this example, we modify the code from the previous step by freezing all but the last four layers of the 
  pretrained model. We achieve this by selecting pretrained_model.layers[:-4] when creating the new model.

  The remaining steps, including adding the new softmax output layer and compiling the model, remain the same.

  By reusing a smaller number of hidden layers, you introduce more flexibility to the model's architecture, which
  can potentially improve its performance. However, it's important to note that the optimal configuration depends
  on the specific task and dataset, and experimentation is often required to find the best combination of layers 
  for your particular use case.

  After making these modifications, you can proceed to train and evaluate the new model using the training and testing 
  data specific to your task."""

#e. Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?

"""Unfreezing the top two hidden layers and continuing the training can allow the model to adapt and fine-tune these 
   layers to the new task, potentially leading to improved performance. Here's an example of how you can unfreeze the 
   top two hidden layers and continue training:
   
   # Unfreeze the top two hidden layers
for layer in new_model.layers[-2:]:
    layer.trainable = True

# Compile the model after unfreezing the layers
new_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Continue training the model
history = new_model.fit(X_train_transfer, y_train_transfer, epochs=50, batch_size=32, validation_data=(X_test_transfer,
y_test_transfer))


    In this example, we unfreeze the top two hidden layers of the new_model by setting their trainable attribute to True. 
    This allows these layers to be updated during training.

    After unfreezing the layers, we compile the model again with the desired optimizer, loss function, and metrics.

   Finally, we continue training the model using the training and testing data (X_train_transfer, y_train_transfer, X_test_
   transfer, y_test_transfer) for additional epochs.

   By fine-tuning the top layers, the model can learn task-specific patterns and potentially improve its performance on 
   the new task.

   Feel free to adjust the number of epochs, batch size, and other training parameters according to your specific requirements.

  Please note that fine-tuning the top layers may require careful consideration, as too much modification to the pretrained
  layers can lead to overfitting or loss of previously learned representations. It's recommended to monitor the validation
  performance and consider early stopping or regularization techniques if needed."""

#3. Pretraining on an auxiliary task.

#a. In this exercise you will build a DNN that compares two MNIST digit images and predicts whether they represent the 
same digit or not. Then you will reuse the lower layers of this network to train an MNIST classifier using very little 
training data. Start by building two DNNs (let’s call them DNN A and B), both similar to the one you built earlier but 
without the output layer: each DNN should have five hidden layers of 100 neurons each, He initialization, and ELU
activation. Next, add one more hidden layer with 10 units on top of both DNNs. To do this, you should use
TensorFlow’s concat() function with axis=1 to concatenate the outputs of both DNNs for each instance, then feed the
result to the hidden layer. Finally, add an output layer with a single neuron using the logistic activation function.

"""To build DNN A and DNN B for comparing MNIST digit images and predicting whether they represent the same digit or not,
   you can follow these steps:
   
   
   import tensorflow as tf

# Create DNN A
dnn_a = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal', input_shape=(784,)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
])

# Create DNN B
dnn_b = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal', input_shape=(784,)),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
])

# Concatenate outputs of DNN A and DNN B
concatenated = tf.keras.layers.Concatenate(axis=1)([dnn_a.output, dnn_b.output])

# Add an additional hidden layer
hidden_layer = tf.keras.layers.Dense(10, activation='elu', kernel_initializer='he_normal')(concatenated)

# Output layer
output = tf.keras.layers.Dense(1, activation='sigmoid')(hidden_layer)

# Create the final model
model = tf.keras.models.Model(inputs=[dnn_a.input, dnn_b.input], outputs=output)

  
  DNN consists of five hidden layers with 100 neurons, ELU activation, and He initialization.

  To combine the outputs of DNN A and DNN B, we use the Concatenate layer from TensorFlow, specifying axis=1 to 
  concatenate along the feature axis.

  Next, an additional hidden layer with 10 units is added on top of the concatenated outputs using the Dense layer.

  Finally, an output layer with a single neuron and a sigmoid activation function is added to predict whether the
  two digit images represent the same digit or not.

  This configuration allows us to compare two digit images by passing them as inputs to DNN A and DNN B and then 
  combining their outputs to make the final prediction."""

#b. Split the MNIST training set in two sets: split #1 should containing 55,000 images, and split #2 should contain 
contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images 
picked from split #1. Half of the training instances should be pairs of images that belong to the same class, while 
the other half should be images from different classes. For each pair, the training label should be 0 if the images 
are from the same class, or 1 if they are from different classes.

"""To generate a training batch with pairs of MNIST images from split #1, where half of the instances have images 
   from the same class and the other half have images from different classes, you can use the following function:
   
   import numpy as np

def generate_training_batch(X_train, y_train, batch_size=32):
    num_samples = batch_size // 2
    num_classes = 10
    
    # Select random indices for same-class pairs
    same_class_indices = np.random.choice(len(X_train), size=num_samples, replace=False)
    
    # Generate same-class pairs
    same_class_pairs = []
    same_class_labels = []
    
    for idx in same_class_indices:
        class_label = y_train[idx]
        class_indices = np.where(y_train == class_label)[0]
        other_idx = np.random.choice(class_indices)
        
        same_class_pairs.append([X_train[idx], X_train[other_idx]])
        same_class_labels.append(0)  # Label 0 for same-class pairs
    
    # Generate different-class pairs
    different_class_pairs = []
    different_class_labels = []
    
    for _ in range(num_samples):
        class_labels = np.random.choice(num_classes, size=2, replace=False)
        class_indices_1 = np.where(y_train == class_labels[0])[0]
        class_indices_2 = np.where(y_train == class_labels[1])[0]
        
        idx_1 = np.random.choice(class_indices_1)
        idx_2 = np.random.choice(class_indices_2)
        
        different_class_pairs.append([X_train[idx_1], X_train[idx_2]])
        different_class_labels.append(1)  # Label 1 for different-class pairs
    
    # Combine same-class and different-class pairs
    pairs = same_class_pairs + different_class_pairs
    labels = same_class_labels + different_class_labels
    
    # Shuffle the pairs and labels
    indices = np.random.permutation(len(pairs))
    pairs = np.array(pairs)[indices]
    labels = np.array(labels)[indices]
    
    # Split the pairs into two separate arrays
    pairs_1 = pairs[:, 0]
    pairs_2 = pairs[:, 1]
    
    return pairs_1, pairs_2, labels

   This function takes the training data X_train and corresponding labels y_train from split #1 as input and generates 
   a batch of pairs of MNIST images along with their labels. The batch_size determines the number of pairs to generate.

   Half of the instances in the batch are pairs of images that belong to the same class (same_class_pairs), and the 
   other half are pairs of images from different classes (different_class_pairs). The labels for same-class pairs are 
   set to 0, and the labels for different-class pairs are set to 1.

   The function returns two arrays (pairs_1 and pairs_2) representing the pairs of images, and a labels array."""

#c. Train the DNN on this training set. For each image pair, you can simultaneously feed the first image to DNN A and 
the second image to DNN B. The whole network will gradually learn to tell whether two images belong to the same class or not.

"""Pretraining on an auxiliary task is a technique commonly used in deep learning to enhance the performance of a neural
   network on the main task. In the context you mentioned, the goal is to train a Deep Neural Network (DNN) to determine
   whether two images belong to the same class or not.

   The process typically involves the following steps:

   1. Data collection: Gather a large dataset of image pairs, where each pair consists of two images. For the purpose of 
      training the network, these pairs should be labeled to indicate whether the images belong to the same class or not.

   2. Architecture setup: Set up two separate DNN models, referred to as DNN A and DNN B. These models will process the
      first and second images from each pair, respectively.

   3. Pretraining: Pretrain DNN A and DNN B on a different auxiliary task, which is typically easier or related to the 
      main task. This pretraining step helps the models to learn useful representations that can be applied to the main task.

   4. Training: Once the pretraining is complete, the network can be trained on the task of determining whether two images
      belong to the same class. For each image pair in the training set, you simultaneously feed the first image to DNN A 
      and the second image to DNN B.

   5. Loss function: Define a suitable loss function to quantify the dissimilarity between the output of DNN A and DNN B.
      This loss function will guide the training process and enable the network to learn to discriminate between same-class 
      and different-class image pairs.

   6. Backpropagation and optimization: Compute the gradients of the loss function with respect to the parameters of both 
      DNN A and DNN B using backpropagation. Update the model parameters using an optimization algorithm, such as stochastic 
      gradient descent (SGD) or Adam, to minimize the loss.

   7. Iterative training: Repeat the training process for multiple iterations or epochs, gradually adjusting the model 
      parameters to improve performance.

  By training the network using this approach, the whole network learns to extract meaningful features from the images and 
  to make accurate predictions regarding their class similarity. The pretraining step helps initialize the models with useful
  knowledge, leading to improved performance on the main task."""

#d. Now create a new DNN by reusing and freezing the hidden layers of DNN A and adding a softmax output layer on top with 
10 neurons. Train this network on split #2 and see if you can achieve high performance despite having only 500 images per class.

"""Pretraining on an auxiliary task followed by fine-tuning is a common approach to achieve high performance on a main 
   task, even with limited data. In this case, you can follow the steps below to create a new DNN by reusing and freezing 
   the hidden layers of DNN A, and then train it on split #2 with only 500 images per class:

   1. Pretraining: Train DNN A on a related task using a large dataset. This pretraining step helps DNN A learn useful
      representations that can be transferred to the main task.

   2. Freeze hidden layers: After pretraining, freeze the parameters of the hidden layers in DNN A. This prevents their 
      weights from being updated during the subsequent training process.

   3. Softmax output layer: Add a new softmax output layer on top of the frozen hidden layers of DNN A. This new layer 
      will have 10 neurons, corresponding to the number of classes in your classification problem.

   4. Split #2 data: Divide your dataset into different splits for training and evaluation. In this case, use split #2 
      for training, which contains only 500 images per class.

   5. Training: Train the new DNN, referred to as DNN B, using split #2. Since the hidden layers are frozen, only the
      weights of the new softmax output layer will be updated during training.

   6. Loss function and optimization: Define a suitable loss function for multi-class classification, such as categorical
      cross-entropy, and an optimization algorithm like SGD or Adam. Use these to update the weights of the output layer
      during training.

   7. Iterative training: Repeat the training process for multiple iterations or epochs, adjusting the weights of the 
      output layer to minimize the loss on the training data.

  By reusing and freezing the hidden layers of DNN A, the new DNN B starts with prelearned features that can be highly 
  beneficial, especially when the available data is limited. Training DNN B on split #2 allows it to adapt to the specific 
  task at hand, even though the amount of data per class is only 500 images.

  Despite having a smaller dataset, fine-tuning the pretrained model on split #2 can potentially lead to high performance,
  as the network can leverage the knowledge and representations learned from the auxiliary task. However, the exact 
  performance achieved will depend on the complexity of the main task, the quality of the pretrained model, and the 
  similarity between the auxiliary and main tasks."""