### Part 1: Understanding Weight initialization

In [None]:
"""1. Explain the importance of weight initialization in artificial neural networks. Why is it necessary to initialze the weights carefully?
    Ans: Weight initialization is crucial in artificial neural networks as it sets the starting values for the model's weights. Careful initialization 
          ensures that the network begins training with reasonable weights, preventing issues like vanishing/exploding gradients. 
          Proper initialization improves convergence and overall performance during training.

2. Describe the challenges associated with improper weight initialization. How do these issues affect model training and convergence?
    Ans:  Improper weight initialization can lead to challenges like vanishing or exploding gradients. Vanishing gradients make the network hard to 
          train as updates become negligible, while exploding gradients cause instability and divergence. Both issues hinder model convergence, 
          resulting in slow learning or failure to learn altogether.

3. Discuss the concept of variance and how it relates to weight initialization. Why is it crucial to consider the variance of weights during initialization?
    Ans: Variance in weight initialization refers to the spread or range of weight values. Proper initialization adjusts the variance to ensure that the 
          weights are not too large or too small, allowing gradients to flow well during training. Balancing the variance is crucial for stable and 
            effective learning in neural networks."""


### Part 2: Weight Initialization Techniques

In [None]:
"""4. Explain the concept of zero initialization. Discuss its potential limitations and when it can be appropriate to use.
    Ans: Zero initialization sets all the weights in a neural network to zero initially. However, this approach has limitations as all 
          neurons learn the same features, and symmetry-breaking is absent, leading to poor learning. Zero initialization can be suitable 
          for biases in certain cases, but it's generally not recommended for weights in hidden layers.

5. Describe the process of random initialization. How can random initialization be adjusted to mitigate potential issues like saturation or vanishing/exploding gradients? 
    Ans: Random initialization sets weights to random values within a defined range. To mitigate saturation or vanishing/exploding gradients, 
          use appropriate variance adjustment. For example, in Xavier initialization, scale weights based on the number of input and output units 
          to maintain a balance between activations and gradients during training.

6. Discuss the concept of Xavier/Glorot initialization. Explain how it addresses the challenges of improper weight initialization and the underlying theory behind it. 
    Ans: Xavier/Glorot initialization is a weight initialization technique that sets weights using a carefully chosen variance. It balances the initialization to avoid 
          vanishing/exploding gradients during training. The theory is based on ensuring the variance of activations and gradients remains consistent across layers, 
          leading to more stable and efficient learning in neural networks.

7. Explain the concept of He initialization. How does it differ from Xavier initialization, and when is it preferred?
    Ans: He initialization is a weight initialization method optimized for ReLU activation functions. It sets weights with higher variance, 
          helping ReLU neurons to overcome vanishing gradients and learn better. Unlike Xavier, it uses variance based solely on the number of input units, 
          making it suitable for deeper networks with ReLU activations."""


### Part 3: Applying Weight initialization

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense, Flatten


def model(INITIALIZER, AF):
    # loading dataset
    (X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

    # Scale the data between 0 to 1 by dividing it by 255. as its an unsigned data between 0-255 range
    X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
    y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

    # scale the test set as well
    X_test = X_test / 255.
    model = tf.keras.models.Sequential()
    model.add(Flatten(input_shape=[28, 28], name="inputLayer"))
    model.add(Dense(300, activation=AF, name="hiddenLayer1",kernel_initializer=INITIALIZER))
    model.add(Dense(100, activation=AF, name="hiddenLayer2",kernel_initializer=INITIALIZER))
    model.add(Dense(10, activation="softmax", name="outputLayer"))

    model.compile(loss='sparse_categorical_crossentropy',
                optimizer='adam',
                metrics=["accuracy"])
    model.fit(X_train, y_train, epochs=25,
                    validation_data=(X_valid, y_valid), batch_size=1000)
    
    return max(model.history.history['val_accuracy'])

In [4]:
# 8. Implement different weight initialization techniques (zero initialization, random initialization, Xavier initialization, and He initialization) in a neural network using a framework of your choice. Train the model on a suitable dataset and compare the performance of the initialized models. 

initializers = {
    'zero_initializers' : model(tf.keras.initializers.Zeros(), 'relu'),
    'random_Uniform_initializers' : model(tf.keras.initializers.RandomUniform(minval=0., maxval=1.), 'relu'),
    'random_Normal_initializers' : model(tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None), 'relu'),
    'Xavier_Normal_initializers' : model(tf.keras.initializers.GlorotNormal(seed=None), 'tanh'),
    'Xavier_Uniform_initializers' : model(tf.keras.initializers.GlorotUniform(seed=None), 'tanh'),
    'He_Normal_initializers' : model(tf.keras.initializers.HeNormal(), 'relu'),
    'He_Uniform_initializers' : model(tf.keras.initializers.HeUniform(), 'relu')
}
print('\n\n\n')
for i,j in initializers.items():
    print(f'{i} : {j}')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25




zero_initializers : 0.11259999871253967
random_Uniform_initializers : 0.8813999891281128
random_Normal_initializers : 0.9832000136375427
Xavier_Normal_initializers : 0.9819999933242798
Xavier_Uniform_initializers : 0.980400025844574
He_Normal_initializers : 0.9824000000953674
He_Uniform_initializers : 0.9828000068664551


### we get the output as:
- zero_initializers : 0.11259999871253967
- random_Uniform_initializers : 0.8813999891281128
- random_Normal_initializers : 0.9832000136375427
- Xavier_Normal_initializers : 0.9819999933242798
- Xavier_Uniform_initializers : 0.980400025844574
- He_Normal_initializers : 0.9824000000953674
- He_Uniform_initializers : 0.9828000068664551

In [None]:
"""9. Discuss the considerations and tradeoffs when choosing the appropriate weight initialization technique for a given neural network architecture and task.
    Ans: When choosing a weight initialization technique for a neural network, consider the architecture and task. 
            Xavier/Glorot initialization is good for tanh and sigmoid activations, while He initialization suits ReLU-based activations. 
            Proper initialization helps prevent vanishing/exploding gradients. Experiment with different techniques to find what works best for your specific model and problem.
"""