In [None]:
'''
### Part 1: Understanding Weight Initialization

**Q1a. Explain the Importance of Weight Initialization in Artificial Neural Networks. Why is it Necessary to Initialize the Weights Carefully?**

Weight initialization is critical in neural networks because it significantly influences the learning process and convergence rate. Proper weight initialization helps:
- Ensure that the gradients do not vanish or explode during training, leading to more stable and faster convergence.
- Facilitate symmetry breaking so that different neurons can learn different features.

**Q1b. Describe the Challenges Associated with Improper Weight Initialization. How Do These Issues Affect Model Training and Convergence?**

Improper weight initialization can lead to several issues:
- **Vanishing Gradients**: When weights are too small, gradients during backpropagation can shrink to near zero, making learning extremely slow.
- **Exploding Gradients**: When weights are too large, gradients can grow exponentially, causing numerical instability and divergence.
- **Symmetry Problem**: If all weights are initialized to the same value, all neurons in a layer will learn the same features, preventing the network from learning diverse representations.

**Q1c. Discuss the Concept of Variance and How It Relates to Weight Initialization. Why Is It Crucial to Consider the Variance of Weights During Initialization?**

Variance in weight initialization determines the spread of weight values. Proper variance ensures that:
- The output of neurons remains in a reasonable range, preventing saturation of activation functions.
- Gradients are maintained at appropriate magnitudes, avoiding vanishing or exploding gradients.

### Part 2: Weight Initialization Techniques

**Q2a. Explain the Concept of Zero Initialization. Discuss Its Potential Limitations and When It Can Be Appropriate to Use.**

Zero initialization sets all weights to zero. While it is simple, it has significant limitations:
- **Symmetry Problem**: All neurons receive the same gradient and thus learn the same features, which prevents the network from learning effectively.
- Appropriate only for initializing bias terms, not for weights.

**Q2b. Describe the Process of Random Initialization. How Can Random Initialization Be Adjusted to Mitigate Potential Issues Like Saturation or Vanishing/Exploding Gradients?**

Random initialization assigns random values to weights from a certain distribution (e.g., uniform or normal). To mitigate potential issues:
- Scale the random values based on the number of input and output units (fan-in and fan-out) to keep the variance under control.
- Use techniques like Xavier or He initialization to adapt the scaling factor appropriately.

**Q2d. Explain the Concept of He Initialization. How Does It Differ from Xavier Initialization, and When Is It Preferred?**

### Part 3: Applying Weight Initialization

**Q3a. Implement Different Weight Initialization Techniques**
'''

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.initializers import Zeros, RandomNormal, GlorotNormal, HeNormal
import matplotlib.pyplot as plt

# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize the data
X_train, X_test = X_train / 255.0, X_test / 255.0

# Function to create model with a given initializer
def create_model(initializer):
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu', kernel_initializer=initializer),
        Dense(64, activation='relu', kernel_initializer=initializer),
        Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Initializers to compare
initializers = {
    'Zeros': Zeros(),
    'RandomNormal': RandomNormal(mean=0.0, stddev=0.05),
    'GlorotNormal': GlorotNormal(),
    'HeNormal': HeNormal()
}

histories = {}

for name, initializer in initializers.items():
    print(f"\nTraining with {name} initializer")
    model = create_model(initializer)
    history = model.fit(X_train, y_train, validation_split=0.2, epochs=10, batch_size=64)
    histories[name] = history

'''
**Q3b. Discuss Considerations and Tradeoffs When Choosing the Appropriate Weight Initialization Technique**

When choosing a weight initialization technique, consider:
- **Activation Function**: Use He initialization for ReLU and its variants, and Xavier initialization for sigmoid and tanh.
- **Network Depth**: Deeper networks benefit more from proper initialization to prevent vanishing/exploding gradients.
- **Task and Architecture**: Some tasks may benefit from specific initialization strategies depending on the data distribution and network design.

Comparing the Performance of Different Initializers**
''''

# Plot the training and validation accuracy
plt.figure(figsize=(14, 7))
for name, history in histories.items():
    plt.plot(history.history['val_accuracy'], label=f'{name} Validation Accuracy')
    plt.plot(history.history['accuracy'], label=f'{name} Training Accuracy')

plt.title('Comparison of Weight Initializers')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
