###Q1. Theory and ConceptsU

1. Explain the concept of batch normalization in the context of Artificial Neural Network.

Ans=  Batch Normalization is a technique used in neural networks to stabilize and accelerate training. It normalizes the activations within mini-batches during training, reducing internal covariate shift. It helps networks converge faster, allows for higher learning rates, acts as regularization, and improves generalization. Batch Normalization is applied to layers and involves normalizing, scaling, and shifting activations.

## Q2. Describe the benefits of using batch normalization during training

Ans = ### The benefits of using batch normalization during training in neural networks include:

1. **Faster Convergence:** Batch normalization stabilizes training, allowing for quicker convergence. Networks reach their desired accuracy in fewer training iterations.

2. **Higher Learning Rates:** It enables the use of higher learning rates, which accelerates training without causing instability. This speeds up the optimization process.

3. **Reduced Internal Covariate Shift:** Batch normalization mitigates the problem of internal covariate shift by normalizing activations within mini-batches. This results in more stable gradients and faster training.

4. **Regularization:** It acts as a form of regularization, reducing the need for dropout or L2 regularization. This helps prevent overfitting.

5. **Improved Generalization:** Batch Normalization often leads to models that generalize better to unseen data, resulting in better test performance.

6. **Robustness to Initialization:** Networks with batch normalization are less sensitive to weight initialization, making it easier to train deep architectures.

7. **Compatibility with Various Architectures:** Batch Normalization can be used with different types of layers, including fully connected, convolutional, and recurrent layers.

8. **Differentiable Operation:** It is designed to be differentiable, allowing gradients to be computed efficiently during backpropagation.

## Q3. Discuss the working principle of batch normalization, including the normalization step and the learnable parameters.
Ans = ### Batch Normalization (BatchNorm) works by normalizing the activations within mini-batches during training to address the problem of internal covariate shift. Here's an overview of its working principle, including the normalization step and the learnable parameters:

**Normalization Step:**
1. **Mini-Batch Statistics:** For each mini-batch during training, BatchNorm calculates two statistics: the mean μ and variance (σ^2) of the activations across the mini-batch. These statistics provide an estimate of the distribution of activations for that batch.

2. **Normalization:** The activations within the mini-batch are then normalized using the calculated mean and variance. This is done element-wise for each activation x within the mini-batch:

   x' =  (x - μ) / √(σ^2 + ε)

   Here, x' is the normalized activation, μ is the mean, σ^2 is the variance, and ε is a small constant (e.g., 1e-5) added to the denominator for numerical stability.

3. **Scaling and Shifting:** After normalization, the activations are scaled by a learnable parameter γ and shifted by another learnable parameter β:

   y = 	γ x' + β

   The 	γ parameter allows the network to adjust the scale of the normalized activations, and the β parameter allows it to adjust the shift. These parameters are learned during training.

#### Learnable Parameters:

The key components of BatchNorm are the γ and β parameters:

- **γ (Scale Parameter):** It allows the network to control the scale or magnitude of the normalized activations. If 	γ is close to 1, it preserves the distribution learned during normalization. If γ is less than 1, it scales down the activations, and if it's greater than 1, it scales them up.

- **β (Shift Parameter):** It allows the network to control the shift or translation of the normalized activations. It can shift the activations away from the standard normal distribution achieved through normalization.


During training, the mean and variance are calculated for each mini-batch. However, during inference or when making predictions on a single example, the statistics used for normalization may be calculated differently. Typically, a moving average of the statistics from all mini-batches seen during training is used for normalization during inference.

# **IMPLEMENTATION**

In [1]:
# Importing necessary libraries
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from keras.datasets import fashion_mnist
plt.style.use("fivethirtyeight")
%load_ext tensorboard

In [2]:
if tf.test.is_gpu_available():
    print('Running on GPU')
else:
    print('Running on CPU')

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


Running on CPU


In [5]:
#Loading the FashionMnist Dataset
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
X_train_full = X_train_full / 255.0 #Typecasting to float
X_test = X_test / 255.0 #Typecasting to float
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
[1m29515/29515[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
[1m26421880/26421880[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
[1m5148/5148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
[1m4422102/4422102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [6]:
X_train_full.shape , y_train_full.shape

((60000, 28, 28), (60000,))

In [7]:
X_train.shape , y_train.shape

((55000, 28, 28), (55000,))

In [8]:
X_test.shape , y_test.shape

((10000, 28, 28), (10000,))

In [9]:
X_valid.shape , X_valid.shape

((5000, 28, 28), (5000, 28, 28))

In [10]:
# Creating layer of model

#Setting seed for code reproducability
tf.random.set_seed(42)
np.random.seed(42)

LAYERS = [ tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")]


model = tf.keras.models.Sequential(LAYERS)

  super().__init__(**kwargs)


In [11]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [12]:
model.summary()

In [13]:
#Training and Calculating the training time

#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 6s - 3ms/step - accuracy: 0.6145 - loss: 1.2406 - val_accuracy: 0.7232 - val_loss: 0.8659
Epoch 2/15
1719/1719 - 4s - 3ms/step - accuracy: 0.7449 - loss: 0.7863 - val_accuracy: 0.7724 - val_loss: 0.7033
Epoch 3/15
1719/1719 - 4s - 2ms/step - accuracy: 0.7782 - loss: 0.6772 - val_accuracy: 0.7974 - val_loss: 0.6301
Epoch 4/15
1719/1719 - 6s - 3ms/step - accuracy: 0.7981 - loss: 0.6188 - val_accuracy: 0.8112 - val_loss: 0.5860
Epoch 5/15
1719/1719 - 4s - 2ms/step - accuracy: 0.8090 - loss: 0.5809 - val_accuracy: 0.8208 - val_loss: 0.5559
Epoch 6/15
1719/1719 - 5s - 3ms/step - accuracy: 0.8166 - loss: 0.5538 - val_accuracy: 0.8254 - val_loss: 0.5339
Epoch 7/15
1719/1719 - 6s - 4ms/step - accuracy: 0.8219 - loss: 0.5333 - val_accuracy: 0.8310 - val_loss: 0.5170
Epoch 8/15
1719/1719 - 4s - 3ms/step - accuracy: 0.8260 - loss: 0.5172 - val_accuracy: 0.8340 - val_loss: 0.5035
Epoch 9/15
1719/1719 - 7s - 4ms/step - accuracy: 0.8294 - loss: 0.5041 - val_accuracy: 0.8372 - 


### ***After Batch Normalization***

In [14]:
# delete the previous model
del model

In [15]:
# Defing new model with batch normalization

tf.random.set_seed(42)#Setting seed for code reproducability
np.random.seed(42)

LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

In [16]:
model.summary()

In [17]:
bn1 = model.layers[1]

In [18]:
for variable in bn1.variables:
  print(variable.name, variable.trainable)

gamma True
beta True
moving_mean False
moving_variance False


In [19]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [20]:
#Training and Calculating the training time

#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 8s - 5ms/step - accuracy: 0.7186 - loss: 0.8556 - val_accuracy: 0.8080 - val_loss: 0.5574
Epoch 2/15
1719/1719 - 11s - 6ms/step - accuracy: 0.8045 - loss: 0.5731 - val_accuracy: 0.8374 - val_loss: 0.4781
Epoch 3/15
1719/1719 - 11s - 6ms/step - accuracy: 0.8235 - loss: 0.5105 - val_accuracy: 0.8498 - val_loss: 0.4421
Epoch 4/15
1719/1719 - 11s - 6ms/step - accuracy: 0.8363 - loss: 0.4741 - val_accuracy: 0.8560 - val_loss: 0.4205
Epoch 5/15
1719/1719 - 7s - 4ms/step - accuracy: 0.8443 - loss: 0.4484 - val_accuracy: 0.8626 - val_loss: 0.4051
Epoch 6/15
1719/1719 - 11s - 6ms/step - accuracy: 0.8513 - loss: 0.4283 - val_accuracy: 0.8644 - val_loss: 0.3938
Epoch 7/15
1719/1719 - 11s - 7ms/step - accuracy: 0.8565 - loss: 0.4118 - val_accuracy: 0.8674 - val_loss: 0.3847
Epoch 8/15
1719/1719 - 8s - 5ms/step - accuracy: 0.8611 - loss: 0.3975 - val_accuracy: 0.8704 - val_loss: 0.3777
Epoch 9/15
1719/1719 - 8s - 5ms/step - accuracy: 0.8653 - loss: 0.3850 - val_accuracy: 0.87

### Observation:
- Runtime of the program is 145.52 sec
- accuracy: 0.8794