# Activation Function Assignment Answer

### 1. Explain the role of activation functions in neural networks. Compare and contrast linear and nonlinear activation functions. Why are nonlinear activation functions preferred in hidden layers


##### --> Activation functions play a crucial role in neural networks by determining the output of each neuron, effectively influencing the network's ability to learn and make complex decisions. Specifically, activation functions introduce non-linearity to the model, enabling it to approximate complex patterns and relationships between the input and output data.
* **Linear activation Function :-**
Linear activations might be useful in the output layer of certain types of neural networks, especially when predicting continuous values in regression tasks. However, they are generally avoided in hidden layers.Linear functions cannot model complex, non-linear relationships. This limits the model’s power and restricts it from solving complex tasks.
* **Non Linear activation Function :-** 
Nonlinear activation functions introduce non-linearity into the model, which allows the neural network to approximate complex relationships. Common nonlinear activations include ReLU (Rectified Linear Unit), Sigmoid, Tanh, and Leaky ReLU.Some nonlinear functions (e.g., Sigmoid, Tanh) can suffer from vanishing gradients, especially in deep networks.Nonlinearity sometimes adds computational complexity, but this is generally manageable with modern hardware and optimizations.

##### Nonlinear activation functions are indispensable for deep learning, enabling neural networks to generalize across complex data patterns. They allow neural networks to make use of depth, with each layer learning unique, hierarchical features essential for tasks like image recognition, language processing, and beyond. In contrast, linear functions lack the power to model non-linear relationships, significantly limiting their utility in deep architectures.

# -------------------------------------------

#### 2. Describe the Sigmoid activation function. What are its characteristics, and in what type of layers is it commonly used? Explain the Rectified Linear Unit (ReLU) activation function. Discuss its advantages and potential challenges.What is the purpose of the Tanh activation function? How does it differ from the Sigmoid activation function


##### ---> **Sigmoid activation Function**
The sigmoid activation function is a mathematical function commonly used in neural networks to transform input values into a range between 0 and 1.It is defined by the formula: $$f(x) = \frac{1}{1 + e^{-x}}$$
where 
x is the input to the neuron.

* **Characteristics of Sigmoid:-**
* Output Range: 0 to 1. This bounded range is useful for probabilities or binary classifications where the result can represent a probability-like interpretation.
* Smooth Gradient: The function has a smooth gradient, making it continuous and differentiable.
Non-linearity: Although non-linear, the Sigmoid activation can lead to issues with gradient values, especially in deep networks.
* Vanishing Gradient Problem: In cases where the inputs are very large or very small, the gradient of the sigmoid function becomes extremely small (close to 0), which can make learning slower in deep networks as backpropagation has little gradient to propagate.
 ##### **Common Usage:-**
The sigmoid function was initially used across all layers but is now generally limited to output layers of binary classification models, where it outputs a value representing the probability of a class (0 or 1). However, it is less commonly used in hidden layers of deep networks due to its vanishing gradient issue.



##### **Rectified Linear Unit(ReLU) Activation Function:-**
The ReLU activation function is defined as: $$f(x) = \max(0, x)$$
This means it outputs zero if the input is negative, and outputs the input value itself if it’s positive.
* **Characteristics of ReLU:-**
* Output Range: 0 to infinity for positive inputs, and exactly 0 for negative inputs.
* Non-linearity: ReLU introduces non-linearity, allowing the network to learn complex functions.
* Computationally Efficient: ReLU is simple and fast to compute, making it suitable for large networks.
* Sparsity: Many neurons in a layer will output zero when using ReLU, which helps create sparse activations, saving memory and making computation more efficient.
* **Advantages:-**
* Relu is widely used in hidden layers of deep neural networks beacuse of its efficiency and effectiveness in learning representations.
* Solves the Vanishing gradient problem and its gradient is 1 for positive inputs, so it doesn't suffer from vanishing gradients.
* Since negative inputs are zeroed out, ReLU creates sparse representations, meaning fewer neurons activate simultaneously. This sparsity often leads to faster computation and can act as a form of regularization.
* **Challenges:-**
* ReLU can sometimes lead to neurons permanently dying, where they output zero for all inputs and stop contributing to learning. This occurs if the weights are updated in such a way that the neuron’s input is always negative. Once a neuron "dies," it is unlikely to recover, especially if it continues to receive negative inputs.
* The gradient of ReLU is undefined at zero, but in practice, this is rarely an issue as deep learning frameworks manage it effectively.


##### **Tanh Activation Function :-**
The Tanh activation function (hyperbolic tangent) is given by: $$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
It maps inputs to a range between -1 and 1.
* **Characteristics of Tanh:-**
* Output Range: -1 to 1, which centers the data around zero.
* Gradient Range: Similar to sigmoid, the gradients of tanh saturate (become small) at large positive and negative inputs, which can lead to the vanishing gradient problem.
* Symmetry: Tanh is zero-centered, which often leads to faster convergence compared to sigmoid. It balances positive and negative values around zero, which can help in certain types of optimization.
* **Comparison with Sigmoid:-**
* Range: Sigmoid ranges between 0 and 1, while tanh ranges between -1 and 1. This makes tanh zero-centered, which is often beneficial for training convergence.
* Gradient Saturation: Both functions can suffer from vanishing gradients for very large or small inputs, but tanh has a broader middle range with higher gradients, making it generally perform better than sigmoid in hidden layers.

#### 3. Discuss the significance of activation functions in the hidden layers of a neural network-
Activation functions play a fundamental role in the hidden layers of a neural network, as they are key to enabling the network to learn complex patterns, hierarchies, and non-linear relationships in the data.Significance:-
* Activation functions introduce non-linearity to the network, allowing it to model non-linear relationships between inputs and outputs.
* Activation functions in hidden layers enable each layer to learn new representations and abstractions of the input data. This process is crucial for building a hierarchy of features — from simple, low-level features in initial layers to complex, high-level representations in deeper layers.
* Non-linear activation functions allow neural networks to stack many hidden layers, each capturing new levels of abstraction. This enables the model to develop a sophisticated understanding of the input data, increasing its representational power.
* Activation functions in hidden layers are what make deep networks useful for tasks that involve highly complex mappings, such as translation, object detection, and speech recognition.
* Activation functions help control how much of the input signal flows to the next layer.
* Activation functions like Tanh and Sigmoid can saturate for very large or small inputs, meaning their gradients can approach zero in these regions, which slows down learning.

#### 4. - Explain the choice of activation functions for different types of problems (e.g., classification, regression) in the output layer-

##### ---> 
| Problem Type                    | Activation Function   | Reason                                           |
|---------------------------------|-----------------------|--------------------------------------------------|
| Binary Classification           | Sigmoid              | Outputs probability between 0 and 1              |
| Multi-Class Classification       | Softmax             | Provides probability distribution across classes |
| Multi-Label Classification       | Sigmoid (per output)| Independent probabilities for each label         |
| Regression (Single Output)       | Linear              | Unbounded continuous output                      |
| Regression (Multiple Output)     | Linear (per output) | Unbounded continuous output for each target      |
| Regression (Bounded Range)       | Sigmoid or Tanh     | Constrains output to specified range             |


#### 5.  Experiment with different activation functions (e.g., ReLU, Sigmoid, Tanh) in a simple neural network architecture. Compare their effects on convergence and performance

In [2]:
# Step 1: Set up the Experiment
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
import numpy as np

# Load and preprocess MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize data
x_train = x_train.reshape(-1, 28 * 28).astype("float32")
x_test = x_test.reshape(-1, 28 * 28).astype("float32")

# Hyperparameters
input_size = 784
hidden_size = 128
num_classes = 10
num_epochs = 5
batch_size = 64
learning_rate = 0.01

# Step 2: Define a Function to Create the Model
def create_model(activation_fn):
    model = models.Sequential([
        layers.InputLayer(input_shape=(input_size,)),
        layers.Dense(hidden_size, activation=activation_fn),
        layers.Dense(num_classes, activation='softmax')  # Softmax for multi-class classification
    ])
    
    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model

# Step 3: Define a Function to Train and Evaluate the Model
def train_and_evaluate(activation_fn, activation_name):
    print(f"Training with {activation_name} Activation:")
    
    model = create_model(activation_fn)
    
    # Train the model
    history = model.fit(x_train, y_train, epochs=num_epochs, batch_size=batch_size, verbose=1)
    
    # Evaluate the model
    test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=0)
    print(f"{activation_name} Activation - Test Accuracy: {test_accuracy * 100:.2f}%\n")
    
    return test_accuracy * 100, history.history['loss']

# Step 4: Run Experiments with Different Activation Functions
# Experiment with ReLU, Sigmoid, and Tanh activation functions
activations = {
    "ReLU": 'relu',
    "Sigmoid": 'sigmoid',
    "Tanh": 'tanh'
}

results = {}
for name, activation in activations.items():
    accuracy, loss_history = train_and_evaluate(activation, name)
    results[name] = {
        "accuracy": accuracy,
        "loss_history": loss_history
    }

# Summary of Results
print("\nSummary of Results:")
for name, result in results.items():
    print(f"{name} Activation - Test Accuracy: {result['accuracy']:.2f}%")


Training with ReLU Activation:




Epoch 1/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.6529 - loss: 1.3263
Epoch 2/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8816 - loss: 0.4465
Epoch 3/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.9035 - loss: 0.3569
Epoch 4/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.9132 - loss: 0.3164
Epoch 5/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.9148 - loss: 0.2977
ReLU Activation - Test Accuracy: 92.41%

Training with Sigmoid Activation:
Epoch 1/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.4788 - loss: 2.0200
Epoch 2/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7989 - loss: 1.1823
Epoch 3/5
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accur

**Conclusion:-**
Relu Activation Function gives highest test accuracy and 2nd is Tanh activation Function and then sigmoid activation Function.