Q1. What is an activation function in the context of artificial neural networks?

In [1]:
"""In the context of artificial neural networks (ANNs), an activation function is a mathematical function applied to the output
of each neuron in a neural network. It helps to introduce non-linearity into the network, allowing it to learn complex patterns
and relationships in the data.

Activation functions determine whether a neuron should be activated (fire) or not based on the weighted sum of its inputs.
Without activation functions, the neural network would be limited to only learning linear relationships between inputs
and outputs, severely restricting its capacity to model complex data.

Commonly used activation functions include:

1. **Sigmoid Function**: This function maps the input to a range between 0 and 1, which is useful for binary classification
tasks. However, it suffers from vanishing gradient problems during backpropagation.

2. **Hyperbolic Tangent (Tanh) Function**: Similar to the sigmoid function, but maps the input to a range between -1 and 1,
addressing the vanishing gradient problem to some extent.

3. **Rectified Linear Unit (ReLU)**: This function returns the input if it is positive, and zero otherwise. ReLU has become
popular due to its simplicity and effectiveness in training deep neural networks.

4. **Leaky ReLU**: A variant of ReLU that allows a small, non-zero gradient when the input is negative. It addresses the
"dying ReLU" problem where neurons can get stuck during training if they consistently output negative values.

5. **Softmax Function**: Typically used in the output layer of a neural network for multi-class classification tasks.
It squashes the outputs of a network into probabilities that sum up to one, facilitating interpretation as class probabilities.

Activation functions play a crucial role in the training and performance of neural networks. Choosing the appropriate
activation function depends on the nature of the problem, the architecture of the neural network, and considerations such 
as avoiding vanishing gradients and ensuring efficient training."""

'In the context of artificial neural networks (ANNs), an activation function is a mathematical function applied to the output\nof each neuron in a neural network. It helps to introduce non-linearity into the network, allowing it to learn complex patterns\nand relationships in the data.\n\nActivation functions determine whether a neuron should be activated (fire) or not based on the weighted sum of its inputs.\nWithout activation functions, the neural network would be limited to only learning linear relationships between inputs\nand outputs, severely restricting its capacity to model complex data.\n\nCommonly used activation functions include:\n\n1. **Sigmoid Function**: This function maps the input to a range between 0 and 1, which is useful for binary classification\ntasks. However, it suffers from vanishing gradient problems during backpropagation.\n\n2. **Hyperbolic Tangent (Tanh) Function**: Similar to the sigmoid function, but maps the input to a range between -1 and 1,\naddressin

Q2. What are some common types of activation functions used in neural networks?

In [2]:
"""In neural networks, activation functions introduce non-linearity, enabling the network to learn complex patterns in data.
Here are some common types of activation functions used in neural networks:

1. **Sigmoid Function**: Also known as the logistic function, it squashes the input values to the range [0, 1]. 
It's useful for binary classification tasks in the output layer, but it suffers from the vanishing gradient problem,
especially during backpropagation in deep networks.


2. **Hyperbolic Tangent Function (Tanh)**: Similar to the sigmoid function, but it squashes the input values to the range 
[-1, 1]. Tanh is commonly used in hidden layers of neural networks and addresses the vanishing gradient problem better 
than the sigmoid function.

 
3. **Rectified Linear Unit (ReLU)**: ReLU sets all negative input values to zero and leaves positive values unchanged.
It's computationally efficient and has been widely adopted in deep learning due to its simplicity and effectiveness.


4. **Leaky ReLU**: A variant of ReLU that introduces a small slope for negative input values instead of setting them to
zero completely. It addresses the "dying ReLU" problem where neurons can become inactive during training.

  

5. **Parametric ReLU (PReLU)**: Similar to Leaky ReLU, but the slope for negative input values is learned during training
instead of being fixed. It allows the network to adapt the slope according to the data.

6. **Exponential Linear Unit (ELU)**: ELU is similar to ReLU for positive input values but has a non-zero output for negative
input values, which helps to mitigate the vanishing gradient problem.

   

7. **Softmax Function**: Commonly used in the output layer for multi-class classification tasks. It squashes the outputs of
a network into probabilities that sum up to one, facilitating interpretation as class probabilities.

These are some of the common activation functions used in neural networks, each with its own characteristics and suitability 
for different types of problems and architectures. Choosing the appropriate activation function is an important consideration
in designing and training neural networks."""

'In neural networks, activation functions introduce non-linearity, enabling the network to learn complex patterns in data.\nHere are some common types of activation functions used in neural networks:\n\n1. **Sigmoid Function**: Also known as the logistic function, it squashes the input values to the range [0, 1]. \nIt\'s useful for binary classification tasks in the output layer, but it suffers from the vanishing gradient problem,\nespecially during backpropagation in deep networks.\n\n\n2. **Hyperbolic Tangent Function (Tanh)**: Similar to the sigmoid function, but it squashes the input values to the range \n[-1, 1]. Tanh is commonly used in hidden layers of neural networks and addresses the vanishing gradient problem better \nthan the sigmoid function.\n\n \n3. **Rectified Linear Unit (ReLU)**: ReLU sets all negative input values to zero and leaves positive values unchanged.\nIt\'s computationally efficient and has been widely adopted in deep learning due to its simplicity and effect

Q3. How do activation functions affect the training process and performance of a neural network?


In [3]:
"""Activation functions play a crucial role in the training process and performance of a neural network. Here's how activation functions affect various aspects of neural network training and performance:

1. **Introduction of Non-Linearity**: Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Without activation functions, the network would be limited to only learning linear transformations of the input data.

2. **Gradient Flow**: Activation functions influence the flow of gradients during backpropagation, which is essential for updating the weights of the network during training. The choice of activation function affects how gradients are propagated through the network layers.

3. **Vanishing and Exploding Gradients**: Some activation functions, such as sigmoid and tanh, are prone to the vanishing gradient problem, where gradients become very small as they propagate backward through many layers. This can hinder the training of deep networks. On the other hand, activation functions like ReLU can lead to exploding gradients, where gradients become very large, making training unstable.

4. **Training Speed**: Activation functions can impact the convergence speed of the training process. Activation functions that allow gradients to flow more easily, such as ReLU, often result in faster convergence compared to activation functions that suffer from vanishing gradients.

5. **Activation Saturation**: Activation functions can suffer from saturation, where large inputs result in very small gradients, causing the network to learn slowly. ReLU addresses this issue to some extent by avoiding saturation for positive inputs.

6. **Model Capacity**: The choice of activation function can affect the capacity of the model to represent complex functions. Some activation functions, like ReLU variants, are computationally efficient and allow for deeper networks with more parameters, enabling them to capture intricate patterns in the data.

7. **Stability and Robustness**: Activation functions can affect the stability and robustness of the trained model. Choosing appropriate activation functions can help prevent issues like vanishing or exploding gradients, which can lead to more stable and reliable models.

In summary, activation functions are a critical component of neural networks that influence training dynamics, convergence speed, stability, and the network's capacity to learn complex patterns. It's essential to choose activation functions carefully based on the characteristics of the data and the requirements of the task at hand to achieve optimal performance in neural network training."""

"Activation functions play a crucial role in the training process and performance of a neural network. Here's how activation functions affect various aspects of neural network training and performance:\n\n1. **Introduction of Non-Linearity**: Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Without activation functions, the network would be limited to only learning linear transformations of the input data.\n\n2. **Gradient Flow**: Activation functions influence the flow of gradients during backpropagation, which is essential for updating the weights of the network during training. The choice of activation function affects how gradients are propagated through the network layers.\n\n3. **Vanishing and Exploding Gradients**: Some activation functions, such as sigmoid and tanh, are prone to the vanishing gradient problem, where gradients become very small as they propagate backward through many layers. This 

Q4. How does the sigmoid activation function work? What are its advantages and disadvantages?

In [4]:
"""The sigmoid activation function, also known as the logistic function, is a widely used non-linear activation function in artificial neural networks. Here's how the sigmoid activation function works:

### Sigmoid Activation Function:
The sigmoid function is defined as:
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]
where \( z \) is the weighted sum of the inputs to the neuron, also known as the activation.

### Working of Sigmoid Activation Function:
1. The sigmoid function takes an input \( z \) and maps it to a value between 0 and 1.
2. For large positive values of \( z \), the sigmoid function outputs a value close to 1.
3. For large negative values of \( z \), the sigmoid function outputs a value close to 0.
4. For \( z = 0 \), the sigmoid function outputs 0.5.

### Advantages of Sigmoid Activation Function:
1. **Output Range**: The sigmoid function squashes its input values to the range [0, 1], making it suitable for tasks where the output needs to be interpreted as a probability. It's commonly used in the output layer of binary classification tasks.
2. **Smooth Gradient**: The sigmoid function has a smooth derivative, which makes it well-suited for gradient-based optimization algorithms like gradient descent during training.

### Disadvantages of Sigmoid Activation Function:
1. **Vanishing Gradient**: The sigmoid function suffers from the vanishing gradient problem, where gradients become very small for large positive or negative values of the input. This can slow down the training process, especially in deep neural networks.
2. **Output Saturation**: For extreme input values, the output of the sigmoid function saturates, meaning it becomes very close to 0 or 1. This can lead to the problem of "killing" gradients, where the gradients become close to zero, hindering the learning process.
3. **Not Zero-Centered**: The sigmoid function is not zero-centered, meaning its output is always positive. This can introduce issues during backpropagation, as it can lead to gradients that are biased in one direction.

### Summary:
The sigmoid activation function has the advantage of producing outputs in the range [0, 1] and having a smooth gradient, making it suitable for certain tasks like binary classification. However, it suffers from the vanishing gradient problem, output saturation, and lack of zero-centeredness, which can limit its effectiveness, especially in deep neural networks. As a result, alternative activation functions like ReLU and its variants are often preferred in modern neural network architectures."""

'The sigmoid activation function, also known as the logistic function, is a widely used non-linear activation function in artificial neural networks. Here\'s how the sigmoid activation function works:\n\n### Sigmoid Activation Function:\nThe sigmoid function is defined as:\n\\[ \\sigma(z) = \x0crac{1}{1 + e^{-z}} \\]\nwhere \\( z \\) is the weighted sum of the inputs to the neuron, also known as the activation.\n\n### Working of Sigmoid Activation Function:\n1. The sigmoid function takes an input \\( z \\) and maps it to a value between 0 and 1.\n2. For large positive values of \\( z \\), the sigmoid function outputs a value close to 1.\n3. For large negative values of \\( z \\), the sigmoid function outputs a value close to 0.\n4. For \\( z = 0 \\), the sigmoid function outputs 0.5.\n\n### Advantages of Sigmoid Activation Function:\n1. **Output Range**: The sigmoid function squashes its input values to the range [0, 1], making it suitable for tasks where the output needs to be interpr

Q5.What is the rectified linear unit (ReLU) activation function? How does it differ from the sigmoid function?

In [5]:
"""The Rectified Linear Unit (ReLU) activation function is a non-linear activation function widely used in artificial neural networks, particularly in deep learning architectures. Here's how the ReLU activation function works and how it differs from the sigmoid function:

### Rectified Linear Unit (ReLU) Activation Function:
The ReLU activation function is defined as:
\[ f(x) = \max(0, x) \]
where \( x \) is the input to the neuron.

### Working of ReLU Activation Function:
1. The ReLU function outputs the input value \( x \) if it is positive, and 0 otherwise.
2. For positive values of \( x \), the ReLU function outputs the same value.
3. For negative values of \( x \), the ReLU function outputs 0.

### Advantages of ReLU Activation Function:
1. **Sparsity**: ReLU introduces sparsity in the network by setting negative values to zero. This sparsity can improve the efficiency of the network during training and inference.
2. **Efficiency**: ReLU is computationally efficient compared to activation functions like sigmoid and tanh, as it involves simple thresholding operations.
3. **Avoids Vanishing Gradient**: ReLU helps to alleviate the vanishing gradient problem, which can occur with activation functions like sigmoid and tanh. By allowing gradients to flow more freely for positive inputs, ReLU facilitates training of deep neural networks.

### Differences from Sigmoid Function:
1. **Output Range**: The sigmoid function squashes its input values to the range [0, 1], while ReLU outputs the input value for positive inputs and 0 for negative inputs. ReLU does not have an upper bound on its output, unlike sigmoid.
2. **Non-Linearity**: Both sigmoid and ReLU are non-linear activation functions, but ReLU introduces a piecewise linear non-linearity, while sigmoid introduces a smooth non-linearity.
3. **Gradient Properties**: ReLU has a constant gradient of 1 for positive inputs, making it less susceptible to the vanishing gradient problem compared to sigmoid, which has gradients that become very small for large positive or negative inputs.

### Summary:
The ReLU activation function is a simple yet powerful activation function widely used in modern neural networks. It offers advantages such as sparsity, computational efficiency, and mitigation of the vanishing gradient problem. Compared to sigmoid, ReLU has a different output range, non-linearity, and gradient properties, making it well-suited for training deep neural networks."""

"The Rectified Linear Unit (ReLU) activation function is a non-linear activation function widely used in artificial neural networks, particularly in deep learning architectures. Here's how the ReLU activation function works and how it differs from the sigmoid function:\n\n### Rectified Linear Unit (ReLU) Activation Function:\nThe ReLU activation function is defined as:\n\\[ f(x) = \\max(0, x) \\]\nwhere \\( x \\) is the input to the neuron.\n\n### Working of ReLU Activation Function:\n1. The ReLU function outputs the input value \\( x \\) if it is positive, and 0 otherwise.\n2. For positive values of \\( x \\), the ReLU function outputs the same value.\n3. For negative values of \\( x \\), the ReLU function outputs 0.\n\n### Advantages of ReLU Activation Function:\n1. **Sparsity**: ReLU introduces sparsity in the network by setting negative values to zero. This sparsity can improve the efficiency of the network during training and inference.\n2. **Efficiency**: ReLU is computationally 

Q6. What are the benefits of using the ReLU activation function over the sigmoid function?

In [6]:
"""Using the Rectified Linear Unit (ReLU) activation function over the sigmoid function offers several benefits, especially in the context of training deep neural networks. Here are some of the key advantages of ReLU over sigmoid:

1. **Avoids Vanishing Gradient Problem**: ReLU helps to mitigate the vanishing gradient problem, which occurs when gradients become very small during backpropagation, especially in deep networks. Sigmoid activation functions tend to saturate for large positive or negative inputs, leading to vanishing gradients. ReLU, on the other hand, has a constant gradient of 1 for positive inputs, allowing gradients to flow more freely and facilitating training of deep networks.

2. **Promotes Sparse Activation**: ReLU introduces sparsity in the network by setting negative values to zero. This sparsity can help reduce computational complexity and memory usage during training and inference, as fewer neurons are activated.

3. **Computational Efficiency**: ReLU is computationally more efficient than sigmoid and tanh activation functions. This efficiency is due to the simple thresholding operation of ReLU, which involves no exponentiation or division operations required by sigmoid and tanh functions.

4. **Faster Convergence**: ReLU activation functions often lead to faster convergence during training compared to sigmoid. The lack of saturation for positive inputs and the avoidance of vanishing gradients allow ReLU to learn more efficiently, leading to faster convergence to the optimal solution.

5. **Better Representation Learning**: ReLU enables better representation learning by allowing the model to capture complex patterns and relationships in the data. Its piecewise linear nature allows for more diverse and flexible representations compared to the sigmoid function.

6. **Improved Gradient Flow**: ReLU has a more stable gradient flow compared to sigmoid. This stability helps in training deeper networks without the risk of gradients becoming too small to update the network parameters effectively.

7. **Better Handling of Dead Neurons**: ReLU helps to mitigate the problem of "dead neurons," where neurons become inactive and stop learning during training. This problem can occur with sigmoid activation functions due to the saturation of gradients, but ReLU's non-saturation for positive inputs helps keep neurons alive and active.

In summary, using ReLU activation functions over sigmoid offers advantages such as mitigating the vanishing gradient problem, promoting sparsity, improving computational efficiency, enabling faster convergence, facilitating better representation learning, and handling dead neurons more effectively. These benefits make ReLU a popular choice in modern neural network architectures, especially for deep learning tasks."""

'Using the Rectified Linear Unit (ReLU) activation function over the sigmoid function offers several benefits, especially in the context of training deep neural networks. Here are some of the key advantages of ReLU over sigmoid:\n\n1. **Avoids Vanishing Gradient Problem**: ReLU helps to mitigate the vanishing gradient problem, which occurs when gradients become very small during backpropagation, especially in deep networks. Sigmoid activation functions tend to saturate for large positive or negative inputs, leading to vanishing gradients. ReLU, on the other hand, has a constant gradient of 1 for positive inputs, allowing gradients to flow more freely and facilitating training of deep networks.\n\n2. **Promotes Sparse Activation**: ReLU introduces sparsity in the network by setting negative values to zero. This sparsity can help reduce computational complexity and memory usage during training and inference, as fewer neurons are activated.\n\n3. **Computational Efficiency**: ReLU is comp

Q7. Explain the concept of "leaky ReLU" and how it addresses the vanishing gradient problem.

In [None]:
"""Leaky ReLU (Rectified Linear Unit) is a variant of the standard ReLU activation function that addresses some of its limitations, particularly the "dying ReLU" problem caused by neurons becoming inactive (i.e., outputting zero) for negative inputs during training. In the standard ReLU function, when the input is negative, the output is zero, leading to neurons becoming inactive and preventing them from learning effectively. Leaky ReLU introduces a small slope for negative inputs, ensuring that neurons remain active even for negative inputs. Here's how the concept of Leaky ReLU works and how it addresses the vanishing gradient problem:

### Leaky ReLU Activation Function:
The Leaky ReLU function is defined as:
\[ f(x) = \begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0 
\end{cases} \]
where \( \alpha \) is a small constant slope (typically a small positive value, e.g., 0.01) applied to negative inputs.

### Working of Leaky ReLU:
1. For positive inputs (\( x > 0 \)), Leaky ReLU behaves the same as the standard ReLU function, outputting the input value.
2. For negative inputs (\( x \leq 0 \)), Leaky ReLU introduces a small slope (\( \alpha x \)) instead of outputting zero. The slope parameter \( \alpha \) is usually a small positive value (e.g., 0.01).

### Addressing the Vanishing Gradient Problem:
The Leaky ReLU activation function addresses the vanishing gradient problem by ensuring that gradients do not completely vanish for negative inputs. In standard ReLU, neurons can become inactive (i.e., output zero) for negative inputs, resulting in zero gradients during backpropagation and preventing the weights from being updated. This problem can hinder the training of deep neural networks, especially in the presence of many layers.

By introducing a small slope for negative inputs, Leaky ReLU ensures that gradients are not completely zero for negative inputs, allowing for some flow of information and gradient updates during backpropagation. This property helps to prevent neurons from becoming completely inactive and addresses the issue of "dead neurons" or the "dying ReLU" problem observed in standard ReLU activation functions.

### Advantages of Leaky ReLU:
1. **Prevents Neuron Inactivity**: Leaky ReLU prevents neurons from becoming completely inactive for negative inputs, ensuring that they continue to contribute to the learning process.
2. **Mitigates Vanishing Gradient**: By introducing a small slope for negative inputs, Leaky ReLU ensures that gradients do not vanish completely, facilitating more stable training and addressing the vanishing gradient problem observed in standard ReLU functions.
3. **Easy to Implement**: Leaky ReLU is simple to implement and computationally efficient, making it a popular choice in practice.

In summary, Leaky ReLU is a variant of the ReLU activation function that addresses the vanishing gradient problem by introducing a small slope for negative inputs, ensuring that neurons remain active and allowing for more stable training of deep neural networks."""

Q8. What is the purpose of the softmax activation function? When is it commonly used?

In [7]:
"""The softmax activation function is a type of activation function commonly used in the output layer of neural networks, especially in multi-class classification tasks. Its primary purpose is to transform the raw output scores of a neural network into probabilities, with each probability representing the likelihood of the input belonging to a particular class. Here's a detailed explanation of the purpose and common usage of the softmax activation function:

### Purpose of Softmax Activation Function:

1. **Probabilistic Interpretation**: The softmax function converts the raw output scores (also known as logits) produced by the neural network into probabilities. These probabilities represent the likelihood or confidence of the input belonging to each class.

2. **Normalization**: Softmax normalizes the output scores so that they sum up to 1. This normalization ensures that the output values represent valid probability distributions, making it easier to interpret and compare the model's predictions across different classes.

3. **Multi-Class Classification**: Softmax is particularly useful in multi-class classification tasks, where the goal is to classify inputs into one of multiple mutually exclusive classes. By converting the raw output scores into probabilities, softmax enables the model to make probabilistic predictions across multiple classes.

### Working of Softmax Activation Function:

The softmax function is defined as follows for a vector of raw scores \( z \) (logits) for \( K \) classes:

\[ \text{softmax}(z_i) = {e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} 

1. **Exponentiation**: The softmax function exponentiates each raw score \( z_i \) to make it positive. This ensures that all class probabilities are non-negative.

2. **Normalization**: The exponentiated scores are then divided by the sum of all exponentiated scores across all classes. This step ensures that the output probabilities sum up to 1, creating a valid probability distribution.

3. **Probabilistic Interpretation**: The resulting values represent the probabilities of the input belonging to each class. The class with the highest probability is typically chosen as the predicted class label.

### Common Usage of Softmax Activation Function:

Softmax activation function is commonly used in the following scenarios:

1. **Multi-Class Classification**: Softmax is extensively used in multi-class classification tasks, where the goal is to classify inputs into one of multiple mutually exclusive classes. It allows the model to produce probabilistic predictions across multiple classes.

2. **Output Layer of Neural Networks**: Softmax is typically applied in the output layer of neural networks, especially when the network is designed for multi-class classification tasks. It transforms the raw output scores of the network into probability distributions over the classes.

3. **Evaluation of Model Performance**: Softmax probabilities can be used to evaluate the confidence of the model's predictions and assess its performance on classification tasks. It enables the calculation of metrics such as cross-entropy loss and accuracy.

In summary, the softmax activation function serves the purpose of transforming raw output scores into probabilities, enabling the model to make probabilistic predictions across multiple classes. It is commonly used in multi-class classification tasks and is applied in the output layer of neural networks to produce valid probability distributions over the classes."""

"The softmax activation function is a type of activation function commonly used in the output layer of neural networks, especially in multi-class classification tasks. Its primary purpose is to transform the raw output scores of a neural network into probabilities, with each probability representing the likelihood of the input belonging to a particular class. Here's a detailed explanation of the purpose and common usage of the softmax activation function:\n\n### Purpose of Softmax Activation Function:\n\n1. **Probabilistic Interpretation**: The softmax function converts the raw output scores (also known as logits) produced by the neural network into probabilities. These probabilities represent the likelihood or confidence of the input belonging to each class.\n\n2. **Normalization**: Softmax normalizes the output scores so that they sum up to 1. This normalization ensures that the output values represent valid probability distributions, making it easier to interpret and compare the mod

Q9. What is the hyperbolic tangent (tanh) activation function? How does it compare to the sigmoid function?

In [9]:
"""The hyperbolic tangent (tanh) activation function is a non-linear activation function commonly used in artificial neural 
networks. It is similar to the sigmoid function but has a range between -1 and 1, making it symmetric around the origin. 
Here's a detailed explanation of the tanh activation function and how it compares to the sigmoid function:

### Hyperbolic Tangent (tanh) Activation Function:
The hyperbolic tangent function, denoted as \( \tanh(x) \), is defined as:
\[ \tanh(x) = {e^x - e^{-x}}{e^x + e^{-x}}

### Working of Tanh Activation Function:
1. The tanh function maps the input \( x \) to a value between -1 and 1, which makes it symmetric around the origin.
2. For positive values of \( x \), the tanh function outputs values close to 1.
3. For negative values of \( x \), the tanh function outputs values close to -1.
4. For \( x = 0 \), the tanh function outputs 0.

### Comparison with Sigmoid Function:
1. **Range**: The main difference between tanh and sigmoid functions is their output range. While sigmoid outputs values 
between 0 and 1, tanh outputs values between -1 and 1. This wider range of output values allows tanh to capture more diverse 
and stronger non-linearities in the data.
   
2. **Symmetry**: Tanh is symmetric around the origin, meaning it has negative values for negative inputs and positive values
for positive inputs, with zero at the origin. In contrast, the sigmoid function is not symmetric and approaches zero as the
input becomes more negative.

3. **Saturation**: Both tanh and sigmoid functions suffer from saturation for extreme input values, leading to vanishing 
gradients. However, tanh saturates more quickly than sigmoid, which can affect the training dynamics, especially in deep 
networks.

4. **Zero-Centered**: One advantage of tanh over sigmoid is that it is zero-centered, meaning its output values have a
mean of zero. This property can aid optimization algorithms by making the gradient updates more consistent and reducing the 
risk of gradient descent being biased in one direction.

### Summary:
The tanh activation function is a non-linear activation function commonly used in neural networks, similar to the sigmoid 
function but with a wider output range between -1 and 1. It is symmetric around the origin, zero-centered, and capable of
capturing stronger non-linearities in the data compared to sigmoid. While both tanh and sigmoid functions suffer from 
saturation and vanishing gradient problems, tanh's zero-centered property and wider output range make it a preferred 
choice in certain scenarios, especially in recurrent neural networks and deep learning architectures."""

"The hyperbolic tangent (tanh) activation function is a non-linear activation function commonly used in artificial neural \nnetworks. It is similar to the sigmoid function but has a range between -1 and 1, making it symmetric around the origin. \nHere's a detailed explanation of the tanh activation function and how it compares to the sigmoid function:\n\n### Hyperbolic Tangent (tanh) Activation Function:\nThe hyperbolic tangent function, denoted as \\( \tanh(x) \\), is defined as:\n\\[ \tanh(x) = {e^x - e^{-x}}{e^x + e^{-x}}\n\n### Working of Tanh Activation Function:\n1. The tanh function maps the input \\( x \\) to a value between -1 and 1, which makes it symmetric around the origin.\n2. For positive values of \\( x \\), the tanh function outputs values close to 1.\n3. For negative values of \\( x \\), the tanh function outputs values close to -1.\n4. For \\( x = 0 \\), the tanh function outputs 0.\n\n### Comparison with Sigmoid Function:\n1. **Range**: The main difference between ta