Activation functions play a crucial role in neural networks by introducing non-linearity into the model. Without activation functions, a neural network would simply perform linear transformations, limiting its ability to learn complex patterns and relationships in the data. Here’s a breakdown of why activation functions are needed and the types commonly used:

### Why Activation Functions are Needed

1. **Non-Linearity**: Many real-world problems are non-linear. Activation functions allow the model to learn these non-linear relationships by applying non-linear transformations to the input data.

2. **Complexity**: They enable the network to combine multiple features and learn complex patterns. Without them, stacking layers would be equivalent to a single layer, reducing the network’s capacity.

3. **Gradient Descent**: Activation functions facilitate the optimization process during training. They allow gradients to propagate back through the network, enabling the learning of weights and biases through gradient descent algorithms.

4. **Output Control**: Certain activation functions constrain the output to a specific range, making them suitable for specific tasks (e.g., probabilities in classification).

### Types of Activation Functions

1. **Sigmoid Function**
   - **Formula**: \( f(x) = \frac{1}{1 + e^{-x}} \)
   - **Range**: (0, 1)
   - **Use**: Primarily used in binary classification problems.
   - **Pros**: Smooth gradient, output range between 0 and 1.
   - **Cons**: Can cause vanishing gradient problems for large inputs.

2. **Hyperbolic Tangent (Tanh)**
   - **Formula**: \( f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
   - **Range**: (-1, 1)
   - **Use**: Often used in hidden layers of neural networks.
   - **Pros**: Zero-centered, gradients are stronger than sigmoid.
   - **Cons**: Still suffers from vanishing gradients.

3. **ReLU (Rectified Linear Unit)**
   - **Formula**: \( f(x) = \max(0, x) \)
   - **Range**: [0, ∞)
   - **Use**: Widely used in hidden layers of deep networks.
   - **Pros**: Computationally efficient, mitigates vanishing gradient problem.
   - **Cons**: Can suffer from the "dying ReLU" problem where neurons can become inactive.

4. **Leaky ReLU**
   - **Formula**: \( f(x) = \max(0.01x, x) \)
   - **Range**: (-∞, ∞)
   - **Use**: A variant of ReLU that allows a small gradient when the input is negative.
   - **Pros**: Addresses the dying ReLU problem.
   - **Cons**: Still lacks the smoothness of sigmoid or tanh.

5. **Softmax**
   - **Formula**: \( f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} \)
   - **Range**: (0, 1) for all outputs, sums to 1.
   - **Use**: Commonly used in multi-class classification problems.
   - **Pros**: Outputs can be interpreted as probabilities.
   - **Cons**: Not used in hidden layers.

6. **Swish**
   - **Formula**: \( f(x) = x \cdot \text{sigmoid}(x) \)
   - **Range**: (-∞, ∞)
   - **Use**: Used in some deep learning models as an alternative to ReLU.
   - **Pros**: Outperforms ReLU in certain cases, smooth and non-monotonic.
   - **Cons**: More computationally intensive than ReLU.

### Summary

Choosing the right activation function depends on the specific problem and the architecture of the neural network. While ReLU and its variants are generally favored for deep learning models due to their simplicity and efficiency, other functions like sigmoid, tanh, and softmax are also essential for particular tasks.