# Building a FeedForward network

1. The INput Layer
    - receives the data and just pass values along.

2. The Hidden Layers
    - the networks's computational engine
    - they learn increasingly abstract features

3. The Output Layer
    - produces the final result

Feedforward flow: information moves in one direction only, input to output(no loops)

Fully Connected: every neuron connects to every neuron in the next layer.

## The Standard Forward Model: A neuron's calculation

Step 1: The Linear Part(Weighted Sum)

    - a weighted sum of all inputs is calculated
    - a bias term is added to this sum
    - result: z = (Summation of wi*xi) + b

Step 2: The Non-Linear Part(activation)
    - the result z is passed through a non-linear activation function g.
    - this introduces complexity and allows the network to learn non-linear patterns.
    - final output: a = g(z)


## The Forward Pass: The vectorized layer view

- Problem: Calculating neuron by neuron is incredibly inefficient.
- Solution: compute an entire layer at once using vectorized operations(i.e., matrix mul)

![image.png](../public/images/image.png)

![image2.png](../public/images/image2.png)




# Activation Functions

### The Problem: Stacking Linear Layers is Useless

- Without a non-linear activation function, a deep network simply collapses into a single linear model, no matter how many layers it has.

- With Non Linearity(g): 
    - the equation becomes: g(W2g(W1x+b1) + b2)
    - this function cannot be simplified.
    - allows network to learn arbitrarily complex pattern.

    - Tradition Activations: Sigmoid
        - simga(z) = 1/(1+e^(-z))
        - squeezes any real number into the range (0,1).
        - historically popular for its interpretation as a neuron's firing rate.
        - useful in output layers for binary classification(predicting probabilities).

        **Problems**: 
            - Vanishing Gradients: the function is flat at both ends. the gradient is near zero for large positive or negative inputs, which effectively stops learning in deep networks.
            - not zero-centered: outputs are always positive, which can slow down learning.

    - Tanh
        - tanh(z) = (e^z - e^(-z))/(e^z + e^(-z))
        - squeezes any real number into the range (-1, 1).
        - **ZERO-Centered**: its major advantage over sigmoid, this property helps center the data for the next layer, often speeding up convergence.

        **Problems**:
            - Vanishing gradients, difficult to use in deep networks.

    - ReLU
        - ReLU(z) = max(0,z)
        - computationally efficient(just a threshold)
        - no Vanishing Gradient (for z>0): gradient is a constant 1 for positive inputs, allowing learning signals to propagate deep into the network.
        - Sparsity: by outputting 0 for negative inputs, can make the network sparse.

        **Problems**:
            - The Dying ReLU problem: if a neuron's weights are upated such that its pre-activation z is always negative, it will always output 0. The gradient will also be 0, and the neuron can never recover. It effectively "dies".

        **Fixing The Dying Problem**:
            - Leaky ReLU:
                - f(z) = { z if z>0, alpha*z if z<=0 where alpha is a small constant like 0.01
                - introducing a small, non-zero gradient for negative inputs.
                - allows "dead" neurons to be revived, a very common and effective choice.

        **GELU(Gaussian Error Linear Unit)**:
            - a smoother, probabilistic alternative to ReLU
            - became the standard in state-of-the-art Transformer models(e.g., GPT, BERT).

- choosing activations:
    **For Hidden Layers**
    - start with ReLU
    - for issues with dying neurons, switch to Leaky ReLU or Parametric ReLU.
    - for transformer based models, consider GELU or Swish.

    **For Output Layer(Task Dependent)**
    
        - Binary Classification: Sigmoid(outputs a probability between 0 and 1)
        - Multiclass Classification: Softmax(outputs a probability distribution over all classes(all outputs sum to 1))