<a href="https://colab.research.google.com/github/wekann/Assignment/blob/main/Neural_Network_A_Simples_Perception.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
Q1. What is Deep Learning and how is it connected to Artificial Intelligence?
Definition of Deep Learning:

Deep Learning is a subset of Machine Learning, which itself is a branch of Artificial Intelligence (AI). Deep learning uses neural networks with many layers (hence “deep”) to learn patterns and representations from large amounts of data.
Connection to Artificial Intelligence:

| Concept                          | Description                                                                                                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| Artificial Intelligence (AI)     | Broad field focused on building machines that can mimic human intelligence — e.g., decision making, reasoning, learning. |
| Machine Learning (ML)            | A subfield of AI where systems learn from data without being explicitly programmed.                                      |
| Deep Learning (DL)               | A subfield of ML that uses **deep neural networks** to automatically extract features and make predictions.              |

So, deep learning is a tool used to achieve AI.
Example Hierarchy:

```
Artificial Intelligence
│
├── Machine Learning
│   └── Algorithms: Decision Trees, SVMs, k-NN, etc.
│
└── Deep Learning
    └── Neural Networks: CNNs, RNNs, Transformers, etc.
```
Why Deep Learning Matters in AI:

* High accuracy on complex tasks: vision, speech, NLP.
* Learns features automatically (no manual feature engineering).
* Powers AI applications like:

  * Self-driving cars
  * Chatbots (like ChatGPT)
  * Face recognition
  * Language translation

In [None]:
'''Q2. What is neural network, and what are the different types of Neural Networks?
What is a Neural Network?

A Neural Network is a computational model inspired by the human brain that is designed to recognize patterns and relationships in data. It consists of layers of interconnected neurons (or nodes), where each neuron processes input and passes the output to the next layer.

Each connection has a weight, and each neuron has an activation function that determines its output.
Basic Structure of a Neural Network:

* Input Layer: Receives the input data
* Hidden Layers: Perform computations and feature extraction
* Output Layer: Produces the final result or prediction

How It Works (Forward Propagation):
1. Inputs are fed to the input layer.
2. Each neuron computes a weighted sum of inputs and applies an activation function.
3. The result is passed to the next layer.
4. This continues until the output layer produces the result.

Types of Neural Networks:

| Type                                        | Description                                                      | Common Use Cases                       |
| ------------------------------------------- | ---------------------------------------------------------------- | -------------------------------------- |
| 1. Feedforward Neural Network (FNN)         | Basic architecture where data flows one way from input to output | Classification, regression             |
| 2. Convolutional Neural Network (CNN)       | Uses convolution layers to process spatial data                  | Image recognition, video analysis      |
| 3. Recurrent Neural Network (RNN)           | Has loops to process sequential data                             | Time series, speech, text              |
| 4. Long Short-Term Memory (LSTM)            | A type of RNN that solves vanishing gradient problems            | Language modeling, machine translation |
| 5. Gated Recurrent Unit (GRU)               | A simplified LSTM, faster to train                               | Sequence data with limited memory      |
| 6. Autoencoder                              | Learns efficient data encoding/decoding                          | Anomaly detection, compression         |
| 7. Generative Adversarial Network (GAN)     | Consists of a generator and discriminator in a game              | Image generation, deepfakes            |
| 8. Radial Basis Function Network (RBFN)     | Uses radial basis functions as activation                        | Function approximation                 |
| 9. Transformer Networks                     | Uses attention mechanism for handling sequences                  | NLP (e.g., ChatGPT, BERT)              |

Visualization (Simple):

```
Input → [Hidden Layer 1] → [Hidden Layer 2] → Output
```

In CNN:

```
Image → [Convolution + Pooling] → [Dense Layers] → Output

In [None]:
'''Q3. What is the mathematical structure of a neural network?
A neural network can be described mathematically as a composition of linear algebra operations (like matrix multiplication) followed by non-linear activation functions, organized in layers.
1.Basic Structure (One Neuron)
A single neuron performs the following computation:

$$
z = w^T x + b
$$

$$
a = \phi(z)
$$

Where:

* $x \in \mathbb{R}^n$ is the input vector
* $w \in \mathbb{R}^n$ is the weight vector
* $b \in \mathbb{R}$ is the bias term
* $\phi$ is an activation function (like ReLU, sigmoid)
* $a$ is the neuron's output

2. Layer-wise Computation

For a layer with multiple neurons:

$$
Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}
$$

$$
A^{[l]} = \phi(Z^{[l]})
$$

Where:

* $l$ = layer index
* $A^{[0]} = X$ is the input
* $W^{[l]}$ = weight matrix of shape $(n_l, n_{l-1})$
* $b^{[l]}$ = bias vector of shape $(n_l, 1)$
* $Z^{[l]}$ = pre-activation vector
* $A^{[l]}$ = output (activation) of the layer

3. Forward Propagation

Given input $X$, the network computes:

$$
A^{[1]} = \phi(W^{[1]} X + b^{[1]})
$$

$$
A^{[2]} = \phi(W^{[2]} A^{[1]} + b^{[2]})
$$

$$
\cdots
$$

$$
\hat{Y} = A^{[L]} = \phi(W^{[L]} A^{[L-1]} + b^{[L]})
$$

Where $L$ is the total number of layers and $\hat{Y}$ is the output.

4. Loss Function

To train the network, define a loss function $\mathcal{L}(\hat{Y}, Y)$ to measure the error between predicted and actual outputs.

Common examples:

* MSE for regression:

  $$
  \mathcal{L} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2
  $$
* Binary Crossentropy for classification:

  $$
  \mathcal{L} = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
  $$

5. Backpropagation (Gradient Descent)

Weights are updated using derivatives of the loss with respect to each parameter:

$$
W^{[l]} := W^{[l]} - \alpha \frac{\partial \mathcal{L}}{\partial W^{[l]}}
$$

Where $\alpha$ is the learning rate.

In [None]:
'''Q4. What isan activation function,  and why is it essential in neural
What is an Activation Function?

An activation function is a mathematical function applied to the output of a neuron after computing the weighted sum of its inputs. It introduces non-linearity into the neural network.
Mathematically:

$$
a = \phi(z), \quad \text{where } z = w^T x + b
$$

* $\phi$ = activation function
* $z$ = linear combination of inputs
* $a$ = output after activation

Why is it Essential?

1. Introduces Non-Linearity

   * Without activation functions, the neural network would be just a linear model, no matter how many layers you add.
   * Non-linearity allows networks to **learn complex patterns.

2. Enables Deep Learning

   * Activation functions make it possible to stack multiple layers and learn hierarchical features.

3. Helps in Feature Transformation

   * It maps input features to a range that makes learning more efficient (e.g., between 0 and 1, or -1 and 1).

Common Activation Functions:

| Function       | Formula                             | Output Range | Use Case                     |
| -------------- | ----------------------------------- | ------------ | ---------------------------- |
| Sigmoid        | $\frac{1}{1 + e^{-z}}$              | (0, 1)       | Binary classification        |
| Tanh           | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (−1, 1)      | Hidden layers in RNNs        |
| ReLU           | $\max(0, z)$                        | \[0, ∞)      | Most common in hidden layers |
| Leaky ReLU     | $\max(0.01z, z)$                    | (−∞, ∞)      | Fixes dying ReLU problem     |
| Softmax        | $\frac{e^{z_i}}{\sum_j e^{z_j}}$    | (0, 1)       | Output layer for multi-class |

Example:

Suppose a neuron calculates $z = 1.2$

* Without activation: output = 1.2 (linear)
* With ReLU: output = 1.2
* With Sigmoid: output ≈ 0.768
* With Tanh: output ≈ 0.833

In [None]:
'''Q5. Could you list some common activation functions used in neural networks?
Here are the most commonly used activation functions, along with their characteristics, formulas, and typical use cases:
1. Sigmoid (Logistic)

$$
\phi(z) = \frac{1}{1 + e^{-z}}
$$

* Output Range: (0, 1)
* Pros: Good for binary classification.
* Cons: Vanishing gradient problem, slow convergence.

Use in: Output layer for binary classification.

2. Tanh (Hyperbolic Tangent)

$$
\phi(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
$$

* Output Range: (−1, 1)
* Pros: Zero-centered output (better than sigmoid).
* Cons: Still suffers from vanishing gradients.

Use in: Hidden layers, especially in RNNs.

3. ReLU (Rectified Linear Unit)

$$
\phi(z) = \max(0, z)
$$

* Output Range: \[0, ∞)
* Pros: Fast training, sparse activation.
* Cons: "Dying ReLU" problem (some neurons become inactive).

Use in: Hidden layers in CNNs, DNNs.

4. Leaky ReLU

$$
\phi(z) = \begin{cases}
z & \text{if } z > 0 \\
\alpha z & \text{if } z \leq 0
\end{cases}
\quad \text{(typically } \alpha = 0.01\text{)}
$$

Fixes ReLU by allowing a small gradient when $z \leq 0$.

Use in: Deep neural networks to avoid dead neurons.

5. Parametric ReLU (PReLU)

Similar to Leaky ReLU, but $\alpha$ is learned during training.

Use in: Custom or advanced networks needing more flexibility.

6. ELU (Exponential Linear Unit)

$$
\phi(z) = \begin{cases}
z & \text{if } z > 0 \\
\alpha (e^z - 1) & \text{if } z \leq 0
\end{cases}
$$

Smoother output than ReLU, avoids zero gradient.

Use in: Deep models where ReLU struggles.

7. Softmax

$$
\phi(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
$$

* Output Range: (0, 1), all values sum to 1.
* Converts logits into probability distribution.

Use in: Output layer of multi-class classification models.

Summary Table:

| Activation | Range   | Use Case                            |
| ---------- | ------- | ----------------------------------- |
| Sigmoid    | (0, 1)  | Binary classification (output)      |
| Tanh       | (−1, 1) | RNNs, hidden layers                 |
| ReLU       | \[0, ∞) | Hidden layers (CNNs, MLPs)          |
| Leaky ReLU | (−∞, ∞) | Avoid dying neurons in deep nets    |
| PReLU      | (−∞, ∞) | Learnable variant of Leaky ReLU     |
| ELU        | (−α, ∞) | Smooth version of ReLU              |
| Softmax    | (0, 1)  | Multi-class classification (output) |

In [None]:
'''Q6. What is a multilayer neural network?

Definition:
A Multilayer Neural Network (also known as a Multilayer Perceptron or MLP) is a type of feedforward artificial neural network that contains one or more hidden layers between the input and output layers.

These layers enable the model to learn complex, non-linear relationships in data.
Structure of a Multilayer Neural Network:

1. Input Layer

   * Receives the raw data (features).

2. Hidden Layers

   * One or more layers that process inputs using weights, biases, and activation functions.
   * These layers extract features and learn representations.

3. Output Layer

   * Produces the final prediction or classification.

Mathematical Representation:
For layer $l$:

$$
Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}
$$

$$
A^{[l]} = \phi(Z^{[l]})
$$

Where:

* $A^{[0]}$ = Input
* $W^{[l]}$, $b^{[l]}$ = Weights and biases
* $\phi$ = Activation function
* $A^{[l]}$ = Output (activation) of layer $l$

Visualization:

```
Input → [Hidden Layer 1] → [Hidden Layer 2] → ... → Output
```

Each arrow represents weighted connections with an activation function.

Key Characteristics:

* Deep learning starts when a neural network has multiple hidden layers.
* Backpropagation is used to train the network by minimizing the loss.
* Multilayer networks can approximate any continuous function (Universal Approximation Theorem).

Applications:

* Classification (e.g., spam detection, image recognition)
* Regression (e.g., predicting house prices)
* Time series forecasting
* Pattern recognition and NLP (with added architectures)

In [None]:
'''Q7. What is a loss function, and why is it crucial for neural network training?
What is a Loss Function?

A loss function (also called a cost function or objective function) is a mathematical function that measures the difference between the predicted output of a neural network and the actual target value (ground truth).

It quantifies how wrong the model's prediction is.

Mathematical Form:
For a single training example:

$$
\mathcal{L}(\hat{y}, y)
$$

* $\hat{y}$ = predicted output
* $y$ = actual output
* $\mathcal{L}$ = loss value (scalar)

For all training examples (m samples):

$$
\text{Total Loss} = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})
$$

Why is it Crucial?

1. Guides Learning:

   * The loss function tells the model how well it is performing.
   * Training algorithms (like gradient descent) use the loss to adjust weights** in the network.

2. Objective for Optimization:

   * The goal of training is to minimize the loss.
   * The lower the loss, the better the model is performing on the task.

3. Backpropagation Depends on It:

   * Loss values are used to compute gradients that flow backward through the network during training.

Common Loss Functions:

| Loss Function                 | Formula                                    | Use Case                           |   |                                 |
| ----------------------------- | ------------------------------------------ | ---------------------------------- | - | ------------------------------- |
| Mean Squared Error (MSE)      | $\frac{1}{m} \sum (y - \hat{y})^2$         | Regression problems                |   |                                 |
| Mean Absolute Error (MAE)     | ( \frac{1}{m} \sum                         | y - \hat{y}                        | ) | Regression (robust to outliers) |
| Binary Crossentropy           | $-[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$ | Binary classification              |   |                                 |
| Categorical Crossentropy      | $-\sum y_i \log(\hat{y}_i)$                | Multi-class classification         |   |                                 |
| Hinge Loss                    | $\max(0, 1 - y\hat{y})$                    | SVM-like classification            |   |                                 |
| Huber Loss                    | Mix of MSE and MAE                         | Regression with outlier robustness |   |                                 |


In [None]:
'''Q8. What are some common type of loss function?
Loss functions differ depending on the type of task: regression, binary classification, or multi-class classification.
1. Regression Loss Functions

Used when the output is continuous (e.g., predicting price, temperature).

| Loss Function                 | Formula                                | Description                                       |   |                                                     |
| ----------------------------- | -------------------------------------- | ------------------------------------------------- | - | --------------------------------------------------- |
| Mean Squared Error (MSE)      | $\frac{1}{n} \sum (y_i - \hat{y}_i)^2$ | Penalizes large errors more (quadratic loss).     |   |                                                     |
| Mean Absolute Error (MAE)     | ( \frac{1}{n} \sum                     | y\_i - \hat{y}\_i                                 | ) | Treats all errors equally; more robust to outliers. |
| Huber Loss                    | Combines MSE and MAE for robustness.   | Smooth for small errors, MAE-like for large ones. |   |                                                     |
| Log-Cosh Loss                 | $\sum \log(\cosh(\hat{y}_i - y_i))$    | Less sensitive to outliers than MSE.              |   |                                                     |

2. Classification Loss Functions

Used for discrete outputs, such as class labels.

a) Binary Classification (2 classes):

| Loss Function           | Formula                                        | Description                                                   |
| ----------------------- | ---------------------------------------------- | ------------------------------------------------------------- |
| **Binary Crossentropy** | $-[y\log(\hat{y}) + (1 - y)\log(1 - \hat{y})]$ | Measures difference between predicted prob. and actual class. |

b) Multi-Class Classification:

| Loss Function                                   | Formula                                                  | Description                                                      |
| ----------------------------------------------- | -------------------------------------------------------- | ---------------------------------------------------------------- |
| Categorical Crossentropy                        | $-\sum y_i \log(\hat{y}_i)$                              | Use when labels are one-hot encoded.                             |
| Sparse Categorical Crossentropy                 | Similar to above, but labels are integers (not one-hot). | Use when labels are not one-hot encoded.                         |
| Kullback-Leibler Divergence (KL Divergence)     | $\sum y_i \log\left(\frac{y_i}{\hat{y}_i}\right)$        | Measures how one probability distribution diverges from another. |

3. Other Specialized Losses

| Loss Function        | Use Case                                                                  |
| -------------------- | ------------------------------------------------------------------------- |
| Hinge Loss           | Used in SVMs and "max-margin" classifiers.                                |
| Contrastive Loss     | Used in Siamese networks and similarity tasks.                            |
| Triplet Loss         | Used in face recognition and embedding tasks.                             |
| Dice Loss            | Used in image segmentation (especially medical imaging).                  |
| Focal Loss           | Used for imbalanced classification problems (e.g., rare class detection). |

Summary Table

| Task                  | Loss Function                   |
| --------------------- | ------------------------------- |
| Regression            | MSE, MAE, Huber, Log-Cosh       |
| Binary Classification | Binary Crossentropy             |
| Multi-Class           | Categorical/Sparse Crossentropy |
| Probabilistic Models  | KL Divergence                   |
| Similarity/Ranking    | Contrastive, Triplet, Hinge     |
| Image Segmentation    | Dice, Focal                     |

In [None]:
'''Q9. How does a neural network learn?
A neural network learns by adjusting its weights and biases to minimize the loss (error between predicted output and actual output) through a process called training.
The Learning Process (Step-by-Step):

1. Forward Propagation
* The input data is passed through the network layer by layer.
* Each neuron computes a weighted sum and applies an **activation function**.
* The final output $\hat{y}$ is generated.

$$
z = w^T x + b,\quad a = \phi(z)
$$

2. Loss Calculation

* The output $\hat{y}$ is compared with the true label $y$.
* A **loss function** computes the error.

$$
\text{Loss} = \mathcal{L}(\hat{y}, y)
$$

3. Backpropagation

* The network computes gradients of the loss with respect to each weight using chain rule (automatic differentiation).
* This determines how much each weight contributed to the error.

4. Weight Update (Gradient Descent)

* Weights are updated in the opposite direction of the gradient to minimize the loss.

$$
w := w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}
$$

* $\eta$ = learning rate (controls step size)
* Repeat this process for all weights and biases.

Repeated Over Epochs:

* One full pass over the dataset = 1 epoch.
* Network iteratively reduces error and improves predictions.

Example Summary:

| Step                 | What Happens                             |
| -------------------- | ---------------------------------------- |
| Forward Pass         | Compute outputs from inputs              |
| Loss Computation     | Measure how wrong predictions are        |
| Backpropagation      | Compute gradients of loss w\.r.t weights |
| Weight Update        | Update weights to reduce loss            |

Optimization Techniques:

* Optimizers: SGD, Adam, RMSProp
* Regularization: Dropout, L2 penalty
* Normalization: Batch normalization
* Data Handling: Batching, shuffling


In [None]:
'''Q10. What is an optimizer in neural Networks, and why is it necessary?

What Is an Optimizer?
An optimizer is an algorithm used to adjust the weights and biases of a neural network during training to minimize the loss function.

In simple terms:
The optimizer helps the network learn faster and better by deciding how to update the model's parameters.

Why Is It Necessary?
* Neural networks learn by reducing the error (loss).
* To do that, they must tweak their internal parameters (weights).
* The optimizer guides this update process using gradients computed through backpropagation.

Without an optimizer:
The network cannot improve — it will never learn from its mistakes.

How It Works:
1. Compute gradients of the loss w\.r.t. weights:

   $$
   \frac{\partial \text{Loss}}{\partial w}
   $$

2. Update the weights using an optimizer-specific rule:

   $$
   w := w - \eta \cdot \frac{\partial \text{Loss}}{\partial w}
   $$

   Where $\eta$ is the learning rate.

Common Optimizers in Deep Learning:

| Optimizer                             | Description                                               |
| ------------------------------------- | --------------------------------------------------------- |
| SGD (Stochastic Gradient Descent)     | Basic, updates weights for each training example.         |
| SGD with Momentum                     | Adds momentum to avoid local minima and oscillations.     |
| RMSProp                               | Scales learning rate based on recent gradient magnitudes. |
| Adam (Adaptive Moment Estimation)     | Combines momentum + RMSProp. Most commonly used today.    |
| Adagrad                               | Adjusts learning rate for each parameter individually.    |
| Adadelta                              | An improvement over Adagrad.                              |

Example (Adam in Keras):

```python
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
'''

In [None]:
'''Q11. Could you briefly describe some common optimizers?
Optimizers help improve a neural network's performance by minimizing the loss function through weight updates. Below are some of the most widely used optimizers, each with unique characteristics:

1. SGD (Stochastic Gradient Descent)
* Description:Updates weights using gradients from each mini-batch of data.
* Formula:

  $$
  w := w - \eta \cdot \nabla L
  $$
* Pros:Simple and effective.
* Cons:Slow convergence; can get stuck in local minima.
* Use Case:Works well with convex problems or large datasets.

2. SGD with Momentum
* Description:Adds a velocity term to remember the previous update direction, helping to smooth and speed up convergence.
* Formula:

  $$
  v := \gamma v + \eta \nabla L,\quad w := w - v
  $$
* Pros:Reduces oscillations, accelerates convergence.
* Use Case:Useful in deep networks and sparse gradients.

3. RMSProp (Root Mean Square Propagation)
* Description:Scales the learning rate for each parameter based on the moving average of squared gradients.
* Formula:

  $$
  E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta)g_t^2
  $$

  $$
  w := w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t
  $$
* Pros: Handles non-stationary objectives well.
* Use Case:Recommended for RNNs and time series problems.

4. Adam (Adaptive Moment Estimation)

* Description: Combines the advantages of Momentum and RMSProp. Maintains running averages of both gradients and squared gradients.
* Formula:

  * $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
  * $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
  * Bias correction + weight update
* Pros:Fast convergence, low memory usage, good default choice.
* Use Case: Works well in most deep learning applications.

5. Adagrad (Adaptive Gradient)
* Description:Adapts learning rate to parameters, performing larger updates for infrequent parameters.
* Cons:Learning rate decreases too much over time.
* Use Case:Sparse data problems like NLP or recommender systems.

6. Adadelta
* Description:Improvement over Adagrad that limits aggressive decay in learning rate.
* Pros:Better for longer training periods.
* Use Case:Similar to Adagrad but better generalization.

Summary Table

| Optimizer    | Highlights                          | Best For                        |
| ------------ | ----------------------------------- | ------------------------------- |
| SGD          | Simple, slow but steady             | Large datasets, basic tasks     |
| Momentum     | Faster convergence                  | Deep networks                   |
| RMSProp      | Adaptive learning rate              | RNNs, non-stationary data       |
| Adam         | Most popular, fast + adaptive       | General-purpose deep learning   |
| Adagrad      | Good for sparse data                | Text/NLP, sparse feature spaces |
| Adadelta     | Fixes Adagrad’s learning rate decay | Longer training runs            |

In [None]:
'''Q12. Can you explain forward and backward propagation?
Forward and Backward Propagation are the two main phases in training a neural network.

They work together to compute predictions, evaluate errors, and adjust weights so the network learns over time.
1. Forward Propagation (Prediction Phase)

> Data moves forward through the network — from input to output — to generate a prediction.
Steps:

1. Input:Feed the input vector $x$ to the network.
2. Linear Transformation:

   $$
   z = w^T x + b
   $$
3. Activation:Apply activation function $a = \phi(z)$ (e.g., ReLU, sigmoid).
4. Repeat for all hidden layers.
5. Output:Get final prediction $\hat{y}$.
6. Loss:Compare $\hat{y}$ to actual label $y$ using a loss function $\mathcal{L}(\hat{y}, y)$.

2. Backward Propagation (Learning Phase)
> Error is propagated backward from output to input to update weights using gradients.

Goal:
Minimize the loss $\mathcal{L}$ by adjusting the weights and biases using gradient descent.

Steps:
1. Compute Gradients:
   Use                                                                                                                                                                                                         chain rule to calculate how the loss changes with respect to each parameter:

   $$
   \frac{\partial \mathcal{L}}{\partial w}
   $$

2. Update Weights:
   Using gradient descent:

   $$
   w := w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}
   $$

   * $\eta$ = learning rate

3. Repeat the process for each layer backward (from output to input).

Example:

For one hidden layer neural network:

* Forward:

  * $z_1 = W_1x + b_1$
  * $a_1 = \text{ReLU}(z_1)$
  * $z_2 = W_2a_1 + b_2$
  * $\hat{y} = \text{Softmax}(z_2)$

* Backward:

  * Compute $\frac{\partial \mathcal{L}}{\partial W_2}, \frac{\partial \mathcal{L}}{\partial W_1}$
  * Update weights with optimizer

Process Overview Diagram:

Input → [Layer 1] → [Layer 2] → ... → Output (ŷ)
   ↓                             ↑
Forward Propagation         Backward Propagation
```
Summary:

| Phase             | Purpose                           | Direction      |
| ----------------- | --------------------------------- | -------------- |
| Forward Pass      | Predict output, compute loss      | Input → Output |
| Backward Pass     | Compute gradients, update weights | Output → Input |


In [None]:
'''Q13. what is weight initialization, and how does it impact training?
### ✅ Q13. What Is Weight Initialization, and How Does It Impact Training in Neural Networks?

What Is Weight Initialization?
Weight initialization refers to the process of setting the initial values of the weights (and sometimes biases) of a neural network before training begins.

Why Is It Important?

Neural networks learn by adjusting weights, so their initial values can greatly affect:

* Training speed
* Convergence to a good solution
* Avoidance of vanishing or exploding gradients
* Model accuracy

Bad initialization = slow or failed learning
Good initialization = faster, stable convergence

Common Weight Initialization Methods

| Method                             | Description                                                       |
| ---------------------------------- | ----------------------------------------------------------------- |
| Zero Initialization                | All weights = 0. Not recommended (symmetry problem).              |
| Random Initialization              | Random small values (helps break symmetry).                       |
| Xavier (Glorot) Initialization     | For tanh or sigmoid activations. Balances variance across layers. |
| He Initialization                  | For ReLU or variants. Deals with variance better in deep nets.    |
| LeCun Initialization               | Good for SELU activation.                                         |

1. Xavier (Glorot) Initialization

* Designed to keep the variance of activations and gradients the same across layers.
* Formula:

  $$
  W \sim \mathcal{U} \left(-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}} \right)
  $$
* Best for: Sigmoid, Tanh

2. He Initialization

* Focuses on maintaining variance for ReLU-type activations.
* Formula:

  $$
  W \sim \mathcal{N}(0, \frac{2}{n_{in}})
  $$
* Best for: ReLU, Leaky ReLU
Impact on Training

| Initialization                | Effect on Training                        |
| ----------------------------- | ----------------------------------------- |
| Poor (e.g., all 0s)           | No learning, neurons become identical     |
| Too large                     | Exploding gradients                       |
| Too small                     | Vanishing gradients                       |
| Well-chosen (e.g., He/Xavier) | Fast, stable convergence, better accuracy |

Keras Example:

```python
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import HeNormal, GlorotUniform

# Xavier/Glorot for sigmoid/tanh
Dense(64, activation='tanh', kernel_initializer=GlorotUniform())

# He for ReLU
Dense(64, activation='relu', kernel_initializer=HeNormal())
```

In [None]:
'''Q14. what is the vanishing gradient problem in deep learning?
Definition:
The vanishing gradient problem occurs when the gradients of the loss function become extremely small as they are backpropagated through a deep neural network, especially in very deep architectures.

When Does It Happen?
During backpropagation, the gradient (partial derivative of the loss) is calculated layer by layer. In very deep networks, these gradients get multiplied many times by small derivative values (especially from activation functions like sigmoid or tanh), causing them to shrink exponentially.

What’s the Problem?
* Weights in earlier layers update very little (or not at all).
* These layers learn very slowly, or stop learning entirely.
* The model may get stuck and fail to converge to a good solution.
* It leads to poor performance**, especially in deep networks like RNNs or deep CNNs.

Mathematical View (Simplified):

Each layer:

$$
a = \phi(w^T x + b)
$$

Backpropagation of gradient:

$$
\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}
$$

For sigmoid:

$$
\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)) \in (0, 0.25)
$$

Multiplying these small values across layers:

$$
\text{gradient} \approx 0
$$

Example Activation Functions That Cause It:

* Sigmoid
* Tanh

Solutions to Vanishing Gradient:

| Solution                           | Description                                   |
| ---------------------------------- | --------------------------------------------- |
| ReLU Activation                    | Avoids saturation → gradient is 1 for $z > 0$ |
| He/Xavier Initialization           | Keeps variance of gradients consistent        |
| Batch Normalization                | Normalizes activations → stabilizes gradients |
| Residual Connections (ResNets)     | Shortcuts help gradient flow                  |
| LSTM/GRU (for RNNs)                | Designed to handle long-term dependencies     |


In [None]:
'''Q15. What is the exploding gradient problem?
Definition:

The exploding gradient problem occurs when the gradients during backpropagation become excessively large, causing unstable training, nan values, or even divergence (model fails to learn).
When Does It Happen?
In very deep networks or RNNs, backpropagation involves multiplying many gradient terms. If those terms are larger than 1, the gradient can grow exponentially as it moves backward through layers.
What’s the Problem?

* Weights get huge, leading to:
  * Loss becoming `NaN`
  * Model diverging
  * Training instability
* Network may never converge to a solution
* Makes training unreliable or unusable

Mathematical Intuition:
During backpropagation:

$$
\frac{\partial \mathcal{L}}{\partial w} = \prod_{i=1}^L \frac{\partial a_i}{\partial a_{i-1}}
$$

If each derivative > 1, then:

$$
\frac{\partial \mathcal{L}}{\partial w} \rightarrow \infty
$$

Symptoms of Exploding Gradients:

* Loss spikes to infinity or NaN
* Model accuracy jumps randomly
* Weights have abnormally large values
* Training crashes or becomes very slow

Solutions to Exploding Gradients:

| Solution                              | Description                               |
| ------------------------------------- | ----------------------------------------- |
| Gradient Clipping                     | Caps the gradient to a maximum value      |
| Proper Weight Initialization          | Use He/Xavier to stabilize variance       |
| Batch Normalization                   | Reduces internal covariate shift          |
| Smaller Learning Rate                 | Prevents drastic weight updates           |
| Use of Normalized RNNs (LSTM/GRU)     | Handles gradient better than vanilla RNNs |

Example: Gradient Clipping in Keras

```python
from tensorflow.keras.optimizers import Adam

# Clip gradients to max norm of 1.0
optimizer = Adam(clipnorm=1.0)

model.compile(optimizer=optimizer, loss='mse')
```

PRACTICAL QUESTION & ANSWER

In [None]:
'''
Q1.How do you create a simple perceptron for basic binary classification?

import numpy as np

# Step function (activation)
def step_function(x):
    return 1 if x >= 0 else 0

# Perceptron class
class Perceptron:
    def __init__(self, input_size, learning_rate=0.1):
        self.weights = np.zeros(input_size)
        self.bias = 0
        self.lr = learning_rate

    def predict(self, x):
        total = np.dot(x, self.weights) + self.bias
        return step_function(total)

    def train(self, X, y, epochs=10):
        for _ in range(epochs):
            for xi, target in zip(X, y):
                pred = self.predict(xi)
                error = target - pred
                self.weights += self.lr * error * xi
                self.bias += self.lr * error

# Example usage
# Input data (AND gate)
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([0, 0, 0, 1])  # AND output

# Train perceptron
p = Perceptron(input_size=2)
p.train(X, y)

# Test
print("Predictions:")
for xi in X:
    print(f"{xi} => {p.predict(xi)}")


In [None]:
'''Q2. how can you build a neural network with one hidden layer using keras?
To build a neural network with one hidden layer using Keras, follow these simple steps. Keras is a high-level API in TensorFlow that makes building neural networks easy and readable.

Step-by-Step Guide: One Hidden Layer Neural Network
Step 1: Import Libraries

```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
```
Step 2: Build the Model

```python
# Define the model
model = Sequential()

# Input layer + Hidden layer (e.g., 8 neurons, ReLU activation)
model.add(Dense(8, input_dim=2, activation='relu'))

# Output layer (1 neuron for binary classification, sigmoid activation)
model.add(Dense(1, activation='sigmoid'))
```
Step 3: Compile the Model

```python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```
Step 4: Train the Model

```python
# Example data: XOR problem (you can use your dataset)
import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0])

# Train the model
model.fit(X, y, epochs=100, verbose=1)
```
Step 5: Evaluate / Predict

```python
# Evaluate the model
loss, acc = model.evaluate(X, y)
print(f"Accuracy: {acc:.2f}")

# Predict new values
predictions = model.predict(X)
print("Predictions:\n", predictions)
```
Network Architecture Summary:

* Input layer: 2 input neurons (from `input_dim=2`)
* Hidden layer: 8 neurons, `ReLU` activation
* Output layer: 1 neuron, `sigmoid` activation (for binary classification)

In [None]:
'''Q3. How do you initialize weights using the xavier(Glorot) initialization method in keras?
To initialize weights using the Xavier (Glorot) initialization method in Keras, you can use:
```python
kernel_initializer='glorot_uniform'
```

or

```python
kernel_initializer='glorot_normal'
```

---
What is Xavier (Glorot) Initialization?

Xavier initialization is designed to maintain the variance of activations across layers:

* glorot\_uniform: weights are sampled from a uniform distribution.
* glorot\_normal: weights are sampled from a normal distribution.

Both are suitable for activations like `tanh` or `sigmoid`.

Example: Using Xavier Initialization in Keras

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

# Hidden layer with Xavier uniform initialization
model.add(Dense(8, input_dim=4, activation='relu', kernel_initializer='glorot_uniform'))

# Output layer
model.add(Dense(1, activation='sigmoid', kernel_initializer='glorot_uniform'))
```
Customizing Further (Optional)

If you want to import and use explicitly:

```python
from tensorflow.keras.initializers import GlorotUniform

model.add(Dense(8, input_dim=4, activation='relu', kernel_initializer=GlorotUniform()))
```
When to Use It:

* glorot\_uniform: Default for most layers in Keras (good for `relu`, `tanh`)
* glorot\_normal: Sometimes better with `sigmoid`

In [None]:
'''Q4. How can you apply different activation function in a neural network in keras?
 To apply different activation functions in a neural network using Keras, you simply specify the desired activation function using the `activation` parameter in each layer (typically `Dense`, `Conv2D`, etc.).

Syntax

```python
Dense(units, activation='activation_name')
```
or using functional form:

```python
from tensorflow.keras.activations import relu, sigmoid
Dense(units, activation=relu)
```
Common Activation Functions in Keras

| Activation                      | Use Case                                  |
| ------------------------------- | ----------------------------------------- |
| `'relu'`                        | Most common for hidden layers             |
| `'sigmoid'`                     | Binary classification (output layer)      |
| `'tanh'`                        | Hidden layers (range: -1 to 1)            |
| `'softmax'`                     | Multi-class classification (output layer) |
| `'linear'`                      | Regression (output layer)                 |
| `'elu'`, `'selu'`, `'softplus'` | Advanced cases                            |

Example: Using Different Activation Functions in Keras

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(16, input_dim=4, activation='tanh'),       # hidden layer with tanh
    Dense(8, activation='relu'),                     # hidden layer with relu
    Dense(1, activation='sigmoid')                   # output layer for binary classification
])
```
Using Functional API for More Control

```python
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense

inputs = Input(shape=(4,))
x = Dense(16, activation='relu')(inputs)
x = Dense(8, activation='tanh')(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs=inputs, outputs=outputs)
```

In [None]:
'''Q5. How do you add dropout to a neural network model to prevent overfitting ?
To add dropout to a neural network in Keras, you use the `Dropout` layer from `tensorflow.keras.layers`. Dropout helps prevent overfitting by randomly "dropping out" (setting to zero) a fraction of the input units during training.

Step-by-Step: Adding Dropout

Import the Layer

```python
from tensorflow.keras.layers import Dropout
```
Syntax

```python
Dropout(rate)
```

* `rate`: fraction of input units to drop (e.g., 0.5 means 50%).

---
Example: Add Dropout to a Model

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential([
    Dense(128, activation='relu', input_shape=(64,)),   # First hidden layer
    Dropout(0.5),                                        # Dropout applied here (50%)
    Dense(64, activation='relu'),                        # Second hidden layer
    Dropout(0.3),                                        # Dropout applied here (30%)
    Dense(1, activation='sigmoid')                       # Output layer
])
```

Why Use Dropout?

* Helps prevent overfitting by making the network less reliant on specific neurons.
* Forces the network to learn more robust features.
* Only active during training, automatically disabled during inference.

Notes

* Typical values: `0.2` to `0.5`.
* Place it **after Dense or Conv layers, not before.
* Don’t use it in the **output layer**.

---
Optional (Functional API version):

```python
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Dropout

inputs = Input(shape=(64,))
x = Dense(128, activation='relu')(inputs)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.3)(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs, outputs)
```

In [None]:
'''Q6. How do you manually implement forward propagation in a simple neural network?
To manually implement forward propagation in a simple neural network, you simulate how data flows from input to output by calculating each layer’s output step-by-step using weights, biases, and activation functions — without using libraries like Keras or PyTorch.
What is Forward Propagation?

Forward propagation is the process of computing the output of a neural network from given input by applying:

1. Weighted sum: $z = w \cdot x + b$
2. Activation: $a = \text{activation}(z)$
Example: One Hidden Layer Neural Network

Network Architecture:

* Input: 2 neurons
* Hidden Layer: 2 neurons, ReLU activation
* Output Layer: 1 neuron, Sigmoid activation

Step-by-Step Python Code

```python
import numpy as np

# Activation functions
def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Input (1 sample, 2 features)
x = np.array([0.5, 0.8])

# Weights and biases
# Hidden layer: 2 neurons
W1 = np.array([[0.2, 0.4],     # weights from input to hidden neuron 1 and 2
               [0.3, 0.7]])
b1 = np.array([0.1, 0.2])      # bias for hidden neurons

# Output layer: 1 neuron
W2 = np.array([0.6, 0.9])      # weights from hidden to output
b2 = 0.3                       # bias for output neuron

# Forward pass
# Step 1: Input to Hidden
z1 = np.dot(x, W1) + b1        # shape: (2,)
a1 = relu(z1)                  # activation of hidden layer

# Step 2: Hidden to Output
z2 = np.dot(a1, W2) + b2       # scalar
a2 = sigmoid(z2)               # final output

# Output
print("Final Output:", a2)
```

Explanation:

1. `np.dot(x, W1)` computes input → hidden layer.
2. `relu(z1)` applies ReLU activation.
3. `np.dot(a1, W2)` computes hidden → output.
4. `sigmoid(z2)` gives final prediction between 0 and 1.

Output:

The final result is a predicted probability (like in binary classification).


In [None]:
'''Q7. How do you add batch normalization to a neural network in keras?
To add Batch Normalization to a neural network in Keras, you use the `BatchNormalization` layer from `tensorflow.keras.layers`.
What is Batch Normalization?

Batch Normalization normalizes the inputs of each layer to have:

* Mean ≈ 0
* Standard deviation ≈ 1

This:
* peeds up training
* Stabilizes learning
* Reduces dependence on weight initialization
* Can help reduce overfitting

How to Use in Keras

Import the Layer:

```python
from tensorflow.keras.layers import BatchNormalization
```
Example: Adding Batch Normalization in a Dense Network

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Activation

model = Sequential([
    Dense(64, input_shape=(100,)),
    BatchNormalization(),           # Add BatchNorm after Dense
    Activation('relu'),             # Then apply activation
    Dense(32),
    BatchNormalization(),
    Activation('relu'),
    Dense(1, activation='sigmoid')  # Output layer
])
```

---
Alternative: Activation inside Dense

You can also write:

```python
model.add(Dense(64, input_shape=(100,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
```

Avoid using `activation='relu'` inside `Dense` if you plan to use BatchNorm after, because:

> BatchNorm should be applied before the activation function.

Using with Functional API:

```python
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, BatchNormalization, Activation

inputs = Input(shape=(100,))
x = Dense(64)(inputs)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Dense(32)(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs, outputs)
```
Tips:
* Works well with deep networks
* Use with or without Dropout (depends on use case)
* Often placed before activation functions (not after)

In [None]:
'''Q8. How can you visualize the training process with accuracy and loss curves?
 To visualize the training process in Keras with accuracy and loss curves, you can use Matplotlib to plot the values stored in the `History` object returned by `model.fit()`.

Step-by-Step: Plot Accuracy & Loss

Step 1: Train the Model and Store History

```python
history = model.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=50,
                    batch_size=32)
```
Step 2: Plot Accuracy and Loss Curves

```python
import matplotlib.pyplot as plt

# Accuracy plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Loss plot
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()
```
What `history.history` Contains:

```python
print(history.history.keys())
```

Typical keys:

* `'loss'`: training loss
* `'val_loss'`: validation loss
* `'accuracy'`: training accuracy
* `'val_accuracy'`: validation accuracy

Tips:

* Use `validation_split=0.2` in `fit()` if no separate validation set.
* You can add `EarlyStopping` and see when training stopped.
* Use `seaborn` for prettier plots (optional).

In [None]:
'''Q9. How can you use gradient clipping in keras to control the gradient size and prevent exploding gradients?

To use gradient clipping in Keras, you specify clipping options in the optimizer when compiling the model. Gradient clipping helps prevent the exploding gradient problem by capping gradients during backpropagation.

Types of Gradient Clipping in Keras:

1. Clip by value: Restrict gradient components to a range
   `clipvalue=threshold`

2. Clip by norm: Restrict the L2 norm of the gradient
   `clipnorm=threshold`

How to Apply Gradient Clipping

Example: Clipping by Norm (Most Common)

```python
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001, clipnorm=1.0)  # clip norm to max 1.0

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
```

Example: Clipping by Value

```python
optimizer = Adam(learning_rate=0.001, clipvalue=0.5)  # clip gradients between -0.5 and 0.5

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
```
When to Use Gradient Clipping?

Use it when:

* Training deep or recurrent networks
* Seeing NaNs or Inf in training
* Experiencing instability in early epochs

Best Practices:

| Type        | Use When...                                        |
| ----------- | -------------------------------------------------- |
| `clipnorm`  | You want to cap the overall gradient magnitude     |
| `clipvalue` | You want to cap individual gradient values         |

In [None]:
'''Q10. How can you create a customer loss function in keras?
To create a custom loss function in Keras, you define a Python function that takes the true labels (`y_true`) and the predicted labels (`y_pred`) as inputs and returns a scalar loss value.

You can then pass this function to the `loss` parameter in `model.compile()`.
Step-by-Step: Custom Loss Function in Keras

Basic Format:

```python
def custom_loss(y_true, y_pred):
    # compute loss
    return loss_value
```

* `y_true` and `y_pred` are tensors.
* Use TensorFlow operations (not NumPy).

Example 1: Mean Squared Error (custom version)

```python
import tensorflow as tf

def custom_mse(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))
```

Use it in model:

```python
model.compile(optimizer='adam', loss=custom_mse, metrics=['mae'])
```
Example 2: Weighted Binary Crossentropy

```python
def weighted_binary_crossentropy(y_true, y_pred):
    weight = 2.0  # apply higher penalty to positive class
    bce = tf.keras.losses.binary_crossentropy(y_true, y_pred)
    return tf.reduce_mean(weight * y_true * bce + (1 - y_true) * bce)
```
Example 3: Huber Loss (smooth for outliers)

```python
def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = tf.abs(error) <= delta
    squared_loss = 0.5 * tf.square(error)
    linear_loss = delta * (tf.abs(error) - 0.5 * delta)
    return tf.where(is_small, squared_loss, linear_loss)
```
Using Custom Loss with Additional Arguments

If your loss function requires extra arguments (e.g., weight, delta), wrap it using a function:

```python
def make_custom_loss(weight):
    def loss_fn(y_true, y_pred):
        return tf.reduce_mean(weight * tf.square(y_true - y_pred))
    return loss_fn

# Compile with it
model.compile(optimizer='adam', loss=make_custom_loss(1.5))
```

In [None]:
'''Q11. How can you visualize the structure of a neural network model in keras?
To visualize the structure of a neural network model in Keras, you can use tools like:

1. `model.summary()` (Text-based View)

```python
model.summary()
```
* Displays a layer-wise table: layer names, output shapes, and parameter counts.
* Best for quick overview in code or console.

2. `plot_model()` from `tensorflow.keras.utils`

This gives you a diagram view of the model.

Example:

```python
from tensorflow.keras.utils import plot_model

plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
```

* `to_file`: Saves the model architecture as an image file (e.g. `model.png`)
* `show_shapes=True`: Displays input/output shape per layer
* `show_layer_names=True`: Shows layer names (enabled by default)

Output: A graphical diagram of the model (saved image)

3. Visualize with **TensorBoard** (Advanced & Interactive)

If you're using the Functional API or a complex model:

```python
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir='./logs')
model.fit(X_train, y_train, epochs=10, callbacks=[tensorboard_callback])
```

Then in terminal:

```bash
tensorboard --logdir=./logs
```

Open browser at `http://localhost:6006` → See model graph and metrics.
Pro Tip: For Functional or Subclassed Models

If using `Functional API`, diagrams will show branches, skip connections, etc.
Summary Table:

| Method            | Output         | Best For                        |
| ----------------- | -------------- | ------------------------------- |
| `model.summary()` | Text           | Console view of architecture    |
| `plot_model()`    | Image (.png)   | Static diagram for reports/docs |
| TensorBoard       | Interactive UI | Debugging + detailed graph view |
