###1. Assume that the inputs X to some scalar function f are n x m matrices. What is the dimensionality of the gradient of f with respect to X? Give reasons to justify your answer.


Let

f(X) be a scalar-valued function, where the input
X is an
n√óm matrix. The gradient of
f with respect to
X, denoted as
‚àÇX/
‚àÇf, has the same dimensionality as
X, i.e., an
n√óm matrix.
###Reasoning:

A gradient represents how the output of a function changes with respect to each input variable. Since
X contains
n√óm individual elements
xij
, the function
f depends on each of these elements. Therefore, we compute partial derivatives with respect to every element in the matrix:

$$
\frac{\partial f}{\partial X} =
\begin{bmatrix}
\frac{\partial f}{\partial x_{11}} & \cdots & \frac{\partial f}{\partial x_{1m}} \\
\vdots & \ddots & \vdots \\
\frac{\partial f}{\partial x_{n1}} & \cdots & \frac{\partial f}{\partial x_{nm}}
\end{bmatrix}
$$



Each partial derivative measures how a small change in one matrix element affects the scalar output. Since there are
n√óm such elements, the gradient must also contain
n√óm components.


###2.Define the evaluation metrics Accuracy, Precision, Recall, and F1 score.

Definitions:

- Accuracy: Proportion of total correct predictions.

- Precision: Proportion of correct positive predictions among all predicted positives.

- Recall (Sensitivity): Proportion of actual positives correctly identified.

- F1 Score: Harmonic mean of Precision and Recall, balancing both

Given Confusion Matrix:
   |                | Actual: Cancer | Actual: No Cancer |
   |----------------|---------------|------------------|
   | Predicted: Cancer | 80 | 80 |
   | Predicted: No Cancer | 20 | 820 |

TP = 80

FP = 80

FN = 20

TN = 820

Total = 1000

Calculations:

**Accuracy**

$$
\text{Accuracy} = \frac{TP + TN}{\text{Total}}
= \frac{80 + 820}{1000}
= 0.90 \; (90\%)
$$

**Precision**

$$
\text{Precision} = \frac{TP}{TP + FP}
= \frac{80}{80 + 80}
= 0.50 \; (50\%)
$$

**Recall**

$$
\text{Recall} = \frac{TP}{TP + FN}
= \frac{80}{80 + 20}
= 0.80 \; (80\%)
$$

**F1 Score**

$$
\text{F1 Score}
= 2 \times \frac{\text{Precision} \times \text{Recall}}
{\text{Precision} + \text{Recall}}
= 2 \times \frac{0.5 \times 0.8}{0.5 + 0.8}
= 0.615 \; (61.5\%)
$$



###3.What is a confusion matrix ?

A confusion matrix is a table used to evaluate the performance of a classification model by comparing predicted labels with actual labels. It provides a detailed breakdown of correct and incorrect predictions, allowing deeper insight than accuracy alone.

The matrix consists of four key components:

- **True Positive (TP):** Correctly predicted positive cases  
- **False Positive (FP):** Incorrectly predicted positive cases  
- **False Negative (FN):** Incorrectly predicted negative cases  
- **True Negative (TN):** Correctly predicted negative cases  


Confusion matrices are especially useful in imbalanced datasets, such as medical diagnosis, where false negatives or false positives carry different consequences. They form the basis for computing metrics like Precision, Recall, F1 score, and Specificity.

###4.What is overfitting and underfitting ?

Overfitting occurs when a model learns the training data too well, including noise and irrelevant patterns. Such a model performs very well on training data but poorly on unseen test data. Overfitting is common in highly complex models trained on small datasets.

Underfitting happens when a model is too simple to capture the underlying structure of the data. It performs poorly on both training and testing data, indicating that it has not learned the patterns effectively.

Key Difference:

- Overfitting ‚Üí High variance, low bias

- Underfitting ‚Üí High bias, low variance

Balancing model complexity is crucial to achieving good generalization.

###5.Explain vanishing and exploding gradients, give reasons why they occur and how they can be prevented.

Vanishing and exploding gradients are common problems encountered while training deep neural networks, especially during backpropagation.

###Vanishing Gradient:

Occurs when gradients become extremely small as they propagate backward through many layers. This leads to negligible weight updates, causing early layers to learn very slowly. It commonly arises due to repeated multiplication of small values (e.g., sigmoid or tanh activations).

###Exploding Gradient:

Occurs when gradients grow exponentially large, leading to unstable training and numerical overflow. This happens when large weights or gradients are repeatedly multiplied.

###Why They Occur:

- Deep networks

- Poor weight initialization

- Inappropriate activation functions

###Prevention Methods:

- Proper weight initialization (Xavier, He initialization)

- Using ReLU or its variants

- Gradient clipping (for exploding gradients)

- Batch normalization

- Residual connections (skip connections)

These techniques help stabilize training and enable effective learning in deep neural networks.

###6.What is regularization ? Explain L1 and L2 regularization

Regularization is a technique used in machine learning to prevent overfitting by discouraging overly complex models. Overfitting occurs when a model learns noise and details from training data instead of the underlying pattern. Regularization addresses this problem by adding a penalty term to the loss function, which restricts the magnitude of model parameters (weights).

###L1 Regularization (Lasso)

L1 regularization adds the sum of absolute values of weights to the loss function:

Loss = Original Loss + Œª‚àë‚à£Wi|


It encourages sparsity, meaning some weights become exactly zero. This effectively performs feature selection, making the model simpler and easier to interpret. L1 is especially useful when many input features are irrelevant.

###L2 Regularization (Ridge)

L2 regularization adds the sum of squared weights to the loss function:

Loss = Original Loss + Œª‚àëWi^2‚Äã
	‚Äã


Instead of forcing weights to zero, L2 regularization shrinks weights smoothly, distributing importance across features. It improves numerical stability and is widely used in deep learning.

Conclusion

- L1 ‚Üí Feature selection, sparse model

- L2 ‚Üí Weight shrinkage, smoother learning

Both methods help improve model generalization.

###7.What is Dropout Layer and To which part of a neural network is dropout generally applied? how does it prevent overfitting

A Dropout layer is a regularization technique used in neural networks to reduce overfitting. During training, dropout randomly deactivates (drops) a fraction of neurons in a layer with a predefined probability
ùëù
p.

Where Dropout is Applied

Dropout is generally applied to:

- Hidden layers of a neural network
It is usually not applied to the output layer and is disabled during testing/inference.

How Dropout Prevents Overfitting

- Prevents neurons from becoming overly dependent on specific other neurons

- Forces the network to learn robust and redundant representations

- Acts like training an ensemble of multiple smaller networks


###8.What will be the output activations of this layer after applying dropout? (Show the final activation vector)

Given:

Activations:

   a = [2.0, 5.0, 7.12, 4.5, 6.0]

Dropout mask:

m = [1, 0, 0, 1, 1]

Applying Dropout

Dropout is applied by element-wise multiplication of activations and the mask:

a
‚Ä≤
=a‚äôm

a‚Ä≤=[2.0√ó1, 5.0√ó0, 7.12√ó0, 4.5√ó1, 6.0√ó1]

Final Output Activations:

[2.0,0,0,4.5,6.0]

###9.Explain the difference between Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent

###Gradient Descent (Batch Gradient Descent)

Uses the entire training dataset to compute gradients for each update.

- Stable convergence

- Computationally expensive

- Slow for large datasets

###Stochastic Gradient Descent (SGD)

Uses one training example at a time for each update.

- Faster updates

- High variance in loss

- Noisy but can escape local minima

###Mini-Batch Gradient Descent

Uses a small batch of samples (e.g., 32, 64) per update.

- Balance between speed and stability

- Efficient on GPUs

- Most commonly used in practice

###10.What are optimizers ? Explain Adam, RMS prop and Momentum in detail

Optimizers are algorithms used to update model parameters in order to minimize the loss function efficiently during training.

###Momentum

Momentum accelerates gradient descent by adding a fraction of the previous update to the current update.

- Reduces oscillations

- Speeds up convergence in consistent directions

###RMSprop

RMSprop adapts the learning rate for each parameter by dividing the gradient by a moving average of squared gradients.

- Prevents large updates

- Works well for non-stationary objectives

- Common in recurrent neural networks

###Adam (Adaptive Moment Estimation)

Adam combines the benefits of Momentum and RMSprop.

- Maintains moving averages of both gradients and squared gradients

- Adaptive learning rate

- Fast convergence and stable training

In [None]:
.
