# 10-714 Homework 2

In this homework, you will be implementing a neural network library in the needle framework. Reminder: __you must save a copy in drive__.

In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/
!mkdir -p 10714
%cd /content/drive/MyDrive/10714
!git clone https://github.com/dlsys10714/hw2.git
%cd /content/drive/MyDrive/10714/hw2

!pip3 install --upgrade --no-deps git+https://github.com/dlsys10714/mugrade.git

### Setting some variables

In [26]:
MY_API_KEY = "qTskW8hPqLZXWgkH0eHH"
HW2_NAME = "hw2"

## Question 0

This homework builds off of Homework 1.

**First, in your Homework 2 directory, copy the files `python/needle/autograd.py`, `python/needle/ops/ops_mathematic.py` from your Homework 1.**

***NOTE***: The default data type for the tensor is `float32`. If you want to change the data type, you can do so by setting the `dtype` parameter in the `Tensor` constructor. For example, `Tensor([1, 2, 3], dtype='float64')` will create a tensor with `float64` data type. 
In this homework, **make sure any tensor you create has `float32` data type to avoid any issues with the autograder**.

In [None]:
import sys
sys.path.append('./python')
sys.path.append('./apps')

## Question 1

In this first question, you will implement a few different methods for weight initialization.  This will be done in the `python/needle/init/init_initializers.py` file, which contains a number of routines for initializing needle Tensors using various random and constant initializations.  Following the same methodology of the existing initializers (you will want to call e.g. `init.rand` or `init.randn` implemented in `python/needle/init/init_basic.py` from your functions below, implement the following common initialization methods.  In all cases, the functions should return `fan_in` by `fan_out` 2D tensors (extensions to other sizes can be done via e.g., reshaping).


### Xavier uniform
`xavier_uniform(fan_in, fan_out, gain=1.0, **kwargs)`

Fills the input Tensor with values according to the method described in [Understanding the difficulty of training deep feedforward neural networks](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), using a uniform distribution. The resulting Tensor will have values sampled from $\mathcal{U}(-a, a)$ where 

$$a = \text{gain} \times \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}$$

Pass remaining `**kwargs` parameters to the corresponding `init` random call.

##### Parameters
- `fan_in` - dimensionality of input
- `fan_out` - dimensionality of output
- `gain` - optional scaling factor
___

### Xavier normal
`xavier_normal(fan_in, fan_out, gain=1.0, **kwargs)`

Fills the input Tensor with values according to the method described in [Understanding the difficulty of training deep feedforward neural networks](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), using a normal distribution. The resulting Tensor will have values sampled from $\mathcal{N}(0, \text{std}^2)$ where 

$$\mathrm{std} = \mathrm{gain} \times \sqrt{\frac{2}{\mathrm{fan}_{in} + \mathrm{fan}_{out}}}$$

##### Parameters
- `fan_in` - dimensionality of input
- `fan_out` - dimensionality of output
- `gain` - optional scaling factor
___

### Kaiming uniform
`kaiming_uniform(fan_in, fan_out, nonlinearity="relu", **kwargs)`

Fills the input Tensor with values according to the method described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf), using a uniform distribution. The resulting Tensor will have values sampled from $\mathcal{U}(-\text{bound}, \text{bound})$ where 

$$\mathrm{bound} = \mathrm{gain} \times \sqrt{\frac{3}{\mathrm{fan}_{in}}}$$

Use the recommended gain value for ReLU: $\text{gain}=\sqrt{2}$.

##### Parameters
- `fan_in` - dimensionality of input
- `fan_out` - dimensionality of output
- `nonlinearity` - the non-linear function
___

### Kaiming normal
`kaiming_normal(fan_in, fan_out, nonlinearity="relu", **kwargs)`

Fills the input Tensor with values according to the method described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf), using a normal distribution. The resulted Tensor will have values sampled from $\mathcal{N}(0, \text{std}^2)$ where 

$$\mathrm{std} = \frac{\mathrm{gain}}{\sqrt{\mathrm{fan}_{in}}}$$

Use the recommended gain value for ReLU: $\text{gain}=\sqrt{2}$.

##### Parameters
- `fan_in` - dimensionality of input
- `fan_out` - dimensionality of output
- `nonlinearity` - the non-linear function

In [73]:
!python3 -m pytest -v -k "test_init"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 89 deselected / 4 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_init_kaiming_uniform [32mPASSED[0m[32m         [ 25%][0m
tests/hw2/test_nn_and_optim.py::test_init_kaiming_normal [32mPASSED[0m[32m          [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_init_xavier_uniform [32mPASSED[0m[32m          [ 75%][0m
tests/hw2/test_nn_and_optim.py::test_init_xavier_normal [32mPASSED[0m[32m           [100%][0m



In [4]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "init"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting init...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
[32m.[0m



## Question 2

In this question, you will implement additional modules in `python/needle/nn/nn_basic.py`. Specifically, for the following modules described below, initialize any variables of the module in the constructor, and fill out the `forward` method. **Note:** Be sure that you are using the `init` functions that you just implemented to initialize the parameters, and don't forget to pass the `dtype` argument.
___

### Linear
`needle.nn.Linear(in_features, out_features, bias=True, device=None, dtype="float32")`

Applies a linear transformation to the incoming data: $y = xA^T + b$. The input shape is $(N, H_{\text{in}})$ where $H_{\text{in}}=\text{in\_features}$. The output shape is $(N, H_{\text{out}})$ where $H_{\text{out}}=\text{out\_features}$.

**Be careful to explicitly broadcast the bias term to the correct shape -- Needle does not support implicit broadcasting.**

**Note: for all layers including this one, you should initialize the weight Tensor before the bias Tensor, and should initialize all Parameters using only functions from `init`**. This initialization order requirement exists because the test case answers on mugrade were prepared assuming that weights are initialized before bias parameters. While initializing bias before weights would still be algorithmically correct, the model solutions and expected test outputs were generated using the weights-first convention. Therefore, to pass the automated tests on mugrade, you must follow this specific initialization order. If you encounter any ambiguity about which layer or parameter should be initialized first in other parts of this assignment, please raise a request on Ed for clarification. 

##### Parameters
- `in_features` - size of each input sample
- `out_features` - size of each output sample
- `bias` - If set to `False`, the layer will not learn an additive bias.

##### Variables
- `weight` - the learnable weights of shape (`in_features`, `out_features`). The values should be **initialized with the Kaiming Uniform initialization** with `fan_in = in_features`
- `bias` - the learnable bias of shape (`out_features`). The values should be initialized with the Kaiming Uniform initialization with `fan_in = out_features`. **Note the difference in fan_in choice, due to their relative sizes**. 


**NOTE:** Make sure to enclose all necessary variables e.g. (`weight`, `bias`) in the **`Parameter` class** so that they are visible to the optimizers which would be implemented next. **You can read `class Parameter` and the function `unpack_params` in `python/needle/nn/nn_basic.py` to understand more.**

In [23]:
!python3 -m pytest -v -k "test_nn_linear" -s

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 85 deselected / 8 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_linear_weight_init_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_bias_init_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_forward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_forward_2 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_forward_3 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_backward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_linear_backward_2 [[ 24.54879958   8.77534697   4.38789841 -21.24851486  -3.96693733
   24.25676758   6.31711124   6.02977716   0.88099365   3.59951611]
 [ 12.23374473  -3.79264582  -4.1903892   -5.10671942 -12.00426875


In [29]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_linear"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_linear...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
[32m.[0m



### ReLU
`needle.nn.ReLU()`

Applies the rectified linear unit function element-wise:
$ReLU(x) = max(0, x)$.

If you have previously implemented ReLU's backwards pass in terms of itself, note that this is numerically unstable and will likely cause problems
down the line.
Instead, consider that we could write the derivative of ReLU as $I\{x>0\}$, where, for this assignment, we arbitrarily decide and fix the convention that we will consider the derivative at $x=0$ to be 0.
(This is a _subdifferentiable_ function.)

___

In [27]:
!python3 -m pytest -v -k "test_nn_relu"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 91 deselected / 2 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_relu_forward_1 [32mPASSED[0m[32m            [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_nn_relu_backward_1 [32mPASSED[0m[32m           [100%][0m



In [28]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_relu"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_relu...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



### Sequential
`needle.nn.Sequential(*modules)`

Applies a sequence of modules to the input (in the order that they were passed to the constructor) and returns the output of the last module.
These should be kept in a `.module` property: you should _not_ redefine any magic methods like `__getitem__`, as this may not be compatible with our tests.

##### Parameters
- `*modules` - any number of modules of type `needle.nn.Module`

___

In [46]:
!python3 -m pytest -v -k "test_nn_sequential"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 91 deselected / 2 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_sequential_forward_1 [32mPASSED[0m[32m      [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_nn_sequential_backward_1 [32mPASSED[0m[32m     [100%][0m



In [47]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_sequential"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_sequential...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



### LogSumExp

`needle.ops.LogSumExp(axes)`

Applies a numerically stable log-sum-exp function to the input by subtracting off the maximum elements. You will need to implement this and the next operation in file `python/needle/ops/ops_logarithmic.py`.

\begin{equation}
\text{LogSumExp}(z) = \log (\sum_{i} \exp (z_i - \max{z})) + \max{z}
\end{equation}

#### Parameters
- `axes` - Tuple of axes to sum and take the maximum element over. This uses the same conventions as `needle.ops.Summation()`

##### Why This Formulation? Handling Numerical Stability

- **Naive definition:**  
  The most direct way to define LogSumExp is  

  $$
  \log \left(\sum_i \exp(z_i)\right)
  $$

- **The problem:**  
  This naive computation is prone to numerical instability.  
  - If some $z_i$ is very large (e.g. $1000$), then $\exp(z_i)$ will overflow to $\infty$.  
  - If some $z_i$ is very small (e.g. $-1000$), then $\exp(z_i)$ will underflow to $0$.  
  Both cases can distort the result in floating-point arithmetic.  

- **The fix:**  
  To avoid this, we factor out the maximum element $M = \max(z)$:  

  $$
  \log \left(\sum_i \exp(z_i)\right)
  = \log \left(\exp(M)\sum_i \exp(z_i - M)\right)
  = M + \log \left(\sum_i \exp(z_i - M)\right).
  $$

  Now all exponentials are at most $\exp(0) = 1$, so overflow is completely avoided.  

- **What about underflow?**  
  Underflow can still occur if $z_i - M$ is very negative (e.g. $-1000$), since $\exp(-1000) \approx 0$ in floating-point.  
  However, this is not a problem: such terms are already negligible compared to the maximum and do not meaningfully affect the sum.  


The following blog post is also a good reference: https://indii.org/blog/gradients-of-softmax-and-logsumexp/
___

In [70]:
!python3 -m pytest -v -k "test_op_logsumexp" -s

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 83 deselected / 10 selected                               [0m

tests/hw2/test_nn_and_optim.py::test_op_logsumexp_forward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_forward_2 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_forward_3 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_forward_4 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_forward_5 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_backward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_backward_2 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_backward_3 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_op_logsumexp_backward_5 [32mPASSED[0m
tests/hw2/test_nn

In [50]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "op_logsumexp"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting op_logsumexp...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
Grader test 7 passed
Grader test 8 passed
Grader test 9 passed
Grader test 10 passed
[32m.[0m



### LogSoftmax

`needle.ops.LogSoftmax(axes)`

Applies a numerically stable logsoftmax function to the input by subtracting off the maximum elements.
For this question, you can assume the input NDArray is 2 dimensional and we are doing softmax over `axis=1`.

\begin{equation}
\text{LogSoftmax}(z) = \log \left(\frac{\exp(z_i - \max z)}{\sum_{i}\exp(z_i - \max z)}\right) = z - \text{LogSumExp}(z)
\end{equation}
___

In [24]:
!python3 -m pytest -v -k "test_op_logsoftmax"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 90 deselected / 3 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_op_logsoftmax_forward_1 [32mPASSED[0m[32m      [ 33%][0m
tests/hw2/test_nn_and_optim.py::test_op_logsoftmax_stable_forward_1 [32mPASSED[0m[32m [ 66%][0m
tests/hw2/test_nn_and_optim.py::test_op_logsoftmax_backward_1 [32mPASSED[0m[32m     [100%][0m



In [27]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "op_logsoftmax"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting op_logsoftmax...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
[32m.[0m



### SoftmaxLoss

`needle.nn.SoftmaxLoss()` in `python/needle/nn/nn_basic.py`

Applies the softmax loss as defined below (and as implemented in Homework 1), taking in as input a Tensor of logits and a Tensor of the true labels (expressed as a list of numbers, *not* one-hot encoded).

Note that you can use the `init.one_hot` function now instead of writing this yourself.  Note: You will need to use the numerically stable logsumexp operator you just implemented for this purpose.

\begin{equation}
\ell_\text{softmax}(z,y) = \log \sum_{i=1}^k \exp z_i - z_y
\end{equation}

___

In [37]:
!python3 -m pytest -v -k "test_nn_softmax_loss" -s

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 93 items / 89 deselected / 4 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_softmax_loss_forward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_softmax_loss_forward_2 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_softmax_loss_backward_1 [32mPASSED[0m
tests/hw2/test_nn_and_optim.py::test_nn_softmax_loss_backward_2 [32mPASSED[0m



In [38]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_softmax_loss"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_softmax_loss...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
[32m.[0m



### LayerNorm1d
`needle.nn.LayerNorm1d(dim, eps=1e-5, device=None, dtype="float32")`

Applies layer normalization over a mini-batch of inputs as described in the paper [Layer Normalization](https://arxiv.org/abs/1607.06450).

\begin{equation}
y = w \circ \frac{x_i - \textbf{E}[x]}{((\textbf{Var}[x]+\epsilon)^{1/2})} + b
\end{equation}

where $\textbf{E}[x]$ denotes the empirical mean of the inputs, $\textbf{Var}[x]$ denotes their empirical variance (note that here we are using the "biased" estimate of the variance, i.e., dividing by $N$ rather than by $N-1$), and $w$ and $b$ denote learnable scalar weights and biases respectively.  Note you can assume the input to this layer is a 2D tensor, with batches in the first dimension and features in the second. You might need to broadcast the weight and bias before applying them.

##### Parameters
- `dim` - number of channels
- `eps` - a value added to the denominator for numerical stability.

##### Variables
- `weight` - the learnable weights of size `dim`, elements initialized to 1.
- `bias` - the learnable bias of shape `dim`, elements initialized to 0.
___

In [83]:
!python3 -m pytest -v -k "test_nn_layernorm" -s

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 86 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_layernorm_backward_1 [32mPASSED[0m



In [84]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_layernorm"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_layernorm...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
Grader test 7 passed
Grader test 8 passed
[32m.[0m




### Flatten
`needle.nn.Flatten()`

Takes in a tensor of shape `(B,X_0,X_1,...)`, and flattens all non-batch dimensions so that the output is of shape `(B, X_0 * X_1 * ...)`

In [85]:
!python3 -m pytest -v -k "test_nn_flatten"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 78 deselected / 9 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_flatten_forward_1 [32mPASSED[0m[32m         [ 11%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_forward_2 [32mPASSED[0m[32m         [ 22%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_forward_3 [32mPASSED[0m[32m         [ 33%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_forward_4 [32mPASSED[0m[32m         [ 44%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_backward_1 [32mPASSED[0m[32m        [ 55%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_backward_2 [32mPASSED[0m[32m        [ 66%][0m
tests/hw2/test_nn_and_optim.py::test_nn_flatten_backward_3 [32mPASSED[0m[32m        [ 77%][0m
tests/hw2/test_nn_a

In [86]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_flatten"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_flatten...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
Grader test 7 passed
Grader test 8 passed
[32m.[0m



### BatchNorm1d
`needle.nn.BatchNorm1d(dim, eps=1e-5, momentum=0.1, device=None, dtype="float32")`

Applies batch normalization over a mini-batch of inputs as described in the paper [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167).

\begin{equation}
y = w \circ \frac{z_i - \textbf{E}[x]}{((\textbf{Var}[x]+\epsilon)^{1/2})} + b
\end{equation}

but where here the mean and variance refer to the mean and variance over the _batch_dimensions.  The function also computes a running average of mean/variance for all features at each layer $\hat{\mu}, \hat{\sigma}^2$, and at test time normalizes by these quantities:

\begin{equation}
y = \frac{(x - \hat{mu})}{((\hat{\sigma}^2_{i+1})_j+\epsilon)^{1/2}}
\end{equation}


BatchNorm uses the running estimates of mean and variance instead of batch statistics at test time, i.e.,
after `model.eval()` has been called on the BatchNorm layer's `training` flag is false.

To compute the running estimates, you can use the equation $$\hat{x_{new}} = (1 - m) \hat{x_{old}} + mx_{observed},$$
where $m$ is momentum.

##### Parameters
- `dim` - input dimension
- `eps` - a value added to the denominator for numerical stability.
- `momentum` - the value used for the running mean and running variance computation.

##### Variables
- `weight` - the learnable weights of size `dim`, elements initialized to 1.
- `bias` - the learnable bias of size `dim`, elements initialized to 0.
- `running_mean` - the running mean used at evaluation time, elements initialized to 0.
- `running_var` - the running (unbiased) variance used at evaluation time, elements initialized to 1. 

___

In [96]:
!python3 -m pytest -v -k "test_nn_batchnorm"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 79 deselected / 8 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_check_model_eval_switches_training_flag_1 [32mPASSED[0m[32m [ 12%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_forward_1 [32mPASSED[0m[32m       [ 25%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_forward_affine_1 [32mPASSED[0m[32m [ 37%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_backward_1 [32mPASSED[0m[32m      [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_backward_affine_1 [32mPASSED[0m[32m [ 62%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_running_mean_1 [32mPASSED[0m[32m  [ 75%][0m
tests/hw2/test_nn_and_optim.py::test_nn_batchnorm_running_var_1 [32mPASSED[0m[32m   [

In [97]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_batchnorm"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_batchnorm...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
Grader test 6 passed
Grader test 7 passed
Grader test 8 passed
Grader test 9 passed
[32m.[0m



### Dropout
`needle.nn.Dropout(p = 0.5)`

During training, randomly zeroes some of the elements of the input tensor with probability `p` using samples from a Bernoulli distribution. This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons as described in the paper [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580). During evaluation the module simply computes an identity function. 

\begin{equation}
\hat{z}_{i+1} = \sigma_i (W_i^T z_i + b_i) \\
(z_{i+1})_j = 
    \begin{cases}
    (\hat{z}_{i+1})_j /(1-p) & \text{with probability } 1-p \\
    0 & \text{with probability } p \\
    \end{cases}
\end{equation}

The division by \(1-p\) keeps the expected activation unchanged, since  

$$
\mathbb{E}\big[\operatorname{Dropout}((\hat{z}_{i+1})_j)\big]
= (1-p)\,\frac{(\hat{z}_{i+1})_j}{1-p} + p \cdot 0
= (\hat{z}_{i+1})_j
$$

**Important**: If the Dropout module has the flag `training=False`, you shouldn't "dropout" any activations. That is, dropout applies during training only, not during evaluation. Note that `training` is a flag in `nn.Module`.

##### Parameters
- `p` - the probability of an element to be zeroed.

Utils in `python/needle/init/init_basic.py` might be helpful when implementing Dropout.
___

In [99]:
!python3 -m pytest -v -k "test_nn_dropout"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 85 deselected / 2 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_dropout_forward_1 [32mPASSED[0m[32m         [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_nn_dropout_backward_1 [32mPASSED[0m[32m        [100%][0m



In [100]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_dropout"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_dropout...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



### Residual
`needle.nn.Residual(fn: Module)`

Applies a residual or skip connection given module $\mathcal{F}$ and input Tensor $x$, returning $\mathcal{F}(x) + x$.
##### Parameters
- `fn` - module of type `needle.nn.Module`

In [101]:
!python3 -m pytest -v -k "test_nn_residual"

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 85 deselected / 2 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_nn_residual_forward_1 [32mPASSED[0m[32m        [ 50%][0m
tests/hw2/test_nn_and_optim.py::test_nn_residual_backward_1 [32mPASSED[0m[32m       [100%][0m



In [102]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "nn_residual"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting nn_residual...
Grader test 1 passed
Grader test 2 passed
[32m.[0m



## Question 3

Implement the `step` function of the following optimizers in `python/needle/optim.py`.
Make sure that your optimizers _don't_ modify the gradients of tensors in-place.

We have included some tests to ensure that you are not consuming excessive memory, which can happen if you are
not using `.data` or `.detach()` in the right places, thus building an increasingly large computational graph
(not just in the optimizers, but in the previous modules as well).
You can ignore these tests, which include the string `memory_check` at your own discretion.

___

### SGD
`needle.optim.SGD(params, lr=0.01, momentum=0.0, weight_decay=0.0)`

Implements stochastic gradient descent (optionally with momentum, shown as $\beta$ below). 

\begin{equation}
\begin{split}
    u_{t+1} &= \beta u_t + (1-\beta) \nabla_\theta f(\theta_t) \\
    \theta_{t+1} &= \theta_t - \alpha u_{t+1}
\end{split}
\end{equation}

##### Parameters
- `params` - iterable of parameters of type `needle.nn.Parameter` to optimize
- `lr` (*float*) - learning rate
- `momentum` (*float*) - momentum factor
- `weight_decay` (*float*) - weight decay (L2 penalty)

Implementation of `clip_grad_norm` can be skipped for this homework.
___

In [166]:
!python3 -m pytest -v -k "test_optim_sgd" -s

platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0 -- /opt/homebrew/anaconda3/envs/ml_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 87 items / 81 deselected / 6 selected                                [0m

tests/hw2/test_nn_and_optim.py::test_optim_sgd_vanilla_1 [31mFAILED[0m
tests/hw2/test_nn_and_optim.py::test_optim_sgd_momentum_1 [31mFAILED[0m
tests/hw2/test_nn_and_optim.py::test_optim_sgd_weight_decay_1 [31mFAILED[0m
tests/hw2/test_nn_and_optim.py::test_optim_sgd_momentum_weight_decay_1 [31mFAILED[0m
tests/hw2/test_nn_and_optim.py::test_optim_sgd_layernorm_residual_1 [31mFAILED[0m
tests/hw2/test_nn_and_optim.py::test_optim_sgd_z_memory_check_1 [32mPASSED[0m

[31m[1m___________________________ test_optim_sgd_vanilla_1 ___________________________[0m

    [0m[94mdef[39;49;00m[90m [39;49;00m[92mtest_optim_sgd_vanilla_1[39;49;00m():[90m[39;49;00m
        np.testing.assert_all

In [147]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "optim_sgd"

submit
platform darwin -- Python 3.12.7, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/shreyasridhar/Desktop/DL Systems/hw2
plugins: anyio-4.8.0
collected 19 items / 18 deselected / 1 selected                                [0m

tests/hw2/test_nn_and_optim.py 
Submitting optim_sgd...
Grader test 1 passed
Grader test 2 passed
Grader test 3 passed
Grader test 4 passed
Grader test 5 passed
[32m.[0m



### Adam
`needle.optim.Adam(params, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.0)`

Implements Adam algorithm, proposed in [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980). 

\begin{equation}
\begin{split}
u_{t+1} &= \beta_1 u_t + (1-\beta_1) \nabla_\theta f(\theta_t) \\
v_{t+1} &= \beta_2 v_t + (1-\beta_2) (\nabla_\theta f(\theta_t))^2 \\
\hat{u}_{t+1} &= u_{t+1} / (1 - \beta_1^t) \quad \text{(bias correction)} \\
\hat{v}_{t+1} &= v_{t+1} / (1 - \beta_2^t) \quad \text{(bias correction)}\\
\theta_{t+1} &= \theta_t - \alpha \hat{u_{t+1}}/(\hat{v}_{t+1}^{1/2}+\epsilon)
\end{split}
    \end{equation}

**Important:** Pay attention to whether or not you are applying bias correction.

##### Parameters
- `params` - iterable of parameters of type `needle.nn.Parameter` to optimize
- `lr` (*float*) - learning rate
- `beta1` (*float*) - coefficient used for computing running average of gradient
- `beta2` (*float*) - coefficient used for computing running average of square of gradient
- `eps` (*float*) - term added to the denominator to improve numerical stability
- `weight_decay` (*float*) - weight decay (L2 penalty)

**Hint**: To help deal with memory issues, try to understand how to use `.data` or `.detach()`

In [None]:
!python3 -m pytest -v -k "test_optim_adam"

In [None]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "optim_adam"

## Question 4

In this question, you will implement two data primitives: `needle.data.DataLoader` and `needle.data.Dataset`. `Dataset` stores the samples and their corresponding labels, and `DataLoader` wraps an iterable around the `Dataset` to enable easy access to the samples. 

For this question, you will be working in the `python/needle/data` directory. 

### Transformations

First we will implement a few transformations that are helpful when working with images. We will stick with a horizontal flip and a random crop for now. Fill out the following functions in `needle/data/data_transforms.py`.
___ 

#### RandomFlipHorizontal
`needle.data.RandomFlipHorizontal(p = 0.5)`

Flips the image horizontally, with probability `p`.

##### Parameters
- `p` (*float*) - The probability of flipping the input image.
___

#### RandomCrop
`needle.data.RandomCrop(padding=3)`

Padding is added to all sides of the image, and then the image is cropped back to it's original size at a random location. Returns an image the same size as the original image.

##### Parameters
- `padding` (*int*) - The padding on each border of the image.

In [None]:
!python3 -m pytest -v -k "flip_horizontal"
!python3 -m pytest -v -k "random_crop"

In [None]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "flip_horizontal"
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "random_crop"

The Dataset is in charge of what your data is (e.g., an image/label pair) and how to get a single sample. The Dataloader is in charge of how your data is fed into training (e.g., batching, shuffling, iterating over epochs).

### Dataset

Each **subclass** of the  `Dataset`class must implement three functions: `__init__`, `__len__`, and `__getitem__`.
* The `__init__` function initializes the images, labels, and transforms.
* The `__len__` function returns the number of samples in the dataset.
* The `__getitem__` function retrieves a sample from the dataset at a given index `idx`, **calls the transform functions on the image (if applicable)**, converts the image and label to a numpy array (the data will be converted to Tensors elsewhere). The output of `__getitem__` and `__next__` should be NDArrays, and you should follow the shapes such that you're accessing an array of size (Datapoint Number, Feature Dim 1, Feature Dim 2, ...). 

Fill out these functions in the `MNISTDataset` class in `needle/data/datasets/mnist_dataset.py`. You can use your solution to `parse_mnist` from the previous homework for the `__init__` function.

### MNISTDataset (subclass of Dataset)
`needle.data.MNISTDataset(image_filename, label_filename, transforms)`

##### Parameters
- `image_filename` - path of file containing images
- `label_filename` - path of file containing labels
- `transforms` - an optional list of transforms to apply to data


In [None]:
!python3 -m pytest -v -k "test_mnist_dataset"

In [None]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "mnist_dataset"

### Dataloader

`needle.data.Dataloader(dataset: Dataset, batch_size: Optional[int] = 1, shuffle: bool = False)`


In `needle/data/data_basic.py`, the Dataloader class provides an interface for assembling mini-batches of examples suitable for training using SGD-based approaches, backed by a `Dataset` object.  In order to build the typical Dataloader interface (allowing users to iterate over all the mini-batches in the dataset), you will need to implement the `__iter__()` and `__next__()` calls in the class:
* `__iter__()` is called at the start of a new epoch (i.e., whenever you begin looping over the dataloader).
* `__next__()` is called once per batch until the epoch is finished.

Please note that subsequent calls to next will require you to return the following batches, so next is not a pure function.

##### Purpose

Combines a dataset and a sampler, and provides an iterable over the given dataset. 

##### Parameters
- `dataset` - `needle.data.Dataset` - a dataset 
- `batch_size` - `int` - what batch size to serve the data in 
- `shuffle` - `bool` - set to ``True`` to reshuffle the data **at the beginning of every epoch**, default ``False``.
___ 





In [None]:
!python3 -m pytest -v -k "test_dataloader"

In [None]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "dataloader"

## Question 5

Given you have now implemented all the necessary components for our neural network library, let's build and train an MLP ResNet. For this question, you will be working in `apps/mlp_resnet.py`. First, fill out the functions `ResidualBlock` and `MLPResNet` as described below:

### ResidualBlock
`ResidualBlock(dim, hidden_dim, norm=nn.BatchNorm1d, drop_prob=0.1)`

Implements a residual block as follows:

<p align="center">
  <img src="https://github.com/dlsyscourse/hw2/blob/f4c994506f2c76d7fdcc5a711a483e31b189afaa/figures/residualblock.png?raw=true" alt="Residual Block"/>
</p>

**NOTE**: if the figure does not render, please see the figure in the `figures` directory.

where the first linear layer has `in_features=dim` and `out_features=hidden_dim`, and the last linear layer has `out_features=dim`. Returns the block as type `nn.Module`. 

##### Parameters
- `dim` (*int*) - input dim
- `hidden_dim` (*int*) - hidden dim
- `norm` (*nn.Module*) - normalization method
- `drop_prob` (*float*) - dropout probability

___

### MLPResNet
`MLPResNet(dim, hidden_dim=100, num_blocks=3, num_classes=10, norm=nn.BatchNorm1d, drop_prob=0.1)`

Implements an MLP ResNet as follows:

<p align="center">
  <img src="https://github.com/dlsyscourse/hw2/blob/f4c994506f2c76d7fdcc5a711a483e31b189afaa/figures/mlp_resnet.png?raw=true" alt="MLP Resnet"/>
</p>

where the first linear layer has `in_features=dim` and `out_features=hidden_dim`, and each ResidualBlock has `dim=hidden_dim` and `hidden_dim=hidden_dim//2`. Returns a network of type `nn.Module`.

##### Parameters
- `dim` (*int*) - input dim
- `hidden_dim` (*int*) - hidden dim
- `num_blocks` (*int*) - number of ResidualBlocks
- `num_classes` (*int*) - number of classes
- `norm` (*nn.Module*) - normalization method
- `drop_prob` (*float*) - dropout probability (0.1)

**Note**: Modules should be initialized to match the order of execution in the Resnet.
___ 

Once you have the deep learning model architecture correct, let's train the network using our new neural network library components. Specifically, implement the functions `epoch` and `train_mnist`.

### Epoch

`epoch(dataloader, model, opt=None)`

Executes one epoch of training or evaluation, iterating over the entire training dataset once (just like `nn_epoch` from previous homeworks). Returns the average error rate (as a *float*) and the average loss over all samples (as a *float*). Set the model to `training` mode at the beginning of the function if `opt` is given; set the model to `eval` if `opt` is not given (i.e. `None`). When setting the modes, use `.train()` and `.eval()` instead of modifying the training attribute.

##### Parameters
- `dataloader` (*`needle.data.DataLoader`*) - dataloader returning samples from the training dataset
- `model` (*`needle.nn.Module`*) - neural network
- `opt` (*`needle.optim.Optimizer`*) - optimizer instance, or `None`

___

### Train Mnist

`train_mnist(batch_size=100, epochs=10, optimizer=ndl.optim.Adam, lr=0.001, weight_decay=0.001, hidden_dim=100, data_dir="data")`
                
Initializes a training dataloader (with `shuffle` set to `True`) and a test dataloader for MNIST data, and trains an `MLPResNet` using the given optimizer (if `opt` is not None) and the softmax loss for a given number of epochs. Returns a tuple of the training error, training loss, test error, test loss computed in the last epoch of training. If any parameters are not specified, use the default parameters.

##### Parameters
- `batch_size` (*int*) - batch size to use for train and test dataloader
- `epochs` (*int*) - number of epochs to train for
- `optimizer` (*`needle.optim.Optimizer` type*) - optimizer type to use
- `lr` (*float*) - learning rate 
- `weight_decay` (*float*) - weight decay
- `hidden_dim` (*int*) - hidden dim for `MLPResNet`
- `data_dir` (*str*) - directory containing MNIST image/label files


In [None]:
!python3 -m pytest -v -k "test_mlp"

In [None]:
!python3 -m mugrade submit "qTskW8hPqLZXWgkH0eHH" "$HW2_NAME" -k "mlp_resnet"

We encourage you to experiment with the `mlp_resnet.py` training script.
You can investigate the effect of using different initializers on the Linear layers,
increasing the dropout probability,
or adding transforms (via a list to the `transforms=` keyword argument of Dataset)
such as random cropping.