Trey Tuscai and Gordon Doore

Spring 2025

CS 444: Deep Learning

Project 1: Deep Neural Networks 

#### Week 2: Training deeper networks with blocks

The focus this week is on block design organizing deeper neural networks.

In [3]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=3)

# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

## Task 5: Building deeper neural networks with blocks

In the quest to classify CIFAR-10 images with the highest accuracy possible, we would like to build a neural network that is deeper than VGG4 and has a greater capacity to learn more complex, nonlinear patterns in the images. Let's focus on designing a slightly deeper network than VGG4 that we will call VGG6 that has the following architecture:

Conv2D → Conv2D → MaxPool2D → **Conv2D → Conv2D → MaxPool2D** → Flatten → *Dense → Dropout* → Dense

Notice how the bold set of `Conv2D`/`MaxPool2D` layers are repeats of the layers to their left. It turns out, it may be beneficial to replicate the `Dense`/`Dropout` layers (italicized) toward the end of the network multiple times as well in deeper versions.

Review your code for assembling `VGG4`. Building `VGG6` would require some copy-pasting of layer creation code. Imagine building even deeper versions with even more layers (e.g. `VGG9`) — this copy-paste process would get tedious, unwieldy, and potentially be error prone the bigger the network gets!

For this reason, modern deep neural networks are often built using **blocks**: sequences of layers that repeat over and over again as you get farther into the network. For example, imagine replacing the layers **Conv2D → Conv2D → MaxPool2D** with a SINGLE new object that represents performing that sequence of those 3 layers. If we also do this for the `Dense`/`Dropout` layers, the architecture would look like:

VGGConvBlock_0 → **VGGConvBlock_1** → Flatten → *VGGDenseBlock_0* → Dense

Much simpler, more manageable, and easier to scale up to deeper nets!



### 5a. Build and test `VGG` blocks

The file `block.py` contains both the `Block` class and the `VGGConvBlock` and `VGGDenseBlock` classes referenced above. The `Block` class is the parent class to all `Block` classes (*both ones you write this week and for the rest of the semester!*) and is designed to work with `DeepNetwork`. Just like `DeepNetwork`, it contains all the "boilerplate" code that needs to be written for ANY block.

Aside from the constructor, I am providing you with the `Block` class fully implemented :) You only need to write code that assembles the layers that belong to a block and specify how the forward pass thru them is done. Blocks can be mixed-and-matched and interspersed with regular layers! Nice!

Implement and test the following classes and methods.

**Block:**
- constructor.

**VGGConvBlock:**
- constructor: What layers belong to a `VGGConvBlock` block?
- `__call__`: How do we perform the forward pass thru the block?

**VGGDenseBlock:**
- constructor: What layers belong to a `VGGDenseBlock` block?
- `__call__`: How do we perform the forward pass thru the block?


In [4]:
from block import VGGConvBlock, VGGDenseBlock

#### Test: `VGGConvBlock` part 1/2

In [5]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 4, 4, 3))
conv_block = VGGConvBlock('TestBlock', units=5, prev_layer_or_block=None, wt_scale=1e-1)
conv_block(x_test_1)
print(conv_block)

2025-02-15 17:55:38.515312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20601 MB memory:  -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9


2025-02-15 17:55:39.647969: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 90400


TestBlock:
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 5]
	Conv2D layer output(TestBlock/conv_1) shape: [1, 4, 4, 5]
	Conv2D layer output(TestBlock/conv_0) shape: [1, 4, 4, 5]


The above should print (naming might be different):

```
TestBlock:
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 5]
	Conv2D layer output(TestBlock/conv1) shape: [1, 4, 4, 5]
	Conv2D layer output(TestBlock/conv0) shape: [1, 4, 4, 5]
```

In [6]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(2, 4, 4, 3))
acts = conv_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]''')

Your block net_acts are
[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]
and they should be:
[[[[0.372 0.487 0.071 0.158 0.   ]
   [0.156 0.412 0.175 0.116 0.019]]

  [[0.51  0.548 0.085 0.299 0.   ]
   [0.375 0.327 0.169 0.209 0.   ]]]


 [[[0.25  0.551 0.    0.321 0.022]
   [0.461 0.47  0.116 0.132 0.044]]

  [[0.37  0.546 0.003 0.221 0.009]
   [0.37  0.486 0.123 0.054 0.   ]]]]


#### Test: `VGGConvBlock` part 2/2

In [17]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 4, 4, 3))
conv_block = VGGConvBlock('TestBlock', units=7, prev_layer_or_block=None, dropout=True)
conv_block(x_test_1)
print(conv_block)

TestBlock:
	Dropout layer output(TestBlock/dropout) shape: [1, 2, 2, 7]
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 7]
	Conv2D layer output(TestBlock/conv_1) shape: [1, 4, 4, 7]
	Conv2D layer output(TestBlock/conv_0) shape: [1, 4, 4, 7]


The above should print (naming might be different):

```
TestBlock:
	Dropout layer output(TestBlock/dropout) shape: [1, 2, 2, 7]
	MaxPool2D layer output(TestBlock/maxpool2) shape: [1, 2, 2, 7]
	Conv2D layer output(TestBlock/conv1) shape: [1, 4, 4, 7]
	Conv2D layer output(TestBlock/conv0) shape: [1, 4, 4, 7]
```

In [18]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(1, 4, 4, 3))
acts = conv_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]''')

Your block net_acts are
[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]
and they should be:
[[[[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]

  [[0.002 0.002 0.    0.    0.    0.    0.   ]
   [0.002 0.002 0.    0.    0.    0.    0.   ]]]]


#### Test: `VGGDenseBlock` part 1/2

In [19]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 6))
dense_block = VGGDenseBlock('TestDenseBlock', units=(2,), prev_layer_or_block=None, wt_scale=1e-1)
dense_block(x_test_1)
print(dense_block)

TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 2]
	Dense layer output(TestDenseBlock/dense_0) shape: [1, 2]


The above should print (naming might be different):

```
TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 2]
	Dense layer output(TestDenseBlock/dense0) shape: [1, 2]
```

In [20]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(3, 6))
acts = dense_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]''')

Your block net_acts are
[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]
and they should be:
[[0.   0.  ]
 [0.   0.  ]
 [0.   0.07]]


#### Test: `VGGDenseBlock` part 2/2

In [21]:
tf.random.set_seed(0)
x_test_1 = tf.random.normal(shape=(1, 7))
dense_block = VGGDenseBlock('TestDenseBlock', units=(4,5), prev_layer_or_block=None, num_dense_blocks=2)
dense_block(x_test_1)
print(dense_block)

TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 5]
	Dense layer output(TestDenseBlock/dense_1) shape: [1, 5]
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 4]
	Dense layer output(TestDenseBlock/dense_0) shape: [1, 4]


The above should print (naming might be different):

```
TestDenseBlock:
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 5]
	Dense layer output(TestDenseBlock/dense1) shape: [1, 5]
	Dropout layer output(TestDenseBlock/dropout) shape: [1, 4]
	Dense layer output(TestDenseBlock/dense0) shape: [1, 4]
```

In [22]:
tf.random.set_seed(1)
x_test_2 = tf.random.normal(shape=(2, 7))
acts = dense_block(x_test_2)
print(f'Your block net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]''')

Your block net_acts are
[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]
and they should be:
[[0.002 0.002 0.    0.    0.   ]
 [0.002 0.002 0.    0.    0.   ]]


### 5b. Build `VGG6`

Now that you have both types of VGG blocks implemented and tested, make us of them to write the `VGG6` constructor and `__call__` methods in `vgg_nets.py`. This should be a quick process.

In [23]:
from vgg_nets import VGG6

#### Test: `VGG6`

In [24]:
test_vgg6_0 = VGG6(C=5, input_feats_shape=(8, 8, 3))
test_vgg6_0.compile()

---------------------------------------------------------------------------
Dense layer output(output_layer) shape: [1, 5]
DenseBlock1:
	Dropout layer output(DenseBlock1/dropout) shape: [1, 256]
	Dense layer output(DenseBlock1/dense_0) shape: [1, 256]
Flatten layer output(flat) shape: [1, 512]
ConvBlock2:
	MaxPool2D layer output(ConvBlock2/maxpool2) shape: [1, 2, 2, 128]
	Conv2D layer output(ConvBlock2/conv_1) shape: [1, 4, 4, 128]
	Conv2D layer output(ConvBlock2/conv_0) shape: [1, 4, 4, 128]
ConvBlock1:
	MaxPool2D layer output(ConvBlock1/maxpool2) shape: [1, 4, 4, 64]
	Conv2D layer output(ConvBlock1/conv_1) shape: [1, 8, 8, 64]
	Conv2D layer output(ConvBlock1/conv_0) shape: [1, 8, 8, 64]
---------------------------------------------------------------------------


The above should print something like (*layer/block names may be different and that's ok*):

```
---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 5]
DenseBlock1:
	Dropout layer output(DenseBlock1/dropout) shape: [1, 256]
	Dense layer output(DenseBlock1/dense0) shape: [1, 256]
Flatten layer output(flat) shape: [1, 512]
ConvBlock2:
	MaxPool2D layer output(ConvBlock2/maxpool2) shape: [1, 2, 2, 128]
	Conv2D layer output(ConvBlock2/conv1) shape: [1, 4, 4, 128]
	Conv2D layer output(ConvBlock2/conv0) shape: [1, 4, 4, 128]
ConvBlock1:
	MaxPool2D layer output(ConvBlock1/maxpool2) shape: [1, 4, 4, 64]
	Conv2D layer output(ConvBlock1/conv1) shape: [1, 8, 8, 64]
	Conv2D layer output(ConvBlock1/conv0) shape: [1, 8, 8, 64]
---------------------------------------------------------------------------
```

In [25]:
tf.random.set_seed(0)
x_test_3 = tf.random.normal(shape=(6, 8, 8, 3))

tf.random.set_seed(1)
test_vgg6 = VGG6(C=5, input_feats_shape=(8, 8, 3), wt_scale=1e-1)
acts = test_vgg6(x_test_3)
print(f'Your VGG6 output layer net_acts are\n{acts.numpy()}')
print('and they should be:')
print('''[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]''')

Your VGG6 output layer net_acts are
[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]
and they should be:
[[0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    1.    0.    0.   ]
 [0.    0.    0.999 0.    0.001]
 [0.    0.    0.652 0.003 0.345]]


### 5c. Train `VGG6` on CIFAR-10 with the default learning rate

In the cells below:
1. Load in the CIFAR-10 dataset.
2. Train for `25` epochs with default lr and other default hyperparameters. Your initial training and val losses should be 2.30 and should hold steady.
3. Print out the final test accuracy.

#### Important notes

#### 1. Running on CoCalc and GPU

You should do this training session (and all subsequent "real" training sessions this semester on the GPU in CoCalc). Training at this point on your CPU is basically infeasible (*feel free to try it!*).

#### 2. JIT compiling the train and test steps

While training VGG6 on the GPU should take ~15 secs per epoch, which is not too bad, soon deeper networks and larger datasets will make the training too slow for us (*even on the GPU!*). To speed things up considerably now and going forward, use the process we discussed in class to decorate `train_step` and `test_step` with `@tf.function(jit_compile=True)`. The 1st epoch might be a little slow, but subsequent epochs should now fly by.

**Note:**
- If you have have trouble just-in-time (JIT) compiling the train and test steps, you should be able to decorate with `@tf.function` to statically compile the network (non-JIT). This may be slower than JIT compiling the network, but should still be faster than no compilation. **If JIT compiling does not work, please seek help. JIT compiling on CoCalc will be very helpful going forward.**
- If you are training locally on macOS, JIT compiling will not work, but falling back to `@tf.function` should work fine.

In [28]:
from datasets import get_dataset
x_train, y_train, x_val, y_val, x_test, y_test, classnames = get_dataset("cifar10")
print(f'Your training set data have shape {x_train.shape}')
print(f'Your training set labels have shape {y_train.shape}')
print(f'Your val set data have shape {x_val.shape}')
print(f'Your val set labels have shape {y_val.shape}')
print(f'Your test set data have shape {x_test.shape}')
print(f'Your test set labels have shape {y_test.shape}')

Your training set data have shape (45000, 32, 32, 3)
Your training set labels have shape (45000,)
Your val set data have shape (5000, 32, 32, 3)
Your val set labels have shape (5000,)
Your test set data have shape (10000, 32, 32, 3)
Your test set labels have shape (10000,)


In [29]:
# KEEP ME
tf.random.set_seed(0)

model = VGG6(10, (32, 32, 3))
model.compile()
model.fit(x_train, y_train, x_val, y_val, max_epochs = 25)

---------------------------------------------------------------------------
Dense layer output(output_layer) shape: [1, 10]
DenseBlock1:
	Dropout layer output(DenseBlock1/dropout) shape: [1, 256]
	Dense layer output(DenseBlock1/dense_0) shape: [1, 256]
Flatten layer output(flat) shape: [1, 8192]
ConvBlock2:
	MaxPool2D layer output(ConvBlock2/maxpool2) shape: [1, 8, 8, 128]
	Conv2D layer output(ConvBlock2/conv_1) shape: [1, 16, 16, 128]
	Conv2D layer output(ConvBlock2/conv_0) shape: [1, 16, 16, 128]
ConvBlock1:
	MaxPool2D layer output(ConvBlock1/maxpool2) shape: [1, 16, 16, 64]
	Conv2D layer output(ConvBlock1/conv_1) shape: [1, 32, 32, 64]
	Conv2D layer output(ConvBlock1/conv_0) shape: [1, 32, 32, 64]
---------------------------------------------------------------------------


I0000 00:00:1739642408.648345     551 service.cc:145] XLA service 0x5effb8545c60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1739642408.648380     551 service.cc:153]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9


2025-02-15 18:00:09.261041: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.




I0000 00:00:1739642417.056637     551 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


validation accuracy: 0.09715545177459717
validation loss: 2.3027117252349854
the epoch 0 took 23350198415 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.302881956100464
the epoch 1 took 4937084258 nanoseconds


validation accuracy: 0.09755609184503555
validation loss: 2.302534580230713
the epoch 2 took 4897686598 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.302851438522339
the epoch 3 took 4989851733 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.302898645401001
the epoch 4 took 4958404346 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.3029656410217285
the epoch 5 took 4935547380 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.303056478500366
the epoch 6 took 4922609156 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.3029212951660156
the epoch 7 took 4951683236 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.302774667739868
the epoch 8 took 4909023395 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.3027126789093018
the epoch 9 took 4957690068 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.3029086589813232
the epoch 10 took 4905419408 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.302946090698242
the epoch 11 took 4907441085 nanoseconds


validation accuracy: 0.09835737198591232
validation loss: 2.3030107021331787
the epoch 12 took 4912170999 nanoseconds


validation accuracy: 0.09835737198591232
validation loss: 2.3027966022491455
the epoch 13 took 4899162383 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.3029539585113525
the epoch 14 took 4920430558 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.3027641773223877
the epoch 15 took 4908937900 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.3030130863189697
the epoch 16 took 4960987977 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.3028368949890137
the epoch 17 took 4917212569 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.303051233291626
the epoch 18 took 4971395169 nanoseconds


validation accuracy: 0.09575320780277252
validation loss: 2.3027613162994385
the epoch 19 took 4916185756 nanoseconds


validation accuracy: 0.09755609184503555
validation loss: 2.303297758102417
the epoch 20 took 4903836955 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.303030014038086
the epoch 21 took 4929036646 nanoseconds


validation accuracy: 0.09515224397182465
validation loss: 2.30277943611145
the epoch 22 took 4898157674 nanoseconds


validation accuracy: 0.09835737198591232
validation loss: 2.3026821613311768
the epoch 23 took 4920220455 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.302874803543091
the epoch 24 took 4939137438 nanoseconds
Finished training after 25 epochs!


([2.302648,
  2.3026395,
  2.3026783,
  2.3026333,
  2.3026297,
  2.3026268,
  2.3025768,
  2.3026707,
  2.3026257,
  2.3026733,
  2.3025904,
  2.3026392,
  2.3026583,
  2.3026316,
  2.3026192,
  2.3026462,
  2.3026388,
  2.3025906,
  2.3026145,
  2.3026922,
  2.3025148,
  2.3026054,
  2.3027303,
  2.3026686,
  2.302591],
 [2.3027117,
  2.302882,
  2.3025346,
  2.3028514,
  2.3028986,
  2.3029656,
  2.3030565,
  2.3029213,
  2.3027747,
  2.3027127,
  2.3029087,
  2.302946,
  2.3030107,
  2.3027966,
  2.302954,
  2.3027642,
  2.303013,
  2.302837,
  2.3030512,
  2.3027613,
  2.3032978,
  2.30303,
  2.3027794,
  2.3026822,
  2.3028748],
 [0.09715545,
  0.09715545,
  0.09755609,
  0.09715545,
  0.09575321,
  0.095152244,
  0.095152244,
  0.09715545,
  0.09715545,
  0.095152244,
  0.095152244,
  0.09715545,
  0.09835737,
  0.09835737,
  0.09575321,
  0.09575321,
  0.09575321,
  0.095152244,
  0.09575321,
  0.09575321,
  0.09755609,
  0.095152244,
  0.095152244,
  0.09835737,
  0.10396635],

### 5d. Train `VGG6` on CIFAR-10 with a smaller learning rate

In the cells below, repeat what you did in the previous subtask, but this time change the learning rate to `1e-5`. You should get a very different result.

In [31]:
# KEEP ME
tf.random.set_seed(0)

model = VGG6(10, (32, 32, 3))
model.compile(lr=1e-5)
model.fit(x_train, y_train, x_val, y_val, max_epochs = 25)


---------------------------------------------------------------------------
Dense layer output(output_layer) shape: [1, 10]
DenseBlock1:
	Dropout layer output(DenseBlock1/dropout) shape: [1, 256]
	Dense layer output(DenseBlock1/dense_0) shape: [1, 256]
Flatten layer output(flat) shape: [1, 8192]
ConvBlock2:
	MaxPool2D layer output(ConvBlock2/maxpool2) shape: [1, 8, 8, 128]
	Conv2D layer output(ConvBlock2/conv_1) shape: [1, 16, 16, 128]
	Conv2D layer output(ConvBlock2/conv_0) shape: [1, 16, 16, 128]
ConvBlock1:
	MaxPool2D layer output(ConvBlock1/maxpool2) shape: [1, 16, 16, 64]
	Conv2D layer output(ConvBlock1/conv_1) shape: [1, 32, 32, 64]
	Conv2D layer output(ConvBlock1/conv_0) shape: [1, 32, 32, 64]
---------------------------------------------------------------------------


validation accuracy: 0.10396634787321091
validation loss: 2.302565574645996
the epoch 0 took 8830110753 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.3025691509246826
the epoch 1 took 4995276702 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.302568197250366
the epoch 2 took 5011688533 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.3025717735290527
the epoch 3 took 5028981951 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.3025777339935303
the epoch 4 took 4971059802 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.302582025527954
the epoch 5 took 4997673595 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.3025903701782227
the epoch 6 took 5004457564 nanoseconds


validation accuracy: 0.10396634787321091
validation loss: 2.3025975227355957
the epoch 7 took 4997436096 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.3025996685028076
the epoch 8 took 5008217629 nanoseconds


validation accuracy: 0.09715545177459717
validation loss: 2.3025991916656494
the epoch 9 took 4952340532 nanoseconds


validation accuracy: 0.09455128014087677
validation loss: 2.276357650756836
the epoch 10 took 4962681541 nanoseconds


validation accuracy: 0.2628205120563507
validation loss: 1.9976333379745483
the epoch 11 took 4974535459 nanoseconds


validation accuracy: 0.30588942766189575
validation loss: 1.9358278512954712
the epoch 12 took 5054310517 nanoseconds


validation accuracy: 0.32131409645080566
validation loss: 1.889448881149292
the epoch 13 took 4982013493 nanoseconds


validation accuracy: 0.33673879504203796
validation loss: 1.8528594970703125
the epoch 14 took 4976507921 nanoseconds


validation accuracy: 0.34955930709838867
validation loss: 1.8209173679351807
the epoch 15 took 4971879783 nanoseconds


validation accuracy: 0.36097756028175354
validation loss: 1.7885019779205322
the epoch 16 took 4964907210 nanoseconds


validation accuracy: 0.36117789149284363
validation loss: 1.762065052986145
the epoch 17 took 5057479777 nanoseconds


validation accuracy: 0.3743990361690521
validation loss: 1.740895390510559
the epoch 18 took 5006281792 nanoseconds


validation accuracy: 0.37900641560554504
validation loss: 1.7199517488479614
the epoch 19 took 4984648170 nanoseconds


validation accuracy: 0.3832131326198578
validation loss: 1.7026828527450562
the epoch 20 took 4989012054 nanoseconds


validation accuracy: 0.38401442766189575
validation loss: 1.69154691696167
the epoch 21 took 5028140707 nanoseconds


validation accuracy: 0.3908253312110901
validation loss: 1.676349401473999
the epoch 22 took 4997381782 nanoseconds


validation accuracy: 0.39162659645080566
validation loss: 1.6668856143951416
the epoch 23 took 4970390090 nanoseconds


validation accuracy: 0.39923879504203796
validation loss: 1.6538937091827393
the epoch 24 took 5005588019 nanoseconds
Finished training after 25 epochs!


([2.3025854,
  2.3025942,
  2.3025808,
  2.3025882,
  2.3025885,
  2.3025892,
  2.302584,
  2.3025858,
  2.3025892,
  2.3025887,
  2.3016465,
  2.1040637,
  1.9683406,
  1.9169444,
  1.8890338,
  1.8498698,
  1.8247643,
  1.7920324,
  1.7797607,
  1.7540473,
  1.7399731,
  1.7273189,
  1.7178975,
  1.6994426,
  1.6917915],
 [2.3025656,
  2.3025692,
  2.3025682,
  2.3025718,
  2.3025777,
  2.302582,
  2.3025904,
  2.3025975,
  2.3025997,
  2.3025992,
  2.2763577,
  1.9976333,
  1.9358279,
  1.8894489,
  1.8528595,
  1.8209174,
  1.788502,
  1.762065,
  1.7408954,
  1.7199517,
  1.7026829,
  1.6915469,
  1.6763494,
  1.6668856,
  1.6538937],
 [0.10396635,
  0.10396635,
  0.10396635,
  0.10396635,
  0.10396635,
  0.10396635,
  0.10396635,
  0.10396635,
  0.09715545,
  0.09715545,
  0.09455128,
  0.2628205,
  0.30588943,
  0.3213141,
  0.3367388,
  0.3495593,
  0.36097756,
  0.3611779,
  0.37439904,
  0.37900642,
  0.38321313,
  0.38401443,
  0.39082533,
  0.3916266,
  0.3992388],
 24)

### 5e. Questions

**Question 4:** How does the modified learning rate compare with the default? Why do you think you observed what you did for VGG6 and not VGG4 trained on CIFAR-10 with the default lr?

**Answer 4:** 
The higher learning rate (1e-3) caused more erratic weight updates, leading to poor convergence. The lower learning rate (1e-5) allowed for more controlled weight updates, leading to better generalization. VGG6 is deeper w/ more parameters than VGG4, so it has a higher sensitivity to learning rate. VGG4, while being shallower, was less affected by the more aggressive learning rate. In other words, for a deep network like VGG6, using a smaller learning rate may be more effective in achieving higher accuracy and lower loss.

## Task 6: Early stopping and He/Kaiming initialization

The experiment that you just ran illuminates two major issues with our training workflow:
1. Cutting off training while the net is learning after waiting a long time at some prespecified number of epochs is frustrating. It would be nice to not have to manually set the number of training epochs as long as the net is making progress.
2. Picking the correct lr that could make or break training is frustrating. It would be nice to have the net work well to a wide range of lr choices and number of layers.

In this section, we will introduce the following techniques to combat these respective issues:
1. Early stopping.
2. He/Kaiming weight initialization (*next week*).

In [0]:
from network import DeepNetwork

### 6a. Implement early stopping

Implement the `early_stopping` method in `DeepNetwork` to determine the appropriate conditions to stop during training.

#### Test: `early_stopping`

In [0]:
dn = DeepNetwork((1,), 0.)

# Test 1
patience_1 = 5
es_lost_hist_1 = []
for iter in range(10):
    curr_loss = float(iter)
    es_lost_hist_1, stop = dn.early_stopping(es_lost_hist_1, curr_loss, patience=patience_1)

    if stop:
        break
print(f'Early stopping Test 1 ({patience_1=}):\n Stopped after {iter} iterations (should be 5 iterations).')
print(f' Recent loss history is {es_lost_hist_1} and should be [1.0, 2.0, 3.0, 4.0, 5.0]')
print()

# Test 2
tf.random.set_seed(1)
patience_2 = 3
es_lost_hist_2 = []
test_2_loss_vals = list(tf.random.uniform(shape=(20,)).numpy())
for iter in range(30):
    curr_loss = test_2_loss_vals[iter]
    es_lost_hist_2, stop = dn.early_stopping(es_lost_hist_2, curr_loss, patience=patience_2)

    if stop:
        break
print(f'Early stopping Test 2 ({patience_2=}):\n Stopped after {iter} iterations (should be 6 iterations).')
print(f' Recent loss history is {es_lost_hist_2} and should be [0.29193902, 0.64250207, 0.9757855]')
print()

# Test 3
tf.random.set_seed(1)
patience_3 = 6
es_lost_hist_3 = []
test_3_loss_vals = list(tf.random.uniform(shape=(20,)).numpy())
for iter in range(30):
    curr_loss = test_3_loss_vals[iter]
    es_lost_hist_3, stop = dn.early_stopping(es_lost_hist_3, curr_loss, patience=patience_3)

    if stop:
        break
print(f'Early stopping Test 3 ({patience_3=}):\n Stopped after {iter} iterations (should be 9 iterations).')
print(f' Recent loss history is\n {es_lost_hist_3}\n and should be')
print(' [0.29193902, 0.64250207, 0.9757855, 0.43509948, 0.6601019, 0.60489583]')
print()



### 6b. Integrate early stopping into training

Modify your `fit` function to support early stopping. Here are the changes to make:

1. Before the training loop create an empty list to record the rolling list of recent validation loss values within the patience window of epochs.
2. Each time the validation loss is computed, update and check the early stopping conditions. If the conditions are met, end the training early before `max_epochs` epochs is reached.
3. Make sure you are returning as the 4th return argument the number of epochs before training ended.

#### Test: `fit` with early stopping

The following test should end:
- in about 10 secs.
- after 300 epochs.
- with final training loss of 0.04, Val loss of 0.06, Val acc of 96.00%

In [0]:
from layers import Dense

In [0]:
# Quickly make a mock network for testing
class SoftmaxNet(DeepNetwork):
    def __init__(self, input_feats_shape, C, reg=0):
        super().__init__(input_feats_shape, reg)
        self.output_layer = Dense('TestDense', units=C, activation='softmax', prev_layer_or_block=None)

    def __call__(self, x):
        return self.output_layer(x)

# Load in Iris train/validation sets
train_samps = tf.constant(np.load('data/iris/iris_train_samps.npy'), dtype=tf.float32)
train_labels = tf.constant(np.load('data/iris/iris_train_labels.npy'), dtype=tf.int32)
val_samps = tf.constant(np.load('data/iris/iris_val_samps.npy'), dtype=tf.float32)
val_labels = tf.constant(np.load('data/iris/iris_val_labels.npy'), dtype=tf.int32)

# Set some vars
C = 3
M = train_samps.shape[1]
mini_batch_sz = 25
lr = 1e-1
max_epochs = 5000
patience = 3
val_every = 100  # how often (in epochs) we check the val loss/acc/early stopping

# Create our test net
tf.random.set_seed(0)
slnet = SoftmaxNet((M,), C)
slnet.compile(lr=lr)

_, val_loss_hist, val_acc_hist, e = slnet.fit(train_samps, train_labels, val_samps, val_labels,
                                              batch_size=mini_batch_sz,
                                              max_epochs=max_epochs,
                                              patience=patience,
                                              val_every=val_every)

print(75*'-')
print(f'Iris test ended after {e} epochs with final val loss/acc of {val_loss_hist[-1]:.2f}/{val_acc_hist[-1]:.2f}')
print(75*'-')