<a href="https://colab.research.google.com/github/quyettranvu/deep_learning_hands_on/blob/main/chapter_multilayer-perceptrons/mlp-implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following additional libraries are needed to run this
notebook. Note that running on Colab is experimental, please report a Github
issue if you have any problem.

In [54]:
# keep pip up to date
%pip install -U pip

# install d2l but skip its (too strict) dependencies
%pip install d2l==1.0.3 --no-deps

# install dependencies compatible with Python 3.12
# NumPy >= 1.26 has Py3.12 wheels
%pip install "numpy>=1.26,<2" matplotlib pandas jupyter

# Choose the right index for your runtime (CPU vs CUDA). Example for CUDA 12.4:
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Or CPU-only:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

import d2l, numpy as np
print("d2l OK, numpy:", np.__version__)

Looking in indexes: https://download.pytorch.org/whl/cpu
d2l OK, numpy: 1.26.4


# Implementation of Multilayer Perceptrons
:label:`sec_mlp-implementation`

Multilayer perceptrons (MLPs) are not much more complex to implement than simple linear models. The key conceptual
difference is that we now concatenate multiple layers.


In [83]:
import tensorflow as tf
from d2l import tensorflow as d2l

## Implementation from Scratch

Let's begin again by implementing such a network from scratch.

### Initializing Model Parameters

Recall that Fashion-MNIST contains 10 classes,
and that each image consists of a $28 \times 28 = 784$
grid of grayscale pixel values.
As before we will disregard the spatial structure
among the pixels for now,
so we can think of this as a classification dataset
with 784 input features and 10 classes.
To begin, we will [**implement an MLP
with one hidden layer and 256 hidden units.**]
Both the number of layers and their width are adjustable
(they are considered hyperparameters).
Typically, we choose the layer widths to be divisible by larger powers of 2.
This is computationally efficient due to the way
memory is allocated and addressed in hardware.

Again, we will represent our parameters with several tensors.
Note that *for every layer*, we must keep track of
one weight matrix and one bias vector.
As always, we allocate memory
for the gradients of the loss with respect to these parameters.


In the code below we use `tf.Variable`
to define the model parameter.


In [118]:
class MLPScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W1 = tf.Variable(
            tf.random.normal((num_inputs, num_hiddens)) * sigma)
        self.b1 = tf.Variable(tf.zeros(num_hiddens))
        self.W2 = tf.Variable(
            tf.random.normal((num_hiddens, num_outputs)) * sigma)
        self.b2 = tf.Variable(tf.zeros(num_outputs))
        self.loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

    def evaluate_accuracy(self, y_hat, y):
        """Compute accuracy for a multiclass classification problem."""
        if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
            y_hat = tf.argmax(y_hat, axis=1)
        cmp = tf.cast(y_hat, y.dtype) == y
        return tf.reduce_sum(tf.cast(cmp, y.dtype)) / len(y)

### Model

To make sure we know how everything works,
we will [**implement the ReLU activation**] ourselves
rather than invoking the built-in `relu` function directly.


In [119]:
def relu(X):
    return tf.math.maximum(X, 0)

Since we are disregarding spatial structure,
we `reshape` each two-dimensional image into
a flat vector of length  `num_inputs`.
Finally, we (**implement our model**)
with just a few lines of code. Since we use the framework built-in autograd this is all that it takes.


In [120]:
@d2l.add_to_class(MLPScratch)
def forward(self, X):
    X = tf.reshape(X, (-1, self.num_inputs))
    H = relu(tf.matmul(X, self.W1) + self.b1)
    return tf.matmul(H, self.W2) + self.b2

### Training

Fortunately, [**the training loop for MLPs
is exactly the same as for softmax regression.**] We define the model, data, and trainer, then finally invoke the `fit` method on model and data.


In [124]:
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

AttributeError: 'numpy.float32' object has no attribute 'numpy'

In [121]:
@d2l.add_to_class(d2l.Classifier)
def evaluate_step(self, batch):
    X, y = batch[:-1][0], batch[-1]
    y_hat = self(X, training=False)
    l = self.loss(tf.one_hot(y, depth=y_hat.shape[-1]), y_hat)
    self.plot('loss', l, train=False)
    self.plot('acc', self.evaluate_accuracy(y_hat, y), train=False)
    return l

In [122]:
@d2l.add_to_class(MLPScratch)
def backward(self, loss, tape):
    params = [self.W1, self.b1, self.W2, self.b2]
    grads = tape.gradient(loss, params)
    self.trainer.optim.apply_gradients(zip(grads, params))

In [123]:
@d2l.add_to_class(d2l.Classifier)
def training_step(self, batch):
    X, y = batch[:-1][0], batch[-1]
    with tf.GradientTape() as tape:
        y_hat = self(X, training=True)

        # Convert y to one-hot encoding
        y_one_hot = tf.one_hot(y, depth=y_hat.shape[-1])

        l = self.loss(y_one_hot, y_hat)
    self.backward(l, tape)
    self.plot('loss', l.numpy(), train=True)
    self.plot('acc', self.evaluate_accuracy(y_hat, y), train=True)
    return l

## Concise Implementation

As you might expect, by relying on the high-level APIs, we can implement MLPs even more concisely.

### Model

Compared with our concise implementation
of softmax regression implementation
(:numref:`sec_softmax_concise`),
the only difference is that we add
*two* fully connected layers where we previously added only *one*.
The first is [**the hidden layer**],
the second is the output layer.


In [62]:
class MLP(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = tf.keras.models.Sequential([
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(num_hiddens, activation='relu'),
            tf.keras.layers.Dense(num_outputs)])

Previously, we defined `forward` methods for models to transform input using the model parameters.
These operations are essentially a pipeline:
you take an input and
apply a transformation (e.g.,
matrix multiplication with weights followed by bias addition),
then repetitively use the output of the current transformation as
input to the next transformation.
However, you may have noticed that
no `forward` method is defined here.
In fact, `MLP` inherits the `forward` method from the `Module` class (:numref:`subsec_oo-design-models`) to
simply invoke `self.net(X)` (`X` is input),
which is now defined as a sequence of transformations
via the `Sequential` class.
The `Sequential` class abstracts the forward process
enabling us to focus on the transformations.
We will further discuss how the `Sequential` class works in :numref:`subsec_model-construction-sequential`.


### Training

[**The training loop**] is exactly the same
as when we implemented softmax regression.
This modularity enables us to separate
matters concerning the model architecture
from orthogonal considerations.


In [63]:
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)

TypeError: 'NoneType' object is not callable

## Summary

Now that we have more practice in designing deep networks, the step from a single to multiple layers of deep networks does not pose such a significant challenge any longer. In particular, we can reuse the training algorithm and data loader. Note, though, that implementing MLPs from scratch is nonetheless messy: naming and keeping track of the model parameters makes it difficult to extend models. For instance, imagine wanting to insert another layer between layers 42 and 43. This might now be layer 42b, unless we are willing to perform sequential renaming. Moreover, if we implement the network from scratch, it is much more difficult for the framework to perform meaningful performance optimizations.

Nonetheless, you have now reached the state of the art of the late 1980s when fully connected deep networks were the method of choice for neural network modeling. Our next conceptual step will be to consider images. Before we do so, we need to review a number of statistical basics and details on how to compute models efficiently.


## Exercises

1. Change the number of hidden units `num_hiddens` and plot how its number affects the accuracy of the model. What is the best value of this hyperparameter?
1. Try adding a hidden layer to see how it affects the results.
1. Why is it a bad idea to insert a hidden layer with a single neuron? What could go wrong?
1. How does changing the learning rate alter your results? With all other parameters fixed, which learning rate gives you the best results? How does this relate to the number of epochs?
1. Let's optimize over all hyperparameters jointly, i.e., learning rate, number of epochs, number of hidden layers, and number of hidden units per layer.
    1. What is the best result you can get by optimizing over all of them?
    1. Why it is much more challenging to deal with multiple hyperparameters?
    1. Describe an efficient strategy for optimizing over multiple parameters jointly.
1. Compare the speed of the framework and the from-scratch implementation for a challenging problem. How does it change with the complexity of the network?
1. Measure the speed of tensor--matrix multiplications for well-aligned and misaligned matrices. For instance, test for matrices with dimension 1024, 1025, 1026, 1028, and 1032.
    1. How does this change between GPUs and CPUs?
    1. Determine the memory bus width of your CPU and GPU.
1. Try out different activation functions. Which one works best?
1. Is there a difference between weight initializations of the network? Does it matter?


In [125]:
import tensorflow as tf
import numpy as np, time, itertools, math
from tensorflow.keras import layers, models

# --- Data ---
(xtr, ytr), (xte, yte) = tf.keras.datasets.fashion_mnist.load_data()
xtr = (xtr.astype("float32")/255.0).reshape(-1, 28*28)
xte = (xte.astype("float32")/255.0).reshape(-1, 28*28)

def build_mlp(
    input_dim=784,
    num_classes=10,
    hidden_layers=(256,),
    activation="relu",
    kernel_init="he_normal",   # "glorot_uniform" for tanh/sigmoid
    use_bn=False,
    dropout=0.0
):
    m = models.Sequential()
    m.add(layers.Input(shape=(input_dim,)))
    for h in hidden_layers:
        m.add(layers.Dense(h, activation=None, kernel_initializer=kernel_init))
        if use_bn:
            m.add(layers.BatchNormalization())
        m.add(layers.Activation(activation))
        if dropout>0:
            m.add(layers.Dropout(dropout))
    m.add(layers.Dense(num_classes, activation="softmax"))
    return m

def train_eval(
    hidden_layers=(256,),
    lr=1e-2,
    epochs=10,
    batch_size=256,
    activation="relu",
    kernel_init="he_normal",
    use_bn=False,
    dropout=0.0,
    verbose=0
):
    m = build_mlp(
        hidden_layers=hidden_layers, activation=activation,
        kernel_init=kernel_init, use_bn=use_bn, dropout=dropout
    )
    m.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=lr, momentum=0.9),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
    hist = m.fit(xtr, ytr, validation_split=0.1, epochs=epochs, batch_size=batch_size, verbose=verbose)
    test_loss, test_acc = m.evaluate(xte, yte, verbose=0)
    return test_acc, hist.history

[Discussions](https://discuss.d2l.ai/t/227)


In [126]:
# Vary the number of hidden inputs
units_grid = [32, 64, 128, 256, 512, 1024]
results_grid = {}

# kernel_init: chọn cách đặt giá trị ban đầu cho trọng số để mô hình học hiệu quả hơn, use_bn: sử dụng BatchNorm để tăng độ hội tự, giảm overfitting, verbose = 0: không in gì
for u in units_grid:
  acc, _ = train_eval(hidden_layers=(u,), lr=0.05, epochs=10, activation="relu", kernel_init="he_normal", use_bn = True, dropout = 0.0, verbose = 0)
  results_grid[u] = acc

print(results_grid) # dataset = 128, 256 hội tụ tốt hơn, không bị quá khớp

{32: 0.8525999784469604, 64: 0.8680999875068665, 128: 0.8707000017166138, 256: 0.8751999735832214, 512: 0.8575000166893005, 1024: 0.8436999917030334}


In [127]:
best_grid = 256
acc_1, _ = train_eval(hidden_layers=(256,), lr=0.05, epochs=10, activation="relu", kernel_init="he_normal", use_bn = True, dropout = 0.0, verbose = 0)
acc_2, _ = train_eval(hidden_layers=(256,128), lr=0.05, epochs=10, use_bn=True)
print ("Accuracy 1: {:.4f}, Accuracy 2: {:.4f}".format(acc_1, acc_2))

Accuracy 1: 0.8722, Accuracy 2: 0.8729


A single hidden neuron creates a severe bottleneck: the network must compress the 784-dim input into one scalar before classifying 10 classes — that’s an extreme information loss. It forces almost-linear decision boundaries after that bottleneck and typically tanks accuracy.

In [128]:
lrs = [0.001, 0.01, 0.05, 0.1, 0.2]
res_lr = {}
for lr in lrs:
    acc, _ = train_eval(hidden_layers=(256,), lr=lr, epochs=10, use_bn=True, verbose=0)
    res_lr[lr] = acc
res_lr

{0.001: 0.8418999910354614,
 0.01: 0.8673999905586243,
 0.05: 0.8705000281333923,
 0.1: 0.8705000281333923,
 0.2: 0.8723999857902527}

In [129]:
import random
search_space = {
    "lr": [0.005, 0.01, 0.02, 0.05, 0.1],
    "epochs": [5, 10, 15],
    "layers": [(128,), (256,), (512,), (256,128), (512,256)],
    "activation": ["relu", "gelu", "tanh"],
    "kernel_init": ["he_normal", "glorot_uniform"],
    "use_bn": [False, True],
    "dropout": [0.0, 0.2]
}
best = (-1, None)
for i in range(20):  # increase for a deeper search
    cfg = {k: random.choice(v) for k,v in search_space.items()}
    acc, _ = train_eval(hidden_layers=cfg["layers"], lr=cfg["lr"], epochs=cfg["epochs"],
                        activation=cfg["activation"], kernel_init=cfg["kernel_init"],
                        use_bn=cfg["use_bn"], dropout=cfg["dropout"], verbose=0)
    if acc > best[0]:
        best = (acc, cfg)
best

(0.8769000172615051,
 {'lr': 0.02,
  'epochs': 15,
  'layers': (256, 128),
  'activation': 'relu',
  'kernel_init': 'glorot_uniform',
  'use_bn': False,
  'dropout': 0.0})

Keras/TF use fused kernels, cuDNN/cuBLAS → much faster than manual NumPy/TensorFlow-eager loops. As network complexity grows (more layers/params/batches), the gap widens because frameworks better utilize vectorization and GPUs.

In [None]:
def bench_matmul(n, device = None, iters = 50, warmup = 10):
  # define A,B
  A = tf.random.normal((n, n))
  B = tf.random.normal((n, n))

  # device '/CPU:0' or '/GPU:0'
  if device:
    with tf.device(device):
      for _ in range(warmup):
        _ = tf.linalg.matmul(A, B)
      start = time.time()
      for _ in range(iters):
        _ = tf.linalg.matmul(A, B)
      tf.experimental.sync_devices()

  # else
  else:
    for _ in range(warmup):
      _ = tf.linalg.matmul(A, B)
      start = time.time()
    for _ in range(iters):
      _ = tf.linalg.matmul(A, B)
  current_time = time.time()
  return (current_time - start) / iters


sizes = [1024, 1025, 1026, 1028, 1032]
cpu_times = {n: bench_matmul(n, device="/CPU:0") for n in sizes}
gpu_times = {}
try:
    tf.config.list_physical_devices('GPU')[0]
    gpu_times = {n: bench_matmul(n, device="/GPU:0") for n in sizes}
except IndexError:
    pass

cpu_times, gpu_times


GPU: optimized on large batch dense operations (matmul, conv,...), CPU: competitive on small models/batches to launch overheads.

Memory bus width: hardware property (128-bit/256-bit/320-bit/384-bit for GPUs):estimate effective bandwidth by timing large tensor reads/writes, computing GB/s.

In [130]:
acts = ["relu", "gelu", "tanh", "elu", "selu"]
act_res = {}
for a in acts:
    acc, _ = train_eval(hidden_layers=(256,128), lr=0.02, epochs=15,
                        activation=a, kernel_init="glorot_uniform", use_bn=False, verbose=0)
    act_res[a] = acc
act_res

{'relu': 0.883400022983551,
 'gelu': 0.8830000162124634,
 'tanh': 0.8813999891281128,
 'elu': 0.8756999969482422,
 'selu': 0.8723999857902527}

Relu, GELU, Elu: sử dụng kernel init là He/Kaiming, He_normal, He_uniform
tanh: dùng xavier với glorot cho kernel init và learning rate thấp, SELU: dùng Lecun normal no BN. Thực tế phải có dropout nhẹ trong các case.

Practical tips / likely best settings (Fashion-MNIST, fast runs)
	•	1 hidden layer 256–512 ReLU, He init, BN on, SGD+Momentum 0.9, LR≈0.05, epochs 10–15 is a very solid baseline (~0.88–0.90+ test acc).
	•	Add a second layer (e.g., 256,128) for a bit more accuracy if you can afford a few more epochs or add dropout 0.2.