# Plumbing: A look under the hood of ``mxnet.gluon``

In the previous tutorials, we taught you about linear regression and softmax regression. We explained how these models work in principle, showed you how to implement them from scratch, and presented a compact implementation using ``mxnet.gluon``. And since our focus was on modeling, we showed 

We explained *how to do things* in ``gluon`` but didn't really explain *how they work*. We relied on ``nn.Sequential``, syntactically convenient shorthand for ``nn.Block`` but didn't peek under the hood.  And while each notebook presented a working, trained model, we didn't show you how to introspect its parameters, save and load models, etc. In this chapter, we'll take a break from modeling to explore the gory details of ``mxnet.gluon``.

## Load up the data
First, let's get the preliminaries out of the way.

In [2]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
ctx = mx.cpu()
batch_size = 64
mnist = mx.test_utils.get_mnist()
train_data = mx.io.NDArrayIter(
    mnist["train_data"], 
    mnist["train_label"], 
    batch_size, 
    shuffle=True)
test_data = mx.io.NDArrayIter(
    mnist["test_data"], 
    mnist["test_label"], 
    batch_size, 
    shuffle=True)

## Peeling away the abstraction of ``nn.Sequential``
Now you might remember that we defined a multilayer perceptron in gluon thusly:

In [4]:
net1 = gluon.nn.Sequential()
with net1.name_scope():
    net1.add(gluon.nn.Dense(128, activation="relu"))
    net1.add(gluon.nn.Dense(64, activation="relu"))
    net1.add(gluon.nn.Dense(10))

In just 5 lines and 183 characters, we defined a multilayer perceptron with three fully-connected layers, each parametrized by weight matrix and bias term. We also specified the ReLU activation function for the hidden layers. The first time I had to implement a multilayer perceptron for a university machine learning course it took considerably more code. To enable such concise code, there's a bit of magic going on here.

## Shape inference
One of the first things you might notice is that for each layer, we only specified the number of nodes output, we never specified how many input nodes! You might wonder, how does ``gluon`` know that the first weight matrix should be $784 \times 128$ and not $42 \times 128$. In fact it doesn't. We can see this by accessing the network's parameters.

In [9]:
print(net1.collect_params())

sequential1_ (
  Parameter sequential1_dense0_bias (shape=(128,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense2_weight (shape=(10, 0), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense2_bias (shape=(10,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense1_weight (shape=(64, 0), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense1_bias (shape=(64,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense0_weight (shape=(128, 0), dtype=<class 'numpy.float32'>)
)


Take a look at the shapes of the weight matrices: (128,0), (64, 0), (10, 0). What does it mean to have zero dimension in a matrix? This is ``gluon``'s way of marking that the shape of these matrices is not yet known. The shape will be inferred on the fly once the network is provided with some input.

So when we initialize our parameters, you might wonder, what precisely is happening?

In [13]:
net1.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

In this situation, ``gluon`` is not actually initializing any parameters! Instead, it's making a note of which initializer to associate with each parameter, even though it's shape is not yet known. The parameters are instantiated and the initializer is called once we provide the network with some input.

In [14]:
data = train_data.next().data[0]
net1(data)
print(net1.collect_params())

sequential1_ (
  Parameter sequential1_dense0_bias (shape=(128,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense2_weight (shape=(10, 64), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense2_bias (shape=(10,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense1_weight (shape=(64, 128), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense1_bias (shape=(64,), dtype=<class 'numpy.float32'>)
  Parameter sequential1_dense0_weight (shape=(128, 784), dtype=<class 'numpy.float32'>)
)


This shape inference can be extremely useful at times. For example, when working with convnets, it can be quite a pain to calculate the shape of various hidden layers. It depends on both the kernel size, the number of filters, the stride, and the precise padding scheme used which can vary in subtle ways from library to library.

## Specifying shape manually

If we want to specify the shape manually, that's always an option. We accomplish this by using the ``in_units`` argument when adding each layer.

In [25]:
net2 = gluon.nn.Sequential()
with net2.name_scope():
    net2.add(gluon.nn.Dense(784, in_units=128, activation="relu"))
    net2.add(gluon.nn.Dense(64, in_units=128, activation="relu"))
    net2.add(gluon.nn.Dense(10, in_units = 64))

Note that the parameters from this network can be initialized before we see any real data.

In [26]:
net2.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
print(net2.collect_params())

sequential8_ (
  Parameter sequential8_dense1_weight (shape=(64, 128), dtype=<class 'numpy.float32'>)
  Parameter sequential8_dense1_bias (shape=(64,), dtype=<class 'numpy.float32'>)
  Parameter sequential8_dense0_bias (shape=(784,), dtype=<class 'numpy.float32'>)
  Parameter sequential8_dense2_bias (shape=(10,), dtype=<class 'numpy.float32'>)
  Parameter sequential8_dense0_weight (shape=(784, 128), dtype=<class 'numpy.float32'>)
  Parameter sequential8_dense2_weight (shape=(10, 64), dtype=<class 'numpy.float32'>)
)


## What's the deal with ``name_scope()``?
The next thing you might have noticed is that we added all of our layers inside a ``with net1.name_scope():`` block. This coerces ``gluon`` to give each parameter an appropriate name, indicating which model it belongs to, e.g. ``sequential8_dense2_weight``. Keeping these names straight makes our lives much easier once we start writing more complex code where we might be working with multiple models and saving and loading the parameters of each. It helps us to make sure that we associate each weight with the right model.


In [None]:
Behind ``Sequential``'s syntactic sugar.