### Deep Learning

This chapter covers the important topic of deep learning. At the time of
writing (2020), deep learning is a very active area of research in the machine
learning and artificial intelligence communities. The cornerstone of deep
learning is the neural network.

Neural networks rose to fame in the late 1980s. There was a lot of excite
ment and a certain amount of hype associated with this approach, and they
were the impetus for the popular Neural Information Processing Systems
meetings (NeurIPS, formerly NIPS) held every year, typically in exotic
places like ski resorts. This was followed by a synthesis stage, where the
properties of neural networks were analyzed by machine learners, math
ematicians and statisticians; algorithms were improved, and the method
ology stabilized. Then along came SVMs, boosting, and random forests,
and neural networks fell somewhat from favor. Part of the reason was that
neural networks required a lot of tinkering, while the new methods were
more automatic. Also, on many problems the new methods outperformed
poorly-trained neural networks. This was the status quo for the first decade
in the new millennium.

All the while, though, a core group of neural-network enthusiasts were
pushing their technology harder on ever-larger computing architectures and
data sets. Neural networks resurfaced after 2010 with the new name deep
learning, with new architectures, additional bells and whistles, and a string
of success stories on some niche problems such as image and video classifi
cation, speech and text modeling. Many in the field believe that the major
reason for these successes is the availability of ever-larger training datasets,
made possible by the wide-scale use of digitization in science and industry.

In this chapter we discuss the basics of neural networks and deep learn
ing, and then go into some of the specializations for specific problems, such
as convolutional neural networks (CNNs) for image classification, and re
current neural networks (RNNs) for time series and other sequences. We will also demonstrate these models using the Python torch package, along
with a number of helper packages.

The material in this chapter is slightly more challenging than elsewhere
in this book.

####  Single Layer Neural Networks

A neural network takes an input vector of $ p $ variables $ X = (X_1, X_2, \ldots, X_p) $ and builds a nonlinear function $ f(X) $ to predict the response $ Y $. We have built nonlinear prediction models in earlier chapters, using trees, boosting, and generalized additive models. What distinguishes neural networks from these methods is the particular structure of the model. 

Figure 10.1 shows a simple feed-forward neural network for modeling a quantitative response using $ p = 4 $ predictors. In the terminology of neural networks, the four features $ X_1, \ldots, X_4 $ make up the units in the input layer. The arrows indicate that each of the inputs from the input layer feeds into each of the $ K $ hidden units (we get to pick $ K $; here we chose 5). The neural network model has hidden units in the form

$$
f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X) = \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj} X_j).
$$

It is built up here in two steps. First, the $ K $ activations $ A_k $, $ k=1, \ldots, K $, in the hidden layer are computed as functions of the input features $ X_1, \ldots, X_p $,

$$
A_k = h_k(X) = g\left(w_{k0} + \sum_{j=1}^p w_{kj} X_j\right).
$$

where $ g(z) $ is a nonlinear activation function that is specified in advance. We can think of each $ A_k $ as a different transformation $ h_k(X) $ of the original features, much like the basis functions of Chapter 7. These $ K $ activations from the hidden layer then feed into the output layer, resulting in

$$
f(X) = \beta_0 + \sum_{k=1}^K \beta_k A_k,
$$

a linear regression model in the $ K = 5 $ activations. All the parameters $ \beta_0, \ldots, \beta_K $ and $ w_{k0}, \ldots, w_{Kp} $ need to be estimated from data. In the early instances of neural networks, the sigmoid activation function was favored,

$$
g(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}},
$$

which is the same function used in logistic regression to convert a linear function into probabilities between zero and one (see Figure 10.2). The preferred choice in modern neural networks is the ReLU (rectified linear unit) activation function, which takes the form

$$
g(z) = 
\begin{cases} 
0 & \text{if } z < 0 \\ 
z & \text{otherwise}
\end{cases}
$$

A ReLU activation can be computed and stored more efficiently than a sigmoid activation. Although it thresholds at zero, because we apply it to a linear function (10.2), the constant term $ w_{k0} $ will shift this inflection point. 

So in words, the model depicted in Figure 10.1 derives five new features by computing five different linear combinations of $ X $, and then squashes each through an activation function $ g(\cdot) $ to transform it. The final model is linear in these derived variables.

The name neural network originally derived from thinking of these hidden units as analogous to neurons in the brain — values of the activations $ A_k = h_k(X) $ close to one are firing, while those close to zero are silent (using the sigmoid activation function).

The nonlinearity in the activation function $ g(\cdot) $ is essential, since without it the model $ f(X) $ in (10.1) would collapse into a simple linear model in $X_1, \ldots, X_p$. Moreover, having a nonlinear activation function allows the model to capture complex nonlinearities and interaction effects. Consider a very simple example with $ p = 2 $ input variables $ X = (X_1, X_2) $, and $ K = 2 $ hidden units $ h_1(X) $ and $ h_2(X) $ with $ g(z) = z^2 $. We specify the other parameters as

$$
\beta_0 = 0, \quad \beta_1 = \frac{1}{4}, \quad \beta_2 = \frac{1}{4}, 
$$
$$
w_{10} = 0, \quad w_{11} = 1, \quad w_{12} = 1,
$$
$$
w_{20} = 0, \quad w_{21} = 1, \quad w_{22} = 1.
$$

From (10.2), this means that

$$
h_1(X) = (0 + X_1 + X_2)^2,
$$
$$
h_2(X) = (0 + X_1 X_2)^2.
$$

Then plugging (10.7) into (10.1), we get

$$
f(X) = 0 + \frac{1}{4} \cdot (0 + X_1 + X_2)^2 - \frac{1}{4} \cdot (0 + X_1 X_2)^2
= \frac{1}{4} (X_1 + X_2)^2 - (X_1 X_2)^2
= X_1 X_2.
$$

So the sum of two nonlinear transformations of linear functions can give us an interaction! In practice, we would not use a quadratic function for $ g(z) $, since we would always get a second-degree polynomial in the original coordinates $ X_1, \ldots, X_p $. The sigmoid or ReLU activations do not have such a limitation.

Fitting a neural network requires estimating the unknown parameters in (10.1). For a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize

$$
\sum_{i=1}^{n} (y_i - f(x_i))^2.
$$

Details about how to perform this minimization are provided in Section 10.7.


####  Multilayer Neural Networks

Modern neural networks typically have more than one hidden layer, and often many units per layer. In theory a single hidden layer with a large number of units has the ability to approximate most functions. However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.

We will illustrate a large dense network on the famous and publicly available MNIST handwritten digit dataset. Figure 10.3 shows examples of these digits. The idea is to build a model to classify the images into their correct digit class $ 0-9 $. Every image has $ p = 28 \times 28 = 784 $ pixels, each of which is an eight-bit grayscale value between $ 0 $ and $ 255 $ representing the relative amount of the written digit in that tiny square. These pixels are stored in the input vector $ X $ (in, say, column order). The output is the class label, represented by a vector $ Y = (Y_0, Y_1, \ldots, Y_9) $ of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere. In the machine learning community, this is known as one-hot encoding. There are 60,000 training images, and 10,000 test images.

On a historical note, digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere. Pattern recognition tasks of this kind are relatively simple for humans. Our visual system occupies a large fraction of our brains, and good recognition is an evolutionary force for survival. These tasks are not so simple for machines, and it has taken more than 30 years to refine the neural-network architectures to match human performance.

Figure 10.4 shows a multilayer network architecture that works well for solving the digit-classification task. It differs from Figure 10.1 in several ways:

- It has two hidden layers $ L_1 $ (256 units) and $ L_2 $ (128 units) rather than one. Later we will see a network with seven hidden layers.
- It has ten output variables, rather than one. In this case, the ten variables really represent a single qualitative variable and so are quite dependent. (We have indexed them by the digit class $ 0 $–$ 9 $ rather than $ 1 $–$ 10 $, for clarity.) More generally, in multi-task learning one can predict different responses simultaneously with a single network; they all have a say in the formation of the hidden layers.
- The loss function used for training the network is tailored for the the relative amount of the written digit in that tiny square. These pixels are stored in the input vector $ X $ (in, say, column order). The output is the class label, represented by a vector $ Y = (Y_0, Y_1, \ldots, Y_9) $ of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere. In the machine learning community, this is known as one-hot encoding. There are 60,000 training images, and 10,000 test images.

On a historical note, digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere. Pattern recognition tasks of this kind are relatively simple for humans. Our visual system occupies a large fraction of our brains, and good recognition is an evolutionary force for survival. These tasks are not so simple for machines, and it has taken more than 30 years to refine the neural-network architectures to match human performance.

Figure 10.4 shows a multilayer network architecture that works well for solving the digit-classification task. It differs from Figure 10.1 in several ways:

- It has two hidden layers $ L_1 $ (256 units) and $ L_2 $ (128 units) rather than one. Later we will see a network with seven hidden layers.
- It has ten output variables, rather than one. In this case, the ten variables really represent a single qualitative variable and so are quite dependent. (We have indexed them by the digit class $ 0 $–$ 9 $ rather than $ 1 $–$ 10 $, for clarity.) More generally, in multi-task learning one can predict different responses simultaneously with a single network; they all have a say in the formation of the hidden layers.
- The loss function used for training the network is tailored for the multiclass classification task

The first hidden layer is as in (10.2), with

$$
A^{(1)}_k = h^{(1)}_k(X) = g\left(w^{(1)}_{k0} + \sum_{j=1}^{p} w^{(1)}_{kj} X_j\right) 
\tag{10.10}
$$

for $ k = 1, \ldots, K_1 $. The second hidden layer treats the activations $ A^{(1)}_k $ of the first hidden layer as inputs and computes new activations

$$
A^{(2)} = h^{(2)}(X) = g\left(w^{(2)}_0 + \sum_{k=1}^{K_1} w^{(2)}_k A^{(1)}_k\right)
\tag{10.11}
$$

for $ \ell = 1, \ldots, K_2 $. Notice that each of the activations in the second layer $ A^{(2)} = h^{(2)}(X) $ is a function of the input vector $ X $. This is the case because while they are explicitly a function of the activations $ A^{(1)}_k $ from layer $ L_1 $, these in turn are functions of $ X $. This would also be the case with more hidden layers. Thus, through a chain of transformations, the network is able to build up fairly complex transformations of $ X $ that ultimately feed into the output layer as features.

We have introduced additional superscript notation such as $ h^{(2)}(X) $ and $ w^{(2)}_j $ in (10.10) and (10.11) to indicate to which layer the activations and weights (coefficients) belong, in this case layer 2. The notation $ W_1 $ in Figure 10.4 represents the entire matrix of weights that feed from the input layer to the first hidden layer $ L_1 $. This matrix will have $ 785 \times 256 = 200,960 $ elements; there are 785 rather than 784 because we must account for the intercept or bias term.

Each element $ A^{(1)}_k $ feeds to the second hidden layer $ L_2 $ via the matrix of weights $ W_2 $ of dimension $ 257 \times 128 = 32,896 $.

We now get to the output layer, where we now have ten responses rather than one. The first step is to compute ten different linear models similar to our single model (10.1),

$$
Z_m = m_0 + \sum_{k=1}^{K_2} m h^{(2)}(X) = m_0 + \sum_{k=1}^{K_2} m A^{(2)},
\tag{10.12}
$$

for $ m = 0, 1, \ldots, 9 $. The matrix $ B $ stores all $ 129 \times 10 = 1,290 $ of these weights.

If these were all separate quantitative responses, we would simply set each $ f_m(X) = Z_m $ and be done. However, we would like our estimates to represent class probabilities $ f_m(X) = Pr(Y = m | X) $, just like in multinomial logistic regression in Section 4.3.5. So we use the special softmax activation function (see (4.13) on page 145),

$$
f_m(X) = Pr(Y = m | X) = \frac{e^{Z_m}}{\sum_{j=0}^{9} e^{Z_j}},
\tag{10.13}
$$

for $ m = 0, 1, \ldots, 9 $. This ensures that the 10 numbers behave like probabilities (non-negative and sum to one). Even though the goal is to build a classifier, our model actually estimates a probability for each of the 10 classes. The classifier then assigns the image to the class with the highest probability.

To train this network, since the response is qualitative, we look for coefficient estimates that minimize the negative multinomial log-likelihood 

$$
-\sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)),
\tag{10.14}
$$

also known as the cross-entropy. This is a generalization of the criterion (4.5) for two-class logistic regression. Details on how to minimize this objective are given in Section 10.7. If the response were quantitative, we would instead minimize squared-error loss as in (10.9).

Table 10.1 compares the test performance of the neural network with two simple models presented in Chapter 4 that make use of linear decision boundaries: multinomial logistic regression and linear discriminant analysis. The improvement of neural networks over both of these linear methods is dramatic: the network with dropout regularization achieves a test error rate below 2% on the 10,000 test images. (We describe dropout regularization in Section 10.7.3.) In Section 10.9.2 of the lab, we present the code for fitting this model, which runs in just over two minutes on a laptop computer.

| $Method$                                         | $Test$ $Error$ |
|------------------------------------------------|------------|
| Neural Network + Ridge Regularization          | 2.3%      |
| Neural Network + Dropout Regularization        | 1.8%      |
| Multinomial Logistic Regression                 | 7.2%      |
| Linear Discriminant Analysis                    | 12.7%     |

Test error rate on the MNIST data, for neural networks with two forms of regularization, as well as multinomial logistic regression and linear discriminant analysis. In this example, the extra complexity of the neural network leads to a marked improvement in test error.



Adding the number of coefficients in $ W_1, W_2, $ and $ B $, we get 235,146 in all, more than 33 times the number $ 785 \times 9 = 7,065 $ needed for multinomial logistic regression. Recall that there are 60,000 images in the training set. While this might seem like a large training set, there are almost four times as many coefficients in the neural network model as there are observations in the training set! To avoid overfitting, some regularization is needed. In this example, we used two forms of regularization: ridge regularization, which is similar to ridge regression from Chapter 6, and dropout regularization. We discuss both forms of regularization in Section 10.7.


####  Convolutional Neural Networks

Neural networks rebounded around 2010 with big successes in image classification. Around that time, massive databases of labeled images were being accumulated, with ever-increasing numbers of classes. This database consists of 60,000 images labeled according to 20 superclasses (e.g. aquatic mammals), with five classes per superclass (beaver, dolphin, otter, seal, whale). Each image has a resolution of $ 32 \times 32 $ pixels, with three eight-bit numbers per pixel representing red, green, and blue. The numbers for each image are organized in a three-dimensional array called a feature map. The first two axes are spatial (both are 32-dimensional), and the third is the channel
axis,5 representing the three colors. There is a designated training set of
50,000 images, and a test set of 10,000.

A special family of convolutional neural networks (CNNs) has evolved for 
classifying images such as these, and has shown spectacular success on a
wide range of problems. CNNs mimic to some degree how humans classify
images, by recognizing specific features or patterns anywhere in the image
that distinguish each particular object class. In this section we give a brief
overview of how they work.

Figure 10.6 illustrates the idea behind a convolutional neural network on
a cartoon image of a tiger.6

The network first identifies low-level features in the input image, such
as small edges, patches of color, and the like. These low-level features are
then combined to form higher-level features, such as parts of ears, eyes,
and so on. Eventually, the presence or absence of these higher-level features
contributes to the probability of any given output class.

 How does a convolutional neural network build up this hierarchy? It com
bines two specialized types of hidden layers, called convolution layers and
pooling layers. Convolution layers search for instances of small patterns in
the image, whereas pooling layers downsample these to select a prominent
subset. In order to achieve state-of-the-art results, contemporary neural
network architectures make use of many convolution and pooling layers.
We describe convolution and pooling layers next.

##### Convolution Layers

A convolution layer is made up of a large number of convolution filters, each of which is a template that determines whether a particular local feature is
present in an image. A convolution filter relies on a very simple operation,
called a convolution, which basically amounts to repeatedly multiplying
matrix elements and then adding the results.

To understand how a convolution filter works, consider a very simple
example of a 4 3 image:

$$\text{Original Image} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \\ j & k & l \end{bmatrix} \quad $$

Now consider a 2 $\times$ 2 filter of the form

$$\text{Convolution Filter} = \begin{bmatrix} \alpha & \beta \\ \gamma & \delta \end{bmatrix}$$


When we convolve the image with the filter, we get the result:

$$\text{Convolved Image} = \begin{bmatrix} (a\alpha + b\beta + d\gamma + e\delta) & (b\alpha + c\beta + e\gamma + f\delta) \\ (d\alpha + e\beta + g\gamma + h\delta) & (e\alpha + f\beta + h\gamma + i\delta) \\ (g\alpha + h\beta + j\gamma + k\delta) & (h\alpha + i\beta + k\gamma + l\delta) \end{bmatrix}$$


For instance, the top-left element comes from multiplying each element in the $2 \times 2$ filter by the corresponding element in the top-left $2 \times 2$ portion of the image, and adding the results. The other elements are obtained in a similar way: the convolution filter is applied to every $2 \times 2$ submatrix of the original image in order to obtain the convolved image. If a $2 \times 2$ submatrix of the original image resembles the convolution filter, then it will have a large value in the convolved image; otherwise, it will have a small value. Thus, the convolved image highlights regions of the original image that resemble the convolution filter. We have used $2 \times 2$ as an example; in general, convolution filters are small $l_1 \times l_2$ arrays, with $l_1$ and $l_2$ small positive integers that are not necessarily equal.

Figure 10.7 illustrates the application of two convolution filters to a $192 \times 179$ image of a tiger, shown on the left-hand side. Each convolution filter is a $15 \times 15$ image containing mostly zeros (black), with a narrow strip of ones (white) oriented either vertically or horizontally within the image. When each filter is convolved with the image of the tiger, areas of the tiger that resemble the filter (i.e., that have either horizontal or vertical stripes or edges) are given large values, and areas of the tiger that do not resemble the feature are given small values. The convolved images are displayed on the right-hand side. We see that the horizontal stripe filter picks out horizontal stripes and edges in the original image, whereas the vertical stripe filter picks out vertical stripes and edges in the original image.

We have used a large image and two large filters in Figure 10.7 for illus
tration. For the CIFAR100 database there are 32 32 color pixels per image,
and we use 3 3 convolution filters.

In a convolution layer, we use a whole bank of filters to pick out a variety
of differently-oriented edges and shapes in the image. Using predefined
filters in this way is standard practice in image processing. By contrast,
with CNNs the filters are learned for the specific classification task. We can
think of the filter weights as the parameters going from an input layer to a
hidden layer, with one hidden unit for each pixel in the convolved image.
This is in fact the case, though the parameters are highly structured and
constrained (see Exercise 4 for more details). They operate on localized
patches in the input image (so there are many structural zeros), and the
same weights in a given filter are reused for all possible patches in the image
(so the weights are constrained).9

We now give some additional details.

- Since the input image is in color, it has three channels represented
by a three-dimensional feature map (array). Each channel is a two
dimensional (32 32) feature map — one for red, one for green, and
one for blue. A single convolution filter will also have three channels,
one per color, each of dimension 3 3, with potentially different filter
weights. The results of the three convolutions are summed to form
a two-dimensional output feature map. Note that at this point the
color information has been used, and is not passed on to subsequent
layers except through its role in the convolution.

- If we use K different convolution filters at this first hidden layer, we get K two-dimensional output feature maps, which together are treated as a single three-dimensional feature map. We view each of the K output feature maps as a separate channel of information, so now we have K channels in contrast to the three color channels of the original input feature map. The three-dimensional feature map is just like the activations in a hidden layer of a simple neural network, except organized and produced in a spatially structured way.

- We typically apply the ReLU activation function (10.5) to the convolved image. This step is sometimes viewed as a separate layer in the convolutional neural network, in which case it is referred to as a detector layer.

##### Pooling Layers

A pooling layer provides a way to condense a large image into a smaller summary image. While there are a number of possible ways to perform pooling, the max pooling operation summarizes each non-overlapping 2 × 2 block of pixels in an image using the maximum value in the block. This reduces the size of the image by a factor of two in each direction, and it also provides some location invariance: i.e. as long as there is a large value in one of the four pixels in the block, the whole block registers as a large value in the reduced image.

Here is a simple example of max pooling:

$$\text{Max pool} = \begin{bmatrix} 1 & 2 & 5 & 3 \\ 3 & 0 & 1 & 2 \\ 2 & 1 & 3 & 4 \\ 1 & 1 & 2 & 0\end{bmatrix}\quad \to    \begin{bmatrix} 3 & 5 \\ 2 & 4\end{bmatrix} \quad $$


##### Architecture of a Convolutional Neural Network



So far we have defined a single convolution layer — each filter produces a new two-dimensional feature map. The number of convolution filters in a convolution layer is akin to the number of units at a particular hidden layer in a fully-connected neural network of the type we saw in Section 10.2. This number also defines the number of channels in the resulting three-dimensional feature map. We have also described a pooling layer, which reduces the first two dimensions of each three-dimensional feature map. Deep CNNs have many such layers. Figure 10.8 shows a typical architecture for a CNN for the CIFAR100 image classification task.

At the input layer, we see the three-dimensional feature map of a color image, where the channel axis represents each color by a 32 × 32 two-dimensional feature map of pixels. Each convolution filter produces a new channel at the first hidden layer, each of which is a 32 × 32 feature map (after some padding at the edges). After this first round of convolutions, we now have a new “image”; a feature map with considerably more channels than the three color input channels (six in the figure, since we used six convolution filters).

This is followed by a max-pool layer, which reduces the size of the feature map in each channel by a factor of four: two in each dimension. This convolve-then-pool sequence is now repeated for the next two layers. Some details are as follows:

- Each subsequent convolve layer is similar to the first. It takes as input the three-dimensional feature map from the previous layer and treats it like a single multi-channel image. Each convolution filter learned has as many channels as this feature map.
- Since the channel feature maps are reduced in size after each pool layer, we usually increase the number of filters in the next convolve layer to compensate.
- Sometimes we repeat several convolve layers before a pool layer. This effectively increases the dimension of the filter.

These operations are repeated until the pooling has reduced each channel feature map down to just a few pixels in each dimension. At this point the three-dimensional feature maps are flattened — the pixels are treated as separate units — and fed into one or more fully-connected layers before reaching the output layer, which is a softmax activation for the 100 classes (as in (10.13)).

There are many tuning parameters to be selected in constructing such a network, apart from the number, nature, and sizes of each layer. Dropout learning can be used at each layer, as well as lasso or ridge regularization (see Section 10.7). The details of constructing a convolutional neural network can seem daunting. Fortunately, terrific software is available, with extensive examples and vignettes that provide guidance on sensible choices for the parameters. For the CIFAR100 official test set, the best accuracy as of this writing is just above 75%, but undoubtedly this performance will continue to improve.





##### Data Augmentation


An additional important trick used with image modeling is data augmentation. Essentially, each training image is replicated many times, with each replicate randomly distorted in a natural way such that human recognition is unaffected. Figure 10.9 shows some examples. Typical distortions are