### Deep Learning

This chapter covers the important topic of deep learning. At the time of
writing (2020), deep learning is a very active area of research in the machine
learning and artificial intelligence communities. The cornerstone of deep
learning is the neural network.

Neural networks rose to fame in the late 1980s. There was a lot of excite
ment and a certain amount of hype associated with this approach, and they
were the impetus for the popular Neural Information Processing Systems
meetings (NeurIPS, formerly NIPS) held every year, typically in exotic
places like ski resorts. This was followed by a synthesis stage, where the
properties of neural networks were analyzed by machine learners, math
ematicians and statisticians; algorithms were improved, and the method
ology stabilized. Then along came SVMs, boosting, and random forests,
and neural networks fell somewhat from favor. Part of the reason was that
neural networks required a lot of tinkering, while the new methods were
more automatic. Also, on many problems the new methods outperformed
poorly-trained neural networks. This was the status quo for the first decade
in the new millennium.

All the while, though, a core group of neural-network enthusiasts were
pushing their technology harder on ever-larger computing architectures and
data sets. Neural networks resurfaced after 2010 with the new name deep
learning, with new architectures, additional bells and whistles, and a string
of success stories on some niche problems such as image and video classifi
cation, speech and text modeling. Many in the field believe that the major
reason for these successes is the availability of ever-larger training datasets,
made possible by the wide-scale use of digitization in science and industry.

In this chapter we discuss the basics of neural networks and deep learn
ing, and then go into some of the specializations for specific problems, such
as convolutional neural networks (CNNs) for image classification, and re
current neural networks (RNNs) for time series and other sequences. We will also demonstrate these models using the Python torch package, along
with a number of helper packages.

The material in this chapter is slightly more challenging than elsewhere
in this book.

####  Single Layer Neural Networks

A neural network takes an input vector of $ p $ variables $ X = (X_1, X_2, \ldots, X_p) $ and builds a nonlinear function $ f(X) $ to predict the response $ Y $. We have built nonlinear prediction models in earlier chapters, using trees, boosting, and generalized additive models. What distinguishes neural networks from these methods is the particular structure of the model. 

Figure 10.1 shows a simple feed-forward neural network for modeling a quantitative response using $ p = 4 $ predictors. In the terminology of neural networks, the four features $ X_1, \ldots, X_4 $ make up the units in the input layer. The arrows indicate that each of the inputs from the input layer feeds into each of the $ K $ hidden units (we get to pick $ K $; here we chose 5). The neural network model has hidden units in the form

$$
f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X) = \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj} X_j).
$$

It is built up here in two steps. First, the $ K $ activations $ A_k $, $ k=1, \ldots, K $, in the hidden layer are computed as functions of the input features $ X_1, \ldots, X_p $,

$$
A_k = h_k(X) = g\left(w_{k0} + \sum_{j=1}^p w_{kj} X_j\right).
$$

where $ g(z) $ is a nonlinear activation function that is specified in advance. We can think of each $ A_k $ as a different transformation $ h_k(X) $ of the original features, much like the basis functions of Chapter 7. These $ K $ activations from the hidden layer then feed into the output layer, resulting in

$$
f(X) = \beta_0 + \sum_{k=1}^K \beta_k A_k,
$$

a linear regression model in the $ K = 5 $ activations. All the parameters $ \beta_0, \ldots, \beta_K $ and $ w_{k0}, \ldots, w_{Kp} $ need to be estimated from data. In the early instances of neural networks, the sigmoid activation function was favored,

$$
g(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}},
$$

which is the same function used in logistic regression to convert a linear function into probabilities between zero and one (see Figure 10.2). The preferred choice in modern neural networks is the ReLU (rectified linear unit) activation function, which takes the form

$$
g(z) = 
\begin{cases} 
0 & \text{if } z < 0 \\ 
z & \text{otherwise}
\end{cases}
$$

A ReLU activation can be computed and stored more efficiently than a sigmoid activation. Although it thresholds at zero, because we apply it to a linear function (10.2), the constant term $ w_{k0} $ will shift this inflection point. 

So in words, the model depicted in Figure 10.1 derives five new features by computing five different linear combinations of $ X $, and then squashes each through an activation function $ g(\cdot) $ to transform it. The final model is linear in these derived variables.

The name neural network originally derived from thinking of these hidden units as analogous to neurons in the brain — values of the activations $ A_k = h_k(X) $ close to one are firing, while those close to zero are silent (using the sigmoid activation function).

The nonlinearity in the activation function $ g(\cdot) $ is essential, since without it the model $ f(X) $ in (10.1) would collapse into a simple linear model in $X_1, \ldots, X_p$. Moreover, having a nonlinear activation function allows the model to capture complex nonlinearities and interaction effects. Consider a very simple example with $ p = 2 $ input variables $ X = (X_1, X_2) $, and $ K = 2 $ hidden units $ h_1(X) $ and $ h_2(X) $ with $ g(z) = z^2 $. We specify the other parameters as

$$
\beta_0 = 0, \quad \beta_1 = \frac{1}{4}, \quad \beta_2 = \frac{1}{4}, 
$$
$$
w_{10} = 0, \quad w_{11} = 1, \quad w_{12} = 1,
$$
$$
w_{20} = 0, \quad w_{21} = 1, \quad w_{22} = 1.
$$

From (10.2), this means that

$$
h_1(X) = (0 + X_1 + X_2)^2,
$$
$$
h_2(X) = (0 + X_1 X_2)^2.
$$

Then plugging (10.7) into (10.1), we get

$$
f(X) = 0 + \frac{1}{4} \cdot (0 + X_1 + X_2)^2 - \frac{1}{4} \cdot (0 + X_1 X_2)^2
= \frac{1}{4} (X_1 + X_2)^2 - (X_1 X_2)^2
= X_1 X_2.
$$

So the sum of two nonlinear transformations of linear functions can give us an interaction! In practice, we would not use a quadratic function for $ g(z) $, since we would always get a second-degree polynomial in the original coordinates $ X_1, \ldots, X_p $. The sigmoid or ReLU activations do not have such a limitation.

Fitting a neural network requires estimating the unknown parameters in (10.1). For a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize

$$
\sum_{i=1}^{n} (y_i - f(x_i))^2.
$$

Details about how to perform this minimization are provided in Section 10.7.


####  Multilayer Neural Networks

Modern neural networks typically have more than one hidden layer, and often many units per layer. In theory a single hidden layer with a large number of units has the ability to approximate most functions. However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.

We will illustrate a large dense network on the famous and publicly available MNIST handwritten digit dataset. Figure 10.3 shows examples of these digits. The idea is to build a model to classify the images into their correct digit class $ 0-9 $. Every image has $ p = 28 \times 28 = 784 $ pixels, each of which is an eight-bit grayscale value between $ 0 $ and $ 255 $ representing the relative amount of the written digit in that tiny square. These pixels are stored in the input vector $ X $ (in, say, column order). The output is the class label, represented by a vector $ Y = (Y_0, Y_1, \ldots, Y_9) $ of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere. In the machine learning community, this is known as one-hot encoding. There are 60,000 training images, and 10,000 test images.

On a historical note, digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere. Pattern recognition tasks of this kind are relatively simple for humans. Our visual system occupies a large fraction of our brains, and good recognition is an evolutionary force for survival. These tasks are not so simple for machines, and it has taken more than 30 years to refine the neural-network architectures to match human performance.

Figure 10.4 shows a multilayer network architecture that works well for solving the digit-classification task. It differs from Figure 10.1 in several ways:

- It has two hidden layers $ L_1 $ (256 units) and $ L_2 $ (128 units) rather than one. Later we will see a network with seven hidden layers.
- It has ten output variables, rather than one. In this case, the ten variables really represent a single qualitative variable and so are quite dependent. (We have indexed them by the digit class $ 0 $–$ 9 $ rather than $ 1 $–$ 10 $, for clarity.) More generally, in multi-task learning one can predict different responses simultaneously with a single network; they all have a say in the formation of the hidden layers.
- The loss function used for training the network is tailored for the the relative amount of the written digit in that tiny square. These pixels are stored in the input vector $ X $ (in, say, column order). The output is the class label, represented by a vector $ Y = (Y_0, Y_1, \ldots, Y_9) $ of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere. In the machine learning community, this is known as one-hot encoding. There are 60,000 training images, and 10,000 test images.

On a historical note, digit recognition problems were the catalyst that accelerated the development of neural network technology in the late 1980s at AT&T Bell Laboratories and elsewhere. Pattern recognition tasks of this kind are relatively simple for humans. Our visual system occupies a large fraction of our brains, and good recognition is an evolutionary force for survival. These tasks are not so simple for machines, and it has taken more than 30 years to refine the neural-network architectures to match human performance.

Figure 10.4 shows a multilayer network architecture that works well for solving the digit-classification task. It differs from Figure 10.1 in several ways:

- It has two hidden layers $ L_1 $ (256 units) and $ L_2 $ (128 units) rather than one. Later we will see a network with seven hidden layers.
- It has ten output variables, rather than one. In this case, the ten variables really represent a single qualitative variable and so are quite dependent. (We have indexed them by the digit class $ 0 $–$ 9 $ rather than $ 1 $–$ 10 $, for clarity.) More generally, in multi-task learning one can predict different responses simultaneously with a single network; they all have a say in the formation of the hidden layers.
- The loss function used for training the network is tailored for the multiclass classification task