# Neural nets

Sources: 

* Tuca's spin glass course....._6/3/19 lecture_

* Gabrie 2020....._Mean-field inference methods for neural networks_

* Andrew Ng....._Stanford video lectures_

* Goodfellow et al. 2016....._Deep Learning_

In [1]:
# LIBRARIES TO CALL IN
#
#
import numpy as np
#
#
#

## 1.1 ..... _Goodfellow Ch. 1_

---
* "Inventors have long dreamed of creating machines that think"
---
* "The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally."
---
* "This book is about a solution to these more intuitive problems. The solution is to allow the computers to learn from experience and understand the world in terms of a heirarchy of concepts."
---
* "This dependence on representations ... appears throughout computer science and even daily life. In computer science, operations such as searching a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently" 
   *  _More generally a comment on features, their relation to the notion of a coordinate system, and the fact that they don't come pre-specified for all tasks, e.g. recognition tasks_ .
---
* "One solution to this problem is to use machine learning to discover not only the mapping from representation to output, but also the representaiton itself. This approach is known as **representation learning**. Learned representations often result in much better performance..."
---
* "When designing features or algorithms for learning features, our goal is usually to separate the **factors of variation** that explain the observed data... A major source of difficulty in many real-world artificial intelligence applications is that many of the factors of variation influence every single piece of data we are able to observe. The individual pixels in an image of a red car might be very close to black at night. The shape of the car's silhouette depends on the viewing angle. Most applications require us to disentangle the factors of variation and discard the ones we do not care about."
---
* "The idea of learning the right representation for the data provides one perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer's memory after executing another set of instructions in parallel."
---
* "There are two main ways of measuring the depth of a model. The first view is based on _(essentially the number of layers of the computation graph)_ ... Another approach, used by deep probabilistic models, regards the depth of a model as being not the depth of the computation al graph but the depth of the graph describing how concepts are related to each other"

In [2]:
### The subject switches to history:

---
* "Broadly speaking, there have been three waves of development of deep learning: deep learning known as **cybernetics** in the 1940s - 1960s, deep learning known as **connectionism** in the 1980s-1990s, and the current resurgence under the name deep learning beginning in 2006."
    * _The first wave started with the development of theories of biological learning_ (  McCulloch-Pitts 1943, Hebb 1949  )_and implementations of the first models such as the perceptron_(  Rosenblatt 1958, 1962  ). _The second wave, connectionism, started with backpropagation_ (  Rumelhart et al. 1986  ) _to train neural networks with hidden layers. The current wave started with_ (  Hinton et al. 2006, Benigo et al. 2007, Ranzato et al. 2007  )
---
* "it would be deeply interesting to understand the brain and the principles that underlie human intelligence, so machine learning models that shed light on these basic scientific questions are useful apart from their ability to solve engineering applications"
---
* **Stochastic gradient** descent was employed as a method to train even some early linear models in the 1950s.
--- 
* The Neocognitron _(Fukushima, 1980)_ introduced a powerful model architecture for processing images that was inspired by the structure of the mammalian visual system and later became the basis for the modern convolutional neural network _(LeCunn et al. 1998)_ "
---
* "While neuroscience has successfully inspired several neural network _architectures_, we dno not yet know enough about biological learning for neuroscience to offer much guidance for the _learning algorithms_ we use to train these architectures."
---
* "The connectionists began to study models of cognition that could actually be grounded in neural implementations _(Touretzky and Minton, 1985)_, reviving many ideas dating back to the work of psychologist Donald Hebb in the 1940s _(Hebb, 1949)_
---
* "The central idea in connectionism is that a large number of simple computational units can achieve intelligent behavior when networked together."
---
* Other key concepts in the field are attributed to connectionists: "that of **distributed representation** _(Hinton et al., 1986)_. This is the idea that each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs."
    * I felt like this was a statement of some kind of mutual absolute continuity for a pair of measures, in that it indicates that the features are not too singularly concentrated.
---
* "Another major accomplishment of the connectionist movement was the successful use of back-propagation to train deep neural networks with internal representations and the popularization of the back-propagation algorithm _(Rumelhart et al. 1986, LeCun 1987)_"
---
* "At this point in time, deep networks were generally believed to be very difficult to train. We now know that algorithms that have existed since the 1980s work quite well, but this was not apparent circa 2006"

In [3]:
# There is a shorter comment on the 3rd wave, and then a discussion of other aspects of the evolution of ML

---
* "Geoffrey Hinton showed that a kind of neural network called a deep belief network could be efficiently trained using a strategy called greedy layer-wise pre training _(Hinton et al. 2006)_ ... This wave of neural networks research popularized the use of the term "deep learning" to emphasize that researchers were now able to train deeper neural networks than had been possible before."
---
* It was mentioned that good algorithms to train deep nets have existed since the 1980s: "The most important new development is that today we can provide these algorithms with the resources they need to succeed... As more and more of our activities take place onc omputers, more and more of what we do is recorded."
---
* "As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5000 labeled eexamples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples. 
--- 
* "The largest contest in object recognition is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) held each year. A dramatic moment in the meteoric rise of deep learning came when a convolutional network won this challenge for the first time and by a wide margin, bringing down the state-of-the-art top-5 error rate from 26.1\% to 15.3\% _(Krizhevsky et al. 2012)_... as of this writing, advances in deep learning have brought the latest top-5 error rate in this contest down to 3.6\%."

## 1.2 ..... _Goodfellow Ch. 4_

---
* "The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns... _(leading to)_ approxmation error, which in many cases is just rounding error."
---
* "Rounding error is problematic, especially when it compounds across many operations"
---
* "One form of rounding error that is particularly devastating is **underflow**. Underflow occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than a small positive number."
---
* "Another highly damaging form of numerical error is **overflow**". Overflow occurs when numbers of large magnitude are approximated as $\pm \infty$. 
---
* "One example of a function that must be stabilized against underflow and overflow is the softmax function. The sofmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is defined to be
$$
\text{softmax}(x)_i = \frac{ \exp(x_i) }{ \sum_{j=1}^n \exp(x_j) }
$$
---
* Both of _(the immediate overflow difficulties of undefinedness)_ can be resolved by evaluating 
$$
\text{softmax}(z) 
$$
where $z = x - \max_i x_i$
though underflow in the denominator can still be problematic, something considered enough in the context of taking a logarithm of the result that there is a separate method for computing $\log ( \text{softmax}(\cdot) )$ in a numerically stable way.

In [None]:
# Moving on to 4.2, "Poor Conditioning"

---
* "Conditioning refers to how rapidly a function changes with respect to small changes in its inputs."
    * I see it as describing modulus of continuity, or the support of the function in frequency space.
---
* "Consider the function $f(x) = A^{-1}x$, where $A$ is a matrix and $x$ is a compatible vector. When $A \in \mathbb{R}^{n \times n}$ has an eigenvalue decomposition, its **condition number** is 
$$
\max_{i,\,j} \left| \frac{ \lambda_i }{ \lambda_j } \right|
$$
---
* "When this number is large, matrix inversion is particularly sensitive to error in the input. This sensitivity is an intrinsic property of the matrix itself, not the result of rounding error during matrix inversion. Poorly conditioned matrices amplify _pre-existing errors_ when we multiply by the true matrix inverse. 

In [1]:
# Moving to 4.3: Gradient-Based Optimization

---
* "Most deep learning algorithms involve optimization of some sort... The function we want to minimize or maximize is called the **objective function** or **criterion**. When we are minimizing it, we may also call it the **cost function**, **loss function**, or **error function**. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms."
---
* "In the context of deep learning, we optimize functions that may have many local minima that are not optimal, and many saddle points surrounded by very flat regions. All of this makes optimization very difficult, especially when the input to the function is multidimensional."
---
* "We can decrease $f$ by moving in the direction of the negative gradient. This is known as the **method of steepest descent** or **gradient descent**. Steepest descent proposes a new point
$$
x' = x - \epsilon \nabla_x f(x) 
$$
    * I wondered in the margin if there was an algorithm that tracked the top five largest coordinates, assumed continuity, and coasts for a bit (reducing computation cost?). If this does poorly, reduce step-size... not sure if this is an idea really
 ---
* Above, $\epsilon$ is the **learning rate**, "a positive scalar determining the size of the step. We can choose $\epsilon$ in several different ways. A popular approach is to set $\epsilon$ to a small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish. Another approach is to evaluate
 $$
 f( x - \epsilon \nabla_x f(x) )
 $$
 for several values of $\epsilon$ and choose the one that results in the smallest objective function value. This last strategy is called a **line search**.
    * I really like this idea. I wonder if it lays down a mesh and looks at evenly spaced points, or if it tests steps at different scales. 

In [2]:
# 4.3.1: Beyond the gradient: Jacobian and Hessian

---
* "Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors. The second derivative in a specific direction represented by a unit vector $e$ is given by $e^T He$. When $e$ is an eigenvector of $H$, the second derivative in that direction is given by the corresponding eigenvalue."
---
* "The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function $f(x)$ around the current point $x^{(0)}$:
$$
f(x) \approx
    f(x^{(0)}
    + (x - x^{(0)})^T 
    g\\
    + \frac{1}{2} (x - x^{(0)})^T
    H 
    (x - x^{(0)})
$$
where $g$ is the gradient and $H$ is the Hessian, both at $x^{(0)}$. "
---
* "If we use a learning rate $\epsilon$, the new point $x$ will be given by $x^{(0)} - \epsilon g$. Substituting this into our approximation, we obtain
$$
f( x^{(0)} - \epsilon g ) \approx \\
f(x^{(0)} ) - \epsilon g^T g + \frac{1}{2} \epsilon^2 g^T Hg 
$$
---
* "There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill."
---
* "When $g^T Hg$ is positive, solving for the optimal step size that decreases the Taylor series approximation of the function most yields
$$
\epsilon^* = \frac{ g^T g }{ g^T Hg } 
$$
In the worst case, when $g$ aligns with the eigenvector of $H$ corresponding to the maximal eigenvalue $\lambda_{\max}$, this optimal step size is given by $\frac{1}{\lambda_{\max}}$. To the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues of Hessian thus determine the scale of the learning rate."
---
* "When the Hessian has a poor condition number, gradient descent performs poorly. This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer. It also makes it difficult to choose a good step size."

In [3]:
# Newton's method

---
* "The issue _(of step-size choice)_ can be resolved using information from the Hessian matrix to guide the search. The simplest method for doing so is known as Newton's method. Newton's method is based on using a second order Taylor series expansion to approximate $f(x)$ near some point $x^{(0)} \equiv \tilde{x}$:
$$
f(x) \approx f(\tilde{x}) + (x- \tilde{x})^T g( \tilde{x}) \\
+ \frac{1}{2} (x- \tilde{x} )^T H (x - \tilde{x} )
$$
If we then solve for the critical point of this function, we obtain:
$$
x^* = \tilde{x} - H ( \tilde{x} )^{-1} g (\tilde{x} ) 
$$
* When $f$ is a positive definite quadratic function, Newton's method consists of applying the above equation once to jump to the minimum of the function directly. When $f$ is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton's method consists of applying the above equation multiple times. .. _(which)_ can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point."
---
* Newton's method seems to be deployed locally when the algorithm is confident it is near a desired local minimum.
---
* "The optimization algorithms employed in most contexts in this book are applicable to a wide variety of functions, but come with almost no guarantees. Deep learning algorithms tend to lack guarantees because the family of functions used in deep learning is quite complicated. In many other fields, the dominant approach is to design optimization algorithms for a limited family of functions."
---
* "In the context of deep learning, we gain some guarantees by restricting ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous derivatives... many optimization problems in deep learning can be made Lipschitz continuous with relatively minor modifications."
---
* Regularization seems to lead to considering constrained optimization problems. There is a generalization of Lagrange multipliers to constraints which are defined by equalities or inequalities. This is called the **Karush-Kuhn-Tucker (KKT)** approach. 
---
* "The inequality constraints are particularly interesting. We say an inequality constraint is **active** if it is satisfied at the boundary."

## 1.3 ..... 4.5 Linear least squares

Suppose we want to find the value of $x$ that minimizes
$$
f(x) 
    = 
        \frac{1}{2} 
        \| \mathbf{A} x - b \|_2^2
$$
There are linear algebra algorithms to compute the solution (the argmin(s) of the problem), but let us also explore how to solve this using gradient descent. 

The gradient is
$$
\nabla_x f(x) = \frac{1}{2} \nabla_x |
$$