<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Initialization-cells----run-this-first!" data-toc-modified-id="Initialization-cells----run-this-first!-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Initialization cells -- run this first!</a></span></li><li><span><a href="#Learning-to-fit--scalar-valued-functions-with-a-neural-network" data-toc-modified-id="Learning-to-fit--scalar-valued-functions-with-a-neural-network-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Learning to fit  scalar valued functions with a neural network</a></span><ul class="toc-item"><li><span><a href="#Modeling-the-input-output-relation" data-toc-modified-id="Modeling-the-input-output-relation-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Modeling the input-output relation</a></span></li><li><span><a href="#The-mean-squared-error-loss-function-and-its-gradient" data-toc-modified-id="The-mean-squared-error-loss-function-and-its-gradient-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>The mean-squared error loss function and its gradient</a></span></li><li><span><a href="#Using-the-network-to-learn-a-univariate,-scalar-valued-function" data-toc-modified-id="Using-the-network-to-learn-a-univariate,-scalar-valued-function-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Using the network to learn a univariate, scalar valued function</a></span></li><li><span><a href="#Understanding-how-a-neural-network-tells-apart-data-from-two-categories" data-toc-modified-id="Understanding-how-a-neural-network-tells-apart-data-from-two-categories-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Understanding how a neural network tells apart data from two categories</a></span></li><li><span><a href="#Linear-separability-of-two-classes" data-toc-modified-id="Linear-separability-of-two-classes-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Linear separability of two classes</a></span></li><li><span><a href="#Separating-linearly-non-separable-classes-with-a-neural-network" data-toc-modified-id="Separating-linearly-non-separable-classes-with-a-neural-network-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Separating linearly non-separable classes with a neural network</a></span></li></ul></li><li><span><a href="#Fitting-vector-valued-multivariate-functions" data-toc-modified-id="Fitting-vector-valued-multivariate-functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fitting vector valued multivariate functions</a></span><ul class="toc-item"><li><span><a href="#Application:-Handwriting-recognition----more-than-2-digits-at-a-time" data-toc-modified-id="Application:-Handwriting-recognition----more-than-2-digits-at-a-time-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Application: Handwriting recognition -- more than 2 digits at a time</a></span></li><li><span><a href="#Recognizing-your-handwriting" data-toc-modified-id="Recognizing-your-handwriting-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Recognizing <em>your</em> handwriting</a></span></li></ul></li><li><span><a href="#Summary-and-where-to-next" data-toc-modified-id="Summary-and-where-to-next-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Summary and where to next</a></span></li></ul></div>

# Initialization cells -- run this first!

In [0]:
using Plots, Interact
gr()

# Learning to fit  scalar valued functions with a neural network

We saw in the previous chapters how neural networks could be used to learn a function a scalar valued function (Chapter 1, where we fit $f(x)$ ) and a multivariate function (in Chapter 2, where we trained a network to distinguish between two digits).

Our goal in this chapter is to get a deeper understanding and appreciation of the power of non-linear activation functions. By end of this chapter you will know how to train a more general network than we have encountered so far -- this network will contain an artbirary number of input neurons and an arbitrary number of  neurons in a so-called hidden layer  and with either one or an equal number of neurons in the output layer as the hidden layer.


We will  use this network to:

- learn univariate and multivariate scalar valued functions including the univariate functions  in Chapter 1

- revisit how to discriminate between two classes and to demonstrate what a `tanh` network that classifies a challenging dataset is able to data that a `linear` network is unable to

- learn how to tell apart more than two classes 

- learn recognize all 10 digits (0-9) in your handwriting!

Let us begin!

## Modeling the input-output relation

We begin with a network that can (we hope) learn a multivariate scalar valued function.

<img src="singleclass_neuron.png" alt="Drawing" style="width: 500px;"/>

The  network we will consider has $d$ inputs (or input neurons), one hidden layer consisting of $n$ neurons and an output layer consisting of a single output neuron.  corresponding to the inputs $x_1, \ldots, x_n$. The figure above shows such a configuration for $d = 4$ and $n = 5$. 

Let $w_{ij}$ denote the weight of the edge from the input neuron $j$ to the neuron $i$ in the hidden layer and let $b_1, 
\ldots,b_n$ denote the respective biases.


The output of the first neuron in the hidden layer is given by

\begin{equation}
g_{\color{blue}{1}}(x) = f_{\rm a}(w_{\color{blue}{1}1} x_1 + w_{\color{blue}{1} 2} x_2 + \ldots + w_{\color{blue}{1} d} x_d + b_{\color{blue}{1}}).
\end{equation}


Similarly, the output of the second neuron in the hidden layer is given by

\begin{equation}
g_{\color{blue}{2}}(x) = f_{\rm a}(w_{\color{blue}{2}1} x_1 + w_{\color{blue}{2} 2} x_2 + \ldots + w_{\color{blue}{2} d} x_d + b_{\color{blue}{2}}).
\end{equation}

More generally if we define $x$ to be the vector of inputs

$$ x = \begin{bmatrix} x_1 & \ldots & x_{d} \end{bmatrix}^{T},$$

then

\begin{equation}
g_{\color{blue}{i}}(x) = f_{\rm a}(w_{\color{blue}{i}1} x_1 + w_{\color{blue}{i} 2} x_2 + \ldots + w_{\color{blue}{i} d} x_d + b_{\color{blue}{i}}).
\end{equation}

The output of the neural network is given by

\begin{equation}
g(x) = \sum_{i=1}^n g_i(x)
\end{equation}

which is equivalent to the expression  

\begin{equation}
g(x) = \sum_{i=1}^{n} f_{\rm a}\left(\sum_{j=1}^{d} w_{ij} x_j + b_i\right),
\end{equation}
where $f_{\rm a}(z)$ is the activation function.


Let us rewrite this expression via a matrix vector product and the "dot" operator.

Define the $n \times d$ matrix

$$W = \begin{bmatrix} w_{11} & \ldots & w_{1d}  \\
                      w_{21} & \ldots & w_{2d} \\
                      \vdots & \vdots & \vdots \\
                      w_{n1} & \ldots & w_{nd} \end{bmatrix}.$$
                      
Then, for the vector of biases

$$ b = \begin{bmatrix} b_1 & \ldots & b_{n} \end{bmatrix}^{T},$$

we have that

\begin{equation}
\begin{bmatrix} g_{1}(x) \\ g_{2}(x) \\ \vdots \\ g_{n}(x) \end{bmatrix} =  f_{\rm a}.(W x + b).
\end{equation}


Thus, we can express the output of this neural network as

\begin{equation}
g(x) = \textrm{sum}\left( f_{\rm a}.(W x + b) \right),
\end{equation}

where the operator $\textrm{sum}(\cdot)$ adds up the elements of the vector.