## Three types of ML

- Supervised learning: learn to predict an output given an input vector
 - Regression: Real number, vector of real numbers
 - Classification: A label (binary or multiclass)
 
 Start by choosing a model-class $y = f(\mathbf x, \mathbf w)$
 
 Fit: minimize some discrepancy between the target output and the actual output produced by the model.
- Reinforcement learning: learn to take an action to maximize payoff
  - The outout is an action or sequence of actions and the only supervisory signal is an occasional scalar reward
  - The goal in selecting each actions is to maximizae the expected sum of the future rewards.
  - The reward is typically rewards, it is hard to know where we went wrong
  - A scalar reward does not supply much information
  
- Unsupervised learning: learn to discover good internal representation of the input
  - Many reseachers thought that clustering was the only form of unsupervised learning
  - The aim of unsupervides learning is not clear. We can say one majour aim is to discover internal representation of the input
  - Other goals: Dimensionality reduction, provides an economical representation of the input in terms of learned features, find sensible clusters in the input.

## Reason to study neural computation

- To understand how the brain actually works
- To under stand a style of parallel computation inspired by neurons and their adaptative connections
- To solve pratical problems by using novel learning algorithms inspired by the brain.

### A typical cortical neuron

- Gross physical structure: one axon that branches, a dendritic tree that collects input from other neurons
- Axon typically contact dendritic trees at synapses: a spike of activity in the axon causes charge to be injected into the post-synaptic neuron
- There is an axon hillock that generates outgoing spikes whever enough charge has flowed in at synapses to depolarize the cell membrane.
- The visual cortex, which is intended to solve image recognition problems, shows a sequence of sectors placed in a hierarchy. Each of these areas receives an input representation, by means of flow signals that connect it to other sectors.

### Synapses:
- When a spike of activity travels along an axon and arrives at a synapse it causes vesicles of transmitter chemical to be released
- The transmitter molecules diffuse across the synaptic cleft and bind to receptor molecules in the membrane of the post-synaptic neuron thus changing their shape.
- The effectiveness of the synapse can be changed:
    - vary the number of vesicles of transmitter
    - vary the number of receptor molecules
- Synapses are slow but gave advantages over RAM
- The effect of each input line on the neuron is controlled by a synaptic weight, the weight can be positive or negative
- The synaptic weighst adapt so that the whole network learns to perform useful computations. We have about $10^{11}$ neuron each with about $10^4$ weights.
- Different bits of the cortex do different things.


## Simple models of neurons

- Linear neuron
$$
y = w_0 + \sum_{n=1}^D x_i w_i
$$
($w_0$: bias, $w_1, \ldots, w_D$: weight on $i^{th}$ feature).

- Binary threshold neuron: compute a weighted sum of the input, send out a finxed size spike of activity if the weighted sum exceeds a threshold.
$$
y = 1_{w_0 + \sum_{i=1}^D x_i w_i \geq 0}
$$

- Rectified Linear newron (threshold newron)
$$
y = (w_0 + \sum_{i=1}^D x_i w_i) 1_{w_0 + \sum_{i=1}^D x_i w_i \geq 0}
$$

- Signoid neurons
$$
y = \frac1{1+ \exp(-w_0 - \sum_{i=1}^D x_i w_i)}
$$

- Stochastic binary neurons
$$
P(y = 1) = \frac1{1+ \exp(-w_0 - \sum_{i=1}^D x_i w_i)}
$$

## Types of NN

- Feed-forward: First layer is input, last layer is output, one or more hidden layers. They compute a series of transformations that change the similarities between cases.

<img src="F1.png"></img>

- Recurrent: Have directed cycles in their connection graph. Can have complicated dynamics and make them very difficult to train. 

<img src="F2.png"></img>

It is natural to model sequential data, it is equivalent to very deep nets with one hidden layer per time slice. They have ability to remember information in the hidden layers for a long time.

<img src="F3.png"></img>

- Symmetrically connected network: Like recurrent networks but the connections between units are symmetrical (same weight in both directions). More restricted in what they can do because they obey an energy function. For example, they cannot model cycles.

### What recurrent neral nets can do 

- I.Sutskever (2011) trained a special type of RNN to predict the next character in a sequence. After training for a long time on a string of $5\cdot 10^8$ characters, he got it to generate new text.

## Perceptron - The First Generation of Neural Network

**Paradigm:**

- Learn how to weight each of the feature activations to get a single scalar quantity

- If this quantity is above some threshold, decide that the input vector is of class 1, otherwise class 0.

**Binary threshold neurons**
$$
z = w_0 + \sum_{i=1}^D w_i
$$

Decision rule
$$
y = \begin{cases} 1 \qquad \textrm{ if } z \geq 0; \\ 0 \qquad \textrm{ if } z < 0 \end{cases}
$$

In other words,
$$
y = \mathbf 1_{w_0 + \sum_{i=1}^D w_i \geq 0}
$$

** The perceptron convergence procedure **

- Add an extra component with value 1 to each input vector.
- Pick training cases using any policy that ensures every training case will keep getting picked.
 - If the output unit is correct, do nothing
 - If the output unit is incorrectly outputs a zero, add the input vector to the weight vector.
 - If the output unit is incorrectly outpust a 1, subtract the input vector from the weight vector.
- This is guaranteed to find a set of weights that gets the right answer for all the training cases if any such set exists.

** Weight-space**

- The weight space is a vector space where each weight is represented by a dimension. A point in the space represents a particular settings for all the weights. 

- Assume that we eliminated the threshold (using bias as a new weight), each training case can be a represented as a hyperplane through the origin $w \cdot x = 0$. So the weight must lie on one side of this hyperplane to get the answer correct. On positive class, $w \cdot x \geq 0$, on negative class $w \cdot x < 0$.

- So we need to find a vector $w$ that correctly solve the inequation system $w \cdot x \geq 0$ for positive class and $w \cdot x < 0$ for negative class.

- In other words, we have $N$ half-spaces and wish to find a point in the intersection (if any) of those half-spaces. Such an intersection (if exists), is a hyper-cone with its apex at 0. It is a convex set.


** Why the algorithm works**

- Consider the square distance between any feasible weight vector $\tilde w$ and the current weight vector $w$: $\Vert w - \tilde  w\Vert^2$. A feasible weight vector is a solution of the problem.

- We represent this squared distance as the squared distance along the hyperplane of the wrong training case ($d_b^2$) and the squared distance perpendicular to that hyperplane ($d_a^2)$ 
<img src="F4.png" width=400></img>

- We would like to get this distance smaller each time we fix the weight for a wrong training case. But it is wrong.

- We define a "generously feasible" weight vectors that lie within the feasible region by a margin at least as great as the length of the input vector that defines each constraint plane.

- This time, the idea works. Every time the perceptron makes a mistak, the squared distance to all of these generoulsy feasible weight vectors is always decreased by at least the squared length of the update vector.
<img src="F5.png" width=400></img>

** Informal sketch of proof**

- Each time the perceptron makes a mistake, the current weight vector moves to decrease its squared distance form every weight vector in the "generously feasible" region.
- The squared distance decreases by at least the squared length of the input vector.
- So after a finite number of mistakes, the weight vector must lie in the feasible region if this region exists.

**What perceptrons can't do**

- XOR problem
- Pattern recognition with the same number of pixels as a feature, wrap-around allowed.