# Hello, _nbpresent_!

In [1]:
import nbpresent
nbpresent.__version__

'3.0.2'

# Deep Learning in Action

> In the early days of artificial intelligence, the field rapidly tackled and solved
problems that are intellectually difficult for human beings but relatively straight-
forward for computers—problems that can be described by a list of formal, math-
ematical rules. The true challenge to artificial intelligence proved to be solving
the tasks that are easy for people to perform but hard for people to describe
formally—problems that we solve intuitively, that feel automatic, like recognizing
spoken words or faces in images.

Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)

## Easy for us. Difficult for computers

- object recognition
- speech recognition


## Representations matter

<figure>
    <img src='coords.png' alt='missing' />
    <figcaption>Source: Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)</figcaption>
</figure>


## Just feed the network the right features?

- What are the correct pixel values for a "bike" feature?
  - race bike, mountain bike, e-bike?
  - pixels in the shadow may be _much_ darker
  - what if bike is mostly obscured by rider standing in front?

## Let the network pick the features

### a layer at a time

<figure>
    <img src='features.png' alt='missing' />
    <figcaption>Source: hGoodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)</figcaption>
</figure>

# Deep Learning, 2 ways to think about it

- hierarchical feature extraction (start simple, end complex)
- function composition (see http://colah.github.io/posts/2015-09-NN-Types-FP/)

# A Short History of (Deep) Learning 

## The first wave: cybernetics (1940s - 1960s)

- neuroscientific motivation
- linear models

## McCulloch-Pitts Neuron (MLP, 1943, a.k.a. Logic Circuit)

- binary output (0 or 1)
- neurons may have inhibiting (negative) and excitatory (positive) inputs
- each neuron has a threshold that has to be surpassed by the sum of activations for the neuron to get active (output 1)
- if just one input is inhibitory, the neuron will not activate

<figure>
    <img src='mcp.png' alt='missing' />
    <figcaption>Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf</figcaption>
</figure>

## Perceptron (Rosenblatt, 1958): the great hope

- compute linear combination of inputs
- return 

<figure>
    <img src='perceptron.png' alt='missing' />
    <figcaption>Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf</figcaption>
</figure>

## Minsky & Papert (1969), "Perceptrons": the great disappointment

- Perceptrons can only solve linearly separable problems
- Big loss of interest in neural networks



## The second wave: Connectionism (1980s, mid-1990s)

- distributed representations
- popularization of backpropagation (Rumelhart
et al., 1986a ; LeCun , 1987)


## The magic ingredient: backpropagation

- Bryson, A.E.; W.F. Denham; S.E. Dreyfus.  Optimal programming problems with inequality constraints.  I:
Necessary conditions for extremal solutions.  AIAA J. 1, 11 (1963) 2544-2550
)
- Paul Werbos, David E. Rumelhart, Geo rey E. Hinton 1974
- Ronald J. Williams.

## Backprop: How could the magic fail?

- Only applicable in case of supervised learning
- Doesn't scale well to multiple layers
- Can converge to poor local minima

## Backprop: the return

- much increased computing power because of GPUs

# The third wave: Deep Learning

- everything starts with Hinton 2006

## The algorithms en vogue now  have mostly been around since the 1980s/1990s.

- convolutional neural networks: 
- recurrent neural networks:
- LSTM

### So why the hype / success NOW?

# Big data

> It is true
that some skill is required to get good performance from a deep learning algorithm.
Fortunately, the amount of skill required reduces as the amount of training data
increases. The learning algorithms reaching human performance on complex tasks
today are nearly identical to the learning algorithms that struggled to solve toy
problems in the 1980s, though the models we train with these algorithms have
undergone changes that simplify the training of very deep architectures.

Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)

## Rule of thumb

> As of 2016, a rough rule of thumb
is that a supervised deep learning algorithm will generally achieve acceptable
performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10
million labeled examples.

Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)

## Big models

thanks to faster/better
- hardware (CPUs, GPUs)
- network infrastructure
- software implementations

> Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years.

Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)

# Big impact

- deep networks consistently wins prestigious competitions (e.g., ImageNet)
- deep learning solves increasingly complex problems (e.g., sequence-to-sequence learning)
- deep learning has started to fuel _other research areas_ 

and most importantly: Deep learning is _highly profitable_

> Deep learning is now used by many top technology companies including Google, Microsoft, Facebook, IBM, Baidu, Apple, Adobe, Netflix, NVIDIA and NEC.

Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)

# Deep Learning Architectures

## Feedforward Deep Neural Network


<figure>
    <img src='deep_nn.png' alt='missing' />
    <figcaption>Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf</figcaption>
</figure>

# Multi-layer Perceptron (MLP)

a.k.a. [deep] feedforward neural network


## Learning XOR

We want to predict
- 0 from [0,0]
- 0 from [1,1]
- 1 from [0,1]
- 1 from [1,0]


<figure>
    <img src='xor1.png' alt='missing' width=70% />
    <figcaption></figcaption>
</figure>

## Trying a linear model 

$f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b$

(x = vector)

- with MSE cost, this leads to: $\mathbf{w}=0, b=0.5$
- mapping every point to 0.5!

## Introduce hidden layer with nonlinear activation function

$f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b$

(x = vector)

<figure>
    <img src='xor4.png' alt='missing' />
    <figcaption>Source: Goodfellow et al. 2016, [Deep Learning](http://www.deeplearningbook.org/)</figcaption>
</figure>

## Calculation with hidden layer


- Design matrix: $\mathbf{X} = \begin{bmatrix}0 & 0 \\0 & 1\\1 & 0 \\1 & 1\end{bmatrix}$

- Parameters: $\mathbf{W} = \begin{bmatrix}1 & 1 \\1 & 1\end{bmatrix}$, $\mathbf{c} = \begin{bmatrix}0 \\ -1 \end{bmatrix}$, $\mathbf{w} = \begin{bmatrix}1 \\ -2 \end{bmatrix}$

- Input to hidden layer: $\mathbf{X}\mathbf{W} = \begin{bmatrix}0 & 0 \\1 & 1\\1 & 1 \\2 & 2\end{bmatrix}$, add $\mathbf{c}$ to every row ==> $\begin{bmatrix}0 & -1 \\1 & 0\\1 & 0 \\2 & 1\end{bmatrix}$


## Which gives us...

<figure>
    <img src='xor2.png' alt='missing' width=70% />
    <figcaption></figcaption>
</figure>


## Introducing nonlinearity

<figure>
    <img src='xor3.png' alt='missing' width=70% />
    <figcaption></figcaption>
</figure>

voilà!

## How to learn: Gradient Descent

## How to learn: backprop

## Feedforward networks: Lessons learned

- use the right loss function (cross entropy instead of mean squared error)
- use the right activation function (rectified linear instead of sigmoid)

## Convolutional Neural Networks

<figure>
    <img src='convnet.jpeg' alt='missing' />
    <figcaption>Source: http://cs231n.github.io/convolutional-networks/</figcaption>
</figure>

## Why ConvNets?

## The Convolution Operation

<figure>
    <img src='convolution_demo.png' alt='missing' />
    <figcaption>Source: http://cs231n.github.io/convolutional-networks/ (Live Demo on website!)</figcaption>
</figure>

## Convolution - the math

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution 

## Convolution hyperparameters

Number of filters K
,
their spatial extent F
,
the stride S
,
the amount of zero padding P
.

# Language Modeling

- predict next word given preceding ones
- based on statistical properties of the distribution of sequences of words 

## Distributional hypothesis: linguistic items with similar distributions have similar meanings

- n-gram/count-based (e.g., LSA)
- predictive (neural network language models, e.g., word2vec)

## ngram-based

- choose ngram-size n
- estimate the probability $\mathbf{P(w_{t+1}|w_1,...,w_{t−2},w_{t−1},w_t)}$ by ignoring context beyond n−1 words and dividing by the count of all given words up till $\mathbf{w_t}$
- e.g., with bigrams: $\mathbf{P(w_{t+1}|w_t = \frac{count(w_{t+1},w_t)}{count(w_t)})}$

## neural network example (Bengio et al 2001, Bengio et al 2003)

- choose a context size n, as in ngrams
- map each word $\mathbf{w_{t−i}}$ in the n−1-word context to an associated d-dimensional feature vector $\mathbf{C_{w_{t-i}}}$
- predict next word using standard NN architecture with tanh (hidden layer) resp. softmax (output layer) activation functions 
- train network to maximize log likelihood $\mathbf{L(θ)=\sum_t{log P(w_t|w_{t−n+1}, ..., w_{t−1}})}$ using stochastic gradient descent 

## word2vec

In [2]:
# cbow

In [3]:
# skip-gram