## 4a. DAG networks, autograd, convolution layers

In [139]:
"""
    Initialization
"""


import torch
from torch import Tensor
from torch.autograd import Variable
from torch.nn import Parameter

### 1. DAG Networks
- **Writing from scratch a large neural network is complex and error-prone**
- PyTorch, Caffe2, TensorFlow, MXNet, CNTK, Torch, Theano, Caffe

### 2. Autograd
- Automatically constructed gradient
- **The specification of the graph (DAG) looks a lot like the forward pass, and the operations of the forward pass define the backward pass**
- Benefit of augograd:
    1. Simpler syntax: just need forward pass, backward pass will automatically be constructed
    2. Greater flexibility: since the graph is not static, forward pass can be dinamically modulated
- To use autograd, use `torch.autograd.Variable` instead of `torch.Tensor`
- `Variable`
> - `data` : `Tensor`
> - `grad` : `Variable`
> - `requires_grad` : `Boolean`
- `Parameter` is a `Variable` with `requires_grad` to `True`
- Usage:
    1. `torch.autograd.grad(outputs, inputs)`
        - To generate the computational graph for computing **higher-order derivatives**: passing `create_graph=True`
    2. `torch.autograd.backward(variables)` or `Variable.backward()`

- **Example: (Partial Derivative)**
    - $(x_{1}, x_{2}, x_{3}) = (1, 2, 2)$
    - $l = norm(x) = || x || = \sqrt[]{x_{1}^{2} + x_{2}^{2} + x_{3}^{2}} = 3$  
    $\rightarrow \frac{\delta l}{\delta x_{i}} = \frac{x_{i}}{||x||}$

In [140]:
"""
    Example of Autograd using `torch.autograd.grad()`
"""
x = Variable(Tensor([1, 2, 2]), requires_grad = True)
l = x.norm()
print(l)

g = torch.autograd.grad(l, x)
print(g)

tensor(3.)
(tensor([ 0.3333,  0.6667,  0.6667]),)


In [141]:
"""
    Example of Autograd using `torch.autograd.backward()`
"""
x = Variable(Tensor([1, 2, 2]), requires_grad = True)
l = x.norm()
print(l)

l.backward()
print(x.grad)

tensor(3.)
tensor([ 0.3333,  0.6667,  0.6667])


- **Example: run forward/backward pass**
<img width=60% src="images/4a-1.png">
Architecture:  
    - $\phi^{(1)}(x^{(0)}; w^{(1)}) = w^{(1)}x^{(0)}$  
    - $\phi^{(2)}(x^{(0)}, x^{(1)}; w^{(2)}) = x^{(0)} + w^{(2)}x^{(1)}$  
    - $\phi^{(3)}(x^{(1)}, x^{(2)}; w^{(1)}) = w^{(1)}(x^{(1)} + x^{(2)})$

In [142]:
"""
    Example of running forward/backward pass
    Problem: Tensor.mv -> Tensor (not Varible/Parameter)
"""
w1 = Parameter(Tensor(5, 5).normal_())
w2 = Parameter(Tensor(5, 5).normal_())
x = Variable(Tensor(5).normal_(), requires_grad = True)

x0 = x
x1 = Variable(w1.mv(x0), requires_grad = True)
x2 = x0 + Variable(w2.mv(x1), requires_grad = True)
x3 = Variable(w1.mv(x1 + x2), requires_grad = True)

q = x3.norm()
q.backward()

print(q)
print(x3)
print(x3.grad)

tensor(11.6648)
tensor([ 2.8595,  5.9130,  5.7412, -7.1607,  2.9478])
tensor([ 0.2451,  0.5069,  0.4922, -0.6139,  0.2527])


### 3. Weight sharing
- In the example above, both $\phi^{(1)}$ and $\phi^{(3)}$ use the same weight $w^{(1)}$. That's called **weight sharing**
- Allow building **siamese networks**

### 4. Convolutional layers (Stride = 1)
- **A representation meaningful at a certain location can/should be used everywhere**
- Main idea:
<img width=60% src="images/4a-2.png">
- Usages:
    1. Differential operator:
        <img width=60% src="images/4a-3.png">
    2. Template matcher:
        <img width=60% src="images/4a-4.png">
- Higher-dimension: 
> - `C` : channel
---
<img width=60% src="images/4a-5.png">  
<img width=60% src="images/4a-6.png">

### 5. Pooling (Down-scaling, Stride = Kernel_Size)
- `max-pooling`: compute max values per block
    <img width=60% src="images/4a-7.png">
- `average-pooling`: compute average values per block
- Higher-dimension: 
> - `C` : channel
---
<img width=60% src="images/4a-8.png">  
