# Learning PyTorch with Examples

**Author:** Justin Johnson  
    
This tutorial introduces the fundamental concepts of PyTorch through self-contained examples.  
    
At its core, PyTorch provides two main features:  
    
- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks
    
We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will   
be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and   
the true output.  

# Tensors

## Warm-up: numpy
    
Before introducing PyTorch, we will first implement the network using numpy.  
    
Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic   
framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients.   
However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and   
backward passes through the network using numpy operations:  

Numpy로 NN을 짜게되면, Backpropagation을 모두 직접 구해줘야한다.

In [1]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 32961279.135018658
1 30914339.51684713
2 32662970.29168381
3 32240412.392855972
4 26620882.448648505
5 17490891.300347164
6 9596446.996476237
7 4884561.624152671
8 2627121.4491464756
9 1602930.1189009012
10 1113880.3353748447
11 849203.4597342282
12 683831.6241760388
13 567626.8116254846
14 479535.641137847
15 409373.1146758629
16 352116.7068102377
17 304659.3666679672
18 264872.2560685585
19 231225.97084823716
20 202606.87063343954
21 178147.4537628517
22 157152.71154935574
23 139010.35937124724
24 123303.56639149974
25 109608.97633162275
26 97653.95910868404
27 87184.08072742059
28 77982.04117152696
29 69873.16866214982
30 62713.38241587822
31 56371.79262216594
32 50751.294732700364
33 45760.24724192398
34 41312.07776734623
35 37341.00836158061
36 33789.8111986051
37 30610.825336117865
38 27757.461547979707
39 25193.269111158505
40 22885.370252244735
41 20807.896688134962
42 18935.039982474264
43 17243.297711933268
44 15714.439220433049
45 14331.733437674862
46 13080.541684111591
4

413 1.5832053918759046e-06
414 1.5039752650185448e-06
415 1.4287746507084363e-06
416 1.3573922811491318e-06
417 1.2896250395967108e-06
418 1.2252954727668922e-06
419 1.1642272299241968e-06
420 1.1062452604034163e-06
421 1.0511949417866742e-06
422 9.989301026435214e-07
423 9.492997788967473e-07
424 9.02171550979012e-07
425 8.574145228827746e-07
426 8.149143849527874e-07
427 7.745516160057761e-07
428 7.362158137452974e-07
429 6.998100860030828e-07
430 6.652300845488988e-07
431 6.323844409556842e-07
432 6.011832534223755e-07
433 5.715453617773559e-07
434 5.433886959238016e-07
435 5.16639396663349e-07
436 4.91228103603147e-07
437 4.670825933880496e-07
438 4.4414067864343934e-07
439 4.223442083361367e-07
440 4.016348302813607e-07
441 3.8195337916070754e-07
442 3.6325090341400564e-07
443 3.454770565186155e-07
444 3.285844183646509e-07
445 3.1252940659422985e-07
446 2.972714437660141e-07
447 2.827675658729933e-07
448 2.6898199127681237e-07
449 2.5587951844557415e-07
450 2.4342374961016806e-07

# PyTorch: Tensors
  
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural   
networks, GPUs often provide speedups of **50x or greater**, so unfortunately numpy won’t be enough for modern deep   
learning.  
    
Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch Tensor is conceptually identical to a   
numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors.   
Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic   
tool for scientific computing.  
    
Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on   
GPU, you simply need to cast it to a new datatype.  
    
Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually   
implement the forward and backward passes through the network:  

Pytorch의 Tensor를 사용하면

- GPU를 사용할 수 있다.
- Computational graph와 gradient를 추적한다.
    
해당 예제에서는 `Augograd`를 사용하지 않기 때문에 Backprop은 구현해줘야한다.

In [2]:
# -*- coding: utf-8 -*-

import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device(""cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred=h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 24956294.0
1 18028082.0
2 15463613.0
3 14656433.0
4 14371805.0
5 13770768.0
6 12519631.0
7 10552262.0
8 8295674.0
9 6109200.5
10 4324451.0
11 2984467.0
12 2057443.0
13 1432476.875
14 1021166.25
15 748562.5
16 566606.4375
17 441996.6875
18 354707.03125
19 291666.125
20 244773.109375
21 208798.3125
22 180470.453125
23 157639.984375
24 138845.859375
25 123107.5
26 109732.96875
27 98241.5
28 88278.4609375
29 79578.6953125
30 71933.1953125
31 65189.6875
32 59206.58984375
33 53874.171875
34 49107.05078125
35 44837.21484375
36 41005.3828125
37 37554.4765625
38 34441.3359375
39 31627.783203125
40 29081.3515625
41 26771.49609375
42 24673.24609375
43 22765.6015625
44 21027.96484375
45 19442.5625
46 17994.978515625
47 16673.0
48 15463.5732421875
49 14355.4638671875
50 13338.1845703125
51 12404.107421875
52 11544.7353515625
53 10753.5791015625
54 10024.6318359375
55 9352.662109375
56 8732.70703125
57 8159.7822265625
58 7629.71875
59 7139.16796875
60 6684.83203125
61 6263.640625
62 5873.057617187

# Autograd

## PyTorch: Tensors and autograd
    
In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually   
implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large   
complex networks.  
    
Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The   
**autograd** package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your   
network  will define a **computational graph**; nodes in the graph will be Tensors, and edges will be functions that   
produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute   
gradients.  
    
This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph.   
If `x` is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of `x` with respect to   
some scalar value.  
    
Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement   
the backward pass through the network:  

## Autograd

- NN을 바닥부터 긁으려면 forward pass와 backprop pass를 모두 구현해줘야한다.
- 작은 네트워크의 경우는 쉽지만, 크고 복잡한 네트워크는 어렵다.
- Pytorch는 `Autograd` package라는 *automatic differentiation engine*이 있다.
- `Autograd` package는 Backprop을 자동으로 계산해준다.
- `Autograd`를 사용하기 위해서는 `torch.Tensor`의 `requires_grad` 옵션을 `True`로 주면된다.
- `requires_grad = True`라면 해당 변수는 `.grad`라는 변수에 gradient를 계산해놓게 된다.

In [3]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hisdden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights
# Setting requires_grad = True indicates that we want to compute graidents with
# respect to these Tensors during the backward pass
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactlly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd
    # An alternative way is to operation on weight.data and weights.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 27139842.0
1 21072698.0
2 17358676.0
3 14104916.0
4 10928735.0
5 8034668.5
6 5685836.0
7 3956639.25
8 2768000.25
9 1978532.875
10 1459844.125
11 1115101.625
12 880406.1875
13 715506.0625
14 595181.5625
15 503970.78125
16 432614.25
17 375283.21875
18 328190.21875
19 288906.625
20 255657.53125
21 227230.15625
22 202710.21875
23 181393.484375
24 162748.25
25 146350.1875
26 131879.640625
27 119069.84375
28 107693.2109375
29 97558.2890625
30 88516.3515625
31 80441.0234375
32 73190.8984375
33 66685.2109375
34 60828.140625
35 55546.2265625
36 50778.68359375
37 46472.39453125
38 42572.33984375
39 39033.44921875
40 35822.171875
41 32904.5078125
42 30250.06640625
43 27831.767578125
44 25625.51171875
45 23611.75
46 21773.3828125
47 20091.859375
48 18552.64453125
49 17141.986328125
50 15848.4189453125
51 14661.458984375
52 13571.5068359375
53 12570.48046875
54 11649.95703125
55 10802.7861328125
56 10022.71875
57 9303.669921875
58 8640.732421875
59 8029.2216796875
60 7464.6943359375
61 6943.28662

# PyTorch: Defining new autograd functions

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The **forward**   
function computes output Tensors from input Tensors. The **backward** function receives the gradient of the output   
Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of `torch.autograd.Function` and  
implementing the `forward` and `backward` functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

## Defining new autograd functions

- autograd operator는 Tensor를 기반으로 수행하는 2가지 함수로 구성되어있다. (**forward**/**backward**)
- Pytorch에서는 쉽게 자신만의 autograd operator를 정의할 수 있다.
- 이때, `torch.autograd.Function`를 상속받아야하며, `foward`/`backward` 함수를 구현해야한다.

<br/>

- 자세한 내용은 Pytorch Docs에 `torch.autograd.Function`구현체를 확인해보면 좋을 것 같다.
- `ctx`는 **context**의 약자, `_ContextMethodMixin` class를 의미하는 듯 싶다.
- 즉, **forward**시, `ctx`에 결과값을 저장해놓았다가, **backward**시, 이를 꺼내 쓴다.

<br/>

```python
class _ContextMethodMixin(object):

    def save_for_backward(self, *tensors):
        r"""Saves given tensors for a future call to :func:`~Function.backward`.

        **This should be called at most once, and only from inside the**
        :func:`forward` **method.**

        Later, saved tensors can be accessed through the :attr:`saved_tensors`
        attribute. Before returning them to the user, a check is made to ensure
        they weren't used in any in-place operation that modified their content.

        Arguments can also be ``None``.
        """
        self.to_save = tensors
        
    ...
        
```
<br/>
```python
class NestedIOFunction(Function):
    ...
    
    @property
    def saved_tensors(self):
        flat_tensors = super(NestedIOFunction, self).saved_tensors
        return _unflatten(flat_tensors, self._to_save_nested)
    
    ...
```


In [5]:
# -*- coding: utf-8 -*-
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd functions by subslassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input
    
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Funtion.apply method. We alias this as 'relu'
    relu = MyReLU.apply
    
    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass.
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 25707416.0
1 20607294.0
2 20175412.0
3 21407194.0
4 22018038.0
5 20464560.0
6 16450541.0
7 11466001.0
8 7143251.5
9 4220328.5
10 2495853.5
11 1551153.125
12 1036630.375
13 748254.375
14 575864.75
15 464597.0625
16 386898.53125
17 328834.59375
18 283220.71875
19 246150.0
20 215337.203125
21 189309.203125
22 167115.40625
23 148068.375
24 131624.078125
25 117367.984375
26 104954.171875
27 94095.46875
28 84620.5703125
29 76279.8515625
30 68909.578125
31 62378.93359375
32 56580.1875
33 51431.84375
34 46860.796875
35 42770.7578125
36 39100.671875
37 35802.5390625
38 32832.2421875
39 30147.740234375
40 27717.392578125
41 25518.2890625
42 23524.98046875
43 21712.615234375
44 20060.908203125
45 18554.1328125
46 17177.7734375
47 15918.3994140625
48 14765.1162109375
49 13706.64453125
50 12734.650390625
51 11841.5537109375
52 11018.927734375
53 10260.9794921875
54 9562.1259765625
55 8916.869140625
56 8320.7255859375
57 7769.60009765625
58 7259.5537109375
59 6787.1513671875
60 6348.90478515625
61

490 0.00041363819036632776
491 0.0004049352719448507
492 0.00039615685818716884
493 0.0003879879368469119
494 0.0003795434604398906
495 0.00037182075902819633
496 0.00036430617910809815
497 0.00035711179953068495
498 0.0003488950605969876
499 0.0003419207932893187


# TensorFlow: Static Graphs
    
PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic   
differentiation to compute gradients. The biggest difference between the two is that TensorFlow?s computational graphs   
are **static** and PyTorch uses **dynamic** computational graphs.
    
In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly   
feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.
    
Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some   
graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many   
machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be   
amortized as the same graph is rerun over and over.  
    
One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different   
computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps   
for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a   
part of the graph; for this reason TensorFlow provides operators such as `tf.scan` for embedding loops into the graph. With   
dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative   
flow control to perform computation that differs for each input.  
    
To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

## TensorFlow

- Pytorch와 Tensorflow는 computational graph와 automatic differentiation engine을 가지고 있다는게 공통점
- Tensorflow는 **static**, Pytorch는 **dynamic** Computational graph를 갖는다는 것이 차이점
- **static** graph방식은 graph 최적화가 가능하며 재사용하기 좋다.
- **static** graph방식은 control flow가 **dynamic**방식과는 다른데, 예를들어 RNN을 구현할 때, unrolling되는 부분을 loop로 구현해야한다. 이러한 부분은 `tf.scan`을 통해서 구현이 되어있는데, **dynamic**방식은 for문을 이용할 수 있다.

In [11]:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward padd: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the act of updating the value of the weights is part of 
# the computational graph; in Pytorch this happens outside the computational
# graph

learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2
    sess.run(tf.global_variables_initializer())
    
    # Create numpy arrays holding the actual data for the inputs x and targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays
        loss_value, _, _ = sess.run([loss, new_w1, new_w2], feed_dict={x: x_value, y: y_value})
        print(loss_value)

32744972.0
28156976.0
30076872.0
33209836.0
32567224.0
26014596.0
16458524.0
8852785.0
4677963.0
2712606.0
1692502.0
1159787.9
865512.56
686700.1
566524.1
478770.72
410789.72
355815.47
310272.0
271972.47
239373.97
211435.62
187328.97
166469.95
148289.39
132413.06
118507.44
106283.836
95504.66
85975.94
77540.33
70045.56
63371.457
57418.86
52097.29
47329.414
43054.906
39212.832
35756.562
32640.479
29828.752
27287.22
24990.68
22909.129
21020.473
19304.54
17743.76
16322.562
15027.07
13845.809
12767.816
11782.615
10880.914
10055.206
9298.32
8604.84
7968.15
7383.291
6845.5146
6350.4814
5894.7617
5474.698
5087.4146
4730.1816
4400.454
4095.7588
3814.0356
3553.3828
3312.0518
3088.5566
2881.3953
2689.2827
2511.0547
2345.5994
2191.9397
2049.168
1916.4546
1793.0359
1678.1565
1571.2012
1471.5842
1378.7476
1292.2324
1211.5525
1136.2793
1066.012
1000.39905
939.1093
881.80756
828.2389
778.1276
731.2507
687.3773
646.2969
607.83026
571.7838
538.0106
506.35217
476.66855
448.8162
422.68994
398.16724
375.1

# PyTorch: nn
    
Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking   
derivatives; however for large neural networks raw autograd can be a bit too low-level.  
    
When building neural networks we frequently think of arranging the computation into layers, some of which have learnable   
parameters which will be optimized during learning.  
    
In TensorFlow, packages like **Keras**, **TensorFlow-Slim**, and **TFLearn** provide higher-level abstractions over raw computational   
graphs that are useful for building neural networks.  
    
In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of **Modules**, which are roughly   
equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold   
internal state such as Tensors containing learnable parameters. The `nn` package also defines a set of useful loss   
functions that are commonly used when training neural networks.  
    
In this example we use the `nn` package to implement our two-layer network:

## PyTorch: nn

- TF를 조금 더 쓰기 쉽게 만든 **Keras**, **TensorFlow-Slim**, **TFLearn**과 같이 Pytorch는 `nn` package를 통해 이를 제공한다.

<br/>

## `torch.nn` package는 굉장히 중요하다!

- 엄청 많은 곳에 Container class로 사용된다.

![total](https://user-images.githubusercontent.com/13328380/53549728-53d39400-3b78-11e9-92b0-75d2d3174eee.png)


In [13]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a 
# linear function, and holds internal Tensors for its weights and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we willl user Mean Squared Error (MSE) as our loss function
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data
    y_pred = model(x)
    
    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad


0 667.9082641601562
1 618.4788208007812
2 575.7503662109375
3 538.4057006835938
4 505.0133972167969
5 474.92205810546875
6 447.71917724609375
7 422.76318359375
8 399.8777160644531
9 378.74603271484375
10 358.9127197265625
11 340.2683410644531
12 322.7906799316406
13 306.343994140625
14 290.8363342285156
15 276.1964416503906
16 262.2452087402344
17 249.01559448242188
18 236.44271850585938
19 224.5404052734375
20 213.22763061523438
21 202.48086547851562
22 192.24960327148438
23 182.52938842773438
24 173.27993774414062
25 164.46875
26 156.08570861816406
27 148.0727081298828
28 140.4662628173828
29 133.25320434570312
30 126.39643096923828
31 119.8735122680664
32 113.68558502197266
33 107.81700134277344
34 102.24555206298828
35 96.95899200439453
36 91.94905090332031
37 87.1937255859375
38 82.68177032470703
39 78.40486145019531
40 74.36045837402344
41 70.53807067871094
42 66.92025756835938
43 63.49578857421875
44 60.25206756591797
45 57.185367584228516
46 54.27803421020508
47 51.521755218505

# PyTorch: optim
    
Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters   
(with `torch.no_grad()` or `.data` to avoid tracking history in autograd). This is not a huge burden for simple optimization   
algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated   
optimizers like AdaGrad, RMSProp, Adam, etc.  
    
The `optim` package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly   
used optimization algorithms.  
    
In this example we will use the `nn` package to define our model as before, but we will optimize the model using the Adam   
algorithm provided by the `optim` package:  

## PyTorch: optim

- 이전까지는 직접 SGD로 Backprop과 update를 했는데,`optim` package를 써서 이를 해결해보자

In [17]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# User the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. the first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)
    
    # Compute and print loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()
    
    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()
    

0 639.4940795898438
1 622.8134155273438
2 606.5660400390625
3 590.7906494140625
4 575.5263671875
5 560.6886596679688
6 546.2153930664062
7 532.1458129882812
8 518.5551147460938
9 505.408203125
10 492.73468017578125
11 480.45953369140625
12 468.5809326171875
13 457.0497131347656
14 445.8498840332031
15 434.94976806640625
16 424.33282470703125
17 414.03509521484375
18 404.0030822753906
19 394.2696533203125
20 384.7678527832031
21 375.5148620605469
22 366.5043640136719
23 357.6905822753906
24 349.0854187011719
25 340.73321533203125
26 332.5883483886719
27 324.6191711425781
28 316.8465881347656
29 309.2675476074219
30 301.8817138671875
31 294.69696044921875
32 287.6822204589844
33 280.826416015625
34 274.13238525390625
35 267.5810852050781
36 261.1506042480469
37 254.84512329101562
38 248.67291259765625
39 242.61068725585938
40 236.6603240966797
41 230.8264923095703
42 225.11386108398438
43 219.51109313964844
44 214.01609802246094
45 208.63955688476562
46 203.37310791015625
47 198.21614074

414 7.542935782112181e-05
415 7.151439785957336e-05
416 6.780189869459718e-05
417 6.427102925954387e-05
418 6.0921367548871785e-05
419 5.7734283473109826e-05
420 5.471093027153984e-05
421 5.183945540920831e-05
422 4.911499490845017e-05
423 4.652592178899795e-05
424 4.406947482493706e-05
425 4.173494744463824e-05
426 3.952248880523257e-05
427 3.742347325896844e-05
428 3.543187995092012e-05
429 3.353979627718218e-05
430 3.174766243319027e-05
431 3.004764766956214e-05
432 2.8437114451662637e-05
433 2.6905863705906086e-05
434 2.5457773517700844e-05
435 2.4083603420876898e-05
436 2.27797372645e-05
437 2.154518551833462e-05
438 2.0374918676679954e-05
439 1.926890399772674e-05
440 1.8217604520032182e-05
441 1.7221927919308655e-05
442 1.627951132832095e-05
443 1.5387418898171745e-05
444 1.4540911251970101e-05
445 1.374070780002512e-05
446 1.2983027772861533e-05
447 1.2266241355973762e-05
448 1.1588484994717874e-05
449 1.0945201211143285e-05
450 1.0337546882510651e-05
451 9.762185072759166e-06


# PyTorch: Custom nn Modules
    
Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you   
can define your own Modules by subclassing `nn.Module` and defining a `forward` which receives input Tensors and produces   
output Tensors using other modules or other autograd operations on Tensors.  
    
In this example we implement our two-layer network as a custom Module subclass:

## Pytorch: Custom nn Modules

- custom module을 만들어보자

In [19]:
# -*- coding: utf-8 -*-
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Moduels defined in the constructor as
        well as arbitrary operations on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred
    
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())
    
    # Zero gradients, perform a bacward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 645.5746459960938
1 598.080322265625
2 556.5667114257812
3 519.7637329101562
4 487.1322937011719
5 457.4517822265625
6 430.4448547363281
7 405.7486572265625
8 382.833984375
9 361.5015869140625
10 341.41973876953125
11 322.6589660644531
12 305.0655517578125
13 288.4554138183594
14 272.6799621582031
15 257.7122497558594
16 243.48106384277344
17 229.94039916992188
18 217.05506896972656
19 204.81373596191406
20 193.16676330566406
21 182.11843872070312
22 171.5971221923828
23 161.51345825195312
24 151.92919921875
25 142.798583984375
26 134.13284301757812
27 125.92477416992188
28 118.14436340332031
29 110.76834869384766
30 103.8010025024414
31 97.25489044189453
32 91.10405731201172
33 85.32078552246094
34 79.87596893310547
35 74.76617431640625
36 69.97187805175781
37 65.45935821533203
38 61.23240280151367
39 57.26137161254883
40 53.532005310058594
41 50.05124282836914
42 46.7999153137207
43 43.76258850097656
44 40.928016662597656
45 38.285072326660156
46 35.82404327392578
47 33.52575302124

349 0.000992960180155933
350 0.0009702860261313617
351 0.000948130851611495
352 0.0009264888940379024
353 0.0009053675457835197
354 0.0008847453282214701
355 0.0008645952329970896
356 0.0008449341985397041
357 0.0008257362060248852
358 0.0008069683681242168
359 0.0007886288221925497
360 0.0007707438780926168
361 0.0007532642339356244
362 0.0007361879688687623
363 0.0007195221842266619
364 0.0007032437133602798
365 0.0006873499369248748
366 0.0006718135555274785
367 0.0006566253723576665
368 0.0006418133853003383
369 0.0006273258477449417
370 0.0006131840054877102
371 0.0005993665545247495
372 0.0005858710501343012
373 0.0005726947565563023
374 0.0005598116549663246
375 0.000547235831618309
376 0.0005349348648451269
377 0.0005229245871305466
378 0.0005111933569423854
379 0.0004997272044420242
380 0.0004885290400125086
381 0.00047758128494024277
382 0.0004668989568017423
383 0.000456450623460114
384 0.00044624428846873343
385 0.000436268252087757
386 0.00042652161209844053
387 0.00041700

# PyTorch: Control Flow + Weight Sharing
    
As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network   
that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same   
weights multiple times to compute the innermost hidden layers.  
    
For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the   
innermost layers by simply reusing the same Module multiple times when defining the forward pass.  
    
We can easily implement this model as a Module subclass:

## PyTorch: Control Flow + Weight Sharing

- ControlFlow, Weight Sharing은 하나의 layer를 반복적으로 쓰는 것을 의미하는 듯 싶다.

In [20]:
# -*- coding: utf-8 -*-
import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.
        
        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.
        
        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
            
        y_pred = self.output_linear(h_relu)
        return y_pred
    
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())
    
    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 637.712158203125
1 646.6466064453125
2 674.0325317382812
3 627.07568359375
4 581.3355712890625
5 600.9844970703125
6 460.945068359375
7 625.039306640625
8 353.77508544921875
9 304.08575439453125
10 620.8109130859375
11 627.003662109375
12 615.2634887695312
13 610.1180419921875
14 622.4473876953125
15 534.4255981445312
16 586.059814453125
17 611.860595703125
18 131.89810180664062
19 549.2403564453125
20 533.5958862304688
21 430.4943542480469
22 570.360107421875
23 469.52630615234375
24 441.0391540527344
25 408.0569152832031
26 474.10955810546875
27 437.95892333984375
28 403.67889404296875
29 230.3914337158203
30 276.4176330566406
31 253.50587463378906
32 259.07818603515625
33 153.6182098388672
34 178.03848266601562
35 265.9148864746094
36 184.09298706054688
37 182.1201629638672
38 207.7692108154297
39 213.28843688964844
40 138.57534790039062
41 109.9355239868164
42 440.67535400390625
43 235.50271606445312
44 155.27581787109375
45 858.1804809570312
46 257.5084533691406
47 224.758911132

396 5.0004706382751465
397 38.468833923339844
398 20.90216636657715
399 20.977420806884766
400 7.331823348999023
401 13.436382293701172
402 5.854316711425781
403 18.006113052368164
404 30.38141441345215
405 11.325276374816895
406 2.821300983428955
407 11.237908363342285
408 6.065662384033203
409 47.42743682861328
410 3.158426523208618
411 3.2920918464660645
412 10.065847396850586
413 11.243875503540039
414 12.734029769897461
415 8.973952293395996
416 9.224218368530273
417 3.2511367797851562
418 1.5244874954223633
419 2.6307175159454346
420 10.352783203125
421 3.351327419281006
422 1.7879525423049927
423 1.5237019062042236
424 2.7636454105377197
425 1.2188994884490967
426 1.7651878595352173
427 3.648951768875122
428 1.9037529230117798
429 1.563104510307312
430 1.2309794425964355
431 0.9091571569442749
432 4.177083492279053
433 0.7424383163452148
434 3.112476110458374
435 2.3282365798950195
436 1.9935998916625977
437 1.0427247285842896
438 1.4635499715805054
439 1.0395424365997314
440 1.