<a href="https://colab.research.google.com/github/ratnaan23/ds_class/blob/main/eng_datascience_presentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 3 (라티나 아스투티) 

## 5.5 파이토치의 자동미분: 모든 것을 역전파하라
## PyTorch's autograd: Backpropagating all things

Simple example of backpropagation: Compute the gradient of a composition of functions with respect to their innermost parameters (w and b) by propagating derivatives backward using the chain rule.

Basic requirement: All functions can be differentiated analytically.

### **기울기 자동 계산** Computing the gradient automatically


---

PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them. They can also provide the chain of derivatives of such operations with respect to their inputs.

With PyTorch’s autograd, given a forward expression, PyTorch will automatically provide the gradient of that expression with respect to its input parameters.



#### **자동미분 적용하기** Applying autograd

Let's rewrite our thermometer callibration code using autograd:

In [None]:
# In[5]:
params = torch.tensor([1.0, 0.0], requires_grad=True)

The requires_grad=True argument is telling PyTorch to track the entire family tree of tensors resulting from operations on params.

#### **미분 속성 사용하기** Using the grad attribute

In general, all PyTorch tensors have an attribute named grad, normally it's None. 

To use the grad attribute, start with a tensor with requires_grad set to True, then call the model and compute the loss, and then call backward on the loss tensor.

In [None]:
# In[7]:
loss = loss_fn(model(t_u, *params), t_c)
loss.backward()

params.grad

```
# Out[7]:
tensor([4517.2969,   82.6000])
```

The grad attribute of params contains the derivatives of the loss with respect to each element of params.


When calculating `loss`, PyTorch creates the autograd graph with the operations as nodes.



When we call `loss.backward()', PyTorch traverses this graph in the reverse direction to compute gradients as shown by the arrows in the bottom row of the figure.

![Figure 5.10](https://drive.google.com/uc?export=view&id=1xlHFCoLoqeu3RVtrVlK-_hBBTnIqHvCl)

#### **미분 함수 누적하기** Accumulating the grad functions


Calling backward will lead derivatives to accumulate at leaf nodes.


We need to zero the gradient explicitly at each iteration, using the in-place zero_ method:

In [None]:
# In[8]:
if params.grad is not None:
    params.grad.zero_()

Our autograd-enabled training code will look like this:

In [None]:
# In[9]:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:      # loss.backward() 호출 전 아무 위치에나 두면 된다
            params.grad.zero_()
        
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
        loss.backward()

        with torch.no_grad():
            params -= learning_rate * params.grad
        
        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))

    return params

In [None]:
# In[10]: 
training_loop(
    n_epochs = 5000, 
    learning_rate = le-2,
    params = torch. tensor( [1.0, 0.0], requires _grad=True),
    t_u = t_un,      # 여기서도 정규화된 t_un을 사용
    t_c = t_c)

```
# Out [10]:
Epoch 500, Loss 7.860116 
Epoch 1000, Loss 3.828538 
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648 
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679 
Epoch 4500, Loss 2.927652 
Epoch 5000, Loss 2.927647
```


We got the same result as before, no need to calculate by hands anymore.

### **골라쓰는 옵티마이저**


---

The torch module has an optim submodule where we can find classes implementing different optimization algorithm.

In [None]:
# In[5]:
import torch.optim as optim

dir(optim)

```
# Out[5]:
['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
...
]
```

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input. 

![Figure 5.11](https://drive.google.com/uc?export=view&id=1d1PWcpufgDU_bqcOll0YqMb7mXTJfJ7g)

---



(A) Conceptual representation of how an optimizer holds a reference to parameters.

(B) After a loss is computed from inputs,

(C) a call to .backward leads to .grad being populated on parameters.

(D) At that point, the optimizer can access .grad. and compute the parameter updates.

#### **경사 하강 옵티마이저 사용하기** Using a gradient descent optimizer

Let's create params and instantiate a gradient descent optimizer:

In [None]:
# In[6]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr = learning_rate)

SGD stands for stochastic gradient descent. The term stochastic comes from the fact that the gradient is typically obtained by averaging over a random subset of all input samples, called a minibatch.

In [None]:
# In[7]:
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
loss.backward()

optimizer.step()

params

```
# Out[7]:
tensor([9.5483e-01, -8.2600e-04], requires_grad=True)
```

The value of params is updated upon calling step, because the optimizer looks into params.grad and updates params, substracting learning_rate times grad from it.

But the code is not ready yet since we haven't zero out the gradients. The loop-ready code with the extra zero_grad right before the call to backward:

In [None]:
# In[8]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr = learning_rate)

t_p = model(t_un, *params)
loss = loss_fn(t_p, t_c)

optimizer.zero_grad()
loss.backward()
optimizer.step()

params

```
# Out[8]:
tensor([1.7761, 0.1064], requires_grad=True)
```

훈련 루프도 여기에 맞춰 고쳐보자:

Our updated training loop:

In [None]:
# In[9]:
def training_loop(n_epochs, optimizer, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
    
    return params

In [None]:
# In[10]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)      # training_loop에 있는 params가 동일해야 한다.

training_loop(
    n_epochs = 5000,
    optimizer = optimizer,
    params = params,
    t_u = t_un,
    t_c = t_c)

```
# Out[10]:
Epoch 500, Loss 7.860118
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957697
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927680
Epoch 4500, Loss 2.927651
Epoch 5000, Loss 2.927648

tensor([  5.3671, -17.3012], requires_grad=True)
```
#### **다른 옵티마이저 테스트하기** Testing other optimizer

To test more optimizers, we have to instantiate a different optimizer, let's say Adam instead of SGD.



In Adam optimizer, the learning rate is set adaptively and it is a lot less sensitive to the scaling of the parameters. We can use the non-normalize input t_u and increase the learning rate to 1e-1.

In [None]:
# In[11]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-1
optimizer = optim.Adam([params], lr=learning_rate)

training_loop(
    n_epochs = 2000,
    optimizer = optimizer,
    params = params,
    t_u = t_u,
    t_c = t_c)


```
# Out[11]:
Epoch 500, Loss 7.612903
Epoch 1000, Loss 3.086700
Epoch 1500, Loss 2.928578
Epoch 2000, Loss 2.927646

tensor([  0.5367, -17.3021], requires_grad=True)
```

### **훈련, 검증, 과적함** Training, Validation, and Overfitting

---

A highly adaptable model will tend to use its many parameters to make sure the loss is minimal at the data points, but we'll have no guarantee that the model behaves well away from or in between the data points.


To avoid overfitting, we must take a few data points out of our dataset (the validation set) and only fit our model on the remaining data point (the training set), as shown below.

![Figure 5.12](https://drive.google.com/uc?export=view&id=1OLFLenUe0a0qX_NDI06kbFXYBumYjlJo)


We can evaluate the loss on the training set and on the validation set, and look at both result to decide if we've done a good job fitting our model.

#### **훈련 손실 평가하기** Evaluating the training loss


If the training loss is not decreasing, chances are:
*   the model is too simple for the data; or
*   our data doesn't contain meaningful information that lets it explain the output.


#### **검증셋으로 일반화하기** Generalizing to the validation set


If the training loss and the validation loss diverge, we're overfitting.



In our thermomether case, if we decided to fit the data with a more complicated function, for example piecewise polynomial, it could generate a model meandering its way through the data points, just like the figure below. 

![Figure 5.13](https://drive.google.com/uc?export=view&id=1OgexRkWqoEa7LXE9Z6Ri4LLvKhiRQn-m)



The model tries to push the loss very close to zero, but the behaviour of the function away from the data points does not increase the loss, so we have nothing to keep the model in check for inputs away from the training data points.

What's the solution for overfitting?:
*   make sure we get enough data for the process
*   make sure the model capable of fitting the training data is as regular as possible in between them (we can add penalization terms to the loss function, or add noise to the input sample)
*   make our model simpler (a simpler model may not fit the training data perfectly but it will likely behave more regularly in between data points)

#### **데이터셋 나누기** Splitting a dataset


Let's go back to our example to see how we can split the data into a training set and a validation set. 



We'll do it by shuffling t_u and t_c the same way and then splitting the resulting shuffled tensors into two parts.



Shuffling the elements of a tensor amounts to finding a permutation of its indices. The randperm function does exactly this.

In [None]:
# In[12]:
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices     # 랜덤이기 때문에 실행했을 때는 값이 다를 수도 있음

```
# Out[12]:
(tensor([9, 6, 5, 8, 4, 7, 0, 1, 3]), tensor([ 2, 10]))
```


We just got index tensor that we can use to build training and validation sets starting from the data tensors:

In [None]:
# In[13]:
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]

val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

Our training loop doesn’t really change. We just want to additionally evaluate the validation loss at every epoch, to have a chance to recognize whether we’re overfitting:

In [None]:
# In[14]:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
              train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)
        val_t_p = model(val_t_u, *params)
        val_loss = loss_fn(val_t_p, val_t_c)
        optimizer.zero_grad()
        train_loss.backward()    # 검증 데이터로는 학슴하면 안 되므로 val_loss.backward()가 없다
        optimizer.step()

        if epoch <= 3 or epoch % 500 == 0:
            print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
                  f" Validation loss {val_loss.item():.4f}")
    
    return params

In [None]:
# In[15]:
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(
    n_epochs = 3000,
    optimizer = optimizer,
    params = params,
    # SGD를 사용하므로 정규화된 입력을 사용함
    train_t_u = train_t_un,
    val_t_u = val_t_un,
    train_t_c = train_t_c,
    val_t_c = val_t_c)

```
# Out[15]:
Epoch 1, Training loss 66.5811, Validation loss 142.3890
Epoch 2, Training loss 38.8626, Validation loss 64.0434
Epoch 3, Training loss 33.3475, Validation loss 39.4590
Epoch 500, Training loss 7.1454, Validation loss 9.1252
Epoch 1000, Training loss 3.5940, Validation loss 5.3110
Epoch 1500, Training loss 3.0942, Validation loss 4.1611
Epoch 2000, Training loss 3.0238, Validation loss 3.7693
Epoch 2500, Training loss 3.0139, Validation loss 3.6279
Epoch 3000, Training loss 3.0125, Validation loss 3.5756

tensor([  5.1964, -16.7512], requires_grad=True)
```

Our validation set is really small, so the validation loss will only be meaningful up to a point.


*   We expect a model to perform better on the training set
*   our main goal is to see both the training loss and the validation loss decreasing.

![Figure 5.14](https://drive.google.com/uc?export=view&id=1NOdCPVz0Y8sNSbbtljhnj2Zwkc03V2UQ)

In the figure above (solid line = training; dotted line = validation),

where in case A, the model seems to not learning at all, and in case B, we can see that the training loss is decreasing but the validation loss is increasing, which shows overfitting.  Case C is ideal, where both training and validation loss are decreasing. While case D is showing a similar trend of training and validation loss, which is an acceptable scenario.

### 자동미분의 주의사항과 자동미분 끄기 Autograd nits and switching it off


---


In our training loop, our model is evaluated twice, once on train_t_u and once on val_t_u and then backward is called. The first line in the training loop evaluates model on train_t_u to produce train_t_p. Then train_loss is evaluated from train_t_p. This creates a computation graph that links train_t_u to train_t_p to train_loss. 


When model is evaluated again on val_t_u, it produces val_t_p and val_loss. In this case, a separate computation graph will be created that links val_t_u to val_t_p to val_loss.


Separate tensors have been run through the same functions, model and loss_fn, generating separate computation graphs, as shown in figure below.


![Figure 5.15](https://drive.google.com/uc?export=view&id=1pmDLKyqQbx1vA52-6ndT_5fi5CKbTxll)



The only tensors these two graphs have in common are the parameters. When we call backward on train_loss, we run backward on the first graph. In other words, we accumulate the derivatives of train_loss with respect to the parameters based on the computation generated from train_t_u. 


Another point, since we're not even calling backward on val_loss, we could switch off autograd on this part.



PyTorch allows us to switch off autograd when we don't need it, using the torch.no_grad context manager.


We can make sure by checking the value of the requires_grad attribute on the val_loss tensor.

In [None]:
# In[16]:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
                  train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)
        
        with torch.no_grad():
            val_t_p = model(val_t_u, *params)
            val_loss = loss_fn(val_t_p, val_t_c)
            assert val_loss.requires_grad == False
        
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()



Using the related set_grad_enabled context, we can also condition the code to run with autograd enabled or disabled, according to a Boolean expression.

We could define a calc_forward function that takes data as input and runs model and loss_fn with or without autograd according to a Boolean is_train argument.

In [None]:
# In[17]:
def calc_forward(t_u, t_c, is_train):
    with torch.set_grad_enabled(is_train):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
    return loss

## 5.6 Conclusion

*   A model can be optimized to fit the data.
*   Linear models are the simplest reasonable model to use to fit data.
*   Data is often split into separate sets of training samples and validation samples.



