# Automatic Differentiation with torch.autograd

- 인공신경망을 학습시킬 때 가장 흔하게 사용되는 알고리즘: back propagation
- Back propagation에서 parameters(model weights)가 gradient of the loss function에 따라 조정됨(with respoect to the given parameter)

- 이러한 gradient들을 계산하기 위해서 PyTorch에 빌트인 되어있는게 torch.autograd
  - it supports automatic computation of gradient for any computational graph



# Tensors, Functions and Computational graph

In [None]:
import torch

x = torch.ones(5) # input tensor
y = torch.zeros(3) # expected output


w = torch.randn(5, 3, requires_grad = True) # parameter1 
b = torch.randn(3, requires_grad = True) # parameter2
# w, b: paramters which need to be optimized
# -> need to compute the gradients of loss function with respect to those(w,b)
# -> to do this: set the [requires_grad] property of those tensors

z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)


In [None]:
# to construct computational graph -> apply an object of class Function
# which knows how to compute the function in the forward direction, and also
# how to compute its derivative during the backward propagation step

# a reference to the pack propagation function is stored in [grad_fn]
print(f'Gradient function for z = {z.grad_fn}')
print(f'Gradient function for loss = {loss.grad_fn}')

Gradient function for z = <AddBackward0 object at 0x7fbc4cc279d0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fbc4cc27c90>


# Computing Gradients

In [None]:
# 파라미터들의 가중치를 최적화하기 위해서 -> 손실 함수의 derivatives를 계산해야함
loss.backward() # to compute these derivatives
print(w.grad) # to retrive the values 
print(b.grad) # to retrive the values 

tensor([[0.2836, 0.0166, 0.1387],
        [0.2836, 0.0166, 0.1387],
        [0.2836, 0.0166, 0.1387],
        [0.2836, 0.0166, 0.1387],
        [0.2836, 0.0166, 0.1387]])
tensor([0.2836, 0.0166, 0.1387])


# Disabling Gradient Tracking

- By default, all tensors with requires_grad = True are tracking their computational hisotry and support gradient computation.
- However, there are some cases when we do not need to do that, for example, when we have trained the model and just wnat to apply it to some input data,i.e. we only want to do [forward] computations through the network. 
- We can stop tracking computations by surrounding out computation code with [torch.no_grad()] block

In [None]:
z = torch.matmul(x, w) + b
print(z.requires_grad)

# gradient 계산하지 않아도 될때: i.e. [forward] 사용할 때 -> no_grad()로 계산 코드를 감싸줌으로써 계산 트래킹 중단
with torch.no_grad():
  z = torch.matmul(x, w) + b
print(z.requires_grad)

True
False


In [None]:
# Another way to achieve the same result: [detach()]
z = torch.matmul(x, w) + b
z_det = z.detach()
print(z_det.requires_grad)

False


In [None]:
#  gradient tracking을 중단하는 이유:
  #1. 신경망의 일부 파라미터를 frozen시키기 위해서 - 사전학습 네트워크 파인튜닝에서 흔한 경우
  #2. forward pass만 할 경우 계산 속도를 향상시키기 위해서 - gradient tracking하지 않는 게 훨씬 효율적

# More on computational Graphs

Conceptually, autograd keeps a record of data(tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. 

In this DAG, 
- leaves: input tensors
- roots: output tensors. 
- By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:
- run the requested operation to compute a resulting tensor
- maintin the operation's gradient function in the DAG

The backward pass kicks off when .backward() is called on the DAG root. autograd then:
- computes the gradients from each .grad_fn
- accumulates them in the respective tensor's .grad attribute
- using the chain rule, propagates all they way to the leaf tensors

An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.