Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.<br>When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.

In [2]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

# create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

for t in range(500):
    # forward pass
    y_pred = model(x)
    
    # compute and print loss
    loss = loss_fn(y_pred, y)
    
    # zero the gradients before running the backward pass
    model.zero_grad()
    
    # backward pass
    loss.backward()
    
    # update the weights using gradient descent
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad