<a href="https://colab.research.google.com/github/ten-jampa/pytorch-grind/blob/main/pytorch_from_blogs/pytorch_in_1hr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pytorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs
Following [blogpost](https://sebastianraschka.com/blog/2023/pytorch-in-one-hour.html) by Sebastian Raschka

## 1. What is PyTorch

Pytorch is an open-source Python-based deep learning library that has been growing in scope in both academic research and industrial dev Ops. One of the reasons why PyTorch is so popular is its user-friendly interface and efficiency. However, despite is accessibility, it doesn't compromise on flexibitliy, providing advanced users the ability to tweak lower-level aspects of their models for customization and optimization.

### 1.1 The three core componenets of Pytorch

There are three broad components for Pytorch:

1. Tensor Library - that extends the concept of array-oriented programming (NumPy library) with features for accelerated computation on GPUs.
2. Automatic Differentiation Engine - autograd, which enables the automatic computation of gradients for tensor operations, simplyfying backpropagation and model optimization.
3. Deep Learning Library - offers modular, flexible, and efficient building blocks (pre-trained models, neurons, loss functions, and optimizer)

### 1.2 Installing PyTorch

In [1]:
!pip install torch==2.4.1

Collecting torch==2.4.1
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.1)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.4.1)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.4.1)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-many

In [2]:
## Importing Pytorch
import torch
torch.__version__

'2.4.1+cu121'

In [3]:
# check if NVIDIA GPU is available
torch.cuda.is_available()

True

## 2. Understanding Tensors

Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. In other words, tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions. For example, a scalar (just a number) is a tensor of rank 0, a vector is a tensor of rank 1, and a matrix is a tensor of rank 2.

From a computational perspective, tensors serve as data containers. For instance, they hold multi-dimensional data, where each dimension represents a different feature. Tensor libraries, such as PyTorch, can create, manipulate, and compute with these multi-dimensional arrays efficiently. In this context, a tensor library functions as an array library.

PyTorch tensors are similar to NumPy arrays but have several additional features important for deep learning. For example, PyTorch adds an automatic differentiation engine, simplifying computing gradients, as discussed later in section 2.4. PyTorch tensors also support GPU computations to speed up deep neural network training

### 2.1 Scalars, vectors, matrices, and Tensors

In [4]:
import torch



# create a 0d tensor (scalar)
tensor0d = torch.tensor(0)

# create a 1d tensor (vector)
tensor1d = torch.tensor([1,2,3])

# create a 2d tensor (matrix)
tensor2d = torch.tensor([[1,2],[3,4]])

# create a 3d tensor (from a nested python list)
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

# rank of tensor is len(tensor.shape)

### 2.2 Tensor data types

In [5]:
tensor1d = torch.tensor([1,2,3])

print(tensor1d.dtype) # default 64-bit integer data type from python

torch.int64


In [6]:
# we can create tensor from python float with 32-bit precision

floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


In [7]:
# it is easy to change the precision

floatvec = tensor1d.to(torch.float32)
print(floatvec)

tensor([1., 2., 3.])


### 2.3 Common PyTorch tensor operation

In [8]:
tensor2d = torch.tensor(
    [
        [1,2,3],
        [4,5,6]
    ]
)
tensor2d.shape # prints the shape attribute of thetensor

torch.Size([2, 3])

In [9]:
# to reshape the tensor
tensor2d.reshape(3,2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [10]:
# more common way to reshape is the view() method
tensor2d.view(3,2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [11]:
tensor2d #the methods above haven't changed the original tensor

tensor([[1, 2, 3],
        [4, 5, 6]])

In [12]:
# transpose a tensor which means flipping it across its diagonal

tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [13]:
tensor3d.T # you can also transpose a tensor of rank 3 and more but it's harder to visualize what's going on

  tensor3d.T # you can also transpose a tensor of rank 3 and more but it's harder to visualize what's going on


tensor([[[1, 5],
         [3, 7]],

        [[2, 6],
         [4, 8]]])

In [14]:
# common way to multiply two matrices matmul method
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

In [15]:
# more common way, which achieves the same thing more compactly.

tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

## 3. Seeing models as computation graphs

PyTorch’s autograd system provides functions to compute gradients in dynamic computational graphs automatically.

A computational graph is a directed graph that allows us to express and visualize mathematical expressions. Inthe context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network.


In [16]:
import torch.nn.functional as F

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


The computational graph

In [17]:
from IPython.display import Image, display

# Display an image from the blogpost
display(Image(url="https://sebastianraschka.com/images/teaching/pytorch-1h/figure_07.webp"))

## 4. Automatic Differentiaiton made easy

PyTorch builds computation graphs by default if one of its terminal nodes has the ```requires_grad``` attribute set to True. This is useful if we want to compute gradients. Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the chain rule from calculus for neural networks.

In [18]:
from IPython.display import Image, display

# display an image
display(Image(url = "https://sebastianraschka.com/images/teaching/pytorch-1h/figure_08.webp"))

In [19]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)


In [20]:
print(grad_L_w1)
print(grad_L_b)


(tensor([-0.0898]),)
(tensor([-0.0817]),)


In [21]:
# we computed the gradients manually but we can call the native backward

loss.backward()

print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## 5. Implementing Multilayer Neural Networks

When implementing a neural network in PyTorch, we typically subclass the ```torch.nn.Module``` class to define our own custom network architecture. This ```Module``` base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model’s parameters.

Within this subclass, we define the network layers in the ```__init__``` constructor and specify how they interact in the ```forward``` method. The ```forward``` method describes how the input data passes through the network and comes together as a computation graph.

In contrast, the backward method, which we typically do not need to implement ourselves, is used during training to compute gradients of the loss function with respect to the model parameters, as we will see in Section 2.7, A typical training loop.

The following code implements a classic multilayer perceptron with two hidden layers to illustrate a typical usage of the ```Module``` class:

In [22]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(
            # 1st layer
            torch.nn.Linear(num_inputs, 30), # 30 hidden units node
            torch.nn.ReLU(),

            # 2nd Layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs)
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits





In [23]:
model = NeuralNetwork(50, 3)

print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


Note that we used the ```Sequential``` class when we implemented the ```NeuralNetwork``` class. Using ```Sequential``` is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here.

In [24]:
# total number of trainable parameters of this model

num_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad
)

print('Total number of trainable model parameters: ', num_params)

Total number of trainable model parameters:  2213


In the case of our neural network model with the two hidden layers above, these trainable parameters are contained in the ```torch.nn.Linear``` layers. A linear layer multiples the inputs with a wieght matrix and adds a bias vector. THis is sometimes also referred to as a feedforward or fully connected layer.


In [25]:
print(model.layers)

Sequential(
  (0): Linear(in_features=50, out_features=30, bias=True)
  (1): ReLU()
  (2): Linear(in_features=30, out_features=20, bias=True)
  (3): ReLU()
  (4): Linear(in_features=20, out_features=3, bias=True)
)


In [26]:
# we can print the weights of each layer

print(model.layers[0].weight)

Parameter containing:
tensor([[-0.1136, -0.0423, -0.0922,  ..., -0.0141, -0.0605,  0.0980],
        [-0.0966, -0.0946, -0.0097,  ...,  0.0683,  0.1382, -0.0259],
        [-0.0043,  0.1131,  0.0089,  ...,  0.0789, -0.1252, -0.0931],
        ...,
        [-0.0827,  0.0314,  0.0436,  ...,  0.0404, -0.1295,  0.0118],
        [ 0.0867,  0.0217,  0.1398,  ..., -0.0560, -0.1069,  0.1034],
        [-0.0395,  0.0778,  0.1309,  ...,  0.0416,  0.0698,  0.1371]],
       requires_grad=True)


In [27]:
# to get the shape of the large weights matrix

print(model.layers[0].weight.shape)

torch.Size([30, 50])


In [28]:
# we can get the bias vector

print(model.layers[0].bias)
print(model.layers[0].bias.shape)

Parameter containing:
tensor([-0.0755,  0.0843, -0.0975,  0.0156, -0.1404,  0.1349, -0.0643, -0.0745,
        -0.0517,  0.0967, -0.0285, -0.1308,  0.0306,  0.0543,  0.1243, -0.1403,
        -0.0986, -0.0761, -0.0456,  0.0312,  0.0408, -0.0600,  0.1393,  0.0653,
        -0.0293, -0.0007, -0.0749,  0.1156, -0.0832, -0.0652],
       requires_grad=True)
torch.Size([30])


In [29]:
# forward pass
torch.manual_seed(123)

X = torch.rand((1, 50))
print(X)# note the shape, this is a row vector
out = model(X)
print(out)
print(out.shape) #another row vector



tensor([[0.2961, 0.5166, 0.2517, 0.6886, 0.0740, 0.8665, 0.1366, 0.1025, 0.1841,
         0.7264, 0.3153, 0.6871, 0.0756, 0.1966, 0.3164, 0.4017, 0.1186, 0.8274,
         0.3821, 0.6605, 0.8536, 0.5932, 0.6367, 0.9826, 0.2745, 0.6584, 0.2775,
         0.8573, 0.8993, 0.0390, 0.9268, 0.7388, 0.7179, 0.7058, 0.9156, 0.4340,
         0.0772, 0.3565, 0.1479, 0.5331, 0.4066, 0.2318, 0.4545, 0.9737, 0.4606,
         0.5159, 0.4220, 0.5786, 0.9455, 0.8057]])
tensor([[0.1446, 0.0256, 0.2507]], grad_fn=<AddmmBackward0>)
torch.Size([1, 3])


Here, grad_fn=<AddmmBackward0> represents the last-used function to compute a variable in the computational graph. In particular, grad_fn=<AddmmBackward0> means that the tensor we are inspecting was created via a matrix multiplication and addition operation. PyTorch will use this information when it computes gradients during backpropagation. The <AddmmBackward0> part of grad_fn=<AddmmBackward0> specifies the operation that was performed. In this case, it is an Addmm operation. Addmm stands for matrix multiplication (mm) followed by an addition (Add).

If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. So, when we use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the torch.no_grad() context manager, as shown below. This tells PyTorch that it doesn’t need to keep track of the gradients, which can result in significant savings in memory and computation.



In [30]:
with torch.no_grad():
    # don't keep the compute graph in memory
    X_tens = torch.rand((10, 50))
    out = model(X_tens)
    print(out)
    print(out.shape)

tensor([[0.1780, 0.0409, 0.2768],
        [0.1568, 0.0643, 0.2248],
        [0.1621, 0.0524, 0.2411],
        [0.1547, 0.0537, 0.2410],
        [0.1406, 0.0730, 0.2086],
        [0.1495, 0.0386, 0.2437],
        [0.1804, 0.0594, 0.2459],
        [0.1551, 0.0822, 0.1894],
        [0.1605, 0.0519, 0.2508],
        [0.1410, 0.0543, 0.2469]])
torch.Size([10, 3])


## 6. Setting up Efficient data loaders

In [31]:
# overall idea behind data loading in Pytorch

display(Image(url = 'https://sebastianraschka.com/images/teaching/pytorch-1h/figure_10.webp'))

Pytorch implements a ```Dataset``` and ```DataLoader``` class. The Dataset class is used to instantiate objects that define how each data record is loaded. The Dataloader handles how the data is shuffled and assembled into batches.

In [32]:
# code to create a dataset

X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])


In [33]:
X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])


In [34]:
# we create a custom dataset class.
from torch.utils.data import Dataset

class ToyDataset(Dataset):

    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]


In [35]:
train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In [36]:
# we can use Pytorch DataLoader class to sample from it

from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset = train_ds,
    batch_size= 2,
    shuffle=True,
    num_workers =0 # what's this argument? # parallelism used
)

In [37]:
test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

In [38]:
# iteration over Dataloader works quite similarly

for idx, (X, y) in enumerate(train_loader):
    print(f'Batch {idx + 1}: ', X, y)

Batch 1:  tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2:  tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3:  tensor([[ 2.7000, -1.5000]]) tensor([1])


In [39]:
# second iter is not the same for the DNN to have a different sequence of learning

for idx, (X, y) in enumerate(train_loader):
    print(f'Batch {idx + 1}: ', X, y)

Batch 1:  tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2:  tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 3:  tensor([[ 2.7000, -1.5000]]) tensor([1])


In [40]:
# we see that the last batch can have mismatching data samples given the unevenness of data
# entries to batch_size, therefore it's recommended to drop the last batch

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle= True,
    num_workers=0,
    drop_last=True
)

In [41]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)


Batch 1: tensor([[-0.9000,  2.9000],
        [ 2.3000, -1.1000]]) tensor([0, 1])
Batch 2: tensor([[ 2.7000, -1.5000],
        [-0.5000,  2.6000]]) tensor([1, 0])


## 7. A Typical Training Loop

In [42]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2) # instantiating the model
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) # instantiating the optimizer we are using

num_epochs = 3 # no of training over the full batch that we are doing

for epoch in range(num_epochs):

    model.train() # set in training mode
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)

        loss = F.cross_entropy(logits, labels) # Loss function

        optimizer.zero_grad() # clean the previous results
        loss.backward() # compute the gradients
        optimizer.step() # use the optimizer to step along the gradients

        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval() # eval mode
    # Optional model evaluation
    with torch.no_grad():
        outputs = model(X_train)

print(outputs)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=2, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=2, bias=True)
  )
)
Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00
tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


In [43]:
torch.set_printoptions(sci_mode=False) #what's scimode? (just print formatting I guess)
probas = torch.softmax(outputs, dim=1)
print(probas)



tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])


In [44]:
predictions = torch.argmax(probas, dim = 1) #find the max index along each dim 1 (which is the col dimension)
print(predictions)

tensor([0, 0, 0, 1, 1])


In [45]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)


tensor([0, 0, 0, 1, 1])


In [46]:
predictions == y_train

tensor([True, True, True, True, True])

In [47]:
# using torch.sum, we can count the number of correct predictions as follows:

torch.sum(predictions == y_train)

tensor(5)

In [48]:
# To generalize the computationa of the prediciton accuracy, we implement

def compute_accuracy(model, train_loader):

    # set in eval model
    model = model.eval()
    correct = 0.0
    total_examples = 0.0

    for idx, (features, labels) in enumerate(train_loader):
        with torch.no_grad():
            logits = model(features)

        predictions = torch.argmax(logits, dim = 1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item() #getting the values from the tensor

In [49]:

accuracy = compute_accuracy(model, train_loader)

In [50]:
accuracy

1.0

## 8. Saving and Loading models

In [51]:
# The recommended way to save and load models in PyTorch

torch.save(model.state_dict(), 'model.pth')

In [52]:
torch.save(model.state_dict(),  'model.pkl')

The model's ```state_dict``` is a Python dictionary object that maps each layer in the model to its trainable parameters (weights and biases). Note that "model.pth" is an arbitrary filename for the model file saved to disk. We can give it any name and file ending we like; however, .pth and .pt are the most common conventions.

Once we saved the model, we can restore it from disk as follows:


In [53]:
model = NeuralNetwork(2, 2) # needs to match the original model exactly
model.load_state_dict(torch.load('model.pth', weights_only=True))

<All keys matched successfully>

In [54]:
model2 = NeuralNetwork(2,2)
model2.load_state_dict(torch.load('model.pkl', weights_only=True))

<All keys matched successfully>

## 9. Optimizing training performance with GPUs

### 9.1 PyTorch computations on GPU devices

Modyifyng the training loop above to optionally run on a GPU is relatively simple and only requires changing three lines of code.
Before we make the modifications, it’s crucial to understand the main concept behind GPU computations within PyTorch.

First, we need to introduce the notion of devices. In PyTorch, a device is where computations occur, and data resides. The CPU and the GPU are examples of devices. A PyTorch tensor resides in a device, and its operations are executed on the same device.

Let’s see how this works in action. Assuming that you installed a GPU-compatible version of PyTorch as explained in section 2.1.3, Installing PyTorch, we can double-check that our runtime indeed supports GPU computing via the following code:

In [55]:
print(torch.cuda.is_available())

True


In [57]:
# by default, the computations run on CPU
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])

print(tensor_1 + tensor_2)


tensor([5., 7., 9.])


In [62]:
tensor_1 = tensor_1.to("cuda") # transfer these tensors onto a GPU and perform the addition there
tensor_2 = tensor_2.to("cuda")

print(tensor_1 + tensor_2)


tensor([5., 7., 9.], device='cuda:0')


Notice that the resulting tensor now includes the device information, device='cuda:0', which means that the tensors reside on the first GPU. If your machine hosts multiple GPUs, you have the option to specify which GPU you’d like to transfer the tensors to. You can do this by indicating the device ID in the transfer command. For instance, you can use .to("cuda:0"), .to("cuda:1"), and so on.

However, it is important to note that all tensors must be on the same device. Otherwise, the computation will fail, as shown below, where one tensor resides on the CPU and the other on the GPU:

In [64]:
tensor_1 = tensor_1.to("cuda:1") # transfer these tensors onto a GPU and perform the addition there
tensor_2 = tensor_2.to("cuda:1")

print(tensor_1 + tensor_2)

# we have only one GPU

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [65]:
# the tensors have to be on the same device
tensor_1 = tensor_1.to("cpu")
print(tensor_1 + tensor_2)


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

### 9.2 Single-GPU Training


In [70]:
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)

# New: Define a device variable that defaults to GPU but falls back to cpu if not available

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

# same loop

optimizer = torch.optim.SGD(model.parameters(), lr = 0.5)

num_epochs =  3

for epoch in range(num_epochs):
  model.train()
  for batch_idx, (features, labels) in enumerate(train_loader):
    features, labels = features.to(device), labels.to(device)
    logits = model(features)
    loss = F.cross_entropy(logits, labels)

    optimizer.zero_grad() # clean gradients from last round
    loss.backward()
    optimizer.step()

    ## Logging
    print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
                f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
                f" | Train/Val Loss: {loss:.2f}")

  model.eval()


Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00
