## Acknowledgement

This notebook is based on the book **["Build a Large Language Model (From Scratch)"](https://www.manning.com/books/build-a-large-language-model-from-scratch)** by **Sebastian Raschka**, published by Manning Publications.

- [Book on Manning](https://www.manning.com/books/build-a-large-language-model-from-scratch)
- [GitHub Repository](https://github.com/rasbt/LLMs-from-scratch)

---

# Basics of PyTorch

In [None]:
import torch
print(torch.__version__)

2.5.1+cu121


In [None]:
import torch
torch.cuda.is_available()

True

## Understanding tensors

In [None]:
import torch
tensor0d = torch.tensor(1)
tensor1d = torch.tensor([1,2,3])
tensor2d = torch.tensor([[1, 2, 3], [3, 4, 5]])
tensor3d = torch.tensor([[[1,2,3,4], [5,6,7,8], [1,1,1,1]],
                         [[7,8,9,10], [11,12,13,14], [2,2,2,2]]])

In [None]:
print("Shape of tensor2d:", tensor2d.shape)

Shape of tensor2d: torch.Size([2, 3])


In [None]:
print("Shape of tensor3d:", tensor3d.shape)

Shape of tensor3d: torch.Size([2, 3, 4])


In [None]:
print(tensor0d.dtype)

torch.int64


In [None]:
print(tensor1d.dtype)

torch.int64


In [None]:
print(tensor3d.dtype)

torch.int64


"If we create tensors from Python **floats**, PyTorch creates tensors with a 32-bit precision by default:"

In [None]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


  A 32-bit floating-point number offers sufficient precision for most deep learning tasks. "It is possible to change the precision using a tensor’s **.to** method"

In [None]:
floatvec = tensor1d.to(torch.float32)
print(floatvec.dtype)

torch.float32


##  Common PyTorch tensor operations

In [None]:
print(tensor2d)

tensor([[1, 2, 3],
        [3, 4, 5]])


We can reshape the tensor into a 3 × 2 tensor.

In [None]:
print(tensor2d.reshape(3, 2))

tensor([[1, 2],
        [3, 3],
        [4, 5]])


More common method for reshaping in PyTorch is **.view()**.

In [None]:
print(tensor2d.view(3, 2))

tensor([[1, 2],
        [3, 3],
        [4, 5]])


".view() requires the original data to be contiguous and will fail if it isn’t, whereas .reshape() will work regardless, copying the data if necessary to ensure the desired shape"

### Transpose -- .T

In [None]:
print(tensor2d)

tensor([[1, 2, 3],
        [3, 4, 5]])


In [None]:
print(tensor2d.shape)

torch.Size([2, 3])


In [None]:
print(tensor2d.T)

tensor([[1, 3],
        [2, 4],
        [3, 5]])


In [None]:
print(tensor2d.shape)

torch.Size([2, 3])


This command **.T** gives transpose matrix but does npt change the original matrix unless you save it to the original.

### Matrix multiplication -- .matmul()

In [None]:
print(tensor2d.matmul(tensor2d.T))

tensor([[14, 26],
        [26, 50]])


### Matrix multiplication -- **@**

In [None]:
print(tensor2d @ tensor2d.T)

tensor([[14, 26],
        [26, 50]])


## Computation Graph

### A logistic regression forward pass

In [None]:
import torch.nn.functional as F

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2])
b = torch.tensor([0.0])
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


### Computing gradients via autograd

In [None]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


In [None]:
print(grad_L_w1[0])
print(grad_L_b[0])

tensor([-0.0898])
tensor([-0.0817])


 "we can call **.backward** on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors’ **.grad** attributes:"

In [None]:
loss.backward()

In [None]:
print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


## Implementing multilayer neural networks

### A multilayer perceptron with two hidden layers

In [None]:
class NeuralNetwork(torch.nn.Module):
  def __init__(self, num_inputs, num_outputs):
    super().__init__()

    self.layers = torch.nn.Sequential(
        # 1st hidden layer -- 30 nodes in the hidden layer
        torch.nn.Linear(num_inputs, 30),
        torch.nn.ReLU(),

        # 2nd hidden layer -- 20 nodes or hidden units
        torch.nn.Linear(30, 20),
        torch.nn.ReLU(),

        # output layer
        torch.nn.Linear(20, num_outputs)
    )

  def forward(self, x):
    logits = self.layers(x)
    return logits

Instantiate a new neural network object

In [None]:
model = NeuralNetwork(50, 3) # Number of inputs is 50 and number of outputs is 3

In [None]:
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


### Total number of trainable parameters of this model

In [None]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable parameters:", num_params)

Total number of trainable parameters: 2213


### Weight of layers

#### Weight of the first Linear layer

In [None]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 0.0902, -0.0872, -0.0647,  ...,  0.0258, -0.0511,  0.0559],
        [ 0.0193,  0.0288, -0.0526,  ...,  0.0612,  0.0223, -0.0825],
        [-0.0870, -0.0267, -0.1318,  ...,  0.1013,  0.0416, -0.0316],
        ...,
        [-0.0869,  0.1370, -0.1022,  ...,  0.1283, -0.0957, -0.0624],
        [ 0.0624,  0.0079, -0.0554,  ...,  0.0250, -0.0850, -0.1238],
        [ 0.0598, -0.0201, -0.1021,  ..., -0.0553, -0.0375,  0.0615]],
       requires_grad=True)


In [None]:
print(model.layers[0].weight.shape)

torch.Size([30, 50])


"We can make the random number initialization reproducible by seeding PyTorch’s random number generator via **manual_seed**"

In [None]:
torch.manual_seed(123)
model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


In [None]:
torch.manual_seed(123)
X = torch.rand((1, 50)) # tensor of shape (1, 50) where each element is a random number in
# [0,1)
#  generates a tensor filled with random values sampled from a uniform distribution
# [0, 1) -- include 0 but exclude 1
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


In [None]:
X

tensor([[0.2961, 0.5166, 0.2517, 0.6886, 0.0740, 0.8665, 0.1366, 0.1025, 0.1841,
         0.7264, 0.3153, 0.6871, 0.0756, 0.1966, 0.3164, 0.4017, 0.1186, 0.8274,
         0.3821, 0.6605, 0.8536, 0.5932, 0.6367, 0.9826, 0.2745, 0.6584, 0.2775,
         0.8573, 0.8993, 0.0390, 0.9268, 0.7388, 0.7179, 0.7058, 0.9156, 0.4340,
         0.0772, 0.3565, 0.1479, 0.5331, 0.4066, 0.2318, 0.4545, 0.9737, 0.4606,
         0.5159, 0.4220, 0.5786, 0.9455, 0.8057]])

"In the preceding code, we generated a single random training example X as a toy input (note that our network expects 50-dimensional feature vectors) and fed it to the model, returning three scores. When we call model(x), it will automatically execute the forward pass of the model."

"When we use a model for inference (for instance, making predictions) rather than training, the best practice is to use the **torch.no_grad()** context manager. This tells PyTorch that it doesn’t need to keep track of the gradients, which can result in significant savings in memory and computation:"

In [None]:
with torch.no_grad():
  out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


"So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:"

In [None]:
with torch.no_grad():
  out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.3113, 0.3934, 0.2952]])


#### Why dim=1 in This Code?
##### Input to the Model
- X has a shape of (1, 50), meaning it has 1 row and 50 features (or inputs).
- model(X) produces logits, likely for a classification problem.

##### Output of model(X)
- Assuming model(X) outputs a tensor of shape (batch_size, num_classes):

  - batch_size: Number of input samples (1 in this case).
  - num_classes: Number of possible classes for classification.
- For example, if model(X) outputs a tensor of shape (1, 10), this corresponds to:

  - 1 sample in the batch.
  - 10 raw scores (logits) for 10 possible classes.
##### dim=1 Meaning
- Dimension 1 corresponds to the second axis (columns) in a 2D tensor. Applying softmax along dim=1 means:
  - For each row (sample) in the tensor, the raw scores (logits) in all columns (classes) are normalized to sum to 1.

##### Why Normalize Along dim=1?
- In classification problems, the softmax is applied across the class dimension (columns) to produce probabilities for each class.
- Each row in the output tensor will represent the probability distribution for one sample.


## Setting up efficient data loaders

Dataset of five observastions for training data for two features. Also create a tensor containing the corresponding class labels: three examples belonging to class 0, and two examples belonging to class 1. In addition, make a test set consisting of two entries.

In [None]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])
y_train = torch.tensor([0, 0, 0, 1, 1])

X_test = torch.tensor([[-0.8, 2.8],
                       [2.6, -1.6],])
y_test = torch.tensor([0, 1])

**PyTorch requires that class labels start with label 0**

#### Create a custom dataset class

In [None]:
from torch.utils.data import Dataset

class ToyDataset(Dataset):
  def __init__(self, X, y):
    self.features = X
    self.labels = y

  def __getitem__(self, index):
    one_x = self.features[index]
    one_y = self.labels[index]
    return one_x, one_y

  def __len__(self):
    return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In [None]:
print(len(train_ds))

5


After defining PyTorch Dataset class, we can use it for our toy dataset using PyTorch’s **DataLoader** class to sample from it

#### Instantiating data loaders

In [None]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(dataset=train_ds,
                          batch_size = 2,
                          shuffle=True,
                          num_workers=0
                          )
# The num_workers parameter in the DataLoader specifies the number of worker
# processes used for loading data. It controls how many separate subprocesses
# are used to fetch data in parallel while training or testing a model.

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

Iterating over data

In [None]:
for idx, (x, y) in enumerate(train_loader):
  print(f"Vatch {idx+1}:", x, y)

Vatch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Vatch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Vatch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])


Set `drop_last=True` to drop the last batch in each epoch.

In [None]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

In [None]:
for idx, (x, y) in enumerate(train_loader):
  print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])


### A typical training loop

In [None]:
import torch.nn.functional as F

torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3
for epoch in range(num_epochs):
  model.train()
  for batch_idx, (features, labels) in enumerate(train_loader):
    logits = model(features)
    loss = F.cross_entropy(logits, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Logging
    print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
          f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
          f" | Trrain Loss: {loss:.2f}")
    model.eval()

Epoch: 001/003 | Batch 000/002 | Trrain Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Trrain Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Trrain Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Trrain Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Trrain Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Trrain Loss: 0.00


In [None]:
# Instantiate the model with 2 input features and 2 output classes
model = NeuralNetwork(num_inputs=2, num_outputs=2)

# Define an optimizer (Stochastic Gradient Descent) with a learning rate of 0.5
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

# Number of epochs to train the model
num_epochs = 3

# Simulated data loader (you need to replace `train_loader` with your actual data loader)
# train_loader is expected to yield batches of (features, labels)
for epoch in range(num_epochs):  # Loop over each epoch
    model.train()  # Set the model to training mode (important for layers like dropout, batch norm)
    for batch_idx, (features, labels) in enumerate(train_loader):  # Loop over batches of data
        # Forward pass: compute predicted logits (unnormalized scores)
        logits = model(features)

        # Compute the loss using cross-entropy
        # Cross-entropy is suitable for classification tasks
        loss = F.cross_entropy(logits, labels)

        # Zero out gradients to prevent accumulation from previous steps
        optimizer.zero_grad()

        # Backward pass: compute gradients of the loss w.r.t model parameters
        loss.backward()

        # Update model parameters based on computed gradients
        optimizer.step()
        #  The optimizer.step() method will use the gradients to update the model
        # parameters to minimize the loss. In the case of the SGD optimizer, this
        # means multiplying the gradients with the learning rate and adding the
        # scaled negative gradient to the parameters.

        # Logging: Print training progress
        # f-strings are used for formatting the output to display relevant info clearly
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"  # Current epoch and total epochs
              f" | Batch {batch_idx+1:03d}/{len(train_loader):03d}"  # Current batch and total batches
              f" | Train Loss: {loss.item():.2f}")  # Training loss for the current batch

        model.eval()  # Set the model to evaluation mode (no dropout, batch norm)


Epoch: 001/003 | Batch 001/002 | Train Loss: 0.69
Epoch: 001/003 | Batch 002/002 | Train Loss: 0.49
Epoch: 002/003 | Batch 001/002 | Train Loss: 0.34
Epoch: 002/003 | Batch 002/002 | Train Loss: 0.20
Epoch: 003/003 | Batch 001/002 | Train Loss: 0.04
Epoch: 003/003 | Batch 002/002 | Train Loss: 0.22


1. Formatter Used in Logging:

  - `f"{epoch+1:03d}"`: Ensures that the epoch number is formatted as a 3-digit number with leading zeros, e.g., "001".
  - `len(train_loader)`: Displays the total number of batches in the data loader.
  - `loss.item()`: Converts the tensor loss value to a Python float for display purposes.
  - `.2f`: Formats the loss to two decimal places for readability.

2. Why `model.train()` and `model.eval()` Are Used:

  - `model.train()`: Ensures that the model behaves appropriately during training, e.g., enabling dropout and batch normalization layers.
  - `model.eval()`: Ensures the model behaves appropriately during evaluation or inference, disabling dropout and using running statistics for batch normalization. It is misplaced in this code and should be used outside the training loop when validating or testing.

3. Why `optimizer.zero_grad()` Is Used:

  - Gradients accumulate in PyTorch by default. Before computing the new gradients, we need to zero them out to prevent incorrect updates.

4. Purpose of Cross-Entropy Loss:

  - Cross-entropy loss is suitable for classification tasks, measuring the difference between predicted probabilities and actual class labels.

5. Learning Rate:

  - The learning rate (`lr=0.5`) determines the step size for parameter updates. A high learning rate can cause the model to converge quickly or fail to converge, while a small one may slow training.

#### Why optimizer.zero_grad() Is Used
Before computing gradients for a new batch, you need to clear the old gradients to ensure they don’t interfere with the new ones.

**Without** `optimizer.zero_grad()`:
  - Gradients will accumulate over multiple forward-backward passes, leading to incorrect parameter updates.

### Making Predictions

After we have trained the model, we can use it to make predictions:

In [None]:
model.eval()
with torch.no_grad():
  outputs = model(X_train)
print(outputs)

tensor([[ 2.4929, -1.8742],
        [ 2.2958, -1.7236],
        [ 1.9967, -1.4961],
        [-1.5813,  1.3215],
        [-1.7966,  1.5170]])


In [None]:
# Set the model to evaluation mode
# This ensures that certain layers, like dropout and batch normalization, behave appropriately during evaluation.
model.eval()

# Disable gradient calculations using torch.no_grad()
# Explanation:
# - Gradient calculations are not needed during evaluation or inference.
# - Disabling gradients reduces memory usage and speeds up computation.
# - This is especially useful when you are only interested in forward pass outputs.
with torch.no_grad():
    # Perform a forward pass through the model using the training data (X_train)
    # Explanation:
    # - The model takes X_train as input and produces outputs.
    # - This step generates predictions or activations from the model without updating weights.
    outputs = model(X_train)

# Print the outputs generated by the model
# Explanation:
# - This displays the predictions or activations generated from the forward pass.
# - Useful for verifying the model's output during evaluation.
print(outputs)

tensor([[ 2.4929, -1.8742],
        [ 2.2958, -1.7236],
        [ 1.9967, -1.4961],
        [-1.5813,  1.3215],
        [-1.7966,  1.5170]])


In [None]:
# Set print options to avoid scientific notation for better readability
# Explanation:
# - By default, PyTorch prints tensors in scientific notation for small or large values.
# - Setting `sci_mode=False` ensures the output is displayed in standard decimal format.
torch.set_printoptions(sci_mode=False)

# Apply the softmax function to the outputs along the specified dimension (dim=1)
# Explanation:
# - Softmax converts raw model outputs (logits) into probabilities.
# - It normalizes the logits into a range between 0 and 1, and the sum of values across the specified dimension equals 1.
# - `dim=1` specifies that softmax operates across the second dimension (columns) of the tensor.
probas = torch.softmax(outputs, dim=1)

# Print the probabilities generated by the softmax function
# Explanation:
# - This displays the normalized probabilities for each input in the batch.
# - Each row in the output tensor represents the probabilities for the two classes (binary classification in this case).
print(probas)


tensor([[0.9875, 0.0125],
        [0.9824, 0.0176],
        [0.9705, 0.0295],
        [0.0520, 0.9480],
        [0.0351, 0.9649]])


### Obtain class membership probabilities using `softmax`

In [None]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

tensor([[0.9875, 0.0125],
        [0.9824, 0.0176],
        [0.9705, 0.0295],
        [0.0520, 0.9480],
        [0.0351, 0.9649]])


1. Softmax Probabilities:

  - Each row corresponds to a single input sample in the batch.
  - Each element in a row represents the predicted probability of the sample belonging to a specific class.
  - The sum of probabilities in each row is always 1 due to the normalization property of softmax.

2. Example Analysis:

  - Row 1: [0.9875, 0.0125]:
    - The model is 98.75% confident that the first sample belongs to Class 0.
    - The probability for Class 1 is only 1.25%, suggesting a strong preference for Class 0.
  - Row 4: [0.0520, 0.9480]:
    - The model is 94.80% confident that the fourth sample belongs to Class 1.
    - It predicts Class 1 with high certainty compared to Class 0.

3. Interpreting the Results:

  - High Values: When one value in a row is much larger than the other, the model is confident about its prediction.
  - Low Certainty: If the values in a row are closer (e.g., [0.6, 0.4]), the model is less confident about its prediction.

#### Why Use Softmax:
Softmax is often used in classification problems because it provides interpretable probabilities, allowing you to assess the model's confidence in its predictions. These probabilities can be used to make decisions or evaluate model performance.

"We can convert these values into class label predictions using PyTorch’s argmax function, which returns the index position of the highest value in each row if we set `dim=1`"

In [None]:
predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


"Note that it is unnecessary to compute softmax probabilities to obtain the class labels. We could also apply the argmax function to the logits (outputs) directly:"

In [None]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


 "Since the training dataset is relatively small, we could compare it to the true training labels by eye and see that the model is 100% correct."

In [None]:
predictions == y_train

tensor([True, True, True, True, True])

#### Number of correct predictions -- using `torch.sum()`

In [None]:
torch.sum(predictions == y_train)

tensor(5)

#### Generalize the computation of the prediction accuracy

In [None]:
def compute_accuracy(model, dataloader):
  model = model.eval()
  correct = 0.0
  total_examples = 0

  for idx, (features, labels) in enumerate(dataloader):
    with torch.no_grad():
      logits = model(features)

    predictions  = torch.argmax(logits, dim=1)
    compare = labels == predictions
    correct += torch.sum(compare)
    total_examples += len(compare)

  return (correct / total_examples).item()

In [None]:
print(compute_accuracy(model, train_loader))

1.0


We can apply the function to the test set:

In [None]:
print(compute_accuracy(model, test_loader))

1.0


### Saving and loading models

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = "/content/drive/My Drive/LLM from Scratch/model.pth"

In [None]:
torch.save(model.state_dict(), path)

`.pth` and `.pt` are the most common file extensions for saving files.

###  Optimizing training performance with GPUs

In [None]:
print(torch.cuda.is_available())

True


In [None]:
tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
print(tensor_1 + tensor_2)

tensor([5., 7., 9.])


 Transfer these tensors onto a GPU and perform the addition.

In [None]:
tensor_1 = tensor_1.to("cuda")
tensor_2 = tensor_2.to("cuda")
print(tensor_1 + tensor_2)

tensor([5., 7., 9.], device='cuda:0')


### Single-GPU training

In [None]:
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)

device = torch.device("cuda")
model = model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):

  model.train()
  for batch_idx, (feature, labels) in enumerate(train_loader):
    features, labels = features.to(device), labels.to(device)
    logits = model(features)
    loss = F.cross_entropy(logits, labels)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Logging
    print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
    f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
    f" | Ttrain/Val loss: {loss:.2f}")
  model.eval()

Epoch: 001/003 | Batch 000/002 | Ttrain/Val loss: 0.81
Epoch: 001/003 | Batch 001/002 | Ttrain/Val loss: 1.29
Epoch: 002/003 | Batch 000/002 | Ttrain/Val loss: 0.65
Epoch: 002/003 | Batch 001/002 | Ttrain/Val loss: 1.20
Epoch: 003/003 | Batch 000/002 | Ttrain/Val loss: 0.70
Epoch: 003/003 | Batch 001/002 | Ttrain/Val loss: 0.63


"We can use `.to("cuda")` instead of `device = torch.device("cuda")`. Transferring a tensor to "cuda" instead of `torch.device("cuda")` works as well and is shorter (see section A.9.1). We can also modify the statement, which will make the same code executable on a CPU if a GPU is not available".

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")