In [10]:
import torch
from urllib3.http2 import orig_HTTPSConnection


# Creating a multilayer perceptron with two hidden layers

class NeuralNetwork(torch.nn.Module):
  def __init__(self, num_inputs, num_outputs):
    super().__init__()
    self.layers = torch.nn.Sequential(
        # First hidden layer
        torch.nn.Linear(num_inputs, 30),
        torch.nn.ReLU(),

        # Second hidden layer
        torch.nn.Linear(30, 20),
        torch.nn.ReLU(),

        # Output layer
        torch.nn.Linear(20, num_outputs)
    )

  def forward(self, x):
    logits = self.layers(x)
    return logits


In [11]:
# Creating a small toy dataset
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

x_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],])

y_test = torch.tensor([0, 1])

In [12]:
# Defining a custom Dataset class
from torch.utils.data import Dataset

class ToyDataset(Dataset):
  def __init__(self, X, y):
    self.features = X
    self.labels = y

  def __getitem__(self, index):
    one_x = self.features[index]
    one_y = self.labels[index]
    return one_x, one_y

  def __len__(self):
    return self.labels.shape[0]


train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(x_test, y_test)

In [13]:
# Instantiating data loaders
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0)


test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

### A.7 A typical training loop

Let’s now train a neural network on the toy dataset. The following listing shows the training code.

In [14]:
# Neural Network training in PyTorch
import torch.nn.functional as F

torch.manual_seed(123)

model = NeuralNetwork(num_inputs=2, num_outputs=2) # The dataset has two features and two classes.
optimiser = torch.optim.SGD(
    model.parameters(), lr=0.5  # The optimiser needs to know which parameters to optimise.
)

num_epochs = 3 #  There will be 3 complete passing through the entire training dataset.
for epoch in range(num_epochs):
    model.train()  # It's time to learn
    for batch_idx, (features, labels) in enumerate(train_loader):
        logits = model(features) # The model makes its predictions based on the input features. These raw predictions (logits) haven't been converted to probabilities yet.

        loss = F.cross_entropy(logits, labels)

        optimiser.zero_grad() # Sets the gradients from the previous round to 0 to prevent unintended gradient accumulation
        loss.backward()    # Computes the gradients of the loss given the model parameters. This calculates how much each parameter in the model contributed to the error. Will calculate the gradients in the computation graph that PyTorch constructed in the background.
        optimiser.step()   # The optimiser uses the gradients to update the model parameters making them slightly better for next time to minimise the loss. In the case of the SGD optimiser, this means multiplying the gradients with the learning rate and adding the scaled negative gradient to the parameters.

        # Logging
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f"| Batch {batch_idx+1:03d}/{len(train_loader):03d}"
              f"| Train Loss: {loss:.2f}")

    # model.eval()

Epoch: 001/003| Batch 001/003| Train Loss: 0.75
Epoch: 001/003| Batch 002/003| Train Loss: 0.65
Epoch: 001/003| Batch 003/003| Train Loss: 0.42
Epoch: 002/003| Batch 001/003| Train Loss: 0.05
Epoch: 002/003| Batch 002/003| Train Loss: 0.13
Epoch: 002/003| Batch 003/003| Train Loss: 0.00
Epoch: 003/003| Batch 001/003| Train Loss: 0.01
Epoch: 003/003| Batch 002/003| Train Loss: 0.00
Epoch: 003/003| Batch 003/003| Train Loss: 0.02


As we can see, the loss reaches 0 after three epochs, a sign that the model converged on the training set. Here, we initialise a model with two inputs and two outputs because our toy dataset has two input features and two class labels to predict. We used a `stochastic gradient descent (SGD)` optimiser with a `learning rate (lr)` of 0.5. The learning rate is a `hyperparameter`, meaning it’s a tunable setting that we must experiment with based on observing the loss. Ideally, we want to choose a learning rate such that the loss converges after a certain number of epochs—the number of epochs is another hyperparameter to choose.

Note about PyTorch optimisers:
The optimiser is a crucial component that helps your neural network learn effectively.
In the code snippet above `SGD` is `Stochastic Gradient Descent`, one of the most fundamental optimisation algorithms in deep learning. Imagine you're trying to find the lowest point in a valley while blindfolded. SGD is like taking steps in the direction where the ground feels like it's sloping downward. The "stochastic" part means you're taking these steps based on looking at just a small part of your data at a time (a batch), rather than the whole dataset at once.

Understanding `lr=0.5` (`Learning Rate`):
The learning rate is like the size of the steps you take while searching for that lowest point. A learning rate of 0.5 means:
- If the gradient (slope) suggests moving in a certain direction, you'll move half that distance
- A larger learning rate (like 1.0) means bigger steps - faster learning but risk of overshooting
- A smaller learning rate (like 0.01) means smaller steps - more precise but slower learning

Note about Epochs and Batches:
1. Epoch
- Definition: An epoch refers to one complete pass through the entire training dataset. During an epoch, the model sees all the training data exactly once.
- Purpose: Repeated passes (multiple epochs) allow the model to learn and refine its parameters.
Typically, you train a model for many epochs (e.g., 10, 100, or more), depending on the dataset size and learning requirements.
2. Batch
- Definition: A batch is a subset of the training data. Instead of processing the entire dataset at once, the data is divided into smaller chunks (batches), and the model updates its parameters after processing each batch.
Why Use Batches?:
    - Efficiency: It’s computationally cheaper and faster to process smaller chunks of data, especially with large datasets.
    - Stability: Helps smooth out updates to the model by averaging gradients over the batch.
    - Memory Management: Allows training on large datasets that don’t fit into memory.

In practice, we often use a third dataset, a so-called `validation dataset`, to find the optimal hyperparameter settings. A validation dataset is similar to a test set. However, while we only want to **use a test set precisely once** to avoid biasing the evaluation, we usually use the validation set multiple times to tweak the model settings.

After we have trained the model, we can use it to make predictions:

In [15]:
with torch.no_grad():
    outputs = model(X_train)
print(outputs)

tensor([[ 2.9320, -4.2563],
        [ 2.6045, -3.8389],
        [ 2.1484, -3.2514],
        [-2.1461,  2.1496],
        [-2.5004,  2.5210]])


To obtain the class membership probabilities, we can then use PyTorch’s `softmax` function:

In [17]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1) # Computes softmax for each row independently.
print(probas)

tensor([[    0.9992,     0.0008],
        [    0.9984,     0.0016],
        [    0.9955,     0.0045],
        [    0.0134,     0.9866],
        [    0.0066,     0.9934]])


Let's consider the first row:   [0.9992,     0.0008] --> Here, the first value (column) means that the training example has a 99.92% probability of belonging to class 0 and a 0.08% probability of belonging to class 1.

We can convert these values into class label predictions using PyTorch’s `argmax` function, which returns the index position of the highest value in each row if we set `dim=1` (setting dim=0 would return the highest value in each column instead):

In [18]:
predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


We could also apply the argmax function to the logits (outputs) directly:

In [19]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])
