**Tensor**

Similar to Numpy ndarrays. Optimized for automatic differentiation.

In [None]:
import torch

data = [[1, 2], [3, 4]]
# From python array
x_data = torch.tensor(data)
# From another tensor
x_ones = torch.ones_like(x_data) # retains the properties of x_data
x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data

In [None]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

In [None]:
# Operations

# To device
if torch.cuda.is_available():
    tensor = rand_tensor.to("cuda")

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

# Deep Copy
# Slices are views (not copies), so changes affect the original array, use .clone().detach()
tensor_copy = tensor.clone().detach()

# Index and slice
# syntax: start(inc):end(exc):step. For multi-dimensional arrays, slicing is done per axis.
tensor[:,1] = 0 # This combines slice and index: [slice all rows, index 1]

print(tensor)
print(tensor_copy)

In [None]:
# Concatenate a sequence of tensors along a given dimension, 0 = row 1 = col
t1 = torch.cat([tensor, tensor, tensor], dim=0)
print(t1)

In [None]:
# Arithmetic operations: y1, y2, y3 will have the same value
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)
y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)

# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)
z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

tensor([[0.9383, 0.0000, 0.7632],
        [0.3839, 0.0000, 0.4835]], device='cuda:0')

In [None]:
agg = tensor.sum()
agg_item = agg.item()

**Datasets & DataLoaders**

 PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.


**Transform Data**
All TorchVision datasets have two parameters -transform to modify the features and target_transform to modify the labels - that accept callables containing the transformation logic.

The FashionMNIST features are in PIL Image format, and the labels are integers. For training, we need the features as **normalized tensors**, and the labels as **one-hot encoded tensors**. To make these transformations, we use ToTensor and Lambda.

In [None]:
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda

# ToTensor Convert a PIL Image or ndarray to tensor and scale the values accordingly.
# Converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255]
# to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0].


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

import matplotlib.pyplot as plt
labels_map = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}
sample_idx = torch.randint(len(training_data), size=(1,)).item()
img, label = training_data[sample_idx]
img.shape # C=1, H=28, W=28 Number of channels (feature dimensions). Here grayscale, so C dimension is 1. RGB: C=3.
plt.title(labels_map[label])
plt.axis("off")
# squeeze() removes dimensions of size 1 from an array.
# For example, if your image array has a shape like (1, 224, 224, 3), applying squeeze() will result in an array with shape (224, 224, 3).
plt.imshow(img.squeeze(), cmap="gray")
plt.show()


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1))
)

We have loaded that dataset into the DataLoader and can iterate through the dataset as needed. Each iteration below returns a batch of train_features and train_labels (containing batch_size=64 features and labels respectively). Because we specified shuffle=True, **after we iterate over all batches the data is shuffled**

In [None]:

from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

train_features, train_labels = next(iter(train_dataloader))

print(f"Feature batch shape: {train_features.size()}") # B, C, H, W
print(f"Labels batch shape: {train_labels.size()}")

Feature batch shape: torch.Size([64, 1, 28, 28])
Labels batch shape: torch.Size([64, 10])


### Build the Neural Network

torch.nn has all the building blocks needed. Every module is a subclass of nn.Module. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.

In [None]:
from torch import nn

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

# Linear + Relu common practice
# ReLU introduces sparsity in the activations
# ReLU simplifies the optimization landscape by introducing piecewise linearity, which helps gradient-based optimizers converge faster compared to smoother activation functions like sigmoid.
# ReLU (Rectified Linear Unit) introduces sparsity into neural networks by selectively activating only a subset of neurons in each layer
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten() #  convert each 2D 28x28 image into a contiguous array of 784 pixel values
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(), # ReLU Introduces Non-Linearity. Without non-linearity, stacking multiple linear layers is equivalent to a single linear transformation
            nn.Linear(512, 512), # 512 In essence, 512: dimensionality the hidden layers in the neural network.
            nn.ReLU(), # ReLU avoids the vanishing gradient problem, which is common with activation functions like sigmoid and tanh
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)

Using cuda device


In [None]:
X = train_features[0].to(device)
logits = model(X)
pred_probab = nn.Softmax(dim=1) (logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([1], device='cuda:0')


In [None]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

**Autograd**

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph. for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code `with torch.no_grad()` block:

There are reasons you might want to disable gradient tracking:
- To mark some parameters in your neural network as frozen parameters.
- To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

**DAGs are dynamic**
in PyTorch An important thing to note is that the graph is recreated from scratch; **after each .backward() call**, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.


PyTorch accumulates the gradients, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this.



Optimizing Model Parameters



In [3]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        # After flattnen, the input can be 64 * 784 where B=64, Dimension=784

        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()

### Hyperparams

adjustable parameters that let you control the model optimization process:
- Number of Epochs - the number times to iterate over the dataset
- Batch Size - the number of data samples propagated through the network before the parameters are updated
- Learning Rate - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.

In [19]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

# Regression: MSE, NLL Classificaion: CrossEntropy
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch_num, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward() # Calculates the gradients of the loss with respect to the model's parameters.
        optimizer.step() # Updates the model's parameters based on the calculated gradients.
        optimizer.zero_grad()

        if batch_num % 100 == 0:
            loss, current = loss.item(), (batch_num + 1) * batch_size
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

train_loop(train_dataloader, model, loss_fn, optimizer)

loss: 1.675389  [   64/60000]
loss: 1.643331  [ 6464/60000]
loss: 1.502533  [12864/60000]
loss: 1.585111  [19264/60000]
loss: 1.454386  [25664/60000]
loss: 1.435264  [32064/60000]
loss: 1.456058  [38464/60000]
loss: 1.368090  [44864/60000]
loss: 1.400835  [51264/60000]
loss: 1.310243  [57664/60000]


Inside the training loop, optimization happens in three steps:
Call optimizer.zero_grad() to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.

Backpropagate the prediction loss with a call to loss.backward(). PyTorch deposits the gradients of the loss w.r.t. each parameter.

Once we have our gradients, we call optimizer.step() to adjust the parameters by the gradients collected in the backward pass.

In [21]:
def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The gradient is a special case of the Jacobian matrix when m=1.

In [None]:
epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

PyTorch models store the learned parameters in an internal state dictionary, called state_dict. These can be persisted via the torch.save method.

be sure to call model.eval() method before inferencing to set the dropout and batch normalization layers to evaluation mode. Failing to do this will yield inconsistent inference results.

In [None]:
torch.save(model, 'model.pth')
model = torch.load('model.pth')