In [1]:
import torch
import numpy as np

torch.cuda.is_available()

False

In [9]:
x = torch.tensor([[1., -1.], [1., 1.]], requires_grad=True)
out = x.pow(2).sum()
out.backward()
x.grad

tensor([[ 2., -2.],
        [ 2.,  2.]])

### Tensors

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see Bridge with NumPy). Tensors are also optimized for automatic differentiation 

In [2]:
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)
x_data


tensor([[1, 2],
        [3, 4]])

In [5]:
np_array = np.array(data)
x_np = torch.from_numpy(np_array)

In [6]:
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")
x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor: 
 tensor([[1, 1],
        [1, 1]]) 

Random Tensor: 
 tensor([[0.9496, 0.0637],
        [0.3262, 0.2992]]) 



In [7]:
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


### Operations on Tensors

Over 1200 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling and more are comprehensively described here.

Each of these operations can be run on the CPU and Accelerator such as CUDA, MPS, MTIA, or XPU. If you’re using Colab, allocate an accelerator by going to Runtime > Change runtime type > GPU.

By default, tensors are created on the CPU. We need to explicitly move tensors to the accelerator using .to method (after checking for accelerator availability). Keep in mind that copying large tensors across devices can be expensive in terms of time and memory!

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [9]:
# We move our tensor to the current accelerator if available
if torch.accelerator.is_available():
    tensor = tensor.to(torch.accelerator.current_accelerator())

### Arithmetic operations

Let $A$ be a matrix of size $m \times n$.

Then:

-   $A^T$ is size $n \times m$
    
-   So, two possible multiplications:
    
    1.  $A^T A$ — **results in a square matrix of size $n \times n$**
        
    2.  $A A^T$ — **results in a square matrix of size $m \times m$**
        

---

### 🧠 What does it *mean*?

#### 🔷 1. $A^T A$: Inner product / Gram matrix

-   Each entry is:
    
    $$
    (A^T A)_{ij} = \langle \text{col}_i, \text{col}_j \rangle
    $$
    
-   It represents **dot products between columns** of $A$.
    
-   The result is **symmetric** and **positive semi-definite**.
    
-   Common in **least squares**, **PCA**, **SVD**, and **machine learning** kernels.
    

#### 🔷 2. $A A^T$: Outer product form

-   Each entry is:
    
    $$
    (A A^T)_{ij} = \langle \text{row}_i, \text{row}_j \rangle
    $$
    
-   Represents **dot products between rows** of $A$.
    
-   Also **symmetric** and **positive semi-definite**.
    
-   Shows up in **covariance matrices**, **projections**, etc.
    

---

### 🧪 Example:

Let:

$$
A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix}
$$

Then:

**Transpose:**

$$
A^T = \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix}
$$

**Multiply:**

$$
A^T A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \\ \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} = \begin{bmatrix} 10 & 14 \\ 14 & 20 \\ \end{bmatrix}
$$

**Notice**:

-   $A^T A$ is square, symmetric.
    
-   $A A^T$ would be $2 \times 2$ as well, and also symmetric.
    

---

### 🧭 Geometrically:

-   $A^T A$ gives you a **Gram matrix** that captures **angles** and **lengths** between the column vectors — essential in measuring **linear dependence**.
    
-   If columns are orthonormal ⇒ $A^T A = I$
    

---

### 🚀 Applications:

-   **Machine Learning**: $X^T X$ in linear regression.
    
-   **PCA**: Eigenvalues of $A^T A$ give variance directions.
    
-   **Signal Processing**: Autocorrelation.
    
-   **Optimization**: Hessians and quadratic forms.
    

---

So, multiplying a matrix with its transpose is like asking:

> "How do the rows or columns of this matrix relate to each other in space?"

It exposes symmetry, structure, and often, the soul of the data.

In [10]:
# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
# ``tensor.T`` returns the transpose of a tensor
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)


# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

tensor([[0.3254, 0.0096, 0.0222, 0.7229],
        [0.0344, 0.3484, 0.1626, 0.1651],
        [0.4874, 0.8650, 0.1118, 0.0982]])

### Bridge with NumPy

Tensors on the CPU and NumPy arrays can share their underlying memory locations, and changing one will change the other.

In [11]:
t = torch.ones(5)
print(f"t: {t}")
n = t.numpy()
print(f"n: {n}")


t: tensor([1., 1., 1., 1., 1.])
n: [1. 1. 1. 1. 1.]


In [12]:
t.add_(1)
print(f"t: {t}")
print(f"n: {n}")

t: tensor([2., 2., 2., 2., 2.])
n: [2. 2. 2. 2. 2.]


### Build the Neural Network

Neural networks comprise of layers/modules that perform operations on data. The torch.nn namespace provides all the building blocks you need to build your own neural network. Every module in PyTorch subclasses the nn.Module. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.

In [13]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### Define the Class

We define our neural network by subclassing nn.Module, and initialize the neural network layers in __init__. Every nn.Module subclass implements the operations on input data in the forward method.

In PyTorch, when you create a custom model by subclassing nn.Module, you must override the forward() method. It tells PyTorch how data flows through your network.

Behind the scenes, when you call `model(input)`, it actually triggers:

```python
model.__call__(input) → model.forward(input)
```


In [15]:
from torch import nn


class NeuralNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.flatten = nn.Flatten()
    self.linear_relu_stack = nn.Sequential(
      nn.Linear(28*28, 512),
      nn.ReLU(),
      nn.Linear(512, 512),
      nn.ReLU(),
      nn.Linear(512, 10)
		)

  def forward(self, x):
    x = self.flatten(x)
    logits = self.linear_relu_stack(x)
    return logits

import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
model = NeuralNetwork().to(device)
print(model)


NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


In [16]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

X = torch.rand(1, 28, 28, device=device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([5])


### Model Layers

Let’s break down the layers in the FashionMNIST model. To illustrate it, we will take a sample minibatch of 3 images of size 28x28 and see what happens to it as we pass it through the network.

In [17]:
input_image = torch.rand(3,28,28)
print(input_image.size())

torch.Size([3, 28, 28])


### nn.Flatten

`nn.Flatten()` is a PyTorch module that **reshapes** input tensors by flattening all dimensions **except the batch dimension**.

So, if you have an input like:

```python
x.shape = (batch_size, channels, height, width)
```

After `nn.Flatten()`, it becomes:

```python
x.shape = (batch_size, channels * height * width)
```

---

### 🔍 Why flatten?

Before feeding data into a `Linear` (fully connected) layer, you must collapse spatial dimensions (like width and height) into a single vector — because `Linear` layers expect a 2D input: `(batch_size, features)`.

We initialize the nn.Flatten layer to convert each 2D 28x28 image into a contiguous array of 784 pixel values ( the minibatch dimension (at dim=0) is maintained).

In [18]:
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

torch.Size([3, 784])


### nn.Linear

The linear layer is a module that applies a linear transformation on the input using its stored weights and biases.

In [19]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 20])


### nn.ReLU

Non-linear activations are what create the complex mappings between the model’s inputs and outputs. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.

In this model, we use nn.ReLU between our linear layers, but there’s other activations to introduce non-linearity in your model.

In [20]:
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[ 0.1788,  0.4263, -0.2719, -0.4069,  0.4450, -0.3410,  0.1791,  0.0007,
          0.4072,  0.3851, -0.2438,  0.2442, -0.2389,  0.1168, -0.0740,  0.0526,
         -0.2256,  0.2876, -0.0852, -0.1439],
        [ 0.4297,  0.3652,  0.1637, -0.4328,  0.1047, -0.0369, -0.0022,  0.2297,
          0.3424,  0.1223,  0.2359,  0.1289, -0.3273, -0.1755, -0.3796,  0.1514,
         -0.2080,  0.4252,  0.1097,  0.2704],
        [-0.0137,  0.1504, -0.2537, -0.4160,  0.1492, -0.2969,  0.0397, -0.0317,
          0.1794,  0.4852,  0.2180,  0.3335, -0.6271,  0.0311, -0.4816,  0.1684,
          0.0594,  0.4597, -0.0965, -0.1409]], grad_fn=<AddmmBackward0>)


After ReLU: tensor([[0.1788, 0.4263, 0.0000, 0.0000, 0.4450, 0.0000, 0.1791, 0.0007, 0.4072,
         0.3851, 0.0000, 0.2442, 0.0000, 0.1168, 0.0000, 0.0526, 0.0000, 0.2876,
         0.0000, 0.0000],
        [0.4297, 0.3652, 0.1637, 0.0000, 0.1047, 0.0000, 0.0000, 0.2297, 0.3424,
         0.1223, 0.2359, 0.1289, 0.0000, 0.0000, 0.00

### nn.Softmax

The last linear layer of the neural network returns logits - raw values in [-infty, infty] - which are passed to the nn.Softmax module. The logits are scaled to values [0, 1] representing the model’s predicted probabilities for each class. dim parameter indicates the dimension along which the values must sum to 1.

In [21]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
pred_probab

tensor([[0.1104, 0.0905, 0.0868, 0.1055, 0.0861, 0.1160, 0.0944, 0.1024, 0.1034,
         0.1045]], grad_fn=<SoftmaxBackward0>)

### Model Parameters

Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. Subclassing nn.Module automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model’s parameters() or named_parameters() methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.

In [22]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure: NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[ 0.0070,  0.0112,  0.0121,  ..., -0.0317,  0.0171, -0.0028],
        [-0.0024, -0.0303,  0.0152,  ...,  0.0244,  0.0247, -0.0324]],
       grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([-0.0054,  0.0305], grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[ 0.0313, -0.0302, -0.0215,  ...,  0.0061,  0.0328, -0.0425],
        [-0.0018,  0.0051,  0.0399,  ...,  0.0174,  0.0170, -0.0050]],
       grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.bias | 

### Automatic Differentiation

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

![image info](../images/comp-graph.png)

In [3]:
x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

### Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need 
  $\frac{\partial \text{Loss}}{\partial W}$ and
    $\frac{\partial \text{Loss}}{\partial b}$
  under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:

In [24]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.3055, 0.0330, 0.1962],
        [0.3055, 0.0330, 0.1962],
        [0.3055, 0.0330, 0.1962],
        [0.3055, 0.0330, 0.1962],
        [0.3055, 0.0330, 0.1962]])
tensor([0.3055, 0.0330, 0.1962])


Note:

- We can only obtain the `grad` properties for the leaf nodes of the computational graph, which have `requires_grad` property set to `True`. For all other nodes in our graph, gradients will not be available.

- We can only perform gradient calculations using `backward` once on a given graph, for performance reasons. If we need to do several `backward` calls on the same graph, we need to pass `retain_graph=True` to the `backward` call.