```
Introduction to PyTorch
├── 1 What is PyTorch
│   └── 1.1 The three core components of PyTorch
├── 1.2 Defining deep learning
├── 1.3 Installing PyTorch
├── 2 Understanding tensors
│   ├── 2.1 Scalars, vectors, matrices, and tensors
│   ├── 2.2 Tensor data types
│   └── 2.3 Common PyTorch tensor operations
├── 3 Seeing models as computation graphs
├── 4 Automatic differentiation made easy
├── 5. Implementing multilayer neural networks
├── 6 Setting up efficient data loaders
├── 7. A typical training loop
└── 8 Saving and loading models
```

# Introduction to PyTorch

This lesson is designed to equip readers with the necessary skills and knowledge to implement neural networks and apply deep learning in practice. PyTorch, a popular Python-based deep learning library, serves as the primary tool. The lesson guides users through setting up a deep learning environment with PyTorch and GPU support.

It covers essential concepts such as tensors and their usage within PyTorch. The material also explores PyTorch's automatic differentiation engine, which enables efficient implementation of backpropagation for neural network training.

Designed as an introductory resource for newcomers to PyTorch, this lesson explains foundational concepts without providing exhaustive library coverage. The focus remains on core PyTorch fundamentals required for implementing neural networks.

## 1 What is PyTorch

*   **PyTorch:**
    *   Open-source Python deep learning library ([pytorch.org](https://pytorch.org/)).
    *   Widely used in research since 2019 ([Papers With Code](https://paperswithcode.com/trends)).
    *   Growing adoption: ~40% of respondents in Kaggle survey ([Kaggle Survey 2022](https://www.kaggle.com/c/kaggle-survey-2022)).
*   **Key Advantages:**
    *   User-friendly interface and efficient.
    *   Flexible for advanced customization.
    *   Balances usability and features for researchers and practitioners.

### 1.1 The three core components of PyTorch

PyTorch is a relatively comprehensive library, and one way to approach it is to focus on its three broad components, summarized in figure 1.

<img src="images2/Figure1.webp" width="600px">

> **Figure 1** PyTorch’s three main components include a tensor library as a\
 fundamental building block for computing, automatic differentiation for model\
 optimization, and deep learning utility functions, making it easier to implement\
 and train deep neural network models.

*   **Tensor Library:**
    *   Extends NumPy with GPU acceleration.
    *   Seamless CPU/GPU switching.
*   **Autograd Engine:**
    *   Automatic differentiation for tensor operations.
    *   Simplifies backpropagation and model optimization.
*   **Deep Learning Library:**
    *   Modular, flexible, and efficient building blocks.
    *   Pretrained models, loss functions, and optimizers.
    *   Supports a wide range of deep learning models.
    *   Caters to researchers and developers.

## 1.2 Defining deep learning
Confused about AI, Machine Learning, Deep Learning, and LLMs? Let's clarify:
 
*   **AI (Artificial Intelligence):**
    *   Creating computer systems to perform tasks requiring human intelligence.
    *   Examples: Natural language understanding, pattern recognition, decision-making.
    *   Still far from achieving general intelligence.

*   **Machine Learning (ML):**
    *   A subfield of AI (see Figure 2).
    *   Focuses on developing learning algorithms.
    *   Enables computers to learn from data and make predictions without explicit programming.
    *   Involves algorithms that:
        *   Identify patterns.
        *   Learn from data.
        *   Improve with more data and feedback.


<img src="images2/Figure2.webp" width="500px">

> **Figure 2** Deep learning is a subcategory of machine learning focused on implementing deep neural networks. Machine learning is a subcategory of AI that is concerned with algorithms that learn from data. AI is the broader concept of machines being able to perform tasks that typically require human intelligence.


**Machine Learning (ML):**
*   Integral to AI evolution, powering advancements like LLMs.
*   Behind technologies:
    *   Recommendation systems.
    *   Spam filtering.
    *   Voice recognition.
    *   Self-driving cars.
*   Enhances AI capabilities: Adapts to new inputs, moving beyond rule-based systems.

**Deep Learning (DL):**
*   A subcategory of ML using deep neural networks.
*   Inspired by the human brain's neuron interconnections.
*   "Deep" refers to multiple hidden layers modeling complex relationships.
*   Excels at unstructured data (images, audio, text), well-suited for LLMs.

**Predictive Modeling Workflow:**
*   Supervised learning in ML and DL (summarized in Figure 3).



<img src="images2/Figure3.webp" width="600px">

> **Figure 3** The supervised learning workflow for predictive modeling consists of a training stage where a model is trained on labeled examples in a training dataset. The trained model can then be used to predict the labels of new observations.


**Supervised Learning Workflow:**

*   **Training:**
    *   Model trained on a dataset of examples and labels.
    *   Example: Spam classifier (emails labeled as "spam" or "not spam").
*   **Inference:**
    *   Trained model predicts labels for new, unseen data.
    *   Example: Classifying new emails as "spam" or "not spam".
*   **Evaluation:**
    *   Model evaluation ensures performance criteria are met.

**LLMs Training:**

*   **Classification:** Similar workflow to Figure 3.
*   **Text Generation:**
    *   Labels derived from the text itself (next-word prediction).
    *   During inference, the LLM generates new text from a prompt.
    *   Figure 3 still applies.

## 1.3 Installing PyTorch
PyTorch can be installed just like any other Python library or package. However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional explanation.

For instance, there are two versions of PyTorch: a leaner version that only supports CPU computing and a full version that supports both CPU and GPU computing. If your machine has a CUDA-compatible GPU that can be used for deep learning (ideally, an NVIDIA T4, RTX 2080 Ti, or newer), I recommend installing the GPU version. Regardless, the default command for installing PyTorch in a code terminal is:

In [None]:
pip install torch

Suppose your computer supports a CUDA-compatible GPU. In that case, it will automatically install the PyTorch version that supports GPU acceleration via CUDA, assuming the Python environment you’re working on has the necessary dependencies (like pip) installed.

To explicitly install the CUDA-compatible version of PyTorch, it’s often better to specify the CUDA you want PyTorch to be compatible with. PyTorch’s official website (https://pytorch.org) provides the commands to install PyTorch with CUDA support for different operating systems. Figure 4 shows a command that will also install PyTorch, as well as the torchvision and torchaudio libraries, which are optional for this lesson.


<img src="images2/Figure4.webp" width="700px">

> **Figure 4** Access the PyTorch installation recommendation on https://pytorch.org to customize and select the installation command for your system.

I use PyTorch 2.7.0 for the examples, so I recommend that you use the following command to install the exact version to guarantee compatibility with this lesson:

`pip install torch==2.7.0`

However, as mentioned earlier, given your operating system, the installation command might differ slightly from the one shown here. Thus, I recommend that you visit https://pytorch.org and use the installation menu (see figure 4) to select the installation command for your operating system. Remember to replace torch with torch==2.4.0 in the command.

To check the version of PyTorch, execute the following code in PyTorch:

In [1]:
import torch

print(torch.__version__)

2.7.0+cpu


After installing PyTorch, you can check whether your installation recognizes your built-in NVIDIA GPU by running the following code in Python:

In [None]:
print(torch.cuda.is_available())

False


If the command returns True, you are all set. If the command returns False, your computer may not have a compatible GPU, or PyTorch does not recognize it. While GPUs are not required for the initial chapters in this book, which are focused on implementing LLMs for educational purposes, they can significantly speed up deep learning–related computations.

If you don’t have access to a GPU, there are several cloud computing providers where users can run GPU computations against an hourly cost. A popular Jupyter notebook–like environment is Google Colab (https://colab.research.google.com), which provides time-limited access to GPUs as of this writing. Using the Runtime menu, it is possible to select a GPU, as shown in the screenshot in figure 5.


<img src="images2/Figure5.webp" width="700px">

> **Figure 5** Select a GPU device for Google Colab under the Runtime/Change Runtime Type menu.


## 2 Understanding tensors

Tensors: Mathematical objects generalizing vectors and matrices to higher dimensions.

*   Characterized by their **order** (or rank) = number of dimensions.
*   Examples:
    *   Scalar (number): rank 0
    *   Vector: rank 1
    *   Matrix: rank 2
*   See Figure 6.

<img src="images2/Figure6.webp" width="700px">

> **Figure 6** Tensors with different ranks. Here 0D corresponds to rank 0, 1D to rank 1, and 2D to rank 2. A three-dimensional vector, which consists of three elements, is still a rank 1 tensor.

## Tensors: Data Containers and Array Libraries

*   **What are Tensors?**
    *   From a computational view, tensors are data containers.
    *   They hold multidimensional data (each dimension can represent a feature).

*   **Tensor Libraries (like PyTorch):**
    *   Efficiently create, manipulate, and compute with these arrays.
    *   Function essentially as array libraries.

*   **PyTorch Tensors vs. NumPy Arrays:**
    *   Similar to NumPy arrays.
    *   **Key additional features for Deep Learning:**
        *   Automatic differentiation engine (simplifies computing gradients - see section 4).
        *   Support for GPU computations (speeds up deep neural network training).

### 2.1 Scalars, vectors, matrices, and tensors

PyTorch Tensors: Data containers for array-like structures.

*   **Scalar:** 0-dimensional tensor (e.g., a number).
*   **Vector:** 1-dimensional tensor.
*   **Matrix:** 2-dimensional tensor.
*   Higher-dimensional tensors: Referred to as 3D tensor, 4D tensor, etc.
*   Creation: Use the `torch.tensor` function. See [PyTorch documentation](https://pytorch.org/docs/stable/tensors.html) for details.

In [3]:
import torch
import numpy as np

# create a 0D tensor (scalar) from a Python integer
tensor0d = torch.tensor(1)

# create a 1D tensor (vector) from a Python list
tensor1d = torch.tensor([1, 2, 3])

# create a 2D tensor from a nested Python list
tensor2d = torch.tensor([[1, 2], 
                         [3, 4]])

# create a 3D tensor from a nested Python list
tensor3d_1 = torch.tensor([[[1, 2], [3, 4]], 
                           [[5, 6], [7, 8]]])

# create a 3D tensor from NumPy array
ary3d = np.array([[[1, 2], [3, 4]], 
                  [[5, 6], [7, 8]]])
tensor3d_2 = torch.tensor(ary3d)  # Copies NumPy array
tensor3d_3 = torch.from_numpy(ary3d)  # Shares memory with NumPy array

In [4]:
ary3d[0, 0, 0] = 999
print(tensor3d_2) # remains unchanged

tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [5]:
print(tensor3d_3) # changes because of memory sharing

tensor([[[999,   2],
         [  3,   4]],

        [[  5,   6],
         [  7,   8]]])


> **Exercise 1:** Create the following tensors:
> 1. A 1D tensor (vector) containing the integers 5, 6, 7.
> 2. A 2x3 tensor (matrix) containing floating-point numbers of your choice. Check its `shape` and `dtype`.
> 3. A 3D tensor with shape (2, 2, 2) initialized with zeros, using `torch.zeros()`.

### 2.2 Tensor data types

PyTorch Tensor Data Types:
* Adopts default 64-bit integer data type from Python
* Access data type using the `.dtype` attribute
* Example: `tensor.dtype`

In [6]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype)

torch.int64


If we create tensors from Python floats, PyTorch creates tensors with a 32-bit precision by default:

In [7]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype)

torch.float32


### Why PyTorch Uses 32-bit Precision by Default

* **Balance between precision and efficiency**
  * 32-bit float offers sufficient precision for most deep learning tasks
  * Consumes less memory than 64-bit floating-point numbers
  * Requires fewer computational resources

* **GPU Optimization**
  * GPU architectures are specifically optimized for 32-bit computations
  * Significantly speeds up model training and inference

### Changing Tensor Precision

You can change a tensor's precision using the `.to()` method:
* Example: Convert a 64-bit integer tensor to a 32-bit float tensor

In [8]:
floatvec = tensor1d.to(torch.float32)
print(floatvec.dtype)

torch.float32


For more information about different tensor data types available in PyTorch, check the official documentation at https://pytorch.org/docs/stable/tensors.html.

### 2.3 Common PyTorch tensor operations

Comprehensive coverage of all the different PyTorch tensor operations and commands is outside the scope of this lesson. However, I will briefly describe relevant operations as we introduce them throughout the class.

We have already introduced the `torch.tensor()` function to create new tensors:

In [9]:
tensor2d = torch.tensor([[1, 2, 3], 
                         [4, 5, 6]])
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

In addition, the `.shape` attribute allows us to access the shape of a tensor:

In [10]:
tensor2d.shape

torch.Size([2, 3])

As you can see, `.shape` returns `[2, 3]`, meaning the tensor has two rows and three columns. To reshape the tensor into a 3 × 2 tensor, we can use the `.reshape` method:

In [11]:
tensor2d.reshape(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

However, note that the more common command for reshaping tensors in PyTorch is `.view()`:

In [12]:
tensor2d.view(3, 2)

tensor([[1, 2],
        [3, 4],
        [5, 6]])

#### PyTorch Syntax Options

* PyTorch often provides multiple syntax options for the same operation
  * Initially followed Lua Torch conventions
  * Later added NumPy-like syntax by popular demand

* `.view()` vs `.reshape()`:
  * **`.view()`**: Requires contiguous data (elements stored sequentially in memory); fails otherwise
  * **`.reshape()`**: Works with any data; copies if necessary to achieve desired shape

#### Tensor Transposition

* Use `.T` to transpose a tensor (flip across diagonal)
* Transposition ≠ Reshaping
  * Transposition rearranges elements while preserving dimensions
  * See example below:

In [13]:
tensor2d.T

tensor([[1, 4],
        [2, 5],
        [3, 6]])

Lastly, the common way to multiply two matrices in PyTorch is the `.matmul` method:

In [14]:
tensor2d.matmul(tensor2d.T)

tensor([[14, 32],
        [32, 77]])

However, we can also adopt the `@` operator, which accomplishes the same thing more compactly:

In [15]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

For readers who’d like to browse through all the different tensor operations available in PyTorch (we won’t need most of these), I recommend checking out the official documentation at https://pytorch.org/docs/stable/tensors.html.

> **Exercise 2:** Given the following tensors:
> ```python
> t1 = torch.tensor([[1., 2.], [3., 4.]])
> t2 = torch.tensor([[5., 6.], [7., 8.]])
> ```
> Perform these operations:
> 1. Reshape `t1` into a 4x1 tensor.
> 2. Transpose `t2` using `.T`.
> 3. Calculate the element-wise sum of `t1` and `t2`.
> 4. Calculate the matrix multiplication of `t1` and `t2` using the `@` operator.

## 3 Seeing models as computation graphs

#### PyTorch's Automatic Differentiation (Autograd)

* **Autograd**: PyTorch's engine that automatically computes gradients in dynamic computational graphs
* **Key features**:
  * Tracks operations on tensors
  * Calculates gradients efficiently
  * Enables backpropagation for neural network training

#### Computational Graphs

* **Definition**: Directed graphs representing mathematical expressions
* **In deep learning**: 
  * Visualize the sequence of calculations in neural networks
  * Essential for computing gradients during backpropagation
  * Form the foundation of model training algorithms

#### Example: Logistic Regression

* We'll examine a simple logistic regression classifier (single-layer neural network)
* **Characteristics**:
  * Produces scores between 0 and 1
  * Compares predictions to true class labels (0 or 1)
  * Demonstrates how computation flows through a graph structure

In [16]:
import torch.nn.functional as F

y = torch.tensor([1.0])  # true label
x1 = torch.tensor([1.1]) # input feature
w1 = torch.tensor([2.2]) # weight parameter
b = torch.tensor([0.0])  # bias unit

z = x1 * w1 + b          # net input
a = torch.sigmoid(z)     # activation & output

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


If not all components in the preceding code make sense to you, don’t worry. The point of this example is not to implement a logistic regression classifier but rather to illustrate how we can think of a sequence of computations as a computation graph, as shown in figure 7.

<img src="images2/Figure7.webp" width="700px">

> **Figure 7:** A logistic regression forward pass as a computation graph. The input feature $x_1$ is multiplied by a model weight $w_1$ and passed through an activation function $s$ after adding the bias. The loss is computed by comparing the model output $a$ with a given label $y$.

*   PyTorch builds a computation graph in the background.
*   This graph is used to calculate gradients of the loss function w.r.t. model parameters (e.g., `w1` and `b`).
*   Gradients are essential for training the model.
    *   Used in optimization algorithms like gradient descent.

> **Note:** A loss function is a mathematical measure used to quantify the difference between the predicted output of a model and the actual target values, guiding the optimization process to improve model accuracy by minimizing this difference.

## 4 Automatic differentiation made easy

*   PyTorch builds a computational graph automatically when `requires_grad=True` for a tensor.
*   This is essential for computing gradients.
*   Gradients are crucial for training neural networks using backpropagation.
    *   Backpropagation is an application of the chain rule (see [Figure 8](images2/Figure8.webp)).

<img src="images2/Figure8.webp" width="700px">

> **Figure 8:** The most common way of computing the loss gradients in a computation graph involves applying the chain rule from right to left, also called reverse-model automatic differentiation or backpropagation. We start from the output layer (or the loss itself) and work backward through the network to the input layer. We do this to compute the gradient of the loss with respect to each parameter (weights and biases) in the network, which informs how we update these parameters during training.

### Partial derivatives and gradients
*   **Partial Derivatives:** Measure the rate at which a function changes with respect to one of its variables.
*   **Gradients:** A vector containing all partial derivatives of a multivariate function.

*   **Calculus Concepts (Simplified):**
    *   Don't worry if you're unfamiliar with partial derivatives, gradients, or the chain rule.
    *   **Chain Rule (High-level):** A method to compute gradients of a loss function w.r.t. model parameters in a computation graph.
        *   Provides information to update parameters to minimize the loss (e.g., using gradient descent).
        *   Training loop implementation revisited in Section 7.

*   **PyTorch Autograd Engine:**
    *   The second core component of PyTorch.
    *   Constructs a computational graph in the background by tracking every operation on tensors.
    *   Enables automatic gradient computation.
    *   Example: Use the `torch.autograd.grad` function to compute gradients (e.g., `grad(loss, w1)`), as shown in the following listing.

In [17]:
import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b 
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)

# retain_graph=True tells PyTorch to keep the computation graph in memory
# after computing gradients. This allows us to compute multiple gradients
# from the same graph. Without it, the graph would be freed after the first
# gradient computation, making the second one impossible.
grad_L_w1 = grad(loss, w1, retain_graph=True)
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)

(tensor([-0.0898]),)
(tensor([-0.0817]),)


Using `grad` manually:
*   Useful for experimentation, debugging, and demonstrating concepts.

In practice, PyTorch automates this:
*   Call `.backward()` on the loss tensor.
*   PyTorch computes gradients for all leaf nodes (tensors with `requires_grad=True`).
*   Gradients are stored in the `.grad` attribute of these tensors.

In [18]:
loss.backward()

print(w1.grad)
print(b.grad)

tensor([-0.0898])
tensor([-0.0817])


#### Autograd Summary:

*   Autograd explained using calculus concepts.
*   Don't be overwhelmed by the math details.
*   Key takeaway: PyTorch handles calculus automatically.
*   This is done via the `.backward()` method.
*   No need to compute derivatives or gradients by hand.

> **Exercise 3:** 
> 1. Define `x = torch.tensor(2.0, requires_grad=True)`.
> 2. Calculate `y = 3*x**2 + 5`.
> 3. Compute the gradient of `y` with respect to `x` (dy/dx) using `y.backward()`. Print the result stored in `x.grad`. (Expected result: 12)

## 5. Implementing multilayer neural networks

Focus: PyTorch for Deep Neural Networks

*   Implementing NNs in PyTorch
*   Concrete example: Multilayer Perceptron (Fully Connected NN)
*   Illustrated in Figure 9

<img src="images2/Figure9.webp" width="600px">

> **Figure 9:** A multilayer perceptron with two hidden layers. Each node represents a unit in the respective layer. For illustration purposes, each layer has a very small number of nodes.

#### Implementing Neural Networks in PyTorch:

*   **Subclass `torch.nn.Module`**: Define custom network architectures by inheriting from this base class.
*   **`torch.nn.Module` Benefits**:
    *   Provides essential functionality for building and training models.
    *   Encapsulates layers and operations.
    *   Automatically tracks model parameters.
*   **Define Layers in `__init__`**: Set up the network's layers (e.g., Linear, ReLU) in the constructor.
*   **Define Forward Pass in `forward`**: Specify how data flows through the layers to create the computation graph.
*   **`backward` Method**:
    *   Used during training to compute gradients.
    *   Typically **do not** need to implement yourself (PyTorch handles this via autograd, see section 7).
*   **Example**: The following code illustrates a typical usage with a multilayer perceptron.

In [19]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()

        self.layers = torch.nn.Sequential(
                
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),

            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x):
        logits = self.layers(x)
        return logits

We can then instantiate a new neural network object as follows:

In [20]:
model = NeuralNetwork(50, 3)

Before using this new model object, we can call print on the model to see a summary of its structure:

In [21]:
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


*   **Using `torch.nn.Sequential`**:
    *   Not strictly required, but simplifies code for sequential layers.
    *   Allows calling `self.layers` in `forward` instead of individual layers.

*   **Next**: Check the total number of trainable parameters.

In [22]:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


*   **Trainable Parameters**:
    *   Parameters with `requires_grad=True` are trainable and updated during training (see section 7).
    *   In our neural network, these are found in the `torch.nn.Linear` layers.
    *   `torch.nn.Linear` layers (also known as feedforward or fully connected) multiply inputs by a weight matrix and add a bias vector.

*   **Accessing Parameters**:
    *   The first `Linear` layer is at index 0 in `model.layers`.
    *   Access its weight matrix using: `model.layers[0].weight`.

In [23]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 0.1182,  0.0606, -0.1292,  ..., -0.1126,  0.0735, -0.0597],
        [-0.0249,  0.0154, -0.0476,  ..., -0.1001, -0.1288,  0.1295],
        [ 0.0641,  0.0018, -0.0367,  ..., -0.0990, -0.0424, -0.0043],
        ...,
        [ 0.0618,  0.0867,  0.1361,  ..., -0.0254,  0.0399,  0.1006],
        [ 0.0842, -0.0512, -0.0960,  ..., -0.1091,  0.1242, -0.0428],
        [ 0.0518, -0.1390, -0.0923,  ..., -0.0954, -0.0668, -0.0037]],
       requires_grad=True)


Since this large matrix is not shown in its entirety, let’s use the `.shape` attribute to show its dimensions:

In [None]:
print(model.layers[0].weight.shape)

*   Accessing Bias: Similar to weights, use `model.layers[0].bias`.
*   Weight Matrix Details:
    *   Dimensions: 30x50.
    *   `requires_grad=True` (default): Entries are trainable.
*   Weight Initialization:
    *   Initialized with small random numbers (differs each time).
    *   **Purpose:** Break symmetry during training for effective learning.
*   Reproducibility:
    *   Make initialization reproducible by seeding the random number generator.
    *   Use `torch.manual_seed()`.

In [24]:
torch.manual_seed(123)

model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


Now that we have spent some time inspecting the NeuralNetwork instance, let’s briefly see how it’s used via the forward pass:

In [26]:
torch.manual_seed(123)

X = torch.rand((1, 50))
out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


#### Forward Pass in Neural Networks

- **Input Generation**: We created a random 50-dimensional feature vector as input
- **Model Execution**: Calling `model(X)` automatically executes the forward pass
- **Forward Pass Definition**: Process of calculating outputs by passing inputs through all network layers
- **Output Interpretation**: The three returned values are scores for each output node

#### Understanding `grad_fn`

- **Purpose**: Indicates the last operation used in computational graph
- **Example**: `grad_fn=<AddmmBackward0>` shows tensor was created via:
  - Matrix multiplication (mm)
  - Addition operation (Add)
- **Usage**: PyTorch uses this information during backpropagation

#### Optimizing Inference

- **Problem**: Tracking gradients during inference is wasteful
- **Solution**: Use `torch.no_grad()` context manager
- **Benefits**:
  - Prevents unnecessary computation
  - Reduces memory consumption
  - Improves inference speed

In [27]:
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


#### PyTorch Model Outputs

* **Default Behavior**: Models return raw logits (last layer outputs without activation)
* **Reason**: PyTorch loss functions (like `CrossEntropyLoss`) internally combine:
  * Softmax (or sigmoid for binary classification)
  * Negative log-likelihood loss
* **Benefits**:
  * Numerical stability
  * Computational efficiency
* **Important**: To get probability distributions, explicitly apply softmax:

In [28]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
print(out)

tensor([[0.3113, 0.3934, 0.2952]])


* **Output Interpretation (Post-Softmax):**
    * Values represent class-membership probabilities.
    * Probabilities sum to 1.
* **Observation (Untrained Model):**
    * Probabilities are roughly equal for the random input.
    * This is expected for a randomly initialized model before training.

> **Exercise 4:** Modify the `NeuralNetwork` class:
> 1. Change the number of units in the first hidden layer from 30 to 10.
> 2. Instantiate the modified network with `num_inputs=50` and `num_outputs=3`.
> 3. Print the modified model structure.

## 6 Setting up efficient data loaders

#### Efficient Data Loaders in PyTorch

*   Crucial for training deep learning models.
*   Iterated over during training.
*   Overall idea behind data loading in PyTorch is illustrated in [Figure 10](images2/Figure10.webp).

<img src="images2/Figure10.webp" width="700px">

> **Figure 10:** PyTorch implements a `Dataset` and a `DataLoader` class. The `Dataset` class is used to instantiate objects that define how each data record is loaded. The `DataLoader` handles how the data is shuffled and assembled into batches.

*   **Implementing Custom Dataset & Data Loaders**
    *   Following [Figure 10](images2/Figure10.webp), we'll implement a custom `Dataset` class.
    *   This class will be used to create training and test datasets.
    *   These datasets will then feed into data loaders.

*   **Creating a Simple Toy Dataset**
    *   Let's start by creating a toy dataset for demonstration.
    *   **Training Data:**
        *   5 examples, 2 features each.
        *   Labels: 3 for class 0, 2 for class 1.
    *   **Test Data:**
        *   2 examples, 2 features each.
        *   Labels: 1 for class 0, 1 for class 1.

In [29]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

In [30]:
X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

> **Note:** PyTorch requires that class labels start with label 0, and the largest class label value should not exceed the number of output nodes minus 1 (since Python index counting starts at zero). So, if we have class labels 0, 1, 2, 3, and 4, the neural network output layer should consist of five nodes.

Next, we create a custom dataset class, ToyDataset, by subclassing from PyTorch’s Dataset parent class, as shown in the following cell.

In [31]:
from torch.utils.data import Dataset


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]        
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

The custom `ToyDataset` class is primarily used to instantiate a PyTorch `DataLoader`.

Before using the `DataLoader`, let's review the structure of the `ToyDataset`:

*   **Main Components:** A custom `Dataset` class in PyTorch typically requires three methods:
    *   `__init__`: Constructor to set up attributes (like data, file paths, etc.). In our case, we store the `X` and `y` tensors.
    *   `__getitem__(index)`: Defines how to retrieve a single data item (features and label) given an `index`. The `DataLoader` will provide this index.
    *   `__len__`: Returns the total number of items in the dataset. We use `.shape[0]` on the labels tensor to get the number of rows.

*   **Example Usage:**
    *   In `__init__`, we assign `X` to `self.features` and `y` to `self.labels`.
    *   In `__getitem__`, we return `self.features[index]` and `self.labels[index]`.
    *   In `__len__`, we return `self.labels.shape[0]`.

*   **Verification:** We can double-check the length of the training dataset using `len(train_ds)`.

In [32]:
len(train_ds)

5

Now that we’ve defined a PyTorch Dataset class we can use for our toy dataset,\
we can use PyTorch’s `DataLoader` class to sample from it, as shown in the following cell.

In [33]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

In [34]:
test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

After instantiating the training data loader, we can iterate over it. The iteration over the test_loader works similarly but is omitted for brevity:

In [35]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])


*   **Training Epoch:** Iterating over the training dataset once, visiting each example exactly once.
*   **Shuffling:**
    *   `shuffle=True` randomizes example order each epoch.
    *   `torch.manual_seed(123)` ensures the *first* epoch's shuffle is reproducible.
    *   Subsequent epochs will have different shuffling (desired for training stability).
*   **Batch Size & `drop_last`:**
    *   Batch size 2 with 5 examples results in a last batch of size 1 (5 % 2 != 0).
    *   Small last batches can disturb training convergence.
    *   Set `drop_last=True` to discard the last incomplete batch.



In [36]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

Now, iterating over the training loader, we can see that the last batch is omitted:

In [37]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])


#### Understanding `num_workers` in `DataLoader`

*   **`num_workers=0`:**
    *   Data loading happens in the **main process**.
    *   Can cause a **bottleneck** during training, especially with GPUs.
    *   CPU is busy loading data, while the GPU waits idle.

*   **`num_workers > 0`:**
    *   Launches **multiple worker processes** for parallel data loading.
    *   Frees the main process to focus on model training.
    *   Better utilizes system resources (CPU and GPU).
    *   See [Figure 11](images2/Figure11.webp) for illustration.


<img src="images2/Figure11.webp" width="700px">

> **Figure 11:** Loading data without multiple workers (setting `num_workers=0`) will create a data loading bottleneck where the model sits idle until the next batch is loaded (left). If multiple workers are enabled, the data loader can queue up the next batch in the background (right).

#### Considerations for `num_workers > 0`

*   **Small Datasets:**
    *   May not provide speedup.
    *   Can add overhead (spinning up processes).
    *   Total training time is already very short.

*   **Interactive Environments (e.g., Jupyter):**
    *   Can lead to issues (resource sharing, crashes).

*   **General Advice:**
    *   It's a tradeoff.
    *   Adapt to dataset size and computational environment.

*   **Practical Tip:**
    *   `num_workers=4` often works well for real-world datasets.
    *   Optimal setting depends on hardware and `Dataset` implementation.

> **Exercise 5:** 
> - Create a new `DataLoader` for the `train_ds` with a `batch_size` of 3. Iterate through this new loader and print the shape of the features (`x`) and labels (`y`) for each batch. Observe how the batches are formed.

## 7. A typical training loop

Let’s now train a neural network on the toy dataset. The following cell shows the training code.

In [38]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):
    
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)
        
        loss = F.cross_entropy(logits, labels) # Loss function
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00


-   **Training Outcome:** Loss reached 0 after 3 epochs, indicating convergence on the training set.
-   **Model Initialization:** Model initialized with 2 inputs and 2 outputs, matching the toy dataset's features and class labels.
-   **Optimizer:** Used Stochastic Gradient Descent (SGD) with a learning rate (`lr`) of 0.5.
-   **Hyperparameters:**
    -   Learning rate (`lr`) and number of epochs are tunable hyperparameters.
    -   Choose them by experimenting and observing loss convergence.

> **Note:** An optimizer, such as AdamW, is a tool used in training deep neural networks to adjust the model's weights to minimize loss, improve regularization, and enhance generalization by dynamically adjusting learning rates and penalizing larger weights.

Here's a summary of key concepts from the training loop:

*   **Validation Dataset:**
    *   Used for finding optimal hyperparameter settings.
    *   Similar to a test set, but used *multiple times* (test set used only *once* to avoid evaluation bias).

*   **Model Modes (`model.train()`, `model.eval()`):**
    *   Set the model to training or evaluation mode.
    *   Crucial for layers with different behavior during training vs. inference (e.g., Dropout, Batch Normalization).
    *   Redundant in simple models without such layers, but *best practice* to include for code reusability and future model changes.

*   **Loss Calculation & Optimization:**
    *   Pass logits directly to `F.cross_entropy`. Softmax is applied *internally* for efficiency and numerical stability.
    *   `loss.backward()`: Calculates gradients for all parameters in the computation graph.
    *   `optimizer.step()`: Updates model parameters using the calculated gradients (e.g., for SGD: `param = param - learning_rate * grad`).
    *   Remember `optimizer.zero_grad()` before `loss.backward()` to prevent gradient accumulation.

> **Note:** To prevent undesired gradient accumulation, it is important to include an `optimizer.zero_grad()` call in each update round to reset the gradients to 0. Otherwise, the gradients will accumulate, which may be undesired.

After we have trained the model, we can use it to make predictions:

In [39]:
model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


#### Interpreting Model Output & Making Predictions**

*   **Output Interpretation:**
    *   The model's output (logits or probabilities after softmax) represents class scores/probabilities for each input example.
    *   Example: For the first input, `[0.9991, 0.0009]` indicates a 99.91% probability for class 0 and 0.09% for class 1.
    *   (`torch.set_printoptions` can be used for better output readability).
*   **Getting Class Labels:**
    *   Use `torch.argmax` to find the index (class label) with the highest score/probability.
    *   `torch.argmax(..., dim=1)`: Finds the max index along each **row** (correct for getting predictions per example).
    *   `torch.argmax(..., dim=0)`: Finds the max index along each **column**.

In [40]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
tensor([0, 0, 0, 1, 1])


Note that it is unnecessary to compute softmax probabilities to obtain the class labels.\
We could also apply the argmax function to the logits (outputs) directly:

In [41]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


#### Verifying Training Predictions

*   We have computed the predicted labels for the training dataset.
*   For small datasets like this one, we can visually compare predictions to true labels.
*   To verify programmatically, use the `==` comparison operator:

In [42]:
predictions == y_train

tensor([True, True, True, True, True])

Using `torch.sum`, we can count the number of correct predictions:

In [43]:
torch.sum(predictions == y_train)

tensor(5)

**Training Accuracy**
*   Achieved 100% accuracy (5/5 correct) on the training set.
*   Need a general function to compute accuracy for any dataset size.
*   Implementing `compute_accuracy` in the next cell.

In [44]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0
    
    for idx, (features, labels) in enumerate(dataloader):
        
        with torch.no_grad():
            logits = model(features)
        
        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

**`compute_accuracy` Function**

*   Iterates over data loader to compute correct predictions and accuracy.
*   Designed for large datasets where models process data in batches due to memory limits.
*   Scales to datasets of arbitrary size by processing data in chunks (batches) similar to training.
*   Internal logic (converting logits to class labels) is similar to previous steps.

We can then apply the function to the training:

In [45]:
compute_accuracy(model, train_loader)

1.0

Similarly, we can apply the function to the test set:



In [46]:
compute_accuracy(model, test_loader)

1.0

> **Exercise 7:** The `compute_accuracy` function was used on the training and test loaders. Now, use the trained `model` to:
> 1. Get the raw `logits` for the `X_test` tensor.
> 2. Convert the `logits` to predicted class labels (0 or 1) using `torch.argmax`.
> 3. Print the predicted labels and compare them to the actual `y_test` labels (`tensor([0, 1])`). Do they match?

## 8 Saving and loading models

Now that we’ve trained our model, let’s see how to save it so we can reuse it later.\
Here’s the recommended way of saving and loading models in PyTorch:

In [47]:
torch.save(model.state_dict(), "model.pth")

*   **`model.state_dict()`**: A Python dictionary mapping each layer to its trainable parameters (weights and biases).
*   **Filename**: "`model.pth`" is an arbitrary name. Common conventions are `.pth` and `.pt`.
*   **Restoring**: After saving, the model can be restored from disk.

In [48]:
model = NeuralNetwork(2, 2) # needs to match the original model exactly
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>

*   **Loading the state dictionary:**
    *   `torch.load("model.pth")` reads the file and reconstructs the Python dictionary containing the model's parameters.
*   **Applying the state dictionary:**
    *   `model.load_state_dict()` applies these parameters to the model instance, restoring its learned state.
*   **Model Instance Requirement:**
    *   An instance of the model (`model = NeuralNetwork(2, 2)`) is needed in memory to apply the loaded parameters.
    *   The architecture (`NeuralNetwork(2, 2)`) must exactly match the original saved model.
    *   This line is not strictly necessary if loading in the same session where the model was saved, but included for illustration.

> **Exercise 8:**\
> You have saved the model state to "model.pth" and loaded it back.
> 1. Use the newly loaded `model` (the one created just before `load_state_dict`) to compute the accuracy on the `test_loader`.
> 2. Verify that the accuracy is the same as the accuracy computed *before* saving the model (which should be 1.0 in this case). 

## Summary

This notebook serves as an introduction to PyTorch, guiding through the essential steps of building and training a simple neural network. Key topics covered include:

1.  **Tensor Basics:** Creating and manipulating PyTorch tensors, including operations like indexing, reshaping, transposition, and matrix multiplication.
2.  **Neural Network Definition:** Constructing a basic neural network model using `torch.nn.Module`, defining layers like `Linear`.
3.  **Dataset and DataLoader:** Preparing data (likely using `TensorDataset`) and loading it efficiently in batches using `DataLoader`.
4.  **Training Process:** Setting up a loss function (e.g., `CrossEntropyLoss`) and an optimizer (e.g., `SGD`), followed by a training loop that performs forward pass, calculates loss, performs backward pass (backpropagation), and updates model weights.
5.  **Evaluation:** Assessing the model's performance on a test set by calculating accuracy.
6.  **Saving and Loading Models:** Persisting the trained model's state (`state_dict`) to a file and loading it back into a new model instance for inference or further training.
