# Why Use torch.nn? Understanding the Benefits Over Manual PyTorch Models


### 1. The Challenge of Creating Models Without torch.nn

When you create neural networks without the torch.nn module, you have to:

    Manually initialize all weights and biases with correct shapes.

    Manually implement forward passes with tensor operations.

    Manually compute loss functions and carefully apply clamping to avoid numerical instability.

    Manually manage gradients — updating parameters and zeroing gradients every training step.

    Manage complex architectures by yourself (multiple layers, activations, dropout, batchnorm, etc.).

    Write a lot of boilerplate code repeatedly for common layers like linear layers, convolutions, or activation functions.

    Make your code less readable and error-prone as complexity grows.

    Handle device management (CPU/GPU) explicitly.

### 2. How Does torch.nn Help?

The torch.nn module is designed to:

    Provide prebuilt layer classes (nn.Linear, nn.Conv2d, nn.BatchNorm2d, etc.) that handle initialization and computation.

    Automatically track parameters (weights, biases) inside nn.Module subclasses.

    Define a clean forward() method for forward propagation.

    Integrate seamlessly with autograd to compute gradients.

    Manage model parameters easily via model.parameters().

    Provide common loss functions (nn.BCELoss(), nn.CrossEntropyLoss(), etc.) that are numerically stable and optimized.

    Include activation functions (nn.ReLU(), nn.Sigmoid(), nn.Softmax()) as reusable components.

    Support GPU/CPU device transfer easily.

    Enable modular, composable architectures by nesting nn.Modules.

    Work directly with optimizers from torch.optim with parameter lists.

    Simplify training loops, reducing boilerplate and errors.



### One Neuron with Sigmoid Activation

Let's first create a simple model with just one linear layer followed by a sigmoid activation, using torch.nn. 

This is much simpler than implementing the math and gradients manually.

In [31]:
import torch
import torch.nn as nn
from torchinfo import summary

In [32]:
class SimpleModel(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        # One linear layer from num_features to 1 output
        self.linear = nn.Linear(num_features, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        x = self.linear(x)  # Linear transformation
        x = self.sigmoid(x) # Sigmoid activation
        return x

# Create random features tensor of size (batch_size=10, num_features=5)
features = torch.rand(10, 5)

# Instantiate model
model = SimpleModel(num_features=features.shape[1])

# Forward pass
output = model(features)
print("Output shape:", output.shape)
print("Output:", output)
print("Weight:", model.linear.weight)
print("Bias:", model.linear.bias)


Output shape: torch.Size([10, 1])
Output: tensor([[0.5741],
        [0.5385],
        [0.5222],
        [0.5003],
        [0.5367],
        [0.5410],
        [0.5264],
        [0.5416],
        [0.5229],
        [0.5297]], grad_fn=<SigmoidBackward0>)
Weight: Parameter containing:
tensor([[ 0.1782,  0.2078, -0.1979, -0.1187, -0.0768]], requires_grad=True)
Bias: Parameter containing:
tensor([0.1280], requires_grad=True)


In [33]:
summary(model, input_size=(10, 5))

Layer (type:depth-idx)                   Output Shape              Param #
SimpleModel                              [10, 1]                   --
├─Linear: 1-1                            [10, 1]                   6
├─Sigmoid: 1-2                           [10, 1]                   --
Total params: 6
Trainable params: 6
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

### Adding Another Layer with ReLU Activation

Now, let's extend this to two linear layers with a ReLU activation in between, then a sigmoid at the end. This creates a simple 2-layer feedforward network.

In [34]:
import torch
import torch.nn as nn

class TwoLayerModelNoSequential(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        # Define layers explicitly
        self.linear1 = nn.Linear(num_features, 3)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(3, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.linear1(x)   # First linear layer
        x = self.relu(x)      # Activation
        x = self.linear2(x)   # Second linear layer
        x = self.sigmoid(x)   # Output activation
        return x

# Create dummy input
features = torch.rand(10, 5)

# Instantiate model
model = TwoLayerModelNoSequential(num_features=features.shape[1])

# Forward pass
output = model(features)
print("Output shape:", output.shape)
print("Output:", output)
print("linear1 Weight:", model.linear1.weight)
print("linear2 Weight:", model.linear2.weight)
print("linear1 Bias:", model.linear1.bias)
print("linear2 Bias:", model.linear2.bias)



Output shape: torch.Size([10, 1])
Output: tensor([[0.6656],
        [0.7235],
        [0.7083],
        [0.6975],
        [0.7161],
        [0.7185],
        [0.6836],
        [0.7008],
        [0.6939],
        [0.7224]], grad_fn=<SigmoidBackward0>)
linear1 Weight: Parameter containing:
tensor([[-0.1821,  0.1225, -0.1386, -0.2808, -0.0496],
        [ 0.1001,  0.2874,  0.2408,  0.2448,  0.3020],
        [ 0.2126,  0.2815,  0.1191, -0.2657, -0.3535]], requires_grad=True)
linear2 Weight: Parameter containing:
tensor([[-0.3266,  0.4533,  0.5282]], requires_grad=True)
linear1 Bias: Parameter containing:
tensor([-0.0424,  0.2290,  0.4403], requires_grad=True)
linear2 Bias: Parameter containing:
tensor([0.2549], requires_grad=True)


In [35]:
summary(model, input_size=(10, 5))

Layer (type:depth-idx)                   Output Shape              Param #
TwoLayerModelNoSequential                [10, 1]                   --
├─Linear: 1-1                            [10, 3]                   18
├─ReLU: 1-2                              [10, 3]                   --
├─Linear: 1-3                            [10, 1]                   4
├─Sigmoid: 1-4                           [10, 1]                   --
Total params: 22
Trainable params: 22
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

### Implement same model using nn.Sequential

In [49]:
class TwoLayerModel(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(num_features, 3),  # First layer: num_features -> 3 neurons
            nn.ReLU(),                   # Activation function
            nn.Linear(3, 1),             # Second layer: 3 neurons -> 1 output
            nn.Sigmoid()                 # Sigmoid activation for output
        )

    def forward(self, x):
        return self.network(x)

# Instantiate model
model = TwoLayerModel(num_features=features.shape[1])

# Forward pass
output = model(features)
print("Output shape:", output.shape)
print("Output:", output)
print("Parameters", model.network.parameters)
for layer in model.network:
    if isinstance(layer, nn.Linear):
        print(layer)
        print("Weights:", layer.weight)
        print("Bias:", layer.bias)
        

Output shape: torch.Size([10, 1])
Output: tensor([[0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209],
        [0.6209]], grad_fn=<SigmoidBackward0>)
Parameters <bound method Module.parameters of Sequential(
  (0): Linear(in_features=5, out_features=3, bias=True)
  (1): ReLU()
  (2): Linear(in_features=3, out_features=1, bias=True)
  (3): Sigmoid()
)>
Linear(in_features=5, out_features=3, bias=True)
Weights: Parameter containing:
tensor([[ 0.2097,  0.0632, -0.0453, -0.3714, -0.2241],
        [ 0.1517, -0.3611,  0.1420, -0.3632, -0.0558],
        [-0.3569, -0.1887,  0.1263, -0.1305, -0.3018]], requires_grad=True)
Bias: Parameter containing:
tensor([-0.2098, -0.4125, -0.2057], requires_grad=True)
Linear(in_features=3, out_features=1, bias=True)
Weights: Parameter containing:
tensor([[-0.2630,  0.4056, -0.4092]], requires_grad=True)
Bias: Parameter containing:
tensor([0.4932], requires_gra

In [50]:
summary(model, input_size=(10, 5))

Layer (type:depth-idx)                   Output Shape              Param #
TwoLayerModel                            [10, 1]                   --
├─Sequential: 1-1                        [10, 1]                   --
│    └─Linear: 2-1                       [10, 3]                   18
│    └─ReLU: 2-2                         [10, 3]                   --
│    └─Linear: 2-3                       [10, 1]                   4
│    └─Sigmoid: 2-4                      [10, 1]                   --
Total params: 22
Trainable params: 22
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

### Implementation with nn.sequential, Loss Function, and Optimizer

In [51]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define model with Sequential
class TwoLayerModel(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(num_features, 3),
            nn.ReLU(),
            nn.Linear(3, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.network(x)

# Dummy data
features = torch.rand(10, 5)  # 10 samples, 5 features
labels = torch.randint(0, 2, (10, 1)).float()  # binary targets (0 or 1)

# Create model
model = TwoLayerModel(num_features=5)

# Loss function: Binary Cross Entropy (for sigmoid output)
loss_function = nn.BCELoss()

# Optimizer: SGD
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training step
model.train()
optimizer.zero_grad()                # Zero out gradients
outputs = model(features)            # Forward pass
loss = loss_function(outputs, labels)    # Compute loss
loss.backward()                      # Backpropagation
optimizer.step()                     # Update weights

# Print loss
print("Loss:", loss.item())


Loss: 0.7613458633422852
