# [SOLUTION] Exercise 3: Experiment with MLP Architectures (Wider vs. Deeper)

Every neural network architect faces a fundamental question: _Should I make my network wider (more neurons per layer) or deeper (more layers)?_ This isn't just an aesthetic choice: it determines what your model can learn, how efficiently it uses parameters, and whether it fits deployment constraints. 

- Depth enables hierarchical learning: each layer builds increasingly abstract representations from simpler ones. 
- Width enables parallel pattern detection: more neurons capture more diverse features simultaneously. 

Understanding this trade-off is essential for effective network design.

> **Overview**: Build and compare multiple MLP architectures with different depth and width configurations. Analyze how architectural choices affect parameter count, model capacity, and design trade-offs.
> 
> **Scenario**: Your food delivery platform needs a restaurant success predictor that runs on mobile devices. You're evaluating different network architectures to find the right balance between capability and efficiency. Should you build a wide network that captures many patterns at once? A deep network that learns hierarchical features? Or something in between?
> 
> **Goal**: Develop practical intuition for architectural design decisions and understand how depth vs. width affects both capability and resource usage.
> 
> **Tools**: Python, PyTorch, NumPy, Pandas
> 
> **Estimated Time**: 15 minutes

## Step 1: Setup

Let's import our libraries and prepare the environment.

In [1]:
# Import core libraries
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")

Setup complete!
PyTorch version: 2.5.1+cu121


## Step 2: Load and prepare data

> Note: This step mirrors the exact same dataset and dataset processing as for [demo 3](/cd1818-intro-to-deep-learning/3-mlp/demo3-building-first-mlp-sequential.ipynb).

We'll use the [MongoDB/whatscooking.restaurants](https://huggingface.co/datasets/MongoDB/whatscooking.restaurants) dataset from Hugging Face, which contains detailed information on restaurants across New York City.

For preprocessing, we'll:

1. Load the dataset from Hugging Face
2. Remove rows with null values for key features
3. Encode categorical variables (cuisine, borough) as numbers
4. Extract key features: stars, review_count, PriceRange, cuisine, borough
5. Normalize features for the neural network
6. Create binary success labels: high rating (4+ stars) AND proven popularity (50+ reviews)
7. Create the final feature and label datasets _(in tensor format)_

**IMPORTANT: Feel free to skip this data preparation section to focus on the MLP architecture itself**. Just know that we end up with a few thousand restaurants, each with 5 features (stars, reviews, price, encoded cuisine, encoded borough) always populated, and a binary success label.

In [2]:
# 1. Load restaurant data from HuggingFace
print("Loading restaurant dataset...")
dataset = load_dataset("MongoDB/whatscooking.restaurants", split="train")
df = pd.DataFrame(dataset)

# # Sample 1000 restaurants
# df = df.sample(n=1000, random_state=42).reset_index(drop=True)

print(f"Total restaurants in dataset: {len(df):,}")
print(f"\nDataset columns: {df.columns.tolist()[:10]}...")  # Show first 10 columns
print(f"\nFirst few entries:")
print(df[['name', 'cuisine', 'stars', 'review_count', 'borough']].head())

Loading restaurant dataset...
Total restaurants in dataset: 25,361

Dataset columns: ['restaurant_id', 'location', '_id', 'review_count', 'DogsAllowed', 'embedding', 'PriceRange', 'menu', 'HappyHour', 'TakeOut']...

First few entries:
                     name   cuisine  stars  review_count        borough
0      Baby Bo'S Burritos   Tex-Mex    2.5            10      Manhattan
1      Buddy'S Wonder Bar  American    3.5            62  Staten Island
2  Great Kills Yacht Club  American    4.0            72  Staten Island
3        Keats Restaurant  American    4.0           149      Manhattan
4                 Olive'S    Bakery    5.0             7      Manhattan


In [3]:
# 2. Remove entries with any null value
df = df.dropna(subset=['stars', 'review_count', 'PriceRange', 'cuisine', 'borough'])

# 3. Encode categorical features
cuisine_encoder = LabelEncoder()
borough_encoder = LabelEncoder()

df['cuisine_encoded'] = cuisine_encoder.fit_transform(df['cuisine'])
df['borough_encoded'] = borough_encoder.fit_transform(df['borough'])

# 4. Select subset of features
feature_columns = ['stars', 'review_count', 'PriceRange', 'cuisine_encoded', 'borough_encoded']
df = df[feature_columns]

# 5. Normalize features (important for neural networks!)
scaler = StandardScaler()
X_raw = df.to_numpy()
X_scaled = scaler.fit_transform(X_raw)

# 6. Create success label: high rating (4+ stars) AND proven popularity (50+ reviews)
df['success'] = ((df['stars'] >= 4.0) & 
                        (df['review_count'] >= 50)).astype(int)
y = df['success'].values

# 7. Define final datasets as PyTorch tensors
X_tensor = torch.FloatTensor(X_scaled)
y_tensor = torch.FloatTensor(y).unsqueeze(1)

print(f"Prepared {X_tensor.shape[0]} restaurants with {X_tensor.shape[1]} features")
print(f"\nFeature names: {feature_columns}")
print(f"Target: success (0 or 1)")
print(f"Success rate: {df['success'].mean():.1%}")

Prepared 22947 restaurants with 5 features

Feature names: ['stars', 'review_count', 'PriceRange', 'cuisine_encoded', 'borough_encoded']
Target: success (0 or 1)
Success rate: 23.3%


> **Understanding data preparation for neural networks:**
> 
> - **Encoding categorical features**: Neural networks only understand numbers, not text. We converted cuisine types (e.g., "Italian", "Chinese") and boroughs (e.g., "Manhattan", "Brooklyn") into numerical codes (0, 1, 2, ...). This is called label encoding: each unique category gets a unique integer.
> 
> - **Normalization**: We scaled all features to have similar ranges using standardization (mean=0, standard deviation=1). Why? Features like `review_count` (range: 5-9185) would dominate features like `stars` (range: 1-5) without normalization. Neural networks learn better when all inputs are on comparable scales.
> 
> **The result**: Just like XOR, we now have numerical data that the network can process: features in columns, examples in rows, all as numbers. The difference? XOR had 2 simple binary features. Our restaurant data has 5 features with richer patterns: continuous values (stars, reviews), ordinal values (price), and encoded categories (cuisine, borough). Same data structure, but with more complexity.


> **Why these features?** We selected 5 features to create enough input dimensions that architectural choices become meaningful. With too few features (like 2-3), there's not enough complexity for depth vs. width to matter. With too many (50+), the exercise becomes dominated by the input layer's size rather than the architectural decisions in hidden layers.

## Step 3: Build and compare network architectures

Now comes the core of the exercise: you'll build three different MLP architectures and systematically compare them. This hands-on experimentation will reveal how depth, width, and activation choices affect network design.

**What you'll build:**
- A **wide & shallow** network (few layers, many neurons)
- A **deep & narrow** network (many layers, few neurons)
- A **deep network with different activation** (exploring activation functions)

**What you'll discover:**
- How to construct different MLPs using `nn.Sequential`
- How architectural choices affect parameter count
- Where parameters are distributed across layers
- Why depth is more parameter-efficient than width

Let's start building!

### Part A: Build a "Wide and Shallow" network

A wide network uses few layers but many neurons per layer. This architecture can learn many different patterns simultaneously in a single transformation step.

**Your task**: Build a 2-layer MLP with one wide hidden layer.

**Target architecture**: 5 inputs → 512 hidden neurons → 1 output

**Design choices to make**:
- What activation function should you use in the hidden layer?
- What activation function should you use for the output?

In [4]:
# Define network dimensions
input_size = X_tensor.shape[1]  # 5 features
hidden_size_wide = 512          # Many neurons in one layer
output_size = 1                 # Binary classification

# TODO: Build the wide network using nn.Sequential
# Hint: Your first Linear layer connects inputs to the hidden layer. What dimensions does it need?
# Hint: Then, the second layer connects the hidden neurons (same size!) to the single output.
# Hint: We want a probability output (0 to 1). Which activation function squashes numbers into that range?
# Reference: https://docs.pytorch.org/docs/stable/nn.html

# SOLUTION
model_wide = nn.Sequential(
    nn.Linear(input_size, hidden_size_wide),  # 5 → 512
    nn.ReLU(),                                # Activation for non-linearity
    nn.Linear(hidden_size_wide, output_size), # 512 → 1
    nn.Sigmoid()                              # Output probability [0,1]
)

print("Wide & Shallow Network:")
print(model_wide)
print(f"\nArchitecture: {input_size} → {hidden_size_wide} → {output_size}")

Wide & Shallow Network:
Sequential(
  (0): Linear(in_features=5, out_features=512, bias=True)
  (1): ReLU()
  (2): Linear(in_features=512, out_features=1, bias=True)
  (3): Sigmoid()
)

Architecture: 5 → 512 → 1


> **Wide network as a parallel processor.** Our "Wide and Shallow" network tries to detect 512 different patterns simultaneously based on the 5 input features. While this network has immense capacity to learn many simple, distinct rules right away, what might be the cost associated with having every one of those 512 neurons connected to every single input feature? 
> <br>_(Hint: The cost is directly related to the number of weights you need to store and train.)_

### Part B: Build the "Deep and Narrow" network

A deep network uses many layers but fewer neurons per layer. This architecture can learn hierarchical representations: each layer builds on the previous layer's features to create increasingly abstract patterns.

**Your task**: Build a 4-layer MLP with three narrow hidden layers.

**Target architecture**: 5 inputs → 32 neurons → 32 neurons → 32 neurons → 1 output

In [5]:
# Define network dimensions
hidden_size_deep = 32  # Fewer neurons per layer

# TODO: Build the deep network using nn.Sequential
# Hint: Think about the data flow: Input → Hidden1 → Hidden2 → Hidden3 → Output.
# Hint: Ensure the 'out_features' of one layer matches the 'in_features' of the next.
# Hint: And, remember: every hidden layer needs an activation function, or else they collapse into a single linear layer!
# Reference: https://docs.pytorch.org/docs/stable/nn.html

# SOLUTION
model_deep = nn.Sequential(
    nn.Linear(input_size, hidden_size_deep),  # 5 → 32
    nn.ReLU(),
    nn.Linear(hidden_size_deep, hidden_size_deep),  # 32 → 32
    nn.ReLU(),
    nn.Linear(hidden_size_deep, hidden_size_deep),  # 32 → 32
    nn.ReLU(),
    nn.Linear(hidden_size_deep, output_size),  # 32 → 1
    nn.Sigmoid()
)

print("Deep & Narrow Network:")
print(model_deep)
print(f"\nArchitecture: {input_size} → {hidden_size_deep} → {hidden_size_deep} → {hidden_size_deep} → {output_size}")

Deep & Narrow Network:
Sequential(
  (0): Linear(in_features=5, out_features=32, bias=True)
  (1): ReLU()
  (2): Linear(in_features=32, out_features=32, bias=True)
  (3): ReLU()
  (4): Linear(in_features=32, out_features=32, bias=True)
  (5): ReLU()
  (6): Linear(in_features=32, out_features=1, bias=True)
  (7): Sigmoid()
)

Architecture: 5 → 32 → 32 → 32 → 1


> **Deep network as hierarchical learner**: Our "Deep and Narrow" Network achieves its power through composition (stacking functions). Each of the three hidden layers (32 → 32 → 32) only learns 32 patterns, but the second layer learns patterns from the output of the first layer, and the third layer learns patterns from the second layer. Why might learning a "complex pattern" via three smaller, stacked steps be more efficient than trying to learn that same complex pattern in a single, massive step? 
> <br>_(Hint: Relate this to how you would solve a long math equation versus a complex logic puzzle.)_

#### Part B.1: Experiment with activation functions - Do they add parameters?

You chose one of many possible activations for your hidden layers. But what would happen with different activations? Let's explore!

In [6]:
# TODO: Build a deep network with a different activation 
# Hint: Simply replace your current activation with a different one
# Hint: Keep everything else the same (same architecture, same output activation)
# Reference: https://docs.pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity

# SOLUTION
model_deep_alternative = nn.Sequential(
    nn.Linear(input_size, hidden_size_deep),  # 5 → 32
    nn.Tanh(),  # Changed from ReLU to Tanh
    nn.Linear(hidden_size_deep, hidden_size_deep),  # 32 → 32
    nn.Tanh(),  # Changed from ReLU to Tanh
    nn.Linear(hidden_size_deep, hidden_size_deep),  # 32 → 32
    nn.Tanh(),  # Changed from ReLU to Tanh
    nn.Linear(hidden_size_deep, output_size),  # 32 → 1
    nn.Sigmoid()  # Output activation stays the same
)

print("\nDeep Network with Alternative Activation:")
print(model_deep_alternative)


Deep Network with Tanh Activation:
Sequential(
  (0): Linear(in_features=5, out_features=32, bias=True)
  (1): Tanh()
  (2): Linear(in_features=32, out_features=32, bias=True)
  (3): Tanh()
  (4): Linear(in_features=32, out_features=32, bias=True)
  (5): Tanh()
  (6): Linear(in_features=32, out_features=1, bias=True)
  (7): Sigmoid()
)


> **How do activation functions contribute to network complexity?**: Do you expect the two versions with different activation functions to have the same number of parameters, or different? Why? 
> <br>_(Think about whether the activation function itself has any trainable weights.)_

### Part C: Analyze parameter count

Now that you've built three different architectures, let's measure one concrete aspect: **how many parameters does each have?**

Parameter count matters for two reasons:
1. **Resource constraints**: More parameters = more memory and computation
2. **Architectural efficiency**: Reveals how different designs use their parameter budget

But remember: Parameters are just ONE way to measure architecture. The real power comes from what patterns the network can learn through its depth and width (something you'll explore when you learn about model training)!

In [7]:
# TODO: Count total parameters in the wide network
# Hint: The model object has a generator for parameters. Can you loop through it?
# Hint: Each item in that loop is a tensor. And, you need to sum the number of elements of every tensor in the model.
# Reference: 
# 1. https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.parameters
# 2. https://discuss.pytorch.org/t/how-do-i-check-the-number-of-parameters-of-a-model/4325

def count_parameters(model):
    """Count total trainable parameters in a model."""
    # SOLUTION
    return sum(p.numel() for p in model.parameters())

# Count parameters for each model
wide_params = count_parameters(model_wide)
deep_params = count_parameters(model_deep)
deep_alternative_params = count_parameters(model_deep_alternative)

print("Parameter Counts:")
print("=" * 60)
print(f"Wide & Shallow:    {wide_params:,} parameters")
print(f"Deep & Narrow :     {deep_params:,} parameters")
print(f"Deep & Narrow (Alternative Activation):     {deep_alternative_params:,} parameters")
print("=" * 60)

Parameter Counts:
Wide & Shallow (ReLU):    3,585 parameters
Deep & Narrow (ReLU):     2,337 parameters
Deep & Narrow (Tanh):     2,337 parameters


> **What to notice**: Compare the wide vs. deep networks. Which has more parameters? The difference reveals how width and depth scale differently: 
> 
> - Width increases parameters dramatically because every neuron connects to every input AND every output. 
> - Depth adds parameters more gradually because each new layer only connects to the previous layer's size.
> 
> Even though the deep network has more steps of abstraction (more layers), hierarchical learning makes learning patterns sequentially more efficient!


### Part D: Understand parameter distribution

Total parameter count tells one story, but WHERE those parameters live tells another. Let's break down each layer to see how parameters are distributed across the network.

In [8]:
# TODO: Print a detailed breakdown showing parameters in each layer
# Hint: Loop through enumerate(model) to get both index and layer
# Hint: Extract both .weight and .bias, but ONLY for layers with learnable parameters
# Hint: Print the shapes and parameter counts for each with a final total
# Reference: https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html

def print_parameter_breakdown(model, model_name):
    """Print detailed parameter breakdown for each layer."""
    print(f"\n{model_name} - Layer-by-Layer Breakdown:")
    print("=" * 70)
    
    # SOLUTION
    total = 0
    for idx, layer in enumerate(model):
        print(f"\nLayer {idx}: {layer}")
        
        if isinstance(layer, nn.Linear):
            weight_params = layer.weight.numel()
            bias_params = layer.bias.numel()
            layer_total = weight_params + bias_params
            
            print(f"  → Weight shape: {layer.weight.shape} = {weight_params:,} parameters")
            print(f"  → Bias shape: {layer.bias.shape} = {bias_params:,} parameters")
            print(f"  → Layer total: {layer_total:,} parameters")
            
            total += layer_total
        else:
            print(f"  → No trainable parameters (activation function)")
    
    print("=" * 70)
    print(f"Total: {count_parameters(model):,} parameters\n")

# Print breakdowns
print_parameter_breakdown(model_wide, "Wide & Shallow")
print_parameter_breakdown(model_deep, "Deep & Narrow")


Wide & Shallow - Layer-by-Layer Breakdown:

Layer 0: Linear(in_features=5, out_features=512, bias=True)
  → Weight shape: torch.Size([512, 5]) = 2,560 parameters
  → Bias shape: torch.Size([512]) = 512 parameters
  → Layer total: 3,072 parameters

Layer 1: ReLU()
  → No trainable parameters (activation function)

Layer 2: Linear(in_features=512, out_features=1, bias=True)
  → Weight shape: torch.Size([1, 512]) = 512 parameters
  → Bias shape: torch.Size([1]) = 1 parameters
  → Layer total: 513 parameters

Layer 3: Sigmoid()
  → No trainable parameters (activation function)
Total: 3,585 parameters


Deep & Narrow - Layer-by-Layer Breakdown:

Layer 0: Linear(in_features=5, out_features=32, bias=True)
  → Weight shape: torch.Size([32, 5]) = 160 parameters
  → Bias shape: torch.Size([32]) = 32 parameters
  → Layer total: 192 parameters

Layer 1: ReLU()
  → No trainable parameters (activation function)

Layer 2: Linear(in_features=32, out_features=32, bias=True)
  → Weight shape: torch.Siz

> **Parameter math for Linear layers**: For any `nn.Linear(in_features, out_features)`, the parameter count is:
> - **Weights**: `in_features × out_features` (one weight per connection)
> - **Biases**: `out_features` (one bias per output neuron)  
> - **Total**: `(in_features × out_features) + out_features`
> 
> For example, `nn.Linear(5, 512)` has (5 × 512) + 512 = 3072 parameters.
>
> **Key observation**: The first layer in the wide network (5→512) dominates the parameter budget because it connects many inputs to many outputs. In the deep network, parameters are distributed more evenly: each 32→32 layer has (32 × 32) + 32 = 1,056 parameters.

## Step 4: Gather insights from architectural experiments

Now that you've built multiple architectures and analyzed their parameter counts, let's synthesize what you've discovered about the **Depth vs. Width** trade-off.

**TODO**: Provide an answer in markdown for each of the following questions.

### Question 1: Parameter efficiency

Look at your parameter counts for the wide vs. deep networks.

**Answer the following** (write your responses in this markdown cell):

1. Which network has more parameters? By approximately what factor (1.5x, 2x, 5x, 10x)?
2. If you had a strict memory constraint (say, 5,000 parameters maximum), which architecture style would fit better?

_Write your answer here:_
1. The Wide network has more parameters (~3,585) compared to the Deep network (~2,337). The difference is approximately 1.5x – the wide network has about 7 times as many parameters
2. The Deep architecture is more memory efficient. We added 3 layers of depth and kept the model size small. Expanding the Wide network would cause the parameter count to jump up much faster.

> **Think about**: The wide network's first layer (5→512) creates a massive weight matrix of 2560 connections plus 512 biases. Meanwhile, the deep network's largest layer (32→32) only has 1,024 connections. This reveals why width scales parameters much faster than depth!

### Question 2: Parameter distribution

Look at your layer-by-layer breakdowns.

**Answer the following**:

1. In the wide network, which layer dominates the parameter budget?
2. In the deep network, how are parameters distributed—mostly in one layer or spread evenly?
3. Why does the first layer in ANY network tend to have many parameters when connecting to a large hidden layer?

_Write your answer here:_
1. In the wide network, the first layer (5→512) dominates with 3072 parameters, representing over half of the total parameter budget
2. In the deep network, parameters are distributed fairly evenly across the three 32→32 hidden layers (each has ~1,056 parameters)
3. The first layer has many parameters when connecting to a large hidden layer because parameters = (inputs × neurons) + neurons. Even with just 5 inputs, connecting to 512 neurons creates (5 × 512) + 512 = 3072 parameters.

> **Consider the math**: In a Linear layer, `parameters = (inputs × outputs) + outputs`. So when you connect 5 inputs to 512 outputs, you get `(5 × 512) + 512 = 3072` parameters. But when you connect 32 to 32, you get `(32 × 32) + 32 = 1,056`. The quadratic growth of width (neurons²) vs. linear growth of depth (neurons × layers) explains the efficiency difference!

### Question 3: Architectural trade-offs

Based on everything you've observed, think about the broader implications.

**Reflect on the following**:

1. **Width vs. Depth for parameters**: If you wanted to add more model capacity but had limited memory, would you add width or depth? Why?

2. **Width vs. Depth for learning**: From the conceptual videos, you learned that depth enables hierarchical feature learning (simple → complex transformations), while width enables learning many parallel patterns. Which seems more powerful for complex problems?

3. **Real-world constraints**: Imagine deploying to a mobile app with a 10MB size limit. How would this constraint influence your architectural decisions?

_Write your answer here:_
1. I would add depth rather than width if memory is limited. Adding a 32-neuron layer adds ~1,056 parameters, while doubling width from 64 to 128 neurons adds much more (roughly quadruples parameters in that layer).
2. Depth seems more powerful for complex problems. The conceptual videos explained that depth enables hierarchical transformations—each layer builds increasingly abstract features. Width captures many parallel patterns but at one level of abstraction. For complex real-world problems that have hierarchical structure (like images: edges → shapes → objects), depth provides a more natural fit.
3. A 10MB limit would severely constrain architecture choices. I would:
    -   Favor deeper, narrower networks for parameter efficiency
    -   Carefully measure parameter counts for each design
    -   Consider the model's parameter count vs. the file size (PyTorch models include architecture + parameters, so roughly 4 bytes per parameter means 2.5 million parameters ≈ 10MB)
Test whether the model can achieve acceptable performance within the constraint, or whether the problem requires a different approach (like model compression technique

> **What depth and width should you start with?** The best starting point is a moderately Deep and moderately Wide network (e.g., 2-3 hidden layers, each with 50-100 neurons). This is the balanced approach that solves most common structured data problems, providing both hierarchical learning and a rich capacity for parallel pattern detection.
> 
> <details> <summary><i>Curious to see how experts combine the two?</i></summary>
> In production systems, the Wide & Deep architecture jointly trains both styles: the Wide component handles fast memorization, while the Deep component ensures superior generalization. Look at the original implementation used by Google for app recommendations: <a href="https://research.google/blog/wide-amp-deep-learning-better-together-with-tensorflow/">Wide & Deep Learning: Better Together with TensorFlow</a>.
> </details>

## Conclusion

Congratulations! You've explored the fundamental architectural trade-off in neural network design: depth vs. width.

**What you've accomplished:**

- [x] Built a wide & shallow network (few layers, many neurons per layer)
- [x] Built a deep & narrow network (many layers, few neurons per layer)  
- [x] Experimented with different activation functions
- [x] Measured parameter counts to understand resource requirements
- [x] Analyzed parameter distribution across layers
- [x] Discovered that depth is more parameter-efficient than width

**Critical insights:**

1. **Depth enables hierarchical learning**: Multiple layers transform problems through increasing abstraction  
2. **Width enables parallel patterns**: More neurons capture more diverse features simultaneously
3. **Depth is parameter-efficient**: Adding layers scales parameters linearly; adding width scales quadratically
4. **Activations add zero parameters**: You can experiment with different activations without changing model size
5. **First layers often dominate**: Connecting inputs to wide hidden layers creates parameter bottlenecks
6. **Research favors depth**: Studies show deeper networks often outperform wider ones, even with fewer parameters

The real power of depth comes from what it enables the network to learn: hierarchical representations where each layer builds increasingly abstract features. For now, you've gained the crucial insight that **architectural choices have concrete consequences**: both for what patterns a network *can* learn, and for whether it *fits* your deployment constraints.

> **Next steps to explore**: Explore balanced architectures, vary hidden-layer widths, and mix activation functions while comparing parameter counts to build intuition about how network design choices affect model complexity.