# Building Models with PyTorch: From Fundamentals to Advanced Architectures

This notebook demonstrates the foundational concepts and tools within PyTorch's `torch.nn` module for constructing neural network models. We'll explore essential classes for defining models and parameters, various common layer types, and crucial functions like activation and loss functions.

## Learning Objectives
- Master the core classes: `nn.Module` and `nn.Parameter`
- Understand common neural network layer types (Linear, Convolutional, RNN, Transformer)
- Explore essential functions: pooling, normalization, dropout, activation, and loss functions
- Build practical models with Australian context examples
- Compare PyTorch patterns with TensorFlow equivalents
- Implement multilingual NLP models (English-Vietnamese)

## Key Concepts Covered
1. **Core Classes**: `nn.Module` and `nn.Parameter`
2. **Layer Types**: Linear, Convolutional, RNN/LSTM, Transformer
3. **Essential Functions**: Pooling, Normalization, Dropout, Activation, Loss
4. **Practical Applications**: Australian tourism sentiment analysis, city classification

---

## 1. Environment Setup and Runtime Detection

Following PyTorch best practices for cross-platform compatibility:

In [24]:
# Environment Detection and Setup
import sys
import subprocess
import os
import time

# Detect the runtime environment
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules or "kaggle" in os.environ.get('KAGGLE_URL_BASE', '')
IS_LOCAL = not (IS_COLAB or IS_KAGGLE)

print(f"Environment detected:")
print(f"  - Local: {IS_LOCAL}")
print(f"  - Google Colab: {IS_COLAB}")
print(f"  - Kaggle: {IS_KAGGLE}")

# Platform-specific system setup
if IS_COLAB:
    print("\nSetting up Google Colab environment...")
    !apt update -qq
    !apt install -y -qq software-properties-common
elif IS_KAGGLE:
    print("\nSetting up Kaggle environment...")
    # Kaggle usually has most packages pre-installed
else:
    print("\nSetting up local environment...")

# Install required packages for this notebook
required_packages = [
    "torch",
    "torchvision",
    "transformers",
    "datasets",
    "tokenizers",
    "pandas",
    "matplotlib",
    "seaborn",
    "tensorboard"
]

print("\nInstalling required packages...")
for package in required_packages:
    if IS_COLAB or IS_KAGGLE:
        !pip install -q {package}
    else:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", package],
                      capture_output=True)
    print(f"✓ {package}")

# Verify PyTorch installation
import torch
print(f"\n✅ PyTorch {torch.__version__} ready!")
print(f"CUDA available: {torch.cuda.is_available()}")

Environment detected:
  - Local: False
  - Google Colab: True
  - Kaggle: False

Setting up Google Colab environment...
44 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)[0m
software-properties-common is already the newest version (0.99.22.9).
0 upgraded, 0 newly installed, 0 to remove and 44 not upgraded.

Installing required packages...
✓ torch
✓ torchvision
✓ transformers
✓ datasets
✓ tokenizers
✓ pandas
✓ matplotlib
✓ seaborn
✓ tensorboard

✅ PyTorch 2.8.0+cu126 ready!
CUDA available: False


## 2. Device Detection and Setup

Implementing intelligent device detection for optimal performance:

In [25]:
import torch
import platform

def detect_device():
    """
    Detect the best available PyTorch device with comprehensive hardware support.

    Priority order:
    1. CUDA (NVIDIA GPUs) - Best performance for deep learning
    2. MPS (Apple Silicon) - Optimized for M1/M2/M3 Macs
    3. CPU (Universal) - Always available fallback

    Returns:
        torch.device: The optimal device for PyTorch operations
        str: Human-readable device description for logging
    """
    # Check for CUDA (NVIDIA GPU)
    if torch.cuda.is_available():
        device = torch.device("cuda")
        gpu_name = torch.cuda.get_device_name(0)
        device_info = f"CUDA GPU: {gpu_name}"

        print(f"🚀 Using CUDA acceleration")
        print(f"   GPU: {gpu_name}")
        print(f"   CUDA Version: {torch.version.cuda}")

        return device, device_info

    # Check for MPS (Apple Silicon)
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device("mps")
        device_info = "Apple Silicon MPS"

        print(f"🍎 Using Apple Silicon MPS acceleration")
        print(f"   System: {platform.uname().system} {platform.uname().release}")

        return device, device_info

    # Fallback to CPU
    else:
        device = torch.device("cpu")
        device_info = "CPU (No GPU acceleration available)"

        print(f"💻 Using CPU (no GPU acceleration detected)")
        print(f"   Processor: {platform.uname().processor}")
        print(f"   PyTorch Threads: {torch.get_num_threads()}")

        return device, device_info

# Detect and set global device
DEVICE, DEVICE_INFO = detect_device()
print(f"\n✅ PyTorch device selected: {DEVICE}")
print(f"📊 Device info: {DEVICE_INFO}")

💻 Using CPU (no GPU acceleration detected)
   Processor: x86_64
   PyTorch Threads: 1

✅ PyTorch device selected: cpu
📊 Device info: CPU (No GPU acceleration available)


## 3. Core Classes: nn.Module and nn.Parameter

Model building in PyTorch revolves around two key classes within the `torch.nn` module:

### `nn.Module`: The Foundation of PyTorch Models

**`nn.Module`** is the base class for all neural network modules. It encapsulates entire models, as well as individual model components like neural network layers.

**Key aspects:**
- An `__init__` method where the model's layers and functions are defined
- A `forward` method that dictates how data flows through the layers
- Automatic registration of `nn.Parameter` objects assigned as attributes

**TensorFlow Comparison:**
- PyTorch: Explicit `nn.Module` subclass with `__init__` and `forward` methods
- TensorFlow: `tf.keras.Sequential` or Functional API with automatic forward pass

In [26]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Australian City Classifier - Demonstrates nn.Module fundamentals
class AustralianCityClassifier(nn.Module):
    """
    Simple classifier for Australian cities - demonstrates nn.Module patterns.

    TensorFlow equivalent:
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(input_size,)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(len(australian_cities), activation='softmax')
    ])
    """
    def __init__(self, input_size, hidden_size, num_cities):
        # MUST call parent constructor first
        super(AustralianCityClassifier, self).__init__()

        # Define layers - these become registered nn.Parameters automatically
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
        self.fc3 = nn.Linear(hidden_size // 2, num_cities)
        self.dropout = nn.Dropout(0.2)

        # Store city names for interpretation
        self.australian_cities = [
            "Sydney", "Melbourne", "Brisbane", "Perth",
            "Adelaide", "Darwin", "Hobart", "Canberra"
        ]

    def forward(self, x):
        """
        Forward pass - defines how data flows through the model.

        TensorFlow: forward pass is automatic with model(x)
        PyTorch: explicit forward method required
        """
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # No softmax here if using CrossEntropyLoss
        return x

    def get_model_info(self):
        """Utility method to get model information."""
        total_params = sum(p.numel() for p in self.parameters())
        trainable_params = sum(p.numel() for p in self.parameters() if p.requires_grad)

        return {
            'total_parameters': total_params,
            'trainable_parameters': trainable_params,
            'cities': self.australian_cities,
            'num_cities': len(self.australian_cities)
        }

# Create model instance
australian_cities = ["Sydney", "Melbourne", "Brisbane", "Perth", "Adelaide", "Darwin", "Hobart", "Canberra"]
model = AustralianCityClassifier(input_size=100, hidden_size=64, num_cities=len(australian_cities))

# Move to detected device
model = model.to(DEVICE)

print("🏛️ Australian City Classifier Model:")
print(model)
print(f"\n📊 Model Info: {model.get_model_info()}")
print(f"🎯 Target cities: {australian_cities}")

🏛️ Australian City Classifier Model:
AustralianCityClassifier(
  (fc1): Linear(in_features=100, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=8, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

📊 Model Info: {'total_parameters': 8808, 'trainable_parameters': 8808, 'cities': ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide', 'Darwin', 'Hobart', 'Canberra'], 'num_cities': 8}
🎯 Target cities: ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide', 'Darwin', 'Hobart', 'Canberra']


### `nn.Parameter`: Learnable Weights and Biases

**`nn.Parameter`** is a subclass of `torch.Tensor` specifically designed to represent learnable weights and biases within a model.

**Key features:**
- Automatically have `requires_grad=True`
- Autograd tracks operations for gradient computation
- Automatically registered when assigned as module attributes

In [27]:
# Demonstrate nn.Parameter usage
class CustomLinearLayer(nn.Module):
    """
    Custom linear layer demonstrating nn.Parameter usage.

    This manually implements what nn.Linear does internally.
    """
    def __init__(self, in_features, out_features):
        super(CustomLinearLayer, self).__init__()

        # Define learnable parameters explicitly
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.randn(out_features))

        # Initialize weights (common practice)
        self._init_weights()

    def _init_weights(self):
        """Initialize weights using Xavier initialization."""
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)

    def forward(self, x):
        # Manual linear transformation: y = xW^T + b
        return F.linear(x, self.weight, self.bias)

# Create custom layer
custom_layer = CustomLinearLayer(10, 5).to(DEVICE)

print("🔧 Custom Linear Layer:")
print(custom_layer)
print(f"\n📋 Parameters:")
for name, param in custom_layer.named_parameters():
    print(f"  {name}: {param.shape}, requires_grad={param.requires_grad}")

# Compare with built-in nn.Linear
builtin_layer = nn.Linear(10, 5).to(DEVICE)
print(f"\n🏗️ Built-in nn.Linear:")
for name, param in builtin_layer.named_parameters():
    print(f"  {name}: {param.shape}, requires_grad={param.requires_grad}")

# Test both layers with sample input
sample_input = torch.randn(1, 10).to(DEVICE)
custom_output = custom_layer(sample_input)
builtin_output = builtin_layer(sample_input)

print(f"\n🧪 Test Results:")
print(f"  Custom layer output shape: {custom_output.shape}")
print(f"  Built-in layer output shape: {builtin_output.shape}")
print(f"  Both layers work identically! ✅")

🔧 Custom Linear Layer:
CustomLinearLayer()

📋 Parameters:
  weight: torch.Size([5, 10]), requires_grad=True
  bias: torch.Size([5]), requires_grad=True

🏗️ Built-in nn.Linear:
  weight: torch.Size([5, 10]), requires_grad=True
  bias: torch.Size([5]), requires_grad=True

🧪 Test Results:
  Custom layer output shape: torch.Size([1, 5])
  Built-in layer output shape: torch.Size([1, 5])
  Both layers work identically! ✅


## 4. Common Neural Network Layer Types

PyTorch provides classes for standard neural network layer types. Let's explore each with practical examples.

### 4.1 Linear Layers (Fully Connected Layers)

**Linear layers** are the most basic type of neural network layer. Every input influences every output, with the degree of influence determined by the layer's weights.

**Key characteristics:**
- If a layer has `m` inputs and `n` outputs, its weight matrix will be `m x n`
- Widely used, commonly found at the end of classifier models
- Parameters automatically set to `requires_grad=True`

In [28]:
# Linear Layer Demonstration with Australian Tourism Example

# Simple linear layer example
print("🔗 Basic Linear Layer Example:")
linear_layer = nn.Linear(3, 2)  # 3 inputs -> 2 outputs
print(f"Layer: {linear_layer}")
print(f"Weight shape: {linear_layer.weight.shape}")
print(f"Bias shape: {linear_layer.bias.shape}")

# Test with sample input
sample_input = torch.tensor([[1.0, 2.0, 3.0]])  # Batch size 1, 3 features
output = linear_layer(sample_input)
print(f"Input: {sample_input}")
print(f"Output: {output}")
print(f"Output shape: {output.shape}")

🔗 Basic Linear Layer Example:
Layer: Linear(in_features=3, out_features=2, bias=True)
Weight shape: torch.Size([2, 3])
Bias shape: torch.Size([2])
Input: tensor([[1., 2., 3.]])
Output: tensor([[ 0.5259, -0.7117]], grad_fn=<AddmmBackward0>)
Output shape: torch.Size([1, 2])


In [29]:
# Practical Example: Australian Tourism Rating Predictor
class TourismRatingPredictor(nn.Module):
    """
    Predicts tourism ratings for Australian attractions using linear layers.

    Features: [location_score, accessibility, price_rating]
    Output: Rating from 1-5 stars

    TensorFlow equivalent:
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(16, activation='relu', input_shape=(3,)),
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    """
    def __init__(self):
        super(TourismRatingPredictor, self).__init__()

        # Sequential linear layers with decreasing dimensions
        self.fc1 = nn.Linear(3, 16)   # Input: 3 features
        self.fc2 = nn.Linear(16, 8)   # Hidden layer
        self.fc3 = nn.Linear(8, 1)    # Output: 1 rating

        # Activation and regularization
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.sigmoid = nn.Sigmoid()   # Normalize output to 0-1

        # Australian attractions for demo
        self.attractions = [
            "Sydney Opera House",
            "Melbourne Laneways",
            "Uluru (Ayers Rock)",
            "Great Barrier Reef",
            "Blue Mountains"
        ]

    def forward(self, x):
        # Forward pass through linear layers
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))  # Output: 0-1 range
        return x * 5  # Scale to 1-5 star rating

    def predict_rating(self, location_score, accessibility, price_rating):
        """Predict rating for given features."""
        features = torch.tensor([[location_score, accessibility, price_rating]], dtype=torch.float32)
        with torch.no_grad():
            rating = self.forward(features)
        return rating.item()

# Create and test the tourism rating predictor
tourism_model = TourismRatingPredictor().to(DEVICE)

print("🏖️ Australian Tourism Rating Predictor:")
print(tourism_model)

# Test with sample Australian attractions
print("\n🎯 Sample Predictions:")
test_attractions = [
    {"name": "Sydney Opera House", "features": [0.95, 0.85, 0.7]},
    {"name": "Melbourne Coffee Culture", "features": [0.8, 0.9, 0.9]},
    {"name": "Uluru Sacred Site", "features": [1.0, 0.6, 0.8]},
    {"name": "Great Barrier Reef", "features": [0.98, 0.5, 0.4]}
]

for attraction in test_attractions:
    rating = tourism_model.predict_rating(*attraction["features"])
    print(f"  {attraction['name']}: {rating:.2f} stars")

print(f"\n📊 Model Parameters:")
total_params = sum(p.numel() for p in tourism_model.parameters())
print(f"  Total parameters: {total_params}")
for name, param in tourism_model.named_parameters():
    print(f"  {name}: {param.shape}")

🏖️ Australian Tourism Rating Predictor:
TourismRatingPredictor(
  (fc1): Linear(in_features=3, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=8, bias=True)
  (fc3): Linear(in_features=8, out_features=1, bias=True)
  (relu): ReLU()
  (dropout): Dropout(p=0.1, inplace=False)
  (sigmoid): Sigmoid()
)

🎯 Sample Predictions:
  Sydney Opera House: 2.25 stars
  Melbourne Coffee Culture: 2.24 stars
  Uluru Sacred Site: 2.18 stars
  Great Barrier Reef: 2.25 stars

📊 Model Parameters:
  Total parameters: 209
  fc1.weight: torch.Size([16, 3])
  fc1.bias: torch.Size([16])
  fc2.weight: torch.Size([8, 16])
  fc2.bias: torch.Size([8])
  fc3.weight: torch.Size([1, 8])
  fc3.bias: torch.Size([1])


### 4.2 Convolutional Layers

**Convolutional layers** are designed for data with strong spatial correlations, such as images. They detect local features and compose them into larger features or recognized objects.

**Key concepts:**
- `nn.Conv2d(in_channels, out_channels, kernel_size)`
- `in_channels`: Number of input channels (1 for grayscale, 3 for color)
- `out_channels`: Number of features the layer learns to detect
- `kernel_size`: Size of the convolution "window" (e.g., 5x5)
- Output: Activation map showing spatial locations of detected features

In [30]:
# LeNet-5 Inspired CNN for Australian Landmark Recognition
class AustralianLandmarkCNN(nn.Module):
    """
    CNN for recognizing Australian landmarks, inspired by LeNet-5 architecture.

    Architecture:
    - Conv2d(1, 6, 5) -> ReLU -> MaxPool2d
    - Conv2d(6, 16, 5) -> ReLU -> MaxPool2d
    - Flatten -> Linear layers

    TensorFlow equivalent:
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(6, 5, activation='relu', input_shape=(32, 32, 1)),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.Conv2D(16, 5, activation='relu'),
        tf.keras.layers.MaxPooling2D(2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(120, activation='relu'),
        tf.keras.layers.Dense(84, activation='relu'),
        tf.keras.layers.Dense(5, activation='softmax')
    ])
    """
    def __init__(self, num_classes=5):
        super(AustralianLandmarkCNN, self).__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)

        # Pooling layer
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully connected layers
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # Calculated based on input size
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)

        # Activation and regularization
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)

        # Australian landmarks for classification
        self.landmarks = [
            "Sydney Opera House",
            "Uluru (Ayers Rock)",
            "Sydney Harbour Bridge",
            "Twelve Apostles",
            "Parliament House Canberra"
        ]

    def forward(self, x):
        # Convolutional feature extraction
        x = self.pool(self.relu(self.conv1(x)))  # Conv1 -> ReLU -> Pool
        x = self.pool(self.relu(self.conv2(x)))  # Conv2 -> ReLU -> Pool

        # Flatten for fully connected layers
        x = x.view(-1, 16 * 5 * 5)  # Reshape to (batch_size, features)

        # Fully connected classification
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.fc3(x)  # No softmax if using CrossEntropyLoss

        return x

    def get_feature_maps(self, x):
        """Return intermediate feature maps for visualization."""
        conv1_out = self.relu(self.conv1(x))
        pool1_out = self.pool(conv1_out)
        conv2_out = self.relu(self.conv2(pool1_out))
        pool2_out = self.pool(conv2_out)

        return {
            'conv1': conv1_out,
            'pool1': pool1_out,
            'conv2': conv2_out,
            'pool2': pool2_out
        }

# Create and test the CNN
landmark_cnn = AustralianLandmarkCNN(num_classes=5).to(DEVICE)

print("🏛️ Australian Landmark CNN:")
print(landmark_cnn)

# Test with sample input (simulating 32x32 grayscale images)
sample_batch = torch.randn(4, 1, 32, 32).to(DEVICE)  # Batch of 4 images
output = landmark_cnn(sample_batch)

print(f"\n🧪 Test Results:")
print(f"  Input shape: {sample_batch.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Batch size: {output.shape[0]}")
print(f"  Number of classes: {output.shape[1]}")

# Analyze feature maps
with torch.no_grad():
    feature_maps = landmark_cnn.get_feature_maps(sample_batch[:1])  # Single image

print(f"\n🔍 Feature Map Analysis:")
for layer_name, feature_map in feature_maps.items():
    print(f"  {layer_name}: {feature_map.shape}")

print(f"\n🎯 Landmark Categories: {landmark_cnn.landmarks}")

# Parameter count
total_params = sum(p.numel() for p in landmark_cnn.parameters())
conv_params = sum(p.numel() for name, p in landmark_cnn.named_parameters() if 'conv' in name)
fc_params = sum(p.numel() for name, p in landmark_cnn.named_parameters() if 'fc' in name)

print(f"\n📊 Parameter Breakdown:")
print(f"  Total parameters: {total_params:,}")
print(f"  Convolutional parameters: {conv_params:,}")
print(f"  Fully connected parameters: {fc_params:,}")

🏛️ Australian Landmark CNN:
AustralianLandmarkCNN(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=5, bias=True)
  (relu): ReLU()
  (dropout): Dropout(p=0.2, inplace=False)
)

🧪 Test Results:
  Input shape: torch.Size([4, 1, 32, 32])
  Output shape: torch.Size([4, 5])
  Batch size: 4
  Number of classes: 5

🔍 Feature Map Analysis:
  conv1: torch.Size([1, 6, 28, 28])
  pool1: torch.Size([1, 6, 14, 14])
  conv2: torch.Size([1, 16, 10, 10])
  pool2: torch.Size([1, 16, 5, 5])

🎯 Landmark Categories: ['Sydney Opera House', 'Uluru (Ayers Rock)', 'Sydney Harbour Bridge', 'Twelve Apostles', 'Parliament House Canberra']

📊 Parameter Breakdown:
  Total parameters: 61,281
  Conv

### 4.3 Recurrent Neural Networks (RNNs)

**RNNs** are used for sequential data like natural language sentences or time-series measurements.

**Key concepts:**
- **Hidden State/Memory**: RNNs maintain a hidden state that acts as memory
- **Sequential Processing**: Process one element at a time, maintaining context
- **Variants**: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit)
- **Applications**: NLP, time series prediction, sequence-to-sequence tasks

In [31]:
# Part-of-Speech Tagger for Australian Tourism Text
class AustralianPOSTagger(nn.Module):
    """
    Part-of-Speech tagger for Australian tourism content using LSTM.

    Architecture:
    - Embedding layer: Maps vocabulary words to embeddings
    - LSTM layer: Processes sequence of embeddings
    - Linear layer: Classifies POS tags with log_softmax

    Example input: "Sydney beaches are beautiful"
    Expected output: [PROPN, NOUN, VERB, ADJ] (simplified POS tags)

    TensorFlow equivalent:
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim),
        tf.keras.layers.LSTM(hidden_dim, return_sequences=True),
        tf.keras.layers.Dense(tagset_size, activation='softmax')
    ])
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, tagset_size):
        super(AustralianPOSTagger, self).__init__()

        # Embedding layer - maps words to dense vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM layer - processes sequences
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Classification layer
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

        # Store tag and word mappings
        self.pos_tags = [
            "NOUN",     # Sydney, beaches, coffee
            "PROPN",    # Sydney, Melbourne, Australia
            "VERB",     # are, visit, enjoy
            "ADJ",      # beautiful, amazing, expensive
            "DET",      # the, a, an
            "ADP",      # in, on, at
            "ADV"       # very, really, quite
        ]

        # Sample Australian tourism vocabulary
        self.sample_vocab = [
            "<PAD>", "<UNK>", "Sydney", "Melbourne", "beaches", "are", "beautiful",
            "coffee", "culture", "amazing", "Opera", "House", "Harbour", "Bridge",
            "the", "in", "at", "very", "quite", "visit", "enjoy", "expensive"
        ]

    def forward(self, sentence):
        """
        Forward pass for POS tagging.

        Args:
            sentence: Tensor of word indices [batch_size, seq_len]

        Returns:
            POS tag scores for each word [batch_size, seq_len, tagset_size]
        """
        # Get word embeddings
        embeds = self.embedding(sentence)  # [batch_size, seq_len, embedding_dim]

        # Process through LSTM
        lstm_out, _ = self.lstm(embeds)    # [batch_size, seq_len, hidden_dim]

        # Classify each position
        tag_space = self.hidden2tag(lstm_out)  # [batch_size, seq_len, tagset_size]

        # Apply log_softmax for NLL loss
        tag_scores = F.log_softmax(tag_space, dim=2)

        return tag_scores

    def predict_pos(self, sentence_indices):
        """Predict POS tags for a sentence."""
        with torch.no_grad():
            scores = self.forward(sentence_indices)
            predicted_tags = torch.argmax(scores, dim=2)
        return predicted_tags

# Create POS tagger
vocab_size = 1000
embedding_dim = 64
hidden_dim = 128
tagset_size = 7  # Number of POS tags

pos_tagger = AustralianPOSTagger(vocab_size, embedding_dim, hidden_dim, tagset_size).to(DEVICE)

print("🏷️ Australian Tourism POS Tagger:")
print(pos_tagger)

# Test with sample sentences
print(f"\n🎯 POS Tags: {pos_tagger.pos_tags}")
print(f"📚 Sample Vocabulary: {pos_tagger.sample_vocab[:10]}...")

# Create sample input (word indices)
# Simulating "Sydney beaches are beautiful"
sample_sentence = torch.tensor([[2, 4, 5, 6]], dtype=torch.long).to(DEVICE)  # [batch_size=1, seq_len=4]
sample_words = ["Sydney", "beaches", "are", "beautiful"]
expected_pos = ["PROPN", "NOUN", "VERB", "ADJ"]

# Forward pass
tag_scores = pos_tagger(sample_sentence)
predicted_tags = pos_tagger.predict_pos(sample_sentence)

print(f"\n🧪 Test Results:")
print(f"  Input shape: {sample_sentence.shape}")
print(f"  Output shape: {tag_scores.shape}")
print(f"  Sample sentence: {' '.join(sample_words)}")
print(f"  Expected POS tags: {' '.join(expected_pos)}")
print(f"  Predicted tag indices: {predicted_tags.squeeze().tolist()}")

# Show parameter breakdown
total_params = sum(p.numel() for p in pos_tagger.parameters())
embed_params = pos_tagger.embedding.weight.numel()
lstm_params = sum(p.numel() for name, p in pos_tagger.named_parameters() if 'lstm' in name)
linear_params = pos_tagger.hidden2tag.weight.numel() + pos_tagger.hidden2tag.bias.numel()

print(f"\n📊 Parameter Breakdown:")
print(f"  Total parameters: {total_params:,}")
print(f"  Embedding parameters: {embed_params:,}")
print(f"  LSTM parameters: {lstm_params:,}")
print(f"  Linear parameters: {linear_params:,}")

🏷️ Australian Tourism POS Tagger:
AustralianPOSTagger(
  (embedding): Embedding(1000, 64)
  (lstm): LSTM(64, 128, batch_first=True)
  (hidden2tag): Linear(in_features=128, out_features=7, bias=True)
)

🎯 POS Tags: ['NOUN', 'PROPN', 'VERB', 'ADJ', 'DET', 'ADP', 'ADV']
📚 Sample Vocabulary: ['<PAD>', '<UNK>', 'Sydney', 'Melbourne', 'beaches', 'are', 'beautiful', 'coffee', 'culture', 'amazing']...

🧪 Test Results:
  Input shape: torch.Size([1, 4])
  Output shape: torch.Size([1, 4, 7])
  Sample sentence: Sydney beaches are beautiful
  Expected POS tags: PROPN NOUN VERB ADJ
  Predicted tag indices: [2, 1, 2, 1]

📊 Parameter Breakdown:
  Total parameters: 164,231
  Embedding parameters: 64,000
  LSTM parameters: 99,328
  Linear parameters: 903


### Multilingual LSTM Example: English-Vietnamese Sentiment Analysis

In [32]:
# Multilingual Sentiment Analyzer for Australian Tourism
class MultilingualSentimentLSTM(nn.Module):
    """
    LSTM-based sentiment analyzer supporting English and Vietnamese.

    Designed for Australian tourism reviews in both languages.
    Example texts:
    - English: "Sydney Opera House is absolutely stunning!"
    - Vietnamese: "Nhà hát Opera Sydney thật tuyệt vời!"
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=2):
        super(MultilingualSentimentLSTM, self).__init__()

        # Shared embedding for both languages
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        # Bidirectional LSTM for better context understanding
        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim, num_layers,
            batch_first=True, bidirectional=True, dropout=0.3
        )

        # Classification layers
        self.dropout = nn.Dropout(0.5)
        self.classifier = nn.Linear(hidden_dim * 2, 3)  # 3 sentiment classes

        # Sentiment labels
        self.sentiments = ["negative", "neutral", "positive"]

        # Sample multilingual tourism vocabulary
        self.sample_phrases = {
            'en': [
                "Sydney beaches are amazing",
                "Melbourne coffee is expensive",
                "Great Barrier Reef is beautiful",
                "Tourist traps are disappointing"
            ],
            'vi': [
                "Bãi biển Sydney tuyệt vời",
                "Cà phê Melbourne đắt tiền",
                "Rạn san hô Great Barrier Reef đẹp",
                "Bẫy du lịch thật thất vọng"
            ]
        }

    def forward(self, x, lengths=None):
        """
        Forward pass for sentiment classification.

        Args:
            x: Input token indices [batch_size, seq_len]
            lengths: Actual sequence lengths for packing (optional)
        """
        # Embedding lookup
        embedded = self.embedding(x)  # [batch_size, seq_len, embedding_dim]

        # LSTM processing
        if lengths is not None:
            # Pack sequences for variable length inputs
            packed = nn.utils.rnn.pack_padded_sequence(
                embedded, lengths, batch_first=True, enforce_sorted=False
            )
            lstm_out, (hidden, _) = self.lstm(packed)
            lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
        else:
            lstm_out, (hidden, _) = self.lstm(embedded)

        # Use last hidden state from both directions
        # hidden: [num_layers * num_directions, batch_size, hidden_dim]
        forward_hidden = hidden[-2, :, :]   # Last layer, forward direction
        backward_hidden = hidden[-1, :, :]  # Last layer, backward direction

        # Concatenate bidirectional hidden states
        final_hidden = torch.cat([forward_hidden, backward_hidden], dim=1)

        # Classification
        output = self.dropout(final_hidden)
        sentiment_scores = self.classifier(output)

        return sentiment_scores

# Create multilingual sentiment analyzer
multilingual_lstm = MultilingualSentimentLSTM(
    vocab_size=5000, embedding_dim=128, hidden_dim=64, num_layers=2
).to(DEVICE)

print("🌏 Multilingual Sentiment LSTM:")
print(multilingual_lstm)

# Test with sample multilingual input
sample_batch = torch.randint(1, 100, (2, 10)).to(DEVICE)  # 2 sentences, max 10 tokens
sentiment_scores = multilingual_lstm(sample_batch)

print(f"\n🧪 Test Results:")
print(f"  Input shape: {sample_batch.shape}")
print(f"  Output shape: {sentiment_scores.shape}")
print(f"  Sentiment classes: {multilingual_lstm.sentiments}")

# Show sample phrases
print(f"\n🗣️ Sample Tourism Phrases:")
for lang, phrases in multilingual_lstm.sample_phrases.items():
    print(f"  {lang.upper()}: {phrases[0]}")

# Parameter analysis
total_params = sum(p.numel() for p in multilingual_lstm.parameters())
print(f"\n📊 Multilingual LSTM Parameters: {total_params:,}")

🌏 Multilingual Sentiment LSTM:
MultilingualSentimentLSTM(
  (embedding): Embedding(5000, 128, padding_idx=0)
  (lstm): LSTM(128, 64, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (classifier): Linear(in_features=128, out_features=3, bias=True)
)

🧪 Test Results:
  Input shape: torch.Size([2, 10])
  Output shape: torch.Size([2, 3])
  Sentiment classes: ['negative', 'neutral', 'positive']

🗣️ Sample Tourism Phrases:
  EN: Sydney beaches are amazing
  VI: Bãi biển Sydney tuyệt vời

📊 Multilingual LSTM Parameters: 839,043


### 4.4 Transformers

**Transformers** are multi-purpose neural networks, prominently used in modern NLP applications. They've revolutionized NLP with models like BERT, GPT, and T5.

**Key aspects:**
- **Self-attention mechanism**: Allows model to focus on relevant parts of input
- **Parallel processing**: Unlike RNNs, can process all positions simultaneously
- **Encoder-decoder architecture**: Flexible for various NLP tasks
- **PyTorch components**: `nn.Transformer`, `nn.TransformerEncoder`, `nn.TransformerDecoder`

In [33]:
# Simple Transformer for Australian Tourism Text Classification
class AustralianTourismTransformer(nn.Module):
    """
    Transformer-based classifier for Australian tourism content.

    Uses PyTorch's built-in transformer components for text classification.
    Suitable for tasks like:
    - Categorizing tourism content (attractions, restaurants, activities)
    - Multilingual tourism review analysis
    - Australian destination recommendation

    Note: This is a simplified example. Production models like BERT
    can be built using similar transformer components with proper parameters.
    """
    def __init__(self, vocab_size, d_model, nhead, num_layers, num_classes):
        super(AustralianTourismTransformer, self).__init__()

        # Embedding and positional encoding
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1000, d_model))  # Max sequence length

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Classification head
        self.classifier = nn.Linear(d_model, num_classes)
        self.dropout = nn.Dropout(0.1)

        # Australian tourism categories
        self.categories = [
            "Attractions",     # Opera House, Uluru, etc.
            "Restaurants",     # Dining experiences
            "Activities",      # Surfing, hiking, etc.
            "Accommodation",   # Hotels, hostels, etc.
            "Transportation"   # Flights, trains, etc.
        ]

        self.d_model = d_model

    def forward(self, x, mask=None):
        """
        Forward pass through transformer.

        Args:
            x: Input token indices [batch_size, seq_len]
            mask: Attention mask for padding (optional)
        """
        seq_len = x.size(1)

        # Embedding with positional encoding
        x = self.embedding(x) * (self.d_model ** 0.5)  # Scale embeddings
        x = x + self.pos_encoding[:seq_len, :].unsqueeze(0)  # Add positional encoding

        # Transform through encoder
        x = self.transformer(x, src_key_padding_mask=mask)

        # Global average pooling for classification
        if mask is not None:
            # Mask out padding tokens for averaging
            x = x.masked_fill(mask.unsqueeze(-1), 0)
            x = x.sum(dim=1) / (~mask).sum(dim=1, keepdim=True).float()
        else:
            x = x.mean(dim=1)  # Simple average pooling

        # Classification
        x = self.dropout(x)
        logits = self.classifier(x)

        return logits

# Create transformer model
tourism_transformer = AustralianTourismTransformer(
    vocab_size=3000,
    d_model=256,
    nhead=8,
    num_layers=6,
    num_classes=5
).to(DEVICE)

print("🤖 Australian Tourism Transformer:")
print(f"Model parameters: {sum(p.numel() for p in tourism_transformer.parameters()):,}")
print(f"Categories: {tourism_transformer.categories}")

# Test transformer
sample_input = torch.randint(1, 1000, (2, 20)).to(DEVICE)  # 2 samples, 20 tokens max
output = tourism_transformer(sample_input)

print(f"\n🧪 Transformer Test:")
print(f"  Input shape: {sample_input.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Output categories: {len(tourism_transformer.categories)}")

# Show transformer architecture components
print(f"\n🏗️ Transformer Architecture:")
print(f"  Model dimension: {tourism_transformer.d_model}")
print(f"  Attention heads: 8")
print(f"  Encoder layers: 6")
print(f"  Vocabulary size: 3,000")

# Note about building BERT-like models
print(f"\n💡 Note: Models like BERT can be built using similar transformer components")
print(f"   with proper parameters, pre-training, and fine-tuning strategies.")

🤖 Australian Tourism Transformer:
Model parameters: 5,763,845
Categories: ['Attractions', 'Restaurants', 'Activities', 'Accommodation', 'Transportation']

🧪 Transformer Test:
  Input shape: torch.Size([2, 20])
  Output shape: torch.Size([2, 5])
  Output categories: 5

🏗️ Transformer Architecture:
  Model dimension: 256
  Attention heads: 8
  Encoder layers: 6
  Vocabulary size: 3,000

💡 Note: Models like BERT can be built using similar transformer components
   with proper parameters, pre-training, and fine-tuning strategies.


## 5. Other Essential Layers and Functions

Beyond the main layer types, PyTorch offers crucial non-learning layers and functions that are essential for building effective neural networks.

### 5.1 Pooling Layers

**Pooling layers** reduce the dimensionality of tensors (downsampling) by combining cells and taking the maximum or average value within a defined window.

In [34]:
# Pooling Layers Demonstration
import torch.nn.functional as F

print("🏊 Pooling Layers Demonstration:")

# Create sample feature map (simulating conv layer output)
# Shape: [batch_size, channels, height, width]
sample_feature_map = torch.randn(1, 3, 8, 8).to(DEVICE)
print(f"Original feature map shape: {sample_feature_map.shape}")

# MaxPool2d - takes maximum value in each window
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
max_pooled = max_pool(sample_feature_map)
print(f"After MaxPool2d(2x2): {max_pooled.shape}")

# AvgPool2d - takes average value in each window
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
avg_pooled = avg_pool(sample_feature_map)
print(f"After AvgPool2d(2x2): {avg_pooled.shape}")

# Adaptive pooling - output size is fixed regardless of input size
adaptive_avg_pool = nn.AdaptiveAvgPool2d((4, 4))  # Always outputs 4x4
adaptive_pooled = adaptive_avg_pool(sample_feature_map)
print(f"After AdaptiveAvgPool2d(4x4): {adaptive_pooled.shape}")

# Global Average Pooling (common in modern architectures)
global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
global_pooled = global_avg_pool(sample_feature_map)
print(f"After Global Average Pooling: {global_pooled.shape}")

print(f"\n💡 Pooling Benefits:")
print(f"  • Reduces computation and memory requirements")
print(f"  • Makes feature detection more robust to small shifts")
print(f"  • Provides translation invariance")
print(f"  • Helps prevent overfitting")

🏊 Pooling Layers Demonstration:
Original feature map shape: torch.Size([1, 3, 8, 8])
After MaxPool2d(2x2): torch.Size([1, 3, 4, 4])
After AvgPool2d(2x2): torch.Size([1, 3, 4, 4])
After AdaptiveAvgPool2d(4x4): torch.Size([1, 3, 4, 4])
After Global Average Pooling: torch.Size([1, 3, 1, 1])

💡 Pooling Benefits:
  • Reduces computation and memory requirements
  • Makes feature detection more robust to small shifts
  • Provides translation invariance
  • Helps prevent overfitting


### 5.2 Normalization Layers

**Normalization layers** like `nn.BatchNorm1d` re-center and normalize the output of one layer before feeding it to the next, leading to faster and more stable training.

In [35]:
# Normalization Layers Demonstration
print("📊 Normalization Layers:")

# BatchNorm for different input types
batch_size, features, seq_len = 32, 128, 50

# 1D BatchNorm (for fully connected layers)
bn1d = nn.BatchNorm1d(features)
fc_input = torch.randn(batch_size, features).to(DEVICE)
bn1d_output = bn1d(fc_input)
print(f"BatchNorm1d: {fc_input.shape} -> {bn1d_output.shape}")
print(f"  Mean before: {fc_input.mean():.4f}, Std before: {fc_input.std():.4f}")
print(f"  Mean after: {bn1d_output.mean():.4f}, Std after: {bn1d_output.std():.4f}")

# 2D BatchNorm (for convolutional layers)
bn2d = nn.BatchNorm2d(64)  # 64 channels
conv_input = torch.randn(batch_size, 64, 32, 32).to(DEVICE)
bn2d_output = bn2d(conv_input)
print(f"\nBatchNorm2d: {conv_input.shape} -> {bn2d_output.shape}")

# LayerNorm (alternative to BatchNorm, common in transformers)
layer_norm = nn.LayerNorm(features)
ln_output = layer_norm(fc_input)
print(f"\nLayerNorm: {fc_input.shape} -> {ln_output.shape}")

# Practical example with Australian city classifier
class NormalizedAustralianClassifier(nn.Module):
    """Australian city classifier with normalization layers."""
    def __init__(self, input_size, hidden_size, num_cities):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.bn1 = nn.BatchNorm1d(hidden_size)  # Normalize after first layer
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
        self.bn2 = nn.BatchNorm1d(hidden_size // 2)  # Normalize after second layer
        self.fc3 = nn.Linear(hidden_size // 2, num_cities)

    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))  # Linear -> BatchNorm -> ReLU
        x = F.relu(self.bn2(self.fc2(x)))  # Linear -> BatchNorm -> ReLU
        x = self.fc3(x)  # Final layer (no normalization)
        return x

normalized_model = NormalizedAustralianClassifier(100, 64, 8).to(DEVICE)
print(f"\n🏛️ Normalized Australian Classifier:")
print(f"  Parameters: {sum(p.numel() for p in normalized_model.parameters()):,}")

print(f"\n✅ Normalization Benefits:")
print(f"  • Prevents vanishing/exploding gradients")
print(f"  • Allows higher learning rates")
print(f"  • Faster and more stable convergence")
print(f"  • Reduces internal covariate shift")

📊 Normalization Layers:
BatchNorm1d: torch.Size([32, 128]) -> torch.Size([32, 128])
  Mean before: -0.0054, Std before: 0.9875
  Mean after: 0.0000, Std after: 1.0001

BatchNorm2d: torch.Size([32, 64, 32, 32]) -> torch.Size([32, 64, 32, 32])

LayerNorm: torch.Size([32, 128]) -> torch.Size([32, 128])

🏛️ Normalized Australian Classifier:
  Parameters: 9,000

✅ Normalization Benefits:
  • Prevents vanishing/exploding gradients
  • Allows higher learning rates
  • Faster and more stable convergence
  • Reduces internal covariate shift


### 5.3 Dropout Layers

**Dropout** is a regularization technique that helps prevent overfitting by encouraging sparse representations.

In [36]:
# Dropout Demonstration
print("🎭 Dropout Layers:")

# Create dropout layer
dropout = nn.Dropout(p=0.5)  # 50% dropout probability

# Sample input
sample_input = torch.randn(4, 10).to(DEVICE)
print(f"Original input:")
print(sample_input)

# Training mode - dropout is active
dropout.train()
training_output = dropout(sample_input)
print(f"\nDuring training (dropout active):")
print(training_output)
print(f"Zeros in output: {(training_output == 0).sum().item()} / {training_output.numel()}")

# Evaluation mode - dropout is inactive
dropout.eval()
eval_output = dropout(sample_input)
print(f"\nDuring evaluation (dropout inactive):")
print(eval_output)
print(f"Zeros in output: {(eval_output == 0).sum().item()} / {eval_output.numel()}")

# Practical example: Australian restaurant review classifier with dropout
class RestaurantReviewClassifier(nn.Module):
    """Restaurant review classifier with dropout regularization."""
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Multiple dropout layers for different stages
        self.embedding_dropout = nn.Dropout(0.2)  # Light dropout for embeddings
        self.lstm_dropout = nn.Dropout(0.3)       # Moderate dropout after LSTM
        self.classifier_dropout = nn.Dropout(0.5) # Heavy dropout before classifier

        self.classifier = nn.Linear(hidden_dim, 3)  # negative, neutral, positive

        self.restaurant_types = ["Asian", "European", "Australian", "Fusion"]

    def forward(self, x):
        # Embedding with light dropout
        embedded = self.embedding(x)
        embedded = self.embedding_dropout(embedded)

        # LSTM processing
        lstm_out, (hidden, _) = self.lstm(embedded)
        last_hidden = self.lstm_dropout(hidden[-1])

        # Classification with heavy dropout
        features = self.classifier_dropout(last_hidden)
        sentiment = self.classifier(features)

        return sentiment

restaurant_classifier = RestaurantReviewClassifier(5000, 128, 64).to(DEVICE)

print(f"\n🍽️ Restaurant Review Classifier:")
print(f"  Embedding dropout: 20%")
print(f"  LSTM dropout: 30%")
print(f"  Classifier dropout: 50%")
print(f"  Restaurant types: {restaurant_classifier.restaurant_types}")

# Demonstrate training vs evaluation mode
sample_reviews = torch.randint(1, 1000, (2, 15)).to(DEVICE)

restaurant_classifier.train()
train_output = restaurant_classifier(sample_reviews)
print(f"\nTraining mode output: {train_output.shape}")

restaurant_classifier.eval()
eval_output = restaurant_classifier(sample_reviews)
print(f"Evaluation mode output: {eval_output.shape}")

print(f"\n💡 Dropout Best Practices:")
print(f"  • Always turn off during inference: model.eval()")
print(f"  • Use different rates for different layers")
print(f"  • Start with 0.2-0.5 and adjust based on overfitting")
print(f"  • Higher dropout for layers closer to output")

🎭 Dropout Layers:
Original input:
tensor([[-0.2934, -1.2665,  0.6081, -0.2501, -0.3373, -0.7917, -0.4177, -1.2013,
         -0.1163,  1.2424],
        [-0.9284, -1.5218,  0.0698,  1.4362,  1.1988, -0.2821,  1.9995, -0.0930,
         -1.4426, -0.8318],
        [-0.3327, -0.1458,  0.0588,  1.1323, -0.6039, -0.9335, -0.8200, -0.5583,
         -1.0536, -2.0693],
        [ 0.9680, -0.6944, -0.0897, -0.8950,  1.0623,  1.5032, -1.0396, -1.6998,
         -1.9106, -1.0283]])

During training (dropout active):
tensor([[-0.0000, -0.0000,  1.2163, -0.5002, -0.6746, -1.5834, -0.8355, -2.4027,
         -0.0000,  0.0000],
        [-1.8567, -3.0436,  0.1397,  0.0000,  2.3977, -0.0000,  3.9990, -0.0000,
         -0.0000, -1.6637],
        [-0.6655, -0.2916,  0.0000,  0.0000, -1.2078, -0.0000, -1.6400, -0.0000,
         -2.1071, -0.0000],
        [ 1.9360, -0.0000, -0.0000, -0.0000,  0.0000,  3.0064, -2.0791, -3.3995,
         -0.0000, -2.0565]])
Zeros in output: 18 / 40

During evaluation (dropout inac

### 5.4 Activation Functions

**Activation functions** introduce non-linearity into the model, enabling neural networks to learn complex, non-linear relationships. Without them, stacking linear layers would just result in another linear transformation.

In [37]:
# Activation Functions Comprehensive Demonstration
import matplotlib.pyplot as plt
import numpy as np

print("⚡ Activation Functions in PyTorch:")

# Sample input for demonstration
x = torch.linspace(-3, 3, 100)

# Common activation functions
activations = {
    'ReLU': nn.ReLU(),
    'LeakyReLU': nn.LeakyReLU(0.1), #
    'Tanh': nn.Tanh(),
    'Sigmoid': nn.Sigmoid(),
    'GELU': nn.GELU(),  # Popular in transformers
    'Swish/SiLU': nn.SiLU(),  # Self-gated activation
}

print(f"\n🎯 Activation Function Properties:")
sample_input = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

for name, activation in activations.items():
    output = activation(sample_input)
    print(f"{name:12}: {output.tolist()}")

# Functional vs Module forms
print(f"\n🔧 Functional vs Module Forms:")
sample_tensor = torch.tensor([[-1.0, 0.0, 1.0]])

# Module form (for use in nn.Sequential or as class attributes)
relu_module = nn.ReLU()
module_output = relu_module(sample_tensor)
print(f"Module form (nn.ReLU):     {module_output.tolist()}")

# Functional form (for direct application)
functional_output = F.relu(sample_tensor)
print(f"Functional form (F.relu):  {functional_output.tolist()}")

# Softmax for classification
logits = torch.tensor([[2.0, 1.0, 0.1]])
softmax = F.softmax(logits, dim=1)
print(f"\nSoftmax probabilities: {softmax.tolist()[0]}")
print(f"Sum of probabilities: {softmax.sum().item():.4f}")

⚡ Activation Functions in PyTorch:

🎯 Activation Function Properties:
ReLU        : [0.0, 0.0, 0.0, 1.0, 2.0]
LeakyReLU   : [-0.20000000298023224, -0.10000000149011612, 0.0, 1.0, 2.0]
Tanh        : [-0.9640275835990906, -0.7615941762924194, 0.0, 0.7615941762924194, 0.9640275835990906]
Sigmoid     : [0.11920291930437088, 0.2689414322376251, 0.5, 0.7310585975646973, 0.8807970285415649]
GELU        : [-0.045500099658966064, -0.1586552858352661, 0.0, 0.8413447141647339, 1.9544999599456787]
Swish/SiLU  : [-0.23840583860874176, -0.2689414322376251, 0.0, 0.7310585975646973, 1.7615940570831299]

🔧 Functional vs Module Forms:
Module form (nn.ReLU):     [[0.0, 0.0, 1.0]]
Functional form (F.relu):  [[0.0, 0.0, 1.0]]

Softmax probabilities: [0.6590011715888977, 0.24243298172950745, 0.09856589138507843]
Sum of probabilities: 1.0000


In [38]:
# Practical Example: Australian Tourism Content Classifier with Different Activations
class TourismClassifierWithActivations(nn.Module):
    """
    Tourism classifier demonstrating different activation functions.

    Classifies Australian tourism content into categories:
    - Nature (beaches, national parks)
    - Culture (museums, galleries)
    - Food (restaurants, wineries)
    - Adventure (sports, activities)
    """
    def __init__(self, input_size, hidden_size, num_classes, activation_type='relu'):
        super().__init__()

        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
        self.fc3 = nn.Linear(hidden_size // 2, num_classes)

        # Choose activation function
        self.activation_type = activation_type
        if activation_type == 'relu':
            self.activation = nn.ReLU()
        elif activation_type == 'leaky_relu':
            self.activation = nn.LeakyReLU(0.1)
        elif activation_type == 'gelu':
            self.activation = nn.GELU()
        elif activation_type == 'swish':
            self.activation = nn.SiLU()
        else:
            self.activation = nn.ReLU()  # Default

        self.dropout = nn.Dropout(0.2)

        self.categories = ["Nature", "Culture", "Food", "Adventure"]

    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.dropout(x)
        x = self.activation(self.fc2(x))
        x = self.fc3(x)  # No activation on final layer for classification
        return x

    def predict_probabilities(self, x):
        """Get probability predictions using softmax."""
        with torch.no_grad():
            logits = self.forward(x)
            probabilities = F.softmax(logits, dim=1)
        return probabilities

# Compare different activation functions
activation_types = ['relu', 'leaky_relu', 'gelu', 'swish']
models = {}

print(f"\n🏛️ Tourism Classifiers with Different Activations:")

for activation in activation_types:
    model = TourismClassifierWithActivations(50, 32, 4, activation).to(DEVICE)
    models[activation] = model

    # Test with sample input
    sample_input = torch.randn(1, 50).to(DEVICE)
    probabilities = model.predict_probabilities(sample_input)

    print(f"\n{activation.upper()} Activation:")
    for i, (category, prob) in enumerate(zip(model.categories, probabilities[0])):
        print(f"  {category}: {prob.item():.3f}")

print(f"\n📋 Activation Function Guidelines:")
print(f"  • ReLU: Most common, simple and effective")
print(f"  • LeakyReLU: Solves 'dying ReLU' problem")
print(f"  • GELU: Popular in transformers (BERT, GPT)")
print(f"  • Swish/SiLU: Self-gated, smooth activation")
print(f"  • Tanh: Centered around 0, good for RNNs")
print(f"  • Sigmoid: For binary classification output")


🏛️ Tourism Classifiers with Different Activations:

RELU Activation:
  Nature: 0.216
  Culture: 0.241
  Food: 0.265
  Adventure: 0.278

LEAKY_RELU Activation:
  Nature: 0.343
  Culture: 0.157
  Food: 0.266
  Adventure: 0.234

GELU Activation:
  Nature: 0.273
  Culture: 0.236
  Food: 0.239
  Adventure: 0.252

SWISH Activation:
  Nature: 0.239
  Culture: 0.264
  Food: 0.278
  Adventure: 0.219

📋 Activation Function Guidelines:
  • ReLU: Most common, simple and effective
  • LeakyReLU: Solves 'dying ReLU' problem
  • GELU: Popular in transformers (BERT, GPT)
  • Swish/SiLU: Self-gated, smooth activation
  • Tanh: Centered around 0, good for RNNs
  • Sigmoid: For binary classification output


### 5.5 Loss Functions

**Loss functions** (also known as criteria) measure the error between a model's prediction and the actual target. Choosing the appropriate loss function is crucial for effective training.

In [39]:
# Loss Functions Comprehensive Demonstration
print("🎯 Loss Functions in PyTorch:")

# Sample data for demonstrations
batch_size = 4
num_classes = 5

# Classification data
predictions = torch.randn(batch_size, num_classes).to(DEVICE)  # Raw logits
class_targets = torch.randint(0, num_classes, (batch_size,)).to(DEVICE)  # Class indices

# Regression data
regression_pred = torch.randn(batch_size, 1).to(DEVICE)
regression_target = torch.randn(batch_size, 1).to(DEVICE)

print(f"\n📊 Sample Data:")
print(f"  Predictions shape: {predictions.shape}")
print(f"  Class targets: {class_targets.tolist()}")
print(f"  Regression pred: {regression_pred.shape}")

# 1. CrossEntropyLoss - Most common for multi-class classification
ce_loss = nn.CrossEntropyLoss()
ce_value = ce_loss(predictions, class_targets)
print(f"\n🏷️ CrossEntropyLoss: {ce_value.item():.4f}")
print(f"   Use case: Multi-class classification (Australian city classification)")
print(f"   Combines LogSoftmax + NLLLoss")

# 2. MSELoss - Mean Squared Error for regression
mse_loss = nn.MSELoss()
mse_value = mse_loss(regression_pred, regression_target)
print(f"\n📈 MSELoss: {mse_value.item():.4f}")
print(f"   Use case: Regression (tourism rating prediction)")
print(f"   Formula: (y_pred - y_true)²")

# 3. BCELoss - Binary Cross Entropy for binary classification
binary_pred = torch.sigmoid(torch.randn(batch_size, 1)).to(DEVICE)  # Must be 0-1
binary_target = torch.randint(0, 2, (batch_size, 1)).float().to(DEVICE)
bce_loss = nn.BCELoss()
bce_value = bce_loss(binary_pred, binary_target)
print(f"\n🔢 BCELoss: {bce_value.item():.4f}")
print(f"   Use case: Binary classification (tourism recommendation: yes/no)")

# 4. NLLLoss - Negative Log Likelihood (often used with log_softmax)
log_probs = F.log_softmax(predictions, dim=1)
nll_loss = nn.NLLLoss()
nll_value = nll_loss(log_probs, class_targets)
print(f"\n📝 NLLLoss: {nll_value.item():.4f}")
print(f"   Use case: With log_softmax for classification")

# 5. Huber Loss - Robust regression loss
huber_loss = nn.HuberLoss(delta=1.0)
huber_value = huber_loss(regression_pred, regression_target)
print(f"\n🛡️ HuberLoss: {huber_value.item():.4f}")
print(f"   Use case: Robust regression (less sensitive to outliers)")

🎯 Loss Functions in PyTorch:

📊 Sample Data:
  Predictions shape: torch.Size([4, 5])
  Class targets: [1, 4, 1, 4]
  Regression pred: torch.Size([4, 1])

🏷️ CrossEntropyLoss: 1.9940
   Use case: Multi-class classification (Australian city classification)
   Combines LogSoftmax + NLLLoss

📈 MSELoss: 0.7074
   Use case: Regression (tourism rating prediction)
   Formula: (y_pred - y_true)²

🔢 BCELoss: 0.6444
   Use case: Binary classification (tourism recommendation: yes/no)

📝 NLLLoss: 1.9940
   Use case: With log_softmax for classification

🛡️ HuberLoss: 0.3440
   Use case: Robust regression (less sensitive to outliers)


In [40]:
# Practical Example: Australian Tourism Multi-task Model
class AustralianTourismMultiTask(nn.Module):
    """
    Multi-task model for Australian tourism analysis.

    Tasks:
    1. City classification (multi-class): Which Australian city?
    2. Rating prediction (regression): 1-5 star rating
    3. Recommendation (binary): Recommend or not?

    Demonstrates different loss functions for different tasks.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()

        # Shared feature extractor
        self.shared_layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(0.2)
        )

        # Task-specific heads
        self.city_classifier = nn.Linear(hidden_size // 2, 8)  # 8 Australian cities
        self.rating_predictor = nn.Linear(hidden_size // 2, 1)  # Regression output
        self.recommender = nn.Linear(hidden_size // 2, 1)      # Binary classification

        # Australian cities
        self.cities = ["Sydney", "Melbourne", "Brisbane", "Perth",
                      "Adelaide", "Darwin", "Hobart", "Canberra"]

    def forward(self, x):
        # Shared feature extraction
        features = self.shared_layers(x)

        # Task-specific outputs
        city_logits = self.city_classifier(features)
        rating = torch.sigmoid(self.rating_predictor(features)) * 5  # Scale to 1-5
        recommendation = torch.sigmoid(self.recommender(features))   # 0-1 probability

        return city_logits, rating, recommendation

# Create multi-task model
multitask_model = AustralianTourismMultiTask(100, 64).to(DEVICE)

print(f"\n🏢 Australian Tourism Multi-task Model:")
print(f"  Cities: {multitask_model.cities}")
print(f"  Tasks: City classification, Rating prediction, Recommendation")

# Define appropriate loss functions for each task
city_loss_fn = nn.CrossEntropyLoss()        # Multi-class classification
rating_loss_fn = nn.MSELoss()               # Regression
recommendation_loss_fn = nn.BCELoss()       # Binary classification

print(f"\n🎯 Loss Functions for Each Task:")
print(f"  City classification: CrossEntropyLoss")
print(f"  Rating prediction: MSELoss")
print(f"  Recommendation: BCELoss")

# Sample training step
sample_input = torch.randn(4, 100).to(DEVICE)
sample_city_labels = torch.randint(0, 8, (4,)).to(DEVICE)
sample_ratings = torch.rand(4, 1).to(DEVICE) * 5  # 1-5 scale
sample_recommendations = torch.randint(0, 2, (4, 1)).float().to(DEVICE)

# Forward pass
city_pred, rating_pred, rec_pred = multitask_model(sample_input)

# Calculate losses
city_loss = city_loss_fn(city_pred, sample_city_labels)
rating_loss = rating_loss_fn(rating_pred, sample_ratings)
rec_loss = recommendation_loss_fn(rec_pred, sample_recommendations)

# Combined loss (weighted)
total_loss = city_loss + rating_loss + rec_loss

print(f"\n📊 Sample Loss Values:")
print(f"  City classification loss: {city_loss.item():.4f}")
print(f"  Rating prediction loss: {rating_loss.item():.4f}")
print(f"  Recommendation loss: {rec_loss.item():.4f}")
print(f"  Total loss: {total_loss.item():.4f}")

print(f"\n💡 Loss Function Selection Guidelines:")
print(f"  📋 Classification:")
print(f"    • CrossEntropyLoss: Multi-class (city, sentiment, category)")
print(f"    • BCELoss: Binary (recommend/not, positive/negative)")
print(f"  📈 Regression:")
print(f"    • MSELoss: Standard regression (ratings, prices)")
print(f"    • HuberLoss: Robust to outliers")
print(f"    • MAELoss: Less sensitive to outliers than MSE")
print(f"  🔧 Special Cases:")
print(f"    • NLLLoss: When using log_softmax manually")
print(f"    • Focal Loss: For imbalanced classification (custom)")
print(f"    • Triplet Loss: For similarity learning (custom)")


🏢 Australian Tourism Multi-task Model:
  Cities: ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide', 'Darwin', 'Hobart', 'Canberra']
  Tasks: City classification, Rating prediction, Recommendation

🎯 Loss Functions for Each Task:
  City classification: CrossEntropyLoss
  Rating prediction: MSELoss
  Recommendation: BCELoss

📊 Sample Loss Values:
  City classification loss: 2.2510
  Rating prediction loss: 2.0492
  Recommendation loss: 0.6879
  Total loss: 4.9880

💡 Loss Function Selection Guidelines:
  📋 Classification:
    • CrossEntropyLoss: Multi-class (city, sentiment, category)
    • BCELoss: Binary (recommend/not, positive/negative)
  📈 Regression:
    • MSELoss: Standard regression (ratings, prices)
    • HuberLoss: Robust to outliers
    • MAELoss: Less sensitive to outliers than MSE
  🔧 Special Cases:
    • NLLLoss: When using log_softmax manually
    • Focal Loss: For imbalanced classification (custom)
    • Triplet Loss: For similarity learning (custom)


## 6. Comprehensive Example: Complete Australian Tourism Analysis Model

Let's put everything together in a comprehensive model that demonstrates all the concepts we've learned, with TensorBoard logging and proper device handling.

In [41]:
# Complete Australian Tourism Analysis System
import time
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader, TensorDataset

class ComprehensiveAustralianTourismModel(nn.Module):
    """
    Comprehensive model combining all PyTorch building blocks:
    - Embedding layers for text processing
    - Convolutional layers for image features
    - LSTM for sequential analysis
    - Linear layers for classification
    - Proper normalization, dropout, and activations

    Multi-modal analysis of Australian tourism content:
    - Text: Reviews in English and Vietnamese
    - Images: Tourism photos
    - Sequential: Time-series visitor data

    TensorFlow equivalent would require multiple models or complex subclassing.
    """
    def __init__(self, vocab_size=5000, embed_dim=128, hidden_dim=256):
        super().__init__()

        # Text processing branch (multilingual)
        self.text_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.text_lstm = nn.LSTM(embed_dim, hidden_dim//2, batch_first=True, bidirectional=True)
        self.text_dropout = nn.Dropout(0.3)

        # Image processing branch
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.MaxPool2d(2)
        self.batch_norm1 = nn.BatchNorm2d(32)
        self.batch_norm2 = nn.BatchNorm2d(64)
        self.batch_norm3 = nn.BatchNorm2d(128)
        self.image_dropout = nn.Dropout(0.25)

        # Feature fusion
        self.fusion_layer = nn.Linear(hidden_dim + 128, 256)
        self.fusion_norm = nn.LayerNorm(256)
        self.fusion_dropout = nn.Dropout(0.4)

        # Multi-task outputs
        self.city_classifier = nn.Linear(256, 8)      # Australian cities
        self.sentiment_classifier = nn.Linear(256, 3) # Sentiment
        self.rating_predictor = nn.Linear(256, 1)     # Rating prediction

        # Activation functions
        self.relu = nn.ReLU()
        self.gelu = nn.GELU()  # For text features

        # Model metadata
        self.cities = ["Sydney", "Melbourne", "Brisbane", "Perth",
                      "Adelaide", "Darwin", "Hobart", "Canberra"]
        self.sentiments = ["negative", "neutral", "positive"]

        # Multilingual support
        self.languages = ["en", "vi"]
        self.sample_texts = {
            "en": ["Sydney Opera House is stunning", "Melbourne coffee is amazing"],
            "vi": ["Nhà hát Opera Sydney tuyệt đẹp", "Cà phê Melbourne tuyệt vời"]
        }

    def forward(self, text_input, image_input):
        """
        Multi-modal forward pass.

        Args:
            text_input: [batch_size, seq_len] - tokenized text
            image_input: [batch_size, 3, H, W] - tourism images

        Returns:
            Dictionary with predictions for all tasks
        """
        batch_size = text_input.size(0)

        # Text processing pipeline
        text_embedded = self.text_embedding(text_input)
        text_lstm_out, (text_hidden, _) = self.text_lstm(text_embedded)
        # Concatenate bidirectional hidden states
        text_features = torch.cat([text_hidden[-2], text_hidden[-1]], dim=1)
        text_features = self.text_dropout(self.gelu(text_features))

        # Image processing pipeline
        x = self.pool(self.relu(self.batch_norm1(self.conv1(image_input))))
        x = self.pool(self.relu(self.batch_norm2(self.conv2(x))))
        x = self.pool(self.relu(self.batch_norm3(self.conv3(x))))

        # Global average pooling for images
        image_features = F.adaptive_avg_pool2d(x, (1, 1)).squeeze(-1).squeeze(-1)
        image_features = self.image_dropout(image_features)

        # Feature fusion
        combined_features = torch.cat([text_features, image_features], dim=1)
        fused_features = self.fusion_layer(combined_features)
        fused_features = self.fusion_dropout(self.relu(self.fusion_norm(fused_features)))

        # Multi-task predictions
        city_logits = self.city_classifier(fused_features)
        sentiment_logits = self.sentiment_classifier(fused_features)
        rating = torch.sigmoid(self.rating_predictor(fused_features)) * 5  # 1-5 scale

        # Return as a tuple for tracing compatibility
        return city_logits, sentiment_logits, rating, fused_features

    def get_model_info(self):
        """Get comprehensive model information."""
        total_params = sum(p.numel() for p in self.parameters())
        trainable_params = sum(p.numel() for p in self.parameters() if p.requires_grad)

        # Parameter breakdown by component
        text_params = sum(p.numel() for name, p in self.named_parameters()
                         if any(comp in name for comp in ['text_', 'embedding']))
        image_params = sum(p.numel() for name, p in self.named_parameters()
                          if any(comp in name for comp in ['conv', 'batch_norm']))
        fusion_params = sum(p.numel() for name, p in self.named_parameters()
                           if 'fusion' in name)
        task_params = sum(p.numel() for name, p in self.named_parameters()
                         if any(comp in name for comp in ['classifier', 'predictor']))

        return {
            'total_parameters': total_params,
            'trainable_parameters': trainable_params,
            'text_parameters': text_params,
            'image_parameters': image_params,
            'fusion_parameters': fusion_params,
            'task_parameters': task_params,
            'cities': self.cities,
            'sentiments': self.sentiments,
            'supported_languages': self.languages
        }

# Create comprehensive model
comprehensive_model = ComprehensiveAustralianTourismModel().to(DEVICE)

print("🌟 Comprehensive Australian Tourism Analysis Model:")
model_info = comprehensive_model.get_model_info()
print(f"  Total parameters: {model_info['total_parameters']:,}")
print(f"  Text processing: {model_info['text_parameters']:,} params")
print(f"  Image processing: {model_info['image_parameters']:,} params")
print(f"  Feature fusion: {model_info['fusion_parameters']:,} params")
print(f"  Task heads: {model_info['task_parameters']:,} params")
print(f"  Cities: {model_info['cities']}")
print(f"  Languages: {model_info['supported_languages']}")

🌟 Comprehensive Australian Tourism Analysis Model:
  Total parameters: 1,100,044
  Text processing: 904,192 params
  Image processing: 93,696 params
  Feature fusion: 99,072 params
  Task heads: 3,084 params
  Cities: ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide', 'Darwin', 'Hobart', 'Canberra']
  Languages: ['en', 'vi']


In [42]:
# Training Setup with TensorBoard Logging
def setup_pytorch_training():
    """Setup comprehensive training with TensorBoard logging."""

    # TensorBoard setup with platform-specific log directory
    if IS_COLAB:
        root_logdir = "/content/tensorboard_logs"
    elif IS_KAGGLE:
        root_logdir = "./tensorboard_logs"
    else:
        root_logdir = "./tensorboard_logs"

    # Create timestamped run directory
    timestamp = time.strftime("%Y_%m_%d-%H_%M_%S")
    run_logdir = f"{root_logdir}/comprehensive_tourism_model_{timestamp}"

    # Initialize TensorBoard writer
    writer = SummaryWriter(log_dir=run_logdir)

    # Define loss functions for multi-task learning
    loss_functions = {
        'city_loss': nn.CrossEntropyLoss(),
        'sentiment_loss': nn.CrossEntropyLoss(),
        'rating_loss': nn.MSELoss()
    }

    # Optimizer with different learning rates for different components
    optimizer = torch.optim.Adam([
        {'params': comprehensive_model.text_embedding.parameters(), 'lr': 1e-4},
        {'params': comprehensive_model.text_lstm.parameters(), 'lr': 1e-4},
        {'params': [p for name, p in comprehensive_model.named_parameters()
                   if 'conv' in name or 'batch_norm' in name], 'lr': 1e-3},
        {'params': [p for name, p in comprehensive_model.named_parameters()
                   if 'fusion' in name], 'lr': 1e-3},
        {'params': [p for name, p in comprehensive_model.named_parameters()
                   if 'classifier' in name or 'predictor' in name], 'lr': 1e-3}
    ])

    return writer, loss_functions, optimizer, run_logdir

# Setup training components
writer, loss_functions, optimizer, run_logdir = setup_pytorch_training()

print(f"\n📊 Training Setup Complete:")
print(f"  TensorBoard logs: {run_logdir}")
print(f"  Loss functions: {list(loss_functions.keys())}")
print(f"  Optimizer: Adam with component-specific learning rates")

# Create sample data for testing
def create_sample_data(batch_size=8):
    """Create sample multi-modal data."""
    # Sample text data (tokenized)
    text_data = torch.randint(1, 1000, (batch_size, 50)).to(DEVICE)

    # Sample image data (tourism photos)
    image_data = torch.randn(batch_size, 3, 64, 64).to(DEVICE)

    # Sample targets
    city_targets = torch.randint(0, 8, (batch_size,)).to(DEVICE)
    sentiment_targets = torch.randint(0, 3, (batch_size,)).to(DEVICE)
    rating_targets = torch.rand(batch_size, 1).to(DEVICE) * 5

    return {
        'text': text_data,
        'images': image_data,
        'city_labels': city_targets,
        'sentiment_labels': sentiment_targets,
        'rating_labels': rating_targets
    }

# Test the complete pipeline
print(f"\n🧪 Testing Complete Pipeline:")
sample_data = create_sample_data(4)

# Forward pass
comprehensive_model.train()
city_pred, sentiment_pred, rating_pred, _ = comprehensive_model(sample_data['text'], sample_data['images'])

# Calculate losses
city_loss = loss_functions['city_loss'](city_pred, sample_data['city_labels'])
sentiment_loss = loss_functions['sentiment_loss'](sentiment_pred, sample_data['sentiment_labels'])
rating_loss = loss_functions['rating_loss'](rating_pred, sample_data['rating_labels'])

# Combined loss with weights
total_loss = city_loss + sentiment_loss + 0.5 * rating_loss

print(f"  Input shapes:")
print(f"    Text: {sample_data['text'].shape}")
print(f"    Images: {sample_data['images'].shape}")
print(f"  Output shapes:")
print(f"    City logits: {city_pred.shape}")
print(f"    Sentiment logits: {sentiment_pred.shape}")
print(f"    Ratings: {rating_pred.shape}")
print(f"  Loss values:")
print(f"    City: {city_loss.item():.4f}")
print(f"    Sentiment: {sentiment_loss.item():.4f}")
print(f"    Rating: {rating_loss.item():.4f}")
print(f"    Total: {total_loss.item():.4f}")

# Log to TensorBoard
writer.add_scalar('Loss/City_Classification', city_loss.item(), 0)
writer.add_scalar('Loss/Sentiment_Analysis', sentiment_loss.item(), 0)
writer.add_scalar('Loss/Rating_Prediction', rating_loss.item(), 0)
writer.add_scalar('Loss/Total', total_loss.item(), 0)

# Log model graph
writer.add_graph(comprehensive_model, (sample_data['text'][:1], sample_data['images'][:1]))

# Log sample predictions
with torch.no_grad():
    comprehensive_model.eval()
    city_eval_pred, sentiment_eval_pred, rating_eval_pred, _ = comprehensive_model(sample_data['text'][:1], sample_data['images'][:1])

    city_probs = F.softmax(city_eval_pred, dim=1)
    sentiment_probs = F.softmax(sentiment_eval_pred, dim=1)

    print(f"\n🎯 Sample Predictions:")
    print(f"  Predicted city: {comprehensive_model.cities[torch.argmax(city_probs).item()]}")
    print(f"  Predicted sentiment: {comprehensive_model.sentiments[torch.argmax(sentiment_probs).item()]}")
    print(f"  Predicted rating: {rating_eval_pred.item():.2f} stars")

writer.close()

print(f"\n📊 TensorBoard Logging Complete!")
print(f"💡 To view logs, run: tensorboard --logdir {run_logdir}")


📊 Training Setup Complete:
  TensorBoard logs: /content/tensorboard_logs/comprehensive_tourism_model_2025_09_21-12_55_37
  Loss functions: ['city_loss', 'sentiment_loss', 'rating_loss']
  Optimizer: Adam with component-specific learning rates

🧪 Testing Complete Pipeline:
  Input shapes:
    Text: torch.Size([4, 50])
    Images: torch.Size([4, 3, 64, 64])
  Output shapes:
    City logits: torch.Size([4, 8])
    Sentiment logits: torch.Size([4, 3])
    Ratings: torch.Size([4, 1])
  Loss values:
    City: 1.9519
    Sentiment: 1.0064
    Rating: 3.8730
    Total: 4.8948

🎯 Sample Predictions:
  Predicted city: Canberra
  Predicted sentiment: positive
  Predicted rating: 3.23 stars

📊 TensorBoard Logging Complete!
💡 To view logs, run: tensorboard --logdir /content/tensorboard_logs/comprehensive_tourism_model_2025_09_21-12_55_37


## 7. Summary and Key Takeaways

🎉 **Congratulations!** You've mastered the fundamentals of building models with PyTorch!

In [43]:
# Summary of PyTorch Model Building Concepts
print("🎓 PyTorch Model Building: Complete Learning Summary")
print("=" * 60)

print("\n🏗️ CORE CLASSES:")
print("  📦 nn.Module: Base class for all models and layers")
print("    • Must implement __init__ and forward methods")
print("    • Automatic parameter registration")
print("    • Built-in training/evaluation modes")
print("  ⚙️ nn.Parameter: Learnable tensors with requires_grad=True")
print("    • Automatic gradient tracking")
print("    • Discoverable by optimizers")

print("\n🧱 LAYER TYPES:")
print("  🔗 Linear Layers (nn.Linear):")
print("    • Fully connected layers")
print("    • Every input influences every output")
print("    • Common in classification heads")
print("  🖼️ Convolutional Layers (nn.Conv2d):")
print("    • For spatial data (images)")
print("    • Local feature detection")
print("    • Translation invariance")
print("  🔄 Recurrent Layers (nn.LSTM, nn.GRU):")
print("    • For sequential data (text, time series)")
print("    • Hidden state memory")
print("    • Handle variable length sequences")
print("  🤖 Transformers (nn.Transformer):")
print("    • Self-attention mechanism")
print("    • Parallel processing")
print("    • State-of-the-art for NLP")

print("\n🛠️ ESSENTIAL FUNCTIONS:")
print("  🏊 Pooling (nn.MaxPool2d, nn.AvgPool2d):")
print("    • Dimensionality reduction")
print("    • Robust feature detection")
print("  📊 Normalization (nn.BatchNorm, nn.LayerNorm):")
print("    • Stable training")
print("    • Faster convergence")
print("  🎭 Dropout (nn.Dropout):")
print("    • Regularization")
print("    • Prevents overfitting")
print("  ⚡ Activations (ReLU, GELU, Sigmoid):")
print("    • Non-linearity")
print("    • Enable complex representations")
print("  🎯 Loss Functions (CrossEntropy, MSE, BCE):")
print("    • Task-specific error measurement")
print("    • Guide training optimization")

print("\n🌏 MULTILINGUAL & AUSTRALIAN CONTEXT:")
australian_examples = [
    "Australian city classification (Sydney, Melbourne, etc.)",
    "Tourism content analysis and recommendation",
    "English-Vietnamese translation for tourism",
    "Restaurant review sentiment analysis",
    "Multi-modal tourism content processing"
]

for i, example in enumerate(australian_examples, 1):
    print(f"  {i}. {example}")

print("\n🔄 TENSORFLOW VS PYTORCH:")
comparisons = [
    ("Model Definition", "tf.keras.Sequential/Functional", "nn.Module subclass"),
    ("Training Loop", "model.fit() automatic", "Manual with optimizer steps"),
    ("Execution", "Graph/Eager modes", "Always eager (dynamic)"),
    ("Device Management", "Automatic strategies", "Explicit .to(device)"),
    ("Parameter Access", "model.trainable_variables", "model.parameters()")
]

for concept, tf_way, pytorch_way in comparisons:
    print(f"  {concept}:")
    print(f"    TensorFlow: {tf_way}")
    print(f"    PyTorch: {pytorch_way}")

print("\n🚀 NEXT STEPS IN YOUR PYTORCH JOURNEY:")
next_steps = [
    "🔧 Advanced Architectures: ResNet, DenseNet, Vision Transformers",
    "🏋️ Training Strategies: Learning rate scheduling, mixed precision",
    "🤗 Hugging Face Integration: Pre-trained transformers, fine-tuning",
    "📊 Data Loading: Custom datasets, data augmentation",
    "🎯 Model Optimization: Quantization, pruning, ONNX export",
    "🌐 Deployment: TorchServe, mobile deployment"
]

for step in next_steps:
    print(f"  {step}")

print("\n💡 KEY BEST PRACTICES:")
best_practices = [
    "Always call super().__init__() in nn.Module subclasses",
    "Use appropriate loss functions for your task type",
    "Implement proper device handling for CPU/GPU compatibility",
    "Add normalization and dropout for stable training",
    "Use TensorBoard for training visualization",
    "Follow the train()/eval() mode switching pattern",
    "Implement proper error handling and data validation"
]

for practice in best_practices:
    print(f"  ✅ {practice}")

print("\n🎉 You're now equipped to build sophisticated PyTorch models!")
print("Ready to tackle real-world deep learning challenges with confidence.")
print("\n🤗 Next recommended: Hugging Face transformers for state-of-the-art NLP!")

🎓 PyTorch Model Building: Complete Learning Summary

🏗️ CORE CLASSES:
  📦 nn.Module: Base class for all models and layers
    • Must implement __init__ and forward methods
    • Automatic parameter registration
    • Built-in training/evaluation modes
  ⚙️ nn.Parameter: Learnable tensors with requires_grad=True
    • Automatic gradient tracking
    • Discoverable by optimizers

🧱 LAYER TYPES:
  🔗 Linear Layers (nn.Linear):
    • Fully connected layers
    • Every input influences every output
    • Common in classification heads
  🖼️ Convolutional Layers (nn.Conv2d):
    • For spatial data (images)
    • Local feature detection
    • Translation invariance
  🔄 Recurrent Layers (nn.LSTM, nn.GRU):
    • For sequential data (text, time series)
    • Hidden state memory
    • Handle variable length sequences
  🤖 Transformers (nn.Transformer):
    • Self-attention mechanism
    • Parallel processing
    • State-of-the-art for NLP

🛠️ ESSENTIAL FUNCTIONS:
  🏊 Pooling (nn.MaxPool2d, nn.AvgPoo