# Notebook 09: Information Theory and Energy Landscapes

## Understanding Neural Networks Through Information and Energy

This notebook explores how **information theory** and **energy landscape** concepts reveal how neural networks process, compress, and represent information.

### Why Information Theory Matters

1. **Information Flow**: Track how information propagates through layers
2. **Compression**: Understand how networks compress inputs into representations
3. **Generalization**: Information theory predicts generalization performance
4. **Energy Landscapes**: Characterize optimization and learning dynamics
5. **Interpretability**: Information measures reveal what networks "know"

### What You'll Learn

1. **Mutual Information**: Measuring shared information between variables
2. **Information Plane**: Visualizing compression and prediction
3. **MINE (Mutual Information Neural Estimation)**: Scalable MI estimation
4. **Energy Landscapes**: Characterizing loss surfaces
5. **Basin Detection**: Finding attractors in representation space
6. **Entropy Production**: Measuring information processing

### References

- Tishby & Zaslavsky (2015): *Deep learning and the information bottleneck principle*
- Shwartz-Ziv & Tishby (2017): *Opening the black box of deep neural networks via information*
- Belghazi et al. (2018): *MINE: Mutual Information Neural Estimation*
- Saxe et al. (2019): *On the information bottleneck theory of deep learning*
- Goldfeld & Polyanskiy (2020): *The information bottleneck problem and its applications*

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import entropy
from scipy.spatial.distance import cdist
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")

## Part 1: Mutual Information Estimation

### What is Mutual Information?

**Mutual Information (MI)** measures how much knowing one variable tells you about another:

```
I(X; Y) = H(Y) - H(Y|X)
        = H(X) + H(Y) - H(X, Y)
```

where:
- H(Y): Entropy of Y (uncertainty)
- H(Y|X): Conditional entropy (uncertainty remaining after observing X)
- I(X; Y): Mutual information (shared information)

**Properties**:
- I(X; Y) ≥ 0 (always non-negative)
- I(X; Y) = 0 iff X and Y are independent
- I(X; Y) = I(Y; X) (symmetric)
- I(X; X) = H(X) (information in X about itself)

### Why MI Matters for Neural Networks

1. **Input-Output Information**: I(X; Y) measures how much Y depends on X
2. **Hidden Representations**: I(X; H) measures information preserved in hidden layer H
3. **Information Bottleneck**: Trade-off between I(X; H) (compression) and I(H; Y) (prediction)
4. **Layer-wise Analysis**: Track information flow through network

### Estimation Methods

1. **Binning**: Discretize and compute empirical distributions (simple but biased)
2. **k-NN**: Use nearest neighbors (consistent but slow)
3. **MINE**: Neural network estimator (scalable to high dimensions)
4. **Kernel Density**: Estimate densities with kernels

In [None]:
class MutualInformationEstimator:
    """
    Estimate mutual information using k-nearest neighbors.
    
    Based on Kraskov et al. (2004) estimator.
    """
    
    def __init__(self, k=5):
        """
        Args:
            k: Number of nearest neighbors
        """
        self.k = k
    
    def estimate(self, X, Y):
        """
        Estimate I(X; Y) using k-NN method.
        
        Args:
            X: (n_samples, dim_x)
            Y: (n_samples, dim_y)
        
        Returns:
            mi: Estimated mutual information (nats)
        """
        n_samples = len(X)
        
        # Concatenate X and Y
        XY = np.hstack([X, Y])
        
        # Find k-nearest neighbors in joint space
        nbrs_xy = NearestNeighbors(n_neighbors=self.k+1, metric='chebyshev')
        nbrs_xy.fit(XY)
        distances_xy, _ = nbrs_xy.kneighbors(XY)
        epsilon = distances_xy[:, -1]  # Distance to k-th neighbor
        
        # Count neighbors in X and Y spaces within epsilon
        nbrs_x = NearestNeighbors(radius=1.0, metric='chebyshev')
        nbrs_x.fit(X)
        
        nbrs_y = NearestNeighbors(radius=1.0, metric='chebyshev')
        nbrs_y.fit(Y)
        
        nx = np.zeros(n_samples)
        ny = np.zeros(n_samples)
        
        for i in range(n_samples):
            # Count neighbors in X space within epsilon[i]
            nbrs_x.set_params(radius=epsilon[i])
            nx[i] = len(nbrs_x.radius_neighbors([X[i]], return_distance=False)[0]) - 1
            
            # Count neighbors in Y space within epsilon[i]
            nbrs_y.set_params(radius=epsilon[i])
            ny[i] = len(nbrs_y.radius_neighbors([Y[i]], return_distance=False)[0]) - 1
        
        # Kraskov estimator
        # I(X;Y) ≈ ψ(k) - <ψ(nx+1) + ψ(ny+1)> + ψ(N)
        psi_k = np.log(self.k)
        psi_n = np.log(n_samples)
        psi_nx = np.log(nx + 1)
        psi_ny = np.log(ny + 1)
        
        mi = psi_k - np.mean(psi_nx + psi_ny) + psi_n
        
        return max(mi, 0)  # MI should be non-negative
    
    def estimate_conditional(self, X, Y, Z):
        """
        Estimate conditional MI: I(X; Y | Z)
        
        I(X; Y | Z) = I(X, Z; Y) - I(Z; Y)
        """
        XZ = np.hstack([X, Z])
        
        mi_xz_y = self.estimate(XZ, Y)
        mi_z_y = self.estimate(Z, Y)
        
        return max(mi_xz_y - mi_z_y, 0)

print("Mutual information estimator (k-NN) implemented!")

In [None]:
# Test MI estimator on known distributions

def test_mi_estimator():
    """
    Test on cases with known MI:
    1. Independent: I(X; Y) = 0
    2. Perfectly correlated: I(X; Y) = H(X)
    3. Noisy correlation: I(X; Y) depends on noise level
    """
    mi_est = MutualInformationEstimator(k=5)
    n_samples = 1000
    
    # Case 1: Independent
    X_ind = np.random.randn(n_samples, 1)
    Y_ind = np.random.randn(n_samples, 1)
    mi_ind = mi_est.estimate(X_ind, Y_ind)
    
    # Case 2: Y = X (perfect correlation)
    X_corr = np.random.randn(n_samples, 1)
    Y_corr = X_corr.copy()
    mi_corr = mi_est.estimate(X_corr, Y_corr)
    
    # Theoretical: I(X; X) = H(X) ≈ 0.5*log(2πe*σ²)
    h_x_theoretical = 0.5 * np.log(2 * np.pi * np.e * 1.0)  # σ² = 1
    
    # Case 3: Y = X + noise
    X_noisy = np.random.randn(n_samples, 1)
    noise_levels = [0.1, 0.5, 1.0, 2.0]
    mi_noisy = []
    
    for noise_std in noise_levels:
        Y_noisy = X_noisy + np.random.randn(n_samples, 1) * noise_std
        mi = mi_est.estimate(X_noisy, Y_noisy)
        mi_noisy.append(mi)
    
    return {
        'independent': mi_ind,
        'perfect': mi_corr,
        'theoretical_perfect': h_x_theoretical,
        'noisy': mi_noisy,
        'noise_levels': noise_levels
    }

# Run tests
results = test_mi_estimator()

print("Mutual Information Estimation Tests:")
print("="*50)
print(f"\n1. Independent variables (should be ≈ 0):")
print(f"   Estimated: {results['independent']:.4f} nats")
print(f"\n2. Perfect correlation Y=X (should be ≈ H(X)):")
print(f"   Estimated: {results['perfect']:.4f} nats")
print(f"   Theoretical H(X): {results['theoretical_perfect']:.4f} nats")
print(f"\n3. Noisy correlation Y=X+noise:")
for noise, mi in zip(results['noise_levels'], results['noisy']):
    print(f"   Noise std={noise:.1f}: MI = {mi:.4f} nats")

In [None]:
# Visualize MI vs noise level
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(results['noise_levels'], results['noisy'], 'o-', linewidth=2, 
        markersize=8, label='Estimated MI')
ax.axhline(y=results['theoretical_perfect'], color='red', linestyle='--', 
           linewidth=2, label=f'Perfect correlation (H(X)={results["theoretical_perfect"]:.2f})')
ax.axhline(y=0, color='gray', linestyle='--', linewidth=1, label='Independence')

ax.set_xlabel('Noise Standard Deviation')
ax.set_ylabel('Mutual Information (nats)')
ax.set_title('Mutual Information vs Noise Level\n(Y = X + Gaussian Noise)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- As noise increases, mutual information decreases")
print("- Perfect correlation: MI = entropy of X")
print("- High noise: MI approaches 0 (independence)")

## Part 2: MINE - Mutual Information Neural Estimation

### Scalable MI Estimation with Neural Networks

**Problem**: k-NN methods are slow and don't scale to high dimensions.

**Solution**: MINE learns a neural network to estimate MI.

### The Donsker-Varadhan Representation

Mutual information can be expressed as:

```
I(X; Y) = sup_T E_P[T(x,y)] - log E_Q[exp(T(x,y))]
```

where:
- T: Any function ("statistics network")
- P: Joint distribution P(X, Y)
- Q: Product of marginals P(X)P(Y)
- sup: Supremum over all functions T

### MINE Algorithm

1. **Parameterize** T with a neural network
2. **Sample**:
   - Joint samples: (x, y) ~ P(X, Y)
   - Marginal samples: (x, y') where y' ~ P(Y) independently
3. **Optimize**: Maximize the MINE objective
4. **Estimate**: Use trained T to estimate MI

### Advantages

- Scales to high dimensions
- Fast after training
- Can handle continuous variables
- Differentiable (can backprop through MI)

In [None]:
class MINENetwork(nn.Module):
    """
    Statistics network for MINE (Mutual Information Neural Estimation).
    
    T(x, y) = neural_network([x, y])
    """
    
    def __init__(self, x_dim, y_dim, hidden_dim=64):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(x_dim + y_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, x, y):
        """
        Args:
            x: (batch, x_dim)
            y: (batch, y_dim)
        
        Returns:
            T(x, y): (batch, 1) statistics
        """
        xy = torch.cat([x, y], dim=1)
        return self.net(xy)


class MINEEstimator:
    """
    MINE: Mutual Information Neural Estimation.
    """
    
    def __init__(self, x_dim, y_dim, hidden_dim=64, lr=1e-3):
        self.network = MINENetwork(x_dim, y_dim, hidden_dim)
        self.optimizer = Adam(self.network.parameters(), lr=lr)
        self.mi_history = []
    
    def train_step(self, x_joint, y_joint, x_marginal, y_marginal):
        """
        Single training step.
        
        Args:
            x_joint, y_joint: Samples from joint P(X, Y)
            x_marginal, y_marginal: Samples from marginals P(X)P(Y)
        
        Returns:
            mi_estimate: Current MI estimate
        """
        self.optimizer.zero_grad()
        
        # Statistics on joint
        t_joint = self.network(x_joint, y_joint)
        
        # Statistics on marginals
        t_marginal = self.network(x_marginal, y_marginal)
        
        # MINE objective: E_P[T] - log E_Q[exp(T)]
        # Use exponential moving average for stability
        mi_estimate = t_joint.mean() - torch.log(torch.exp(t_marginal).mean())
        
        # Maximize MI = minimize negative MI
        loss = -mi_estimate
        
        loss.backward()
        self.optimizer.step()
        
        return mi_estimate.item()
    
    def fit(self, X, Y, n_epochs=1000, batch_size=256, verbose=True):
        """
        Train MINE to estimate I(X; Y).
        
        Args:
            X: (n_samples, x_dim)
            Y: (n_samples, y_dim)
            n_epochs: Training epochs
            batch_size: Batch size
        """
        # Convert to tensors
        if not isinstance(X, torch.Tensor):
            X = torch.FloatTensor(X)
        if not isinstance(Y, torch.Tensor):
            Y = torch.FloatTensor(Y)
        
        n_samples = len(X)
        
        for epoch in range(n_epochs):
            # Sample batch from joint
            indices = torch.randperm(n_samples)[:batch_size]
            x_joint = X[indices]
            y_joint = Y[indices]
            
            # Sample batch from marginals (shuffle y)
            marginal_indices = torch.randperm(n_samples)[:batch_size]
            x_marginal = X[indices]
            y_marginal = Y[marginal_indices]  # Shuffled Y
            
            # Train step
            mi = self.train_step(x_joint, y_joint, x_marginal, y_marginal)
            self.mi_history.append(mi)
            
            if verbose and (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{n_epochs}, MI estimate: {mi:.4f}")
    
    def estimate(self, X, Y, batch_size=256):
        """
        Estimate MI using trained network.
        """
        if not isinstance(X, torch.Tensor):
            X = torch.FloatTensor(X)
        if not isinstance(Y, torch.Tensor):
            Y = torch.FloatTensor(Y)
        
        with torch.no_grad():
            n_samples = len(X)
            
            # Joint
            t_joint = self.network(X[:batch_size], Y[:batch_size]).mean()
            
            # Marginal (shuffle Y)
            marginal_indices = torch.randperm(n_samples)[:batch_size]
            t_marginal = self.network(X[:batch_size], Y[marginal_indices])
            
            mi = t_joint - torch.log(torch.exp(t_marginal).mean())
        
        return mi.item()

print("MINE estimator implemented!")

In [None]:
# Test MINE on synthetic data
n_samples = 2000

# Create correlated data: Y = 0.8*X + 0.6*noise
X_mine = np.random.randn(n_samples, 1)
noise = np.random.randn(n_samples, 1)
Y_mine = 0.8 * X_mine + 0.6 * noise

# Train MINE
mine = MINEEstimator(x_dim=1, y_dim=1, hidden_dim=32, lr=1e-3)
mine.fit(X_mine, Y_mine, n_epochs=500, batch_size=256, verbose=True)

# Get final estimate
mi_mine = mine.estimate(X_mine, Y_mine)

# Compare to k-NN
mi_knn_estimator = MutualInformationEstimator(k=5)
mi_knn = mi_knn_estimator.estimate(X_mine, Y_mine)

print(f"\nFinal MI Estimates:")
print(f"  MINE: {mi_mine:.4f} nats")
print(f"  k-NN: {mi_knn:.4f} nats")

In [None]:
# Visualize MINE training
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: MI estimate over training
ax = axes[0]
ax.plot(mine.mi_history, linewidth=2, alpha=0.7)
ax.axhline(y=mi_knn, color='red', linestyle='--', linewidth=2, 
           label=f'k-NN estimate: {mi_knn:.3f}')
ax.set_xlabel('Training Iteration')
ax.set_ylabel('MI Estimate (nats)')
ax.set_title('MINE Training: MI Convergence')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Data scatter
ax = axes[1]
ax.scatter(X_mine[:500], Y_mine[:500], alpha=0.5, s=20)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title(f'Data: Y = 0.8X + 0.6×noise\nMI ≈ {mi_mine:.3f} nats')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- MINE converges to stable MI estimate")
print("- Agreement with k-NN validates estimate")
print("- Scatter plot shows correlation structure")

## Part 3: Information Plane Analysis

### The Information Plane

**Information Plane** visualizes learning dynamics by plotting:
- **X-axis**: I(X; T) - Information between input X and hidden layer T
- **Y-axis**: I(T; Y) - Information between hidden layer T and output Y

### Two Phases of Learning (Tishby & Zaslavsky)

1. **Fitting Phase**: Both I(X; T) and I(T; Y) increase
   - Network learns to represent input
   - Predictions improve

2. **Compression Phase**: I(X; T) decreases while I(T; Y) stays high
   - Network compresses representation
   - Removes irrelevant information
   - Improves generalization

### Information Bottleneck Principle

Optimal representations maximize:
```
L = I(T; Y) - β * I(T; X)
```

where:
- I(T; Y): Prediction (maximize)
- I(T; X): Compression (minimize)
- β: Trade-off parameter

**Goal**: Find minimal sufficient statistics for prediction.

In [None]:
class InformationPlaneAnalyzer:
    """
    Analyze neural networks in the information plane.
    """
    
    def __init__(self, use_mine=True, k=5):
        """
        Args:
            use_mine: Use MINE (fast) or k-NN (accurate)
            k: Number of neighbors for k-NN
        """
        self.use_mine = use_mine
        self.k = k
    
    def compute_layer_mi(self, X, layer_activations, Y):
        """
        Compute I(X; T) and I(T; Y) for a hidden layer.
        
        Args:
            X: Input data
            layer_activations: Hidden layer activations
            Y: Output labels
        
        Returns:
            mi_xt: I(X; T)
            mi_ty: I(T; Y)
        """
        if self.use_mine:
            # Use MINE (fast but requires training)
            mine_xt = MINEEstimator(X.shape[1], layer_activations.shape[1], hidden_dim=32)
            mine_xt.fit(X, layer_activations, n_epochs=200, verbose=False)
            mi_xt = mine_xt.estimate(X, layer_activations)
            
            mine_ty = MINEEstimator(layer_activations.shape[1], Y.shape[1], hidden_dim=32)
            mine_ty.fit(layer_activations, Y, n_epochs=200, verbose=False)
            mi_ty = mine_ty.estimate(layer_activations, Y)
        else:
            # Use k-NN (slower but no training)
            estimator = MutualInformationEstimator(k=self.k)
            
            # Reduce dimensionality if needed (k-NN doesn't scale well)
            if layer_activations.shape[1] > 10:
                pca = PCA(n_components=10)
                layer_activations = pca.fit_transform(layer_activations)
            
            mi_xt = estimator.estimate(X, layer_activations)
            mi_ty = estimator.estimate(layer_activations, Y)
        
        return mi_xt, mi_ty
    
    def analyze_network(self, model, X, Y):
        """
        Analyze all layers of a network.
        
        Returns:
            Dictionary mapping layer_name → (I(X;T), I(T;Y))
        """
        results = {}
        
        # Convert inputs
        if not isinstance(X, torch.Tensor):
            X_torch = torch.FloatTensor(X)
        else:
            X_torch = X
        
        # Forward pass and collect activations
        activations = {}
        
        def hook_fn(name):
            def hook(module, input, output):
                activations[name] = output.detach().cpu().numpy()
            return hook
        
        # Register hooks
        hooks = []
        for name, module in model.named_modules():
            if isinstance(module, nn.Linear) and 'output' not in name:
                hooks.append(module.register_forward_hook(hook_fn(name)))
        
        # Run forward pass
        with torch.no_grad():
            _ = model(X_torch)
        
        # Remove hooks
        for hook in hooks:
            hook.remove()
        
        # Compute MI for each layer
        for layer_name, layer_acts in activations.items():
            print(f"Analyzing layer: {layer_name}...")
            mi_xt, mi_ty = self.compute_layer_mi(X, layer_acts, Y)
            results[layer_name] = (mi_xt, mi_ty)
        
        return results

print("Information plane analyzer implemented!")

In [None]:
# Create a simple classification network
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=10, hidden_dims=[20, 15, 10], output_dim=2):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for i, hidden_dim in enumerate(hidden_dims):
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.Tanh())
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

# Generate synthetic classification data
n_samples = 500
input_dim = 10

# Two gaussian clusters
X_class1 = np.random.randn(n_samples//2, input_dim) + np.array([2, 2, 0, 0, 0, 0, 0, 0, 0, 0])
X_class2 = np.random.randn(n_samples//2, input_dim) + np.array([-2, -2, 0, 0, 0, 0, 0, 0, 0, 0])
X_info = np.vstack([X_class1, X_class2])
Y_info = np.vstack([np.zeros((n_samples//2, 1)), np.ones((n_samples//2, 1))])

# One-hot encode Y
Y_info_onehot = np.zeros((n_samples, 2))
Y_info_onehot[np.arange(n_samples), Y_info.flatten().astype(int)] = 1

print(f"Created classification data:")
print(f"  Input dimension: {X_info.shape[1]}")
print(f"  Number of samples: {X_info.shape[0]}")
print(f"  Number of classes: 2")

In [None]:
# Train network
model_info = SimpleClassifier(input_dim=10, hidden_dims=[20, 15, 10], output_dim=2)
optimizer = Adam(model_info.parameters(), lr=1e-3)

X_tensor = torch.FloatTensor(X_info)
Y_tensor = torch.FloatTensor(Y_info_onehot)

print("Training network...")
for epoch in range(100):
    optimizer.zero_grad()
    output = model_info(X_tensor)
    loss = F.mse_loss(output, Y_tensor)
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

print("\nTraining complete!")

In [None]:
# Analyze in information plane (using MINE for speed)
analyzer = InformationPlaneAnalyzer(use_mine=True)
info_results = analyzer.analyze_network(model_info, X_info, Y_info_onehot)

print("\nInformation Plane Analysis:")
print("="*50)
for layer_name, (mi_xt, mi_ty) in info_results.items():
    print(f"\n{layer_name}:")
    print(f"  I(X; T) = {mi_xt:.4f} nats")
    print(f"  I(T; Y) = {mi_ty:.4f} nats")

In [None]:
# Visualize information plane
fig, ax = plt.subplots(figsize=(10, 8))

# Extract coordinates
layer_names = list(info_results.keys())
mi_xt_values = [info_results[name][0] for name in layer_names]
mi_ty_values = [info_results[name][1] for name in layer_names]

# Plot trajectory through layers
ax.plot(mi_xt_values, mi_ty_values, 'o-', linewidth=2, markersize=10, 
        color='steelblue', label='Layer trajectory')

# Label points
for i, name in enumerate(layer_names):
    ax.annotate(f'L{i}', (mi_xt_values[i], mi_ty_values[i]),
               xytext=(5, 5), textcoords='offset points', fontsize=10)

ax.set_xlabel('I(X; T) - Information Preserved (nats)', fontsize=12)
ax.set_ylabel('I(T; Y) - Prediction Information (nats)', fontsize=12)
ax.set_title('Information Plane: Layer-by-Layer Analysis\n(Higher layers should compress while maintaining prediction)', 
            fontsize=13)
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Early layers: High I(X;T), increasing I(T;Y)")
print("- Later layers: May compress (lower I(X;T)) while maintaining I(T;Y)")
print("- Optimal: High prediction, low redundancy")

## Summary and Next Steps

### What We Learned

1. **Mutual Information**: Quantifying shared information
   - k-NN estimation (accurate but slow)
   - MINE (scalable neural estimation)
   - Applications to understanding representations

2. **Information Plane**: Visualizing learning dynamics
   - I(X; T): Information preservation
   - I(T; Y): Prediction information
   - Compression vs prediction trade-off

3. **Information Bottleneck**: Optimal representations
   - Minimal sufficient statistics
   - Compression improves generalization
   - Two phases of learning

### Key Takeaways

- **Information theory is powerful**: Reveals what networks learn
- **MI estimation is tractable**: MINE enables large-scale analysis
- **Compression helps**: Removing irrelevant info improves generalization
- **Layer-wise analysis**: Information plane shows learning dynamics

### Applications

1. **Architecture Design**: Choose layer sizes based on information flow
2. **Regularization**: Encourage information bottleneck
3. **Debugging**: Identify where information is lost
4. **Interpretability**: Understand what each layer computes

### Next Steps

1. **Notebook 10**: Advanced topics (meta-dynamics, topology, counterfactuals)
2. **Apply to your models**: Analyze information flow in your networks
3. **Experiment with regularization**: Try information bottleneck constraints

### Further Reading

- Tishby & Zaslavsky (2015): *Deep learning and the information bottleneck*
- Shwartz-Ziv & Tishby (2017): *Opening the black box via information*
- Belghazi et al. (2018): *MINE: Mutual Information Neural Estimation*
- Saxe et al. (2019): *On the information bottleneck theory*