# Neural Network Digit Classification - Complete Parameter Exploration

This notebook explores neural networks for MNIST digit classification with:
1. **Scikit-learn MLPClassifier** with extensive parameter tuning
2. **Pure Python implementation** from scratch for deep understanding
3. **Comprehensive comparison** of approaches and parameters

## Neural Network Fundamentals

### Key Components:
- **Neurons**: Basic processing units
- **Layers**: Input, Hidden, Output layers
- **Weights & Biases**: Learnable parameters
- **Activation Functions**: Non-linear transformations
- **Backpropagation**: Learning algorithm

### Parameters to Explore:
- **Architecture**: Number of hidden layers and neurons
- **Activation Functions**: ReLU, Sigmoid, Tanh
- **Solvers**: SGD, Adam, L-BFGS
- **Learning Rate**: How fast the model learns
- **Regularization**: Preventing overfitting

## 1. Import Libraries and Load Data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

# Load MNIST dataset
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X, y = mnist.data, mnist.target.astype(int)

# Use subset for faster experimentation
subset_size = 15000  # Larger subset for neural networks
X_subset = X[:subset_size]
y_subset = y[:subset_size]

# Normalize data
X_normalized = X_subset / 255.0

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_normalized, y_subset, test_size=0.2, random_state=42, stratify=y_subset
)

print(f"Dataset loaded: {X_subset.shape[0]} samples")
print(f"Training: {X_train.shape[0]}, Testing: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]} (28x28 pixels)")

Libraries imported successfully!
Loading MNIST dataset...
Dataset loaded: 15000 samples
Training: 12000, Testing: 3000
Features: 784 (28x28 pixels)


## 2. Neural Network Architecture Exploration

Let's explore different neural network architectures and parameters systematically.

In [2]:
# Function to train and evaluate neural network
def train_and_evaluate_nn(hidden_layer_sizes, activation='relu', solver='adam', 
                         learning_rate_init=0.001, max_iter=500, alpha=0.0001):
    """
    Train neural network with given parameters and return results
    """
    start_time = time.time()
    
    # Create and train model
    mlp = MLPClassifier(
        hidden_layer_sizes=hidden_layer_sizes,
        activation=activation,
        solver=solver,
        learning_rate_init=learning_rate_init,
        max_iter=max_iter,
        alpha=alpha,
        random_state=42
    )
    
    mlp.fit(X_train, y_train)
    
    # Make predictions
    y_pred = mlp.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    training_time = time.time() - start_time
    
    return {
        'model': mlp,
        'accuracy': accuracy,
        'training_time': training_time,
        'iterations': mlp.n_iter_,
        'loss': mlp.loss_
    }

print("Neural network training function defined!")

Neural network training function defined!


## 3. Architecture Comparison

Let's test different network architectures to understand the impact of depth and width.

In [3]:
# Different architectures to test
architectures = {
    'Single Hidden (50)': (50,),
    'Single Hidden (100)': (100,),
    'Single Hidden (200)': (200,),
    'Two Hidden (100,50)': (100, 50),
    'Two Hidden (200,100)': (200, 100),
    'Three Hidden (100,50,25)': (100, 50, 25),
    'Deep Network (200,100,50,25)': (200, 100, 50, 25),
    'Wide Network (300,300)': (300, 300)
}

print("Testing different neural network architectures...\n")

architecture_results = []

for name, hidden_layers in architectures.items():
    print(f"Training {name}: {hidden_layers}")
    
    result = train_and_evaluate_nn(
        hidden_layer_sizes=hidden_layers,
        max_iter=300  # Reduce iterations for faster comparison
    )
    
    architecture_results.append({
        'Architecture': name,
        'Hidden Layers': str(hidden_layers),
        'Accuracy': result['accuracy'],
        'Training Time': result['training_time'],
        'Iterations': result['iterations'],
        'Final Loss': result['loss']
    })
    
    print(f"  Accuracy: {result['accuracy']:.4f} ({result['accuracy']*100:.2f}%)")
    print(f"  Training Time: {result['training_time']:.2f}s")
    print(f"  Iterations: {result['iterations']}\n")

# Create results DataFrame
arch_df = pd.DataFrame(architecture_results)
print("\n=== ARCHITECTURE COMPARISON ===\n")
print(arch_df.round(4))

Testing different neural network architectures...

Training Single Hidden (50): (50,)
  Accuracy: 0.9513 (95.13%)
  Training Time: 54.28s
  Iterations: 118

Training Single Hidden (100): (100,)
  Accuracy: 0.9570 (95.70%)
  Training Time: 64.24s
  Iterations: 87

Training Single Hidden (200): (200,)
  Accuracy: 0.9600 (96.00%)
  Training Time: 117.58s
  Iterations: 63

Training Two Hidden (100,50): (100, 50)
  Accuracy: 0.9613 (96.13%)
  Training Time: 51.06s
  Iterations: 60

Training Two Hidden (200,100): (200, 100)
  Accuracy: 0.9627 (96.27%)
  Training Time: 80.21s
  Iterations: 41

Training Three Hidden (100,50,25): (100, 50, 25)
  Accuracy: 0.9617 (96.17%)
  Training Time: 49.87s
  Iterations: 51

Training Deep Network (200,100,50,25): (200, 100, 50, 25)
  Accuracy: 0.9643 (96.43%)
  Training Time: 83.14s
  Iterations: 35

Training Wide Network (300,300): (300, 300)
  Accuracy: 0.9657 (96.57%)
  Training Time: 136.12s
  Iterations: 33


=== ARCHITECTURE COMPARISON ===

          

## 4. Activation Function Comparison

Different activation functions have different properties and performance characteristics.

In [4]:
# Test different activation functions
activations = ['relu', 'tanh', 'logistic']
activation_results = []

print("Testing different activation functions...\n")

for activation in activations:
    print(f"Training with {activation} activation...")
    
    result = train_and_evaluate_nn(
        hidden_layer_sizes=(100, 50),
        activation=activation,
        max_iter=300
    )
    
    activation_results.append({
        'Activation': activation,
        'Accuracy': result['accuracy'],
        'Training Time': result['training_time'],
        'Iterations': result['iterations']
    })
    
    print(f"  Accuracy: {result['accuracy']:.4f}")
    print(f"  Training Time: {result['training_time']:.2f}s\n")

activation_df = pd.DataFrame(activation_results)
print("\n=== ACTIVATION FUNCTION COMPARISON ===\n")
print(activation_df.round(4))

Testing different activation functions...

Training with relu activation...
  Accuracy: 0.9613
  Training Time: 61.21s

Training with tanh activation...
  Accuracy: 0.9587
  Training Time: 65.45s

Training with logistic activation...
  Accuracy: 0.9567
  Training Time: 132.47s


=== ACTIVATION FUNCTION COMPARISON ===

  Activation  Accuracy  Training Time  Iterations
0       relu    0.9613        61.2053          60
1       tanh    0.9587        65.4457          66
2   logistic    0.9567       132.4698         107


## 5. Solver Comparison

Different optimization algorithms (solvers) can significantly impact training speed and final performance.

In [5]:
# Test different solvers
solvers = ['adam', 'sgd', 'lbfgs']
solver_results = []

print("Testing different optimization solvers...\n")

for solver in solvers:
    print(f"Training with {solver} solver...")
    
    # Adjust parameters based on solver
    max_iter = 1000 if solver == 'lbfgs' else 300
    learning_rate = 0.01 if solver == 'sgd' else 0.001
    
    result = train_and_evaluate_nn(
        hidden_layer_sizes=(100, 50),
        solver=solver,
        learning_rate_init=learning_rate,
        max_iter=max_iter
    )
    
    solver_results.append({
        'Solver': solver,
        'Accuracy': result['accuracy'],
        'Training Time': result['training_time'],
        'Iterations': result['iterations']
    })
    
    print(f"  Accuracy: {result['accuracy']:.4f}")
    print(f"  Training Time: {result['training_time']:.2f}s\n")

solver_df = pd.DataFrame(solver_results)
print("\n=== SOLVER COMPARISON ===\n")
print(solver_df.round(4))

Testing different optimization solvers...

Training with adam solver...
  Accuracy: 0.9613
  Training Time: 57.39s

Training with sgd solver...
  Accuracy: 0.9547
  Training Time: 105.94s

Training with lbfgs solver...
  Accuracy: 0.9573
  Training Time: 31.42s


=== SOLVER COMPARISON ===

  Solver  Accuracy  Training Time  Iterations
0   adam    0.9613        57.3893          60
1    sgd    0.9547       105.9406         125
2  lbfgs    0.9573        31.4166         127


## 6. Learning Rate Impact

Learning rate is crucial for neural network training - too high and it won't converge, too low and it trains slowly.

In [6]:
# Test different learning rates
learning_rates = [0.1, 0.01, 0.001, 0.0001]
lr_results = []

print("Testing different learning rates...\n")

for lr in learning_rates:
    print(f"Training with learning rate {lr}...")
    
    result = train_and_evaluate_nn(
        hidden_layer_sizes=(100, 50),
        learning_rate_init=lr,
        max_iter=300
    )
    
    lr_results.append({
        'Learning Rate': lr,
        'Accuracy': result['accuracy'],
        'Training Time': result['training_time'],
        'Iterations': result['iterations']
    })
    
    print(f"  Accuracy: {result['accuracy']:.4f}")
    print(f"  Training Time: {result['training_time']:.2f}s\n")

lr_df = pd.DataFrame(lr_results)
print("\n=== LEARNING RATE COMPARISON ===\n")
print(lr_df.round(4))

Testing different learning rates...

Training with learning rate 0.1...
  Accuracy: 0.8370
  Training Time: 27.49s

Training with learning rate 0.01...
  Accuracy: 0.9573
  Training Time: 27.25s

Training with learning rate 0.001...
  Accuracy: 0.9613
  Training Time: 54.78s

Training with learning rate 0.0001...
  Accuracy: 0.9580
  Training Time: 204.72s


=== LEARNING RATE COMPARISON ===

   Learning Rate  Accuracy  Training Time  Iterations
0         0.1000    0.8370        27.4926          32
1         0.0100    0.9573        27.2531          30
2         0.0010    0.9613        54.7794          60
3         0.0001    0.9580       204.7228         224


## 7. Regularization Impact

Alpha parameter controls L2 regularization to prevent overfitting.

In [7]:
# Test different regularization strengths
alphas = [0.0001, 0.001, 0.01, 0.1]
reg_results = []

print("Testing different regularization strengths...\n")

for alpha in alphas:
    print(f"Training with alpha (regularization) {alpha}...")
    
    result = train_and_evaluate_nn(
        hidden_layer_sizes=(100, 50),
        alpha=alpha,
        max_iter=300
    )
    
    reg_results.append({
        'Alpha (Regularization)': alpha,
        'Accuracy': result['accuracy'],
        'Training Time': result['training_time'],
        'Iterations': result['iterations']
    })
    
    print(f"  Accuracy: {result['accuracy']:.4f}")
    print(f"  Training Time: {result['training_time']:.2f}s\n")

reg_df = pd.DataFrame(reg_results)
print("\n=== REGULARIZATION COMPARISON ===\n")
print(reg_df.round(4))

Testing different regularization strengths...

Training with alpha (regularization) 0.0001...
  Accuracy: 0.9613
  Training Time: 57.17s

Training with alpha (regularization) 0.001...
  Accuracy: 0.9617
  Training Time: 56.56s

Training with alpha (regularization) 0.01...
  Accuracy: 0.9637
  Training Time: 78.81s

Training with alpha (regularization) 0.1...
  Accuracy: 0.9643
  Training Time: 84.73s


=== REGULARIZATION COMPARISON ===

   Alpha (Regularization)  Accuracy  Training Time  Iterations
0                  0.0001    0.9613        57.1720          60
1                  0.0010    0.9617        56.5623          57
2                  0.0100    0.9637        78.8118          87
3                  0.1000    0.9643        84.7314          94


## 8. Best Model Training

Based on our experiments, let's train the best performing model with more iterations.

In [8]:
# Find best architecture from our tests
best_arch = arch_df.loc[arch_df['Accuracy'].idxmax()]
print(f"Best architecture: {best_arch['Architecture']} with accuracy {best_arch['Accuracy']:.4f}")

# Train best model with more iterations
print("\nTraining best model with extended iterations...")

best_model_result = train_and_evaluate_nn(
    hidden_layer_sizes=(200, 100),  # Generally good performing architecture
    activation='relu',
    solver='adam',
    learning_rate_init=0.001,
    max_iter=1000,
    alpha=0.0001
)

best_model = best_model_result['model']
best_accuracy = best_model_result['accuracy']

print(f"\nBest Model Performance:")
print(f"Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
print(f"Training Time: {best_model_result['training_time']:.2f}s")
print(f"Iterations: {best_model_result['iterations']}")

# Detailed evaluation
y_pred_best = best_model.predict(X_test)
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_best))

Best architecture: Wide Network (300,300) with accuracy 0.9657

Training best model with extended iterations...

Best Model Performance:
Accuracy: 0.9627 (96.27%)
Training Time: 91.35s
Iterations: 41

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       299
           1       0.98      0.99      0.98       338
           2       0.97      0.95      0.96       292
           3       0.94      0.93      0.93       309
           4       0.96      0.97      0.96       294
           5       0.96      0.95      0.95       264
           6       0.96      0.98      0.97       298
           7       0.97      0.96      0.96       319
           8       0.96      0.94      0.95       286
           9       0.97      0.95      0.96       301

    accuracy                           0.96      3000
   macro avg       0.96      0.96      0.96      3000
weighted avg       0.96      0.96      0.96      3000



## 9. Neural Network from Scratch - Pure Python Implementation

Now let's implement a neural network from scratch to understand the underlying mathematics.

In [9]:
class NeuralNetworkFromScratch:
    def __init__(self, layers, learning_rate=0.01, activation='relu'):
        """
        Initialize neural network with given architecture
        
        Args:
            layers: List of layer sizes [input_size, hidden1, hidden2, ..., output_size]
            learning_rate: Learning rate for gradient descent
            activation: Activation function ('relu', 'sigmoid', 'tanh')
        """
        self.layers = layers
        self.learning_rate = learning_rate
        self.activation = activation
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        
        for i in range(len(layers) - 1):
            # Xavier initialization
            w = np.random.randn(layers[i], layers[i+1]) * np.sqrt(2.0 / layers[i])
            b = np.zeros((1, layers[i+1]))
            
            self.weights.append(w)
            self.biases.append(b)
    
    def relu(self, z):
        """ReLU activation function"""
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        """Derivative of ReLU"""
        return (z > 0).astype(float)
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip z to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def sigmoid_derivative(self, z):
        """Derivative of sigmoid"""
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def tanh(self, z):
        """Tanh activation function"""
        return np.tanh(z)
    
    def tanh_derivative(self, z):
        """Derivative of tanh"""
        return 1 - np.tanh(z)**2
    
    def softmax(self, z):
        """Softmax activation for output layer"""
        # Subtract max for numerical stability
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    def apply_activation(self, z, derivative=False):
        """Apply chosen activation function"""
        if self.activation == 'relu':
            return self.relu_derivative(z) if derivative else self.relu(z)
        elif self.activation == 'sigmoid':
            return self.sigmoid_derivative(z) if derivative else self.sigmoid(z)
        elif self.activation == 'tanh':
            return self.tanh_derivative(z) if derivative else self.tanh(z)
    
    def forward_propagation(self, X):
        """Forward pass through the network"""
        self.z_values = []  # Store z values for backprop
        self.activations = [X]  # Store activations
        
        current_input = X
        
        # Forward through hidden layers
        for i in range(len(self.weights) - 1):
            z = np.dot(current_input, self.weights[i]) + self.biases[i]
            self.z_values.append(z)
            
            a = self.apply_activation(z)
            self.activations.append(a)
            current_input = a
        
        # Output layer (softmax)
        z_output = np.dot(current_input, self.weights[-1]) + self.biases[-1]
        self.z_values.append(z_output)
        
        output = self.softmax(z_output)
        self.activations.append(output)
        
        return output
    
    def backward_propagation(self, X, y, output):
        """Backward pass to compute gradients"""
        m = X.shape[0]  # Number of samples
        
        # Convert y to one-hot encoding
        y_one_hot = np.eye(10)[y]
        
        # Compute gradients
        dw = []
        db = []
        
        # Output layer error
        dz = output - y_one_hot
        
        # Backpropagate through all layers
        for i in range(len(self.weights) - 1, -1, -1):
            # Gradient for weights and biases
            dw_i = np.dot(self.activations[i].T, dz) / m
            db_i = np.sum(dz, axis=0, keepdims=True) / m
            
            dw.insert(0, dw_i)
            db.insert(0, db_i)
            
            # Compute error for previous layer (if not input layer)
            if i > 0:
                dz = np.dot(dz, self.weights[i].T) * self.apply_activation(self.z_values[i-1], derivative=True)
        
        return dw, db
    
    def update_parameters(self, dw, db):
        """Update weights and biases using gradients"""
        for i in range(len(self.weights)):
            self.weights[i] -= self.learning_rate * dw[i]
            self.biases[i] -= self.learning_rate * db[i]
    
    def compute_loss(self, y_true, y_pred):
        """Compute cross-entropy loss"""
        m = y_true.shape[0]
        y_one_hot = np.eye(10)[y_true]
        
        # Clip predictions to prevent log(0)
        y_pred_clipped = np.clip(y_pred, 1e-12, 1 - 1e-12)
        
        loss = -np.sum(y_one_hot * np.log(y_pred_clipped)) / m
        return loss
    
    def train(self, X, y, epochs=100, batch_size=32, verbose=True):
        """Train the neural network"""
        losses = []
        
        for epoch in range(epochs):
            # Shuffle data
            indices = np.random.permutation(X.shape[0])
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            epoch_loss = 0
            num_batches = 0
            
            # Mini-batch training
            for i in range(0, X.shape[0], batch_size):
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size]
                
                # Forward and backward pass
                output = self.forward_propagation(X_batch)
                dw, db = self.backward_propagation(X_batch, y_batch, output)
                self.update_parameters(dw, db)
                
                # Compute loss
                batch_loss = self.compute_loss(y_batch, output)
                epoch_loss += batch_loss
                num_batches += 1
            
            avg_loss = epoch_loss / num_batches
            losses.append(avg_loss)
            
            if verbose and epoch % 20 == 0:
                print(f"Epoch {epoch}, Loss: {avg_loss:.4f}")
        
        return losses
    
    def predict(self, X):
        """Make predictions"""
        output = self.forward_propagation(X)
        return np.argmax(output, axis=1)
    
    def accuracy(self, X, y):
        """Compute accuracy"""
        predictions = self.predict(X)
        return np.mean(predictions == y)

print("Neural Network from scratch implemented!")

Neural Network from scratch implemented!


## 10. Training Custom Neural Network

Let's train our custom neural network and compare it with scikit-learn.

In [10]:
# Test different architectures with our custom implementation
custom_architectures = {
    'Simple': [784, 50, 10],
    'Medium': [784, 100, 50, 10],
    'Deep': [784, 200, 100, 50, 10]
}

custom_results = []

print("Training custom neural networks from scratch...\n")

# Use smaller subset for custom implementation (it's slower)
X_custom = X_train[:5000]
y_custom = y_train[:5000]
X_test_custom = X_test[:1000]
y_test_custom = y_test[:1000]

for name, architecture in custom_architectures.items():
    print(f"Training {name} architecture: {architecture}")
    
    start_time = time.time()
    
    # Create and train custom network
    nn = NeuralNetworkFromScratch(
        layers=architecture,
        learning_rate=0.01,
        activation='relu'
    )
    
    losses = nn.train(X_custom, y_custom, epochs=100, batch_size=64, verbose=False)
    
    training_time = time.time() - start_time
    
    # Evaluate
    train_accuracy = nn.accuracy(X_custom, y_custom)
    test_accuracy = nn.accuracy(X_test_custom, y_test_custom)
    
    custom_results.append({
        'Architecture': name,
        'Layers': str(architecture),
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy,
        'Training Time': training_time,
        'Final Loss': losses[-1]
    })
    
    print(f"  Train Accuracy: {train_accuracy:.4f}")
    print(f"  Test Accuracy: {test_accuracy:.4f}")
    print(f"  Training Time: {training_time:.2f}s\n")

custom_df = pd.DataFrame(custom_results)
print("\n=== CUSTOM NEURAL NETWORK RESULTS ===\n")
print(custom_df.round(4))

Training custom neural networks from scratch...

Training Simple architecture: [784, 50, 10]
  Train Accuracy: 0.9536
  Test Accuracy: 0.9230
  Training Time: 29.93s

Training Medium architecture: [784, 100, 50, 10]
  Train Accuracy: 0.9856
  Test Accuracy: 0.9260
  Training Time: 53.30s

Training Deep architecture: [784, 200, 100, 50, 10]
  Train Accuracy: 0.9982
  Test Accuracy: 0.9300
  Training Time: 112.06s


=== CUSTOM NEURAL NETWORK RESULTS ===

  Architecture                   Layers  Train Accuracy  Test Accuracy  \
0       Simple            [784, 50, 10]          0.9536          0.923   
1       Medium       [784, 100, 50, 10]          0.9856          0.926   
2         Deep  [784, 200, 100, 50, 10]          0.9982          0.930   

   Training Time  Final Loss  
0        29.9312      0.1755  
1        53.3006      0.0863  
2       112.0557      0.0254  


## 11. Comprehensive Comparison and Analysis

In [11]:
# Create comprehensive comparison
print("=== COMPREHENSIVE NEURAL NETWORK ANALYSIS ===\n")

print("1. ARCHITECTURE IMPACT:")
print(f"   Best: {arch_df.loc[arch_df['Accuracy'].idxmax(), 'Architecture']} - {arch_df['Accuracy'].max():.4f}")
print(f"   Worst: {arch_df.loc[arch_df['Accuracy'].idxmin(), 'Architecture']} - {arch_df['Accuracy'].min():.4f}")
print(f"   Range: {(arch_df['Accuracy'].max() - arch_df['Accuracy'].min())*100:.2f}% difference\n")

print("2. ACTIVATION FUNCTIONS:")
for _, row in activation_df.iterrows():
    print(f"   {row['Activation']}: {row['Accuracy']:.4f} accuracy, {row['Training Time']:.1f}s")
print()

print("3. OPTIMIZATION SOLVERS:")
for _, row in solver_df.iterrows():
    print(f"   {row['Solver']}: {row['Accuracy']:.4f} accuracy, {row['Training Time']:.1f}s")
print()

print("4. LEARNING RATE IMPACT:")
best_lr = lr_df.loc[lr_df['Accuracy'].idxmax(), 'Learning Rate']
print(f"   Best learning rate: {best_lr} - {lr_df['Accuracy'].max():.4f} accuracy")
print(f"   Learning rates tested: {list(lr_df['Learning Rate'])}")
print()

print("5. REGULARIZATION EFFECT:")
best_alpha = reg_df.loc[reg_df['Accuracy'].idxmax(), 'Alpha (Regularization)']
print(f"   Best alpha: {best_alpha} - {reg_df['Accuracy'].max():.4f} accuracy")
print()

print("6. IMPLEMENTATION COMPARISON:")
print(f"   Scikit-learn best: {best_accuracy:.4f} accuracy")
print(f"   Custom implementation best: {custom_df['Test Accuracy'].max():.4f} accuracy")
print(f"   Performance gap: {(best_accuracy - custom_df['Test Accuracy'].max())*100:.2f}%")
print()

print("7. KEY INSIGHTS:")
print("   ✓ Deeper networks generally perform better but take longer to train")
print("   ✓ ReLU activation typically outperforms sigmoid/tanh for this problem")
print("   ✓ Adam optimizer usually converges faster than SGD")
print("   ✓ Learning rate around 0.001-0.01 works best")
print("   ✓ Custom implementation shows the underlying math works correctly")
print("   ✓ Scikit-learn is optimized and typically performs better")

print("\n" + "="*60)
print(f"FINAL RECOMMENDATION: Use {arch_df.loc[arch_df['Accuracy'].idxmax(), 'Architecture']}")
print(f"with ReLU activation, Adam solver, learning rate 0.001")
print(f"Expected accuracy: ~{best_accuracy*100:.1f}% on MNIST digits")
print("="*60)

=== COMPREHENSIVE NEURAL NETWORK ANALYSIS ===

1. ARCHITECTURE IMPACT:
   Best: Wide Network (300,300) - 0.9657
   Worst: Single Hidden (50) - 0.9513
   Range: 1.43% difference

2. ACTIVATION FUNCTIONS:
   relu: 0.9613 accuracy, 61.2s
   tanh: 0.9587 accuracy, 65.4s
   logistic: 0.9567 accuracy, 132.5s

3. OPTIMIZATION SOLVERS:
   adam: 0.9613 accuracy, 57.4s
   sgd: 0.9547 accuracy, 105.9s
   lbfgs: 0.9573 accuracy, 31.4s

4. LEARNING RATE IMPACT:
   Best learning rate: 0.001 - 0.9613 accuracy
   Learning rates tested: [0.1, 0.01, 0.001, 0.0001]

5. REGULARIZATION EFFECT:
   Best alpha: 0.1 - 0.9643 accuracy

6. IMPLEMENTATION COMPARISON:
   Scikit-learn best: 0.9627 accuracy
   Custom implementation best: 0.9300 accuracy
   Performance gap: 3.27%

7. KEY INSIGHTS:
   ✓ Deeper networks generally perform better but take longer to train
   ✓ ReLU activation typically outperforms sigmoid/tanh for this problem
   ✓ Adam optimizer usually converges faster than SGD
   ✓ Learning rate around