# Is Your New Architecture Actually Better?

You've designed two CNN architectures. Model A gets 98.2% on MNIST. Model B gets 98.7%. Model B wins, right?

Not so fast. You've compared single runs. That's like flipping one coin for each model and declaring a winner.

In this notebook, we'll train each architecture multiple times and use proper statistics to determine if Model B is *actually* better — or just got lucky.

## Setup

In [None]:
# !pip install ictonyx tensorflow

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential, layers, Input

import ictonyx as ix
from ictonyx import (
    ModelConfig, 
    KerasModelWrapper, 
    ArraysDataHandler,
    run_variability_study,
    compare_two_models,
    plot_comparison_boxplots
)

print(f"TensorFlow: {tf.__version__}")
print(f"Ictonyx: {ix.__version__}")

## Load MNIST

In [None]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

X_train = X_train.astype('float32') / 255.0
X_train = X_train[..., np.newaxis]

# Use subset for speed
X_subset = X_train[:10000]
y_subset = y_train[:10000]

print(f"Using {len(X_subset)} training samples")

## Define Two Competing Architectures

**Model A:** A shallow CNN — two conv layers, relatively few parameters.

**Model B:** A deeper CNN — three conv layers with dropout. More capacity, but also more things that can go wrong.

In [None]:
def create_shallow_cnn(config: ModelConfig) -> KerasModelWrapper:
    """Model A: Simple, shallow CNN."""
    model = Sequential([
        Input(shape=(28, 28, 1)),
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return KerasModelWrapper(model, model_id='shallow_cnn')


def create_deep_cnn(config: ModelConfig) -> KerasModelWrapper:
    """Model B: Deeper CNN with dropout."""
    model = Sequential([
        Input(shape=(28, 28, 1)),
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return KerasModelWrapper(model, model_id='deep_cnn')


# Quick comparison of parameter counts
shallow = create_shallow_cnn(ModelConfig({}))
deep = create_deep_cnn(ModelConfig({}))

print(f"Shallow CNN: {shallow.model.count_params():,} parameters")
print(f"Deep CNN: {deep.model.count_params():,} parameters")

## Run Variability Studies for Both Models

We'll train each model 10 times and collect accuracy distributions.

In [None]:
config = ModelConfig({
    'epochs': 5,
    'batch_size': 64,
    'verbose': 0
})

data_handler = ArraysDataHandler(X_subset, y_subset)

print("Training Shallow CNN (10 runs)...")
shallow_results = run_variability_study(
    model_builder=create_shallow_cnn,
    data_handler=data_handler,
    model_config=config,
    num_runs=10,
    epochs_per_run=5
)

print("\nTraining Deep CNN (10 runs)...")
deep_results = run_variability_study(
    model_builder=create_deep_cnn,
    data_handler=data_handler,
    model_config=config,
    num_runs=10,
    epochs_per_run=5
)

## Compare the Results

Now we have two distributions of accuracies. Let's see if there's a statistically significant difference.

In [None]:
shallow_accs = list(shallow_results.final_val_accuracies.values())
deep_accs = list(deep_results.final_val_accuracies.values())

print("Shallow CNN:")
print(f"  Mean: {np.mean(shallow_accs):.4f}")
print(f"  Std:  {np.std(shallow_accs):.4f}")
print(f"  Range: [{np.min(shallow_accs):.4f}, {np.max(shallow_accs):.4f}]")

print("\nDeep CNN:")
print(f"  Mean: {np.mean(deep_accs):.4f}")
print(f"  Std:  {np.std(deep_accs):.4f}")
print(f"  Range: [{np.min(deep_accs):.4f}, {np.max(deep_accs):.4f}]")

## Statistical Test

Looking at means isn't enough. We need to know if the difference is statistically significant. Ictonyx uses the Mann-Whitney U test by default, which doesn't assume normal distributions.

In [None]:
test_result = compare_two_models(
    scores_a=shallow_accs,
    scores_b=deep_accs,
    name_a='Shallow CNN',
    name_b='Deep CNN'
)

print(f"Test: {test_result.test_name}")
print(f"p-value: {test_result.p_value:.4f}")
print(f"\nConclusion: {test_result.conclusion}")

## Visualize the Comparison

A boxplot makes the overlap (or lack thereof) between the two distributions immediately clear.

In [None]:
# Prepare data for the comparison plot
comparison_data = {
    'raw_data': {
        'Shallow CNN': shallow_accs,
        'Deep CNN': deep_accs
    }
}

plot_comparison_boxplots(comparison_data)

## What We Learned

If the p-value is less than 0.05, we can be reasonably confident that the difference is real, not just noise. If it's higher, the models might be performing similarly — and that "better" result from the deeper model might just have been a lucky run.

This is the difference between:
- "Model B got 98.7% vs Model A's 98.2%" (anecdote)
- "Model B outperforms Model A with p < 0.01" (evidence)

The second statement is what belongs in a paper, a report, or a production decision.

---

**Effect size matters too.** Even if the difference is statistically significant, it might be tiny in practical terms. A 0.1% improvement might not be worth the extra complexity. That's a judgment call — but at least now you have the data to make it.