# Ictonyx Quickstart

Train a model once and you get a number. Train it twenty times and you get a distribution. Ictonyx does the second thing.

This notebook runs in under a minute — no GPU, no TensorFlow, no PyTorch. Just sklearn.

**Requirements:** `pip install ictonyx scikit-learn`

In [1]:
# If running in Google Colab, uncomment the next line:
# !pip install ictonyx

import ictonyx as ix
print(f"Ictonyx v{ix.__version__}")

Ictonyx v0.3.2


## How variable is a model, really?

Let's find out. We'll train a Random Forest on the breast cancer dataset 20 times and look at the distribution of validation accuracy.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load data as a DataFrame — Ictonyx handles the splitting
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df['target'] = bc.target

print(f"Dataset: {len(df)} samples, {df.shape[1] - 1} features, 2 classes")

Dataset: 569 samples, 30 features, 2 classes


In [3]:
results = ix.variability_study(
    model=RandomForestClassifier,
    data=df,
    target_column='target',
    runs=20,
    seed=42,
)

print(results.summarize())

Loading and preparing data...
Loaded tabular data: 569 rows, 31 columns
Data loaded successfully

Starting Variability Study
  Runs: 20
  Epochs per run: 10
  Execution mode: in standard mode
  Seed: 42



Variability Study: 100%|████████████| 20/20 [00:21<00:00,  1.07s/run, val_accuracy=0.9825]


--------------------------------------------------

Study Summary:
  Successful runs: 20/20
  train_accuracy: 1.0000 (SD = 0.0000)
  train_loss: 0.0000 (SD = 0.0000)
  val_accuracy: 0.9798 (SD = 0.0063)
  val_loss: 0.0202 (SD = 0.0063)
Variability Study Results
Successful runs: 20
Seed: 42
train_accuracy:
  Mean: 1.0000
  Std:  0.0000
  Min:  1.0000
  Max:  1.0000
train_loss:
  Mean: 0.0000
  Std:  0.0000
  Min:  0.0000
  Max:  0.0000
val_accuracy:
  Mean: 0.9798
  Std:  0.0063
  Min:  0.9649
  Max:  0.9825
val_loss:
  Mean: 0.0202
  Std:  0.0063
  Min:  0.0175
  Max:  0.0351


In [4]:
# The raw per-run values
val_accs = results.get_metric_values("val_accuracy")
for i, acc in enumerate(val_accs, 1):
    print(f"  Run {i:2d}: {acc:.4f}")

  Run  1: 0.9825
  Run  2: 0.9825
  Run  3: 0.9825
  Run  4: 0.9825
  Run  5: 0.9649
  Run  6: 0.9825
  Run  7: 0.9825
  Run  8: 0.9825
  Run  9: 0.9825
  Run 10: 0.9825
  Run 11: 0.9825
  Run 12: 0.9825
  Run 13: 0.9825
  Run 14: 0.9825
  Run 15: 0.9649
  Run 16: 0.9825
  Run 17: 0.9825
  Run 18: 0.9825
  Run 19: 0.9649
  Run 20: 0.9825


That standard deviation is the number most ML workflows ignore. It tells you how much your reported accuracy depends on which random seed you happened to use.

## Are two models actually different?

A common situation: Model B scores higher than Model A. Is B genuinely better, or did it just get a lucky initialization? Let's test it properly.

In [5]:
from sklearn.tree import DecisionTreeClassifier

comparison = ix.compare_models(
    models=[DecisionTreeClassifier, RandomForestClassifier],
    data=df,
    target_column='target',
    runs=20,
    metric='val_accuracy',
)

print(comparison['overall_test'].get_summary())

--- Starting Comparison of 2 Models ---
Evaluating: DecisionTreeClassifier
Loading and preparing data...
Loaded tabular data: 569 rows, 31 columns
Data loaded successfully

Starting Variability Study
  Runs: 20
  Epochs per run: 10
  Execution mode: in standard mode
  Seed: 1043239339



Variability Study: 100%|████████████| 20/20 [00:19<00:00,  1.02run/s, val_accuracy=0.9298]


--------------------------------------------------

Study Summary:
  Successful runs: 20/20
  train_accuracy: 1.0000 (SD = 0.0000)
  train_loss: 0.0000 (SD = 0.0000)
  val_accuracy: 0.9439 (SD = 0.0205)
  val_loss: 0.0561 (SD = 0.0205)
Evaluating: RandomForestClassifier
Loading and preparing data...
Loaded tabular data: 569 rows, 31 columns
Data loaded successfully

Starting Variability Study
  Runs: 20
  Epochs per run: 10
  Execution mode: in standard mode
  Seed: 866708919



Variability Study: 100%|████████████| 20/20 [00:20<00:00,  1.01s/run, val_accuracy=0.9825]


--------------------------------------------------

Study Summary:
  Successful runs: 20/20
  train_accuracy: 1.0000 (SD = 0.0000)
  train_loss: 0.0000 (SD = 0.0000)
  val_accuracy: 0.9833 (SD = 0.0038)
  val_loss: 0.0167 (SD = 0.0038)
Kruskal-Wallis H-Test: 30.615, p=0.0000 ***, epsilon-squared=0.779


In [6]:
# Pairwise detail — effect size tells you if the difference matters in practice
for name, result in comparison['pairwise_comparisons'].items():
    print(f"{name}:")
    print(f"  p-value:     {result.p_value:.4f}")
    print(f"  Effect size: {result.effect_size:.3f} ({result.effect_size_interpretation})")
    print(f"  Conclusion:  {result.conclusion}")

DecisionTreeClassifier_vs_RandomForestClassifier:
  p-value:     0.0000
  Effect size: 0.953 (large)
  Conclusion:  Mann-Whitney U test indicates a statistically significant difference between groups (p=0.0000) with large effect size (rank-biserial correlation=0.953)


That's the core of Ictonyx: **distributions instead of point estimates, and formal tests instead of eyeballing.**

---

For deeper examples — Keras CNNs, PyTorch models, learning rate sweeps, multi-model comparisons — see the [examples/](.) directory.