# Epsilon-Machine Inference: Discovering Hidden Structure

This notebook demonstrates how the **emic** library can discover emergent computational structure from raw data sequences. We'll:

1. Generate sequences from various stochastic processes
2. Use the CSSR algorithm to infer epsilon-machines
3. Analyze the complexity of both true and inferred machines

The key insight: **complex patterns emerge from simple local rules**, and epsilon-machines capture the minimal hidden state structure needed to optimally predict the future.

In [None]:
# Import the emic library
import itertools
from emic.sources import GoldenMeanSource, EvenProcessSource, PeriodicSource, BiasedCoinSource
from emic.inference import CSSR, CSSRConfig
from emic.analysis import analyze

In [None]:
# Verify kernel has updated CSSR code
import inspect
from emic.inference.cssr.algorithm import CSSR

source = inspect.getsource(CSSR._initialize_partition)
has_fix = "if len(history) == 0:" in source
print(f"Kernel has CSSR fix: {'âœ“ YES' if has_fix else 'âœ— NO (restart kernel!)'}")

In [None]:
import inspect
from emic.inference.cssr.algorithm import CSSR

# Get the source of _initialize_partition and check for the fix
source = inspect.getsource(CSSR._initialize_partition)
if "if len(history) == 0:" in source:
    print("âœ“ CSSR has the fix (excludes empty history)")
else:
    print("âœ— CSSR does NOT have the fix")

## 1. The Golden Mean Process

The Golden Mean Process has a simple rule: **no two consecutive 1s allowed**.

- After seeing a `0`, the next symbol can be `0` or `1` (with probability `p` for `0`)
- After seeing a `1`, the next symbol **must** be `0`

This creates hidden memory structure that the machine must track.

In [None]:
# Create a Golden Mean source
golden_mean = GoldenMeanSource(p=0.5, _seed=42)

# Generate a sample sequence
sample = list(itertools.islice(golden_mean, 50))
print("Sample sequence (first 50 symbols):")
print("".join(str(s) for s in sample))
print("\nNotice: no '11' patterns!")

In [None]:
# Analyze the TRUE machine
true_machine = golden_mean.true_machine
true_summary = analyze(true_machine)

print("=== TRUE Golden Mean Machine ===")
print(f"States: {true_summary.num_states}")
print(f"Statistical Complexity (CÎ¼): {true_summary.statistical_complexity:.4f} bits")
print(f"Entropy Rate (hÎ¼): {true_summary.entropy_rate:.4f} bits/symbol")
print(f"Excess Entropy (E): {true_summary.excess_entropy:.4f} bits")

In [None]:
# Infer the machine from a sequence using CSSR
# Note: Use stricter significance (0.001) to avoid spurious state splits from noise
config = CSSRConfig(max_history=5, significance=0.001, min_count=10)
cssr = CSSR(config)

# Generate training data
training_source = GoldenMeanSource(p=0.5, _seed=42)
training_data = list(itertools.islice(training_source, 10000))

# Run inference
result = cssr.infer(training_data)
inferred_machine = result.machine

print(f"Inferred machine has {len(list(inferred_machine.states))} states")
print(f"Training sequence length: {len(training_data)}")
print(f"Converged: {result.converged}")

In [None]:
# Compare true vs inferred
inferred_summary = analyze(inferred_machine)

print("\n=== Comparison: True vs Inferred ===")
print(f"{'Metric':<25} {'True':>10} {'Inferred':>10}")
print("-" * 47)
print(f"{'States':<25} {true_summary.num_states:>10} {inferred_summary.num_states:>10}")
print(
    f"{'Statistical Complexity':<25} {true_summary.statistical_complexity:>10.4f} {inferred_summary.statistical_complexity:>10.4f}"
)
print(
    f"{'Entropy Rate':<25} {true_summary.entropy_rate:>10.4f} {inferred_summary.entropy_rate:>10.4f}"
)

In [None]:
# Experiment: CSSR inference accuracy
print("=== CSSR Inference Analysis ===\n")

# Test with various data sizes
source = GoldenMeanSource(p=0.5, _seed=42)
data = list(itertools.islice(source, 10000))

config_test = CSSRConfig(max_history=3, significance=0.1, min_count=10)
cssr_test = CSSR(config_test)
result = cssr_test.infer(data)

print(f"True Golden Mean: 2 states")
print(f"CSSR inferred:    {len(list(result.machine.states))} states")
print(f"Converged: {result.converged}")

# Analyze inferred machine
inferred = analyze(result.machine)
print(f"\nInferred Machine Metrics:")
print(f"  Statistical Complexity: {inferred.statistical_complexity:.4f} bits")
print(f"  Entropy Rate: {inferred.entropy_rate:.4f} bits/symbol")

print(f"\nTrue Machine Metrics:")
print(f"  Statistical Complexity: {true_summary.statistical_complexity:.4f} bits")
print(f"  Entropy Rate: {true_summary.entropy_rate:.4f} bits/symbol")

print("\nðŸ“Š Observation: CSSR over-estimates states but captures similar entropy rate.")
print("   This is a known limitation - the algorithm creates equivalent states")
print("   that could be merged with additional post-processing.")

## 2. The Even Process

The Even Process forbids **odd runs of 1s**. So `01110` is forbidden, but `0110` and `011110` are fine.

This requires tracking parity - a truly hidden state that can't be determined from the current symbol alone.

In [None]:
# Even Process
even_process = EvenProcessSource(p=0.5, _seed=42)

# Sample and true machine
sample = list(itertools.islice(even_process, 50))
print("Sample sequence:")
print("".join(str(s) for s in sample))

# Count runs of 1s to verify
runs = "".join(str(s) for s in sample).split("0")
one_runs = [len(r) for r in runs if r]
print(f"\nRuns of 1s: {one_runs}")
print(f"All even? {all(r % 2 == 0 for r in one_runs)}")

In [None]:
# Analyze even process
true_even = analyze(even_process.true_machine)

# Infer from data
even_source = EvenProcessSource(p=0.5, _seed=456)
even_data = list(itertools.islice(even_source, 10000))
inferred_even = analyze(cssr.infer(even_data).machine)

print("=== Even Process ===")
print(f"{'Metric':<25} {'True':>10} {'Inferred':>10}")
print("-" * 47)
print(f"{'States':<25} {true_even.num_states:>10} {inferred_even.num_states:>10}")
print(
    f"{'Statistical Complexity':<25} {true_even.statistical_complexity:>10.4f} {inferred_even.statistical_complexity:>10.4f}"
)
print(f"{'Entropy Rate':<25} {true_even.entropy_rate:>10.4f} {inferred_even.entropy_rate:>10.4f}")

## 3. Periodic Process (Deterministic)

A periodic process repeats a fixed pattern. For `[0, 1, 2]`, it produces `012012012...`

This is fully deterministic - zero entropy rate, but requires states to track position in the cycle.

In [None]:
# Periodic process
periodic = PeriodicSource(pattern=[0, 1, 2])

sample = list(itertools.islice(periodic, 15))
print("Periodic sequence:", sample)

# Analyze
true_periodic = analyze(periodic.true_machine)
print(f"\nStates: {true_periodic.num_states}")
print(f"Entropy rate: {true_periodic.entropy_rate:.4f} (should be 0 - deterministic!)")
print(f"Statistical complexity: {true_periodic.statistical_complexity:.4f} bits")

## 4. Biased Coin (IID Process)

A biased coin is the simplest case: no memory, no hidden states.

The epsilon-machine has just **one state** - no structure to discover!

In [None]:
# Biased coin
coin = BiasedCoinSource(p=0.7, _seed=42)

sample = list(itertools.islice(coin, 30))
print("Biased coin (p=0.7):")
print("".join(str(s) for s in sample))
print(f"Fraction of 1s: {sum(sample) / len(sample):.2f}")

# Analyze
true_coin = analyze(coin.true_machine)
print(f"\nStates: {true_coin.num_states} (minimal!)")
print(f"Statistical complexity: {true_coin.statistical_complexity:.4f} bits")
print(f"Entropy rate: {true_coin.entropy_rate:.4f} bits/symbol")

## 5. Complexity Comparison

Let's compare all processes on a single chart to see how complexity varies.

In [None]:
# Summary comparison
processes = [
    ("Biased Coin", true_coin),
    ("Golden Mean", true_summary),
    ("Even Process", true_even),
    ("Periodic (3)", true_periodic),
]

print("\n" + "=" * 70)
print("COMPLEXITY COMPARISON (True Machines)")
print("=" * 70)
print(f"{'Process':<20} {'States':>8} {'CÎ¼ (bits)':>12} {'hÎ¼ (bits)':>12}")
print("-" * 70)
for name, summary in processes:
    print(
        f"{name:<20} {summary.num_states:>8} {summary.statistical_complexity:>12.4f} {summary.entropy_rate:>12.4f}"
    )
print("=" * 70)
print("\nKey insights:")
print("â€¢ Biased Coin: 1 state, no structure (IID)")
print("â€¢ Golden Mean: 2 states encode 'just saw a 1' memory")
print("â€¢ Even Process: 3 states track parity of 1-runs")
print("â€¢ Periodic: States = period length, zero entropy (deterministic)")

## Conclusion

We've demonstrated how **emic** can:

1. **Generate** sequences from well-defined stochastic processes
2. **Infer** the hidden state structure using CSSR
3. **Analyze** complexity measures that quantify the "computational depth" of processes

The epsilon-machine representation is powerful because:
- It's **minimal**: no redundant states
- It's **optimal**: achieves the entropy rate for prediction
- It **reveals structure**: hidden patterns become explicit states

This is the foundation for studying **emergent computation** in complex systems!