# Benchmarking RISK Combat and Probability Engines

This notebook contains benchmark tests and performance comparisons for different implementations of combat simulation and probability estimation in our Python version of the RISK game.

### Goals
1. Evaluate runtime efficiency of various `estimate_win_probability` implementations
2. Compare single-battle simulation methods (`battle` functions) for speed
3.  Explore vectorized vs iterative vs parallel approaches
4.  Document observations and trade-offs

### Structure
1. Setup: imports, parameters, utility functions
2. Baseline implementations benchmarking
3. Vectorized implementations benchmarking
4. Parallel and compiled (Numba) implementations benchmarking
5. Summary and conclusions

---

## 1. Setup

This section prepares the environment for benchmarking the RISK combat and probability engines.

### Tested Methods

- **`battle(num_attackers, num_defenders)`**:  
  Simulates a single RISK battle from start to finish between the attacker and defender armies. It repeatedly simulates battle rounds until one side runs out of troops, returning the remaining armies for both sides.

- **`estimate_win_probability(num_attackers, num_defenders, reps)`**:  
  Uses Monte Carlo simulation by running `reps` number of full battles to estimate the probability that the attacker wins (i.e., the defender is eliminated). This method aggregates results over many simulations to provide a statistical estimate of the attacker's success chance.

### Imports and Parameters

We will import required libraries and define default parameters such as the number of troops and simulation repetitions to ensure consistency in benchmarking:


In [120]:
import time, random
import numpy as np

# Past a certain troop limit, simulations get extremely slow and we would have to rely on complicated formulas/approximations.
num_attackers = 70
num_defenders = 70

reps = 10000
num_trials = 5

### Key points for choosing number of repetitions `reps`:

**1. Monte Carlo estimate variance:**

When estimating a probability $p$ via $n$ independent trials, the estimate $\hat{p}$ is a random variable with approximately a Binomial distribution:

$$
\hat{p} \sim \text{Binomial}(n,p) / n
$$

So the **standard error** of $\hat{p}$ is:

$$
\text{SE} = \sqrt{\frac{p(1-p)}{n}}
$$

Using this, we can build a confidence interval around $\hat{p}$: 

$$
\hat{p} \pm z_{\alpha/2} \times \text{SE}
$$

**2. Choosing $n$ for desired margin of error:**

If you want to estimate $\hat{p}$ to be within a margin of error $E$, rearrange:

$$
E = z_{\alpha/2} \times \sqrt{\frac{p(1-p)}{n}} \implies n = \frac{z^2_{\alpha/2}p(1-p)}{E^2}
$$

Suppose we want to estimate $p$ with a margin of error of 0.01 with a 95% confidence interval. Assuming the worst case, $p = 0.5$ (which maximizes variance $p(1-p)$):

$$
n = \frac{z^2_{0.05/2} \times 0.5 \times (1-0.5)}{0.01^2} 
\approx \frac{(1.96)^2 \times 0.25}{0.0001} 
= \boxed{9604}
$$

**3. Understanding the precision vs. speed problem**

We would need about $n = 10000$ repetitions to get what I would consider a barely passing estimate of the probability. In the actual RISK game, estimates are rounded to the nearest percent (0.01), so using a margin of error of 0.5% (0.005) would make more sense ($n \approx 38416$). Unfortunately, the former is probably the best we can do, since over-tightening the margin of error comes at the cost of square scaling repetitions and thus unacceptably long computational times. The pattern is shown in the table below:

| Margin of Error ($E$) | Upper bound $n$ for 95% CI |
| ------------------- | -------------------------- |
| 0.1% (0.001)        | \~1,000,000                |
| 0.5% (0.005)        | \~40,000                   |
| **1% (0.01)**       | **\~10,000**               |
| 2% (0.02)           | \~2,400                    |
| 5% (0.05)           | \~400                      |
| 10% (0.10)          | \~100                      |

**Final answer? Use `reps = 10000` for Monte Carlo estimation.**

### Utility functions

I've created a **generic benchmark wrapper function** that accepts any callable (function) along with its arguments, runs it a specified number of trials, and returns summary statistics like total time, average time per call, and standard deviation:

In [121]:
def benchmark_test(func, *args, trials=num_trials, verbose=True, **kwargs):
    """
    Benchmark wrapper to time execution of a function over multiple runs.

    Args:   func (callable): Function to benchmark.
            *args: Positional arguments to pass to func.
            trials (int): Number of trials to run func.
            verbose (bool): Whether to print summary stats.
            **kwargs: Keyword arguments to pass to func.

    Returns:    dict: Summary statistics with keys:

            - total_time (float): Total time for all trials (seconds).
            - avg_time (float): Average time per call (seconds).
            - std_time (float): Standard deviation of times (seconds).
            - times (np.ndarray): Array of individual run times.
            - result: The result of the last function call.
    """
    times = []
    result = None
    for _ in range(trials):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        times.append(end - start)
    times = np.array(times)
    summary = {
        "total_time": times.sum(),
        "avg_time": times.mean(),
        "std_time": times.std(),
        "times": times,
        "result": result
    }
    if verbose:
        print(f"Function '{func.__name__}' benchmarked over {trials} trials:")
        print(f"  Total time: {summary['total_time']:.4f} sec")
        print(f"  Average time per call: {summary['avg_time']:.6f} sec")
        print(f"  Std dev of times: {summary['std_time']:.6f} sec")
    return summary

## 2. Baseline implementations benchmarking

We define our baseline `battle` and `estimate_win_probability` functions as follows: 

In [122]:
def battle(a=num_attackers, d=num_defenders):
        """
        Simulates a full battle in according with RISK True Random settings.

        Args:
            a (int): Number of attackers
            d (int): Number of defenders
                
        Returns:    
            (int, int): The total remaining troops on both sides.
        """
        while a > 0 and d > 0:
            # Attacker rolls 3 dice (while they have 3+ troops)
            atk_dice = min(3, a) 
            # Defender rolls 2 dice (while they have 2+ troops)
            def_dice = min(2, d)

            att_rolls = np.random.randint(1, 7, size=atk_dice)
            def_rolls = np.random.randint(1, 7, size=def_dice)

            # Sort rolls by descending
            att_top = np.sort(att_rolls)[::-1]
            def_top = np.sort(def_rolls)[::-1]

            # Compare best dice pairs, lesser side loses one troop.
            for i in range(min(atk_dice, def_dice)):
                # If dice are a tie, defender wins.
                if att_top[i] > def_top[i]: 
                    d -= 1
                else:
                    a -= 1

        return a, d
        
def estimate_win_probability(a=num_attackers, d=num_defenders, n=reps):
    """
    Estimates the likelihood of winning an attack via Monte Carlo.

    Args:
        a (int): Number of attackers
        d (int): Number of defenders
        n (int): Number of repetitions
            
    Returns: 
        float: The percentage value estimated probability of winning.
    """
    wins = 0
    for _ in range(n):
        _, defenders = battle(a, d)
        if defenders == 0:
            wins += 1
    return round(wins / n * 100, 4)

These implementations are the simplest, so we would hope improvements can be made. Let's take a look at their benchmark times below:

In [123]:
random.seed("smashthategg")

# Smaller army counts (faster)
base_battle_10_summary = benchmark_test(battle, a=10, d=10) 
base_ewp_10_summary = benchmark_test(estimate_win_probability, a=10, d=10)

# Default 70 army counts (slower)
base_battle_summary = benchmark_test(battle) 
base_ewp_summary = benchmark_test(estimate_win_probability)

# Huge battle, estimation impossible
base_battle_15k_summary = benchmark_test(battle, a=15000, d=15000)



Function 'battle' benchmarked over 5 trials:
  Total time: 0.0015 sec
  Average time per call: 0.000301 sec
  Std dev of times: 0.000602 sec
Function 'estimate_win_probability' benchmarked over 5 trials:
  Total time: 6.5049 sec
  Average time per call: 1.300974 sec
  Std dev of times: 0.029198 sec
Function 'battle' benchmarked over 5 trials:
  Total time: 0.0059 sec
  Average time per call: 0.001185 sec
  Std dev of times: 0.000229 sec
Function 'estimate_win_probability' benchmarked over 5 trials:
  Total time: 85.4915 sec
  Average time per call: 17.098297 sec
  Std dev of times: 4.331547 sec
Function 'battle' benchmarked over 5 trials:
  Total time: 2.7144 sec
  Average time per call: 0.542887 sec
  Std dev of times: 0.028830 sec


Even though `random.seed` is set for reproducibility, you **will** get different results in runtime speed due to differences in hardware. At the very least, our hardwares should remain consistent throughout benchmark tests so the relative comparisons should still hold weight! My data and observations are as follows:

- Calls to `battle` are extremely fast, as expected for a single-simulation function. However, since our main focus `estimate_win_probability` makes repeated calls to `battle`, optimizations to this function will still be of great benefit to us.
- `estimate_win_probability`, even at low troop counts (`10` on each side), takes about **1 second** to calculate. At higher counts (`70`), it takes **10 times as long**. This is extremely slow if we are trying to utilize this function to recreate the *Balanced Blitz* setting, and **strictly impossible** to train reinforcement learning bots on. 
- We desperately have to find better implementations, or otherwise greatly increase the margin of error to allow for reduced repetitions at the cost of precision.

