# Platform Validation for Typical Experiment Regimes:

In this validation we are testing platform to assess if it produces results that align with typical experimental regimes as described below:

| Scenario | Experiment Id | Sample Size | Experiment Duration | Expected Behavior in Experiments|
|---|---|---|---|---|
|Small Sample Size | exp_precision_small_n | 10,000 | 14 Days | Relatively Wider Confidence Intervals , Unstable Estimates |
|Large Sample Size | exp_precision_large_n | 200,000 | 14 Days | Narrow Confidence intervals, Stable Estimates |
|Short Campaign Duration | exp_precision_short_run | 50,000 | 3 Days | Relatively Wider Confidence Intervals , Unstable Estimates |
|Long Campaign Duration  | exp_precision_long_n | 200,000 | 28 Days | Narrow Confidence intervals, Stable Estimates |

**Other experiment parameters common across all experiments:**
Please note that primary intent for these is get realistic impact of duration and sample sizes on experiment results without inducing covariates that would accidentally drive results.



```
    -  exposure_rate: 0.85
    -  treatment_share: 0.5
    -  base_conversion: 0.10
    -  effect_lift_conversion: 0.05
    -  revenue_mean: 12.5
    -  revenue_sd: 3.0
    -  pre_period_days: 14
```

## Comparing AB Results different Sample sizes:

We are comparing two sample size: `10,000` and `200,000`. To reduce the cognitive overload we are gonna focus only on standard AB results and leave out CUPED and DiD.

### Results from Small Sample size:

| metric | control mean | treatment mean | delta | rel lift | CI low | CI high | p-value | method |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| conversion | 0.0930014 | 0.152668 | 0.0596667 | 0.641568 | 0.0457018 | 0.0736316 | 0 | two_proportion_z |
| pre_revenue | 12.4562 | 12.4455 | -0.0106622 | -0.00085598 | -0.139422 | 0.118097 | 0.871068 | welch_normal_approx |
| revenue | 1.17101 | 1.94678 | 0.775763 | 0.662472 | 0.592538 | 0.958989 | 0 | welch_normal_approx |

### Results from Large Sample size:

| metric | control mean | treatment mean | delta | rel lift | CI low | CI high | p-value | method |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| conversion | 0.101108 | 0.151056 | 0.0499485 | 0.494012 | 0.0468032 | 0.0530937 | 0 | two_proportion_z |
| pre_revenue | 12.4904 | 12.4999 | 0.0095004 | 0.000760617 | -0.0190107 | 0.0380115 | 0.513688 | welch_normal_approx |
| revenue | 1.26038 | 1.88826 | 0.627877 | 0.498164 | 0.587323 | 0.66843 | 0 | welch_normal_approx |


### Commentary on Results

**1. Precision Improves with Sample Size**

When increasing sample size from 10,000 to 200,000 users, confidence interval widths shrink by approximately 4.5×.

This closely matches the theoretical relationship.

$$ SE = \frac{1}{\sqrt n} $$

Since sample size increased by 20×, the expected reduction in standard error is:

 $$ \sqrt(20) ≈ 4.47 $$


The observed CI shrinkage aligns almost exactly with this prediction, demonstrating that the platform’s inference engine scales correctly with sample size.

**2. Stabilization of Effect Estimates**

The small-sample experiment produces slightly larger delta estimates (both for conversion and revenue). This reflects higher variance in small samples, where random fluctuation can inflate observed lift.

In the large-sample experiment, estimates stabilize closer to the true underlying treatment effect.

**3. Guardrail Metric Behavior**

The revenue metric shows identical precision scaling behavior as the conversion metric.
Larger samples produce dramatically narrower confidence intervals while maintaining consistent inference logic.

### Commentary on the results:
1. From impact standpoint, we clearly notice that for conversions the lift deltas are fairly, however, we note that confidence intervals are wider for small sample size as compared to large sample size.
    - The ratio of effect to CI width for Small sample size is: `0.47` whereas the same ratio for Large sample size is: `0.126`. Thus, the estimates from Large experiment are more precise.
2. For revenue guardrail, the observation is same as conversion, the small sample size has wider confidence intervals as compared to large sample size.
    - The delta to CI width ratio for Small sample size is: `0.472` and the ratio for large sample size is: `0.129`. Thus, similar to conversions revenue impact is more precise

**As expected, with any realistic experimental platform, we note that running experiments with same parameters across small vs large sample size, the large sample size experiment provides more precise estimates.**


---

## Comparing AB Results different durations:

We are comparing two experiment durations: `3 DAYS` and `28 DAYS`. To reduce the cognitive overload we are gonna focus only on standard AB results and leave out CUPED and DiD.

### Results from Short Duration Experiment:

| metric | control mean | treatment mean | delta | rel lift | CI low | CI high | p-value | method |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| conversion | 0.0956505 | 0.148083 | 0.0524327 | 0.548169 | 0.0462192 | 0.0586462 | 0 | two_proportion_z |
| pre_revenue | 12.5064 | 12.506 | -0.00038127 | -3.0486e-05 | -0.057365 | 0.0566025 | 0.989537 | welch_normal_approx |
| revenue | 1.19782 | 1.8549 | 0.657079 | 0.548563 | 0.57675 | 0.737407 | 0 | welch_normal_approx |

### Results from Long Duration Experiment:

| metric | control mean | treatment mean | delta | rel lift | CI low | CI high | p-value | method |
|---|---:|---:|---:|---:|---:|---:|---:|---|
| conversion | 0.0983336 | 0.147467 | 0.0491335 | 0.499661 | 0.042895 | 0.0553721 | 0 | two_proportion_z |
| pre_revenue | 12.5345 | 12.4732 | -0.0613104 | -0.00489132 | -0.118267 | -0.00435353 | 0.0348744 | welch_normal_approx |
| revenue | 1.24209 | 1.83647 | 0.594383 | 0.478535 | 0.513739 | 0.675026 | 0 | welch_normal_approx |




### Commentary on Results:

**1. Effect Size Stabilization**
For both conversion (NSM) and revenue (guardrail), the longer-duration experiment produces slightly smaller delta estimates compared to the 3-day run.
This suggests that short-duration experiments may capture early novelty effects or transient behavioral responses that moderate over time.

**2. Precision Behavior (Conversion)**

   - Short CI width: ~0.0124
   - Long CI width: ~0.0125

Despite the longer runtime, the confidence interval width remains nearly identical.
This indicates that effective precision depends on variance and exposed sample size — not calendar duration alone.

**3. Precision Behavior (Revenue)**

   - Short CI width: ~0.1607
   - Long CI width: ~0.1613

Again, CI width remains stable. The longer run does not materially improve precision, suggesting that additional days did not significantly reduce variance per exposed user.

**4. Interpretation Principle**
Longer experiments often produce more stable estimates, not necessarily larger or smaller ones.
In this case, the longer run slightly moderates the effect estimate while maintaining similar uncertainty bounds.