# Statistical Analysis - Testing the Results

**Ridham Patel | January 2026**

## What I'm Doing

Now that I've simulated the A/B test, I need to analyze the results to see if the Vibe Shift feature actually works.

My analysis plan:
- Test if treatment increases engagement overall (t-test)
- Break it down by segment to see who benefits most
- Calculate effect sizes (Cohen's d) to measure how BIG the impact is
- Check guardrail metrics to make sure we're not breaking the user experience


# Import and Loading Data

In [19]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

In [20]:
df = pd.read_csv('playlist_complete_metrics.csv')


In [21]:
control_df = df[df['experiment_group'] == 'Control'].copy()
treatment_df = df[df['experiment_group'] == 'Treatment'].copy()


## Overall Treatment Effect - Does It Work?

First, let me test if the treatment works overall (pooling all segments together).

**My hypotheses:**
- **Null (H₀):** Treatment has no effect - engagement is the same in both groups
- **Alternative (H₁):** Treatment increases engagement

**Test:** Two-sample t-test (comparing Control vs Treatment means)  
**Significance level:** 0.05 (standard threshold we use in stats)


In [29]:


control_engagement = control_df['post_treatment_engagement']
treatment_engagement = treatment_df['post_treatment_engagement']

t_stat, p_value = stats.ttest_ind(treatment_engagement, control_engagement)
control_mean = control_engagement.mean()
treatment_mean = treatment_engagement.mean()
diff = treatment_mean - control_mean
pct_lift = (diff / control_mean) * 100

print(f"\nControl mean: {control_mean:.2f} songs/session")
print(f"Treatment mean: {treatment_mean:.2f} songs/session")
print(f"Difference: +{diff:.2f} songs/session ({pct_lift:+.1f}%)")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print(f"\nCONCLUSION: Treatment significantly increases engagement overall (p<0.05)")
else:
    print(f"\ CONCLUSION: No significant overall effect (p≥0.05)")


Control mean: 6.86 songs/session
Treatment mean: 7.52 songs/session
Difference: +0.67 songs/session (+9.7%)
t-statistic: 3.565
p-value: 0.0004

CONCLUSION: Treatment significantly increases engagement overall (p<0.05)


### What This Means

**Statistically significant (p=0.0004):**  
There's only a 0.04% chance this result happened by random chance. We can be very confident the treatment actually works.

**Practically meaningful (+0.67 songs/session):**  
Users like about 1 extra song every 2 sessions. That's a 9.7% increase.

**Bottom line:** The Vibe Shift feature works. Recommending opposite music really does increase engagement across users.


## Breaking It Down by Segment

The overall effect is positive, but my hypothesis was that the feature would work differently for different user segments.

**What I expect:**
- Narrow users (most stuck) → Biggest boost
- Moderate users (balanced) → Medium boost
- Wide users (already exploring) → Little to no effect


In [None]:

segments = ['Narrow Comfort Zone', 'Moderate Comfort Zone', 'Wide Comfort Zone']
segment_results = []

for segment in segments:
    print(f"\n{segment}:\n")
    
    control_seg = control_df[control_df['segment'] == segment]['post_treatment_engagement']
    treatment_seg = treatment_df[treatment_df['segment'] == segment]['post_treatment_engagement']
    
    t_stat, p_value = stats.ttest_ind(treatment_seg, control_seg)
    
    control_mean = control_seg.mean()
    treatment_mean = treatment_seg.mean()
    diff = treatment_mean - control_mean
    pct_lift = (diff / control_mean) * 100
    
    pooled_std = np.sqrt(((len(control_seg)-1)*control_seg.std()**2 + (len(treatment_seg)-1)*treatment_seg.std()**2) / (len(control_seg) + len(treatment_seg) - 2))
    cohens_d = diff / pooled_std
    
    if abs(cohens_d) < 0.2:
        effect_interp = "Negligible"
    elif abs(cohens_d) < 0.5:
        effect_interp = "Small"
    elif abs(cohens_d) < 0.8:
        effect_interp = "Medium"
    else:
        effect_interp = "Large"
    
    se = np.sqrt(control_seg.var()/len(control_seg) + treatment_seg.var()/len(treatment_seg))
    ci_lower = diff - 1.96 * se
    ci_upper = diff + 1.96 * se
    
    print(f"  Control: {control_mean:.2f} songs/session (n={len(control_seg)})")
    print(f"  Treatment: {treatment_mean:.2f} songs/session (n={len(treatment_seg)})")
    print(f"  Lift: +{diff:.2f} songs ({pct_lift:+.1f}%)")
    print(f"  95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
    print(f"  t-statistic: {t_stat:.3f}")
    print(f"  p-value: {p_value:.4f}")
    print(f"  Cohen's d: {cohens_d:.3f} ({effect_interp})")
    print(f"  Significant: {'YES' if p_value < 0.05 else 'NO'}")
    
    segment_results.append({'segment': segment,'control_mean': control_mean,'treatment_mean': treatment_mean,'diff': diff,
        'pct_lift': pct_lift,'p_value': p_value,'cohens_d': cohens_d,'ci_lower': ci_lower,'ci_upper': ci_upper,'significant': p_value < 0.05})



Narrow Comfort Zone:

  Control: 5.12 songs/session (n=69)
  Treatment: 5.64 songs/session (n=70)
  Lift: +0.52 songs (+10.1%)
  95% CI: [0.05, 0.98]
  t-statistic: 2.190
  p-value: 0.0302
  Cohen's d: 0.372 (Small)
  Significant: YES

Moderate Comfort Zone:

  Control: 7.15 songs/session (n=110)
  Treatment: 8.00 songs/session (n=111)
  Lift: +0.85 songs (+11.9%)
  95% CI: [0.42, 1.28]
  t-statistic: 3.881
  p-value: 0.0001
  Cohen's d: 0.522 (Medium)
  Significant: YES

Wide Comfort Zone:

  Control: 8.44 songs/session (n=55)
  Treatment: 8.95 songs/session (n=55)
  Lift: +0.51 songs (+6.0%)
  95% CI: [-0.15, 1.17]
  t-statistic: 1.518
  p-value: 0.1320
  Cohen's d: 0.289 (Small)
  Significant: NO


### Results by Segment

Interpretation of what I found:

**Narrow Comfort Zone - Weaker Than Expected**  
+10.1% lift (p=0.030, Cohen's d=0.37)

The effect is barely significant as p just below the 0.05 threshold. The confidence interval [0.05, 0.98] barely excludes zero - meaning the true effect could be quite small.

Effect size is small by Cohen's standards (d<0.5). This is surprising - I expected the biggest effect here since these users are most stuck in their comfort zones.

**Moderate Comfort Zone - The Sweet Spot!**  
+11.9% lift (p<0.001, Cohen's d=0.52)

Highly significant (p<0.001 means we're extremely confident this is real). Medium effect size (0.5<d<0.8) - meaningful impact. 

The confidence interval [0.42, 1.28] is tight and well above zero. We're confident the true effect is substantial.

These users are stuck enough to benefit from disruption, but not so stuck that opposite recs are jarring.

**Wide Comfort Zone - No Effect**  
+6.0% lift (p=0.13, Cohen's d=0.29)

Not statistically significant (p>0.05 - could just be random noise). The confidence interval [-0.15, 1.17] includes zero, so we can't be confident there's any real effect.

This makes sense - these users are already exploring diverse music, so opposite recommendations don't add much value.

**Key insight:** The feature works best for Moderate users, not Narrow users as I initially hypothesized. This is interesting and suggests Narrow users might find opposite recs too disruptive.

# Secondary Metric Analysis: Discovery Diversity

Validates whether the feature actually increases music discovery.

In [33]:

for segment in segments:
    control_seg = control_df[control_df['segment'] == segment]['post_diversity']
    treatment_seg = treatment_df[treatment_df['segment'] == segment]['post_diversity']
    
    t_stat, p_value = stats.ttest_ind(treatment_seg, control_seg)
    
    control_mean = control_seg.mean()
    treatment_mean = treatment_seg.mean()
    pct_lift = ((treatment_mean - control_mean) / control_mean) * 100
    
    print(f"\n{segment}:")
    print(f"  Lift: {pct_lift:+.1f}% (p={p_value:.4f}) {': Significant' if p_value < 0.05 else 'Not Significant'}")



Narrow Comfort Zone:
  Lift: +12.9% (p=0.0000) : Significant

Moderate Comfort Zone:
  Lift: +8.6% (p=0.0000) : Significant

Wide Comfort Zone:
  Lift: +5.3% (p=0.0144) : Significant


### What This Tells Me

Opposite recommendations successfully push all users to explore more diverse music.

**The pattern makes sense:**
- Narrow users (most stuck) → Largest discovery boost (+12.9%)
- Moderate users (somewhat stuck) → Medium boost (+8.6%)
- Wide users (already diverse) → Smaller boost (+5.3%)

Users with narrower tastes have more room to expand, so they benefit most from being pushed outside their comfort zones.

**Interesting:** Even though Narrow users discover the most new music (+12.9%), they don't convert that into engagement as much as Moderate users do. This suggests discovery alone isn't enough - the recommendations also need to feel accessible, not jarring.

The Vibe Shift feature achieves its intended goal of increasing music discovery, with strongest effects for users who need it most.



## Checking gaurdrail metrics - Making Sure We Didn't Break Anything

So far the results look good - engagement is up, discovery is working. But before recommending launch, I need to check the guardrail metrics as feature could increase engagement but still be a bad idea if it frustrates users. Guardrails catch these problems.



In [37]:

print("\nGaudrail 1: Session Length\n")
session_violations = []
for segment in segments:
    control_seg = control_df[control_df['segment'] == segment]['post_session_length']
    treatment_seg = treatment_df[treatment_df['segment'] == segment]['post_session_length']
    
    t_stat, p_value = stats.ttest_ind(treatment_seg, control_seg)
    pct_change = ((treatment_seg.mean() - control_seg.mean()) / control_seg.mean()) * 100
    
    if pct_change < -5 and p_value < 0.05:
        status = "VIOLATION (significant)"
        session_violations.append((segment, pct_change, p_value))
    elif pct_change < -5 and p_value >= 0.05:
        status = "WARNING (not significant)"
    else:
        status = "Safe"
    
    print(f"  {segment}: {pct_change:+.1f}% (p={p_value:.4f}) {status}")

print("\nGaudrail 2: Skip Rate\n")
skip_violations = []
for segment in segments:
    control_seg = control_df[control_df['segment'] == segment]['post_skip_rate']
    treatment_seg = treatment_df[treatment_df['segment'] == segment]['post_skip_rate']
    
    t_stat, p_value = stats.ttest_ind(treatment_seg, control_seg)
    pp_change = (treatment_seg.mean() - control_seg.mean()) * 100
    
    if pp_change > 10 and p_value < 0.05:
        status = "VIOLATION (significant)"
        skip_violations.append((segment, pp_change, p_value))
    elif pp_change > 10 and p_value >= 0.05:
        status = "WARNING (not significant)"
    else:
        status = "Safe"
    
    print(f"  {segment}: {pp_change:+.1f}pp (p={p_value:.4f}) {status}")




Gaudrail 1: Session Length

  Narrow Comfort Zone: +1.2% (p=0.5996) Safe
  Moderate Comfort Zone: -0.0% (p=0.9955) Safe
  Wide Comfort Zone: -4.7% (p=0.0751) Safe

Gaudrail 2: Skip Rate

  Narrow Comfort Zone: +11.9pp (p=0.0000) VIOLATION (significant)
  Moderate Comfort Zone: +2.8pp (p=0.0004) Safe
  Wide Comfort Zone: -0.1pp (p=0.9439) Safe


Here is what I found - 

**1. Session Length (threshold: can't drop by more than 5%)**

All good here! No segment shows users quitting early:
- Narrow: +1.2% (actually stayed slightly longer)
- Moderate: 0.0% (no change)
- Wide: -4.7% (p=0.075, not significant - just noise)

Nobody's leaving early. 

**2. Skip Rate (threshold: can't increase by more than 10 percentage points)**

**Narrow users: violated the threshold**  
Skip rate jumped +11.9pp (from 35% to 46.9%), and this is statistically significant (p<0.001).

This means Narrow users are skipping way more songs. The opposite recommendations are too jarring for them - they have to skip through a bunch of songs before finding something they like.

**Moderate users: Safe**  
+2.8pp increase (statistically significant but below our 10pp threshold). Some extra skipping, but acceptable given the +11.9% engagement boost.

**Wide users: Safe**  
Basically no change (-0.1pp).

**The critical finding:** Even though Narrow users show some engagement lift (+10.1%), they're having a frustrating experience (excessive skipping). This violates our guardrail and means we shouldn't launch for this segment.


# Export

In [38]:

df.to_csv('statistical_results.csv')
