# Experiment Design - Simulating the A/B Test

**Ridham Patel | January 2026**

## What I'm Doing

Now that I have user segments from my EDA, I need to design and simulate an A/B test to see if my Vibe Shift feature actually works.

The plan:
- Randomly assign users to Control (normal recs) vs Treatment (opposite recs)
- Make sure each segment is balanced across both groups (stratified randomization from my ML course)
- Simulate what would happen to engagement
- Check for unintended consequences (guardrail metrics)

**Important note:** This is a simulated experiment for my portfolio. In production, we'd run this on real users and measure actual behavior. But for now I'm creating realistic synthetic data to demonstrate the experimental design process.



# Import and Loading Data

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [29]:
df = pd.read_csv("playlist_comfort_zones.csv")

In [30]:
df.head()

Unnamed: 0,playlist_id,valence,energy,comfort_zone_score,segment
0,0275i1VNfBnsNbPl0QIBpG,0.204712,0.136614,0.170663,Moderate Comfort Zone
1,03qQtbNHoJuFezRu2CnLuF,0.235168,0.197765,0.216467,Wide Comfort Zone
2,03sDEv7FN58Mb9CJOs1Tgn,0.192502,0.109796,0.151149,Moderate Comfort Zone
3,06zrBJ5cts5aemZmqe80J7,0.188636,0.120561,0.154599,Moderate Comfort Zone
4,07SNJ4MwYba9wwmzrbjmYi,0.203764,0.133636,0.1687,Moderate Comfort Zone


# Stratified Random Assignment

This is important in order to have a balanced representation of each segment is present in both treatment and control groups.

### Methodology:
- Stratification variable: Segment (Narrow/Moderate/Wide)
- Assignment ratio: 50/50 (Control/Treatment)
- Randomization: Permutation during assignment


In [31]:
random_seed = 123
# 42
np.random.seed(random_seed)


In [32]:
print("Current Segment Distribution")
segment_counts = df['segment'].value_counts().sort_index()
print(segment_counts)
print(f"\nTotal playlists: {len(df)}")

Current Segment Distribution
segment
Moderate Comfort Zone    272
Narrow Comfort Zone      120
Wide Comfort Zone         78
Name: count, dtype: int64

Total playlists: 470


In [33]:
def assign_experiment_group(segment_df):
    n = len(segment_df)
    indices = segment_df.index.tolist()

    shuffled_indices = np.random.permutation(indices)

    split_point =  n // 2
    control_indices = shuffled_indices[:split_point]
    treatment_indices = shuffled_indices[split_point:]

    assigment = pd.Series(index=indices)

    assigment.loc[control_indices] = 'Control'
    assigment.loc[treatment_indices] = 'Treatment'

    return assigment

In [34]:
#Performing stratified random assignment

for segment in df['segment'].unique():
    seg_mask = df['segment'] == segment
    segment_df = df[seg_mask]
    assignments = assign_experiment_group(segment_df)
    df.loc[seg_mask, 'experiment_group'] = assignments

   

  assigment.loc[control_indices] = 'Control'
  assigment.loc[control_indices] = 'Control'
  assigment.loc[control_indices] = 'Control'


In [35]:
print("Crosstab")

crosstab = pd.crosstab(df['segment'], df['experiment_group'], margins=True,margins_name='Total')
print("\n", crosstab)

print("\nPercentage Split within each segment")

crosstab_pct = pd.crosstab(df['segment'], df['experiment_group'], normalize='index') * 100
print("\n", crosstab_pct.round(1))

Crosstab

 experiment_group       Control  Treatment  Total
segment                                         
Moderate Comfort Zone      136        136    272
Narrow Comfort Zone         60         60    120
Wide Comfort Zone           39         39     78
Total                      235        235    470

Percentage Split within each segment

 experiment_group       Control  Treatment
segment                                  
Moderate Comfort Zone     50.0       50.0
Narrow Comfort Zone       50.0       50.0
Wide Comfort Zone         50.0       50.0


In [36]:
print("Verification: Total Counts")

total_control = (df['experiment_group'] == 'Control').sum()
total_treatment = (df['experiment_group'] == 'Treatment').sum()

print(f"\nSplit: {total_control/len(df)*100:.1f}% Control, {total_treatment/len(df)*100:.3f}% Treatment")

# Check for any missing assignments
if df['experiment_group'].isna().any():
    print("\nWARNING: Some playlists were not assigned!")
    print(f"Missing: {df['experiment_group'].isna().sum()}")
else:
    print("\nAll playlists successfully assigned!")

Verification: Total Counts

Split: 50.0% Control, 50.000% Treatment

All playlists successfully assigned!


- Perfect balanced achieved across all segments

### Sanity Check - Are the Groups Balanced?

Before running the experiment, I need to verify that randomization worked properly. Let me check if Control and Treatment groups are similar on key variables:

**Comfort zone score** - This is what I used for segmentation, so it should be balanced. If Treatment has more stuck users on average, that would bias the results.

**Song count** - Bigger playlists = more opportunity to engage. Need to make sure one group doesn't have systematically larger playlists.

If these are balanced, any differences in outcomes are due to the treatment, not pre-existing differences.

In [39]:
df2 = pd.read_csv("spotify.csv")

df3 = df.merge(df2, on = "playlist_id")

In [40]:
# Count songs per playlist
song_counts = df3.groupby('playlist_id').size().reset_index(name='song_count')

df4 = df3.merge(song_counts, on = "playlist_id")

In [41]:
print("Check for Balance")


print("\n Average Comfort Zone Score:")
balance_check = df.groupby('experiment_group')['comfort_zone_score'].agg(['mean', 'std', 'count'])
print(balance_check.round(4))

difference = abs(balance_check.loc['Control', 'mean'] - balance_check.loc['Treatment', 'mean'])
print(f"Difference: {difference:.4f} {'GOOD' if difference < 0.005 else 'CHECK'}")

print("\n Average Song Count:")
balance_check_songs = df4.groupby('experiment_group')['song_count'].agg(['mean', 'std', 'count'])
print(balance_check_songs.round(1))

difference_songs = abs(balance_check_songs.loc['Control', 'mean'] - balance_check_songs.loc['Treatment', 'mean'])
print(f"Difference: {difference_songs:.1f} {'GOOD' if difference_songs < 2 else 'CHECK'}")



Check for Balance

 Average Comfort Zone Score:
                    mean     std  count
experiment_group                       
Control           0.1695  0.0278    235
Treatment         0.1661  0.0277    235
Difference: 0.0034 GOOD

 Average Song Count:
                  mean   std  count
experiment_group                   
Control           82.8  28.2  16205
Treatment         84.4  33.3  16627
Difference: 1.6 GOOD



- Groups are perfectly balanced this means that any outcome differences are due to treatment, not baseline differences.

## Simulating Baseline Engagement

Since this is a portfolio project, I don't have real user data. I need to simulate realistic engagement levels for each segment.

**My assumption:** users with wider comfort zones probably already engage more. Users with narrow zones engage less.

**Baseline engagement I'm simulating (songs liked per session):**
- Narrow users: 5 songs ± 1.2 (they're stuck, don't explore much)
- Moderate users: 7 songs ± 1.5 (balanced)
- Wide users: 9 songs ± 1.8 (already high engagement)

I based these numbers on what seems realistic for music streaming.

**Important:** These are made-up numbers for simulation purposes. In a real A/B test, I'd measure actual baseline engagement before the experiment starts.


In [42]:
np.random.seed(42)
def simulate_baseline_engagement(row):
    segment = row['segment']
    
    if segment == 'Narrow Comfort Zone':
        mean, std = 5.0, 1.2
    elif segment == 'Moderate Comfort Zone':
        mean, std = 7.0, 1.5
    else:  
        mean, std = 9.0, 1.8
    
    baseline = np.random.normal(mean, std)
    return max(0.5, baseline) 

df['baseline_engagement'] = df.apply(simulate_baseline_engagement, axis=1)


# Calculate Average Playlist Audio Features

In [43]:
df_songs = pd.read_csv("spotify.csv")

In [44]:
playlist_averages = df_songs.groupby('playlist_id').agg({'valence': 'mean', 'energy': 'mean'}).round(4)


In [45]:
playlist_averages.columns = ['avg_valence', 'avg_energy']


In [46]:
print("\nOverall Statistics:")
print(f"Valence - Mean: {playlist_averages['avg_valence'].mean():.3f}, "
      f"Std: {playlist_averages['avg_valence'].std():.3f}, "
      f"Range: [{playlist_averages['avg_valence'].min():.3f}, {playlist_averages['avg_valence'].max():.3f}]")
print(f"Energy  - Mean: {playlist_averages['avg_energy'].mean():.3f}, "
      f"Std: {playlist_averages['avg_energy'].std():.3f}, "
      f"Range: [{playlist_averages['avg_energy'].min():.3f}, {playlist_averages['avg_energy'].max():.3f}]")



Overall Statistics:
Valence - Mean: 0.517, Std: 0.126, Range: [0.045, 0.844]
Energy  - Mean: 0.701, Std: 0.110, Range: [0.243, 0.931]


In [47]:
if 'playlist_id' not in df.columns:
    df = df.reset_index()

df = df.merge(playlist_averages, left_on='playlist_id', right_index=True, how='left')

In [48]:
df.head()

Unnamed: 0,playlist_id,valence,energy,comfort_zone_score,segment,experiment_group,baseline_engagement,avg_valence,avg_energy
0,0275i1VNfBnsNbPl0QIBpG,0.204712,0.136614,0.170663,Moderate Comfort Zone,Treatment,7.745071,0.556,0.715
1,03qQtbNHoJuFezRu2CnLuF,0.235168,0.197765,0.216467,Wide Comfort Zone,Treatment,8.751124,0.5961,0.7341
2,03sDEv7FN58Mb9CJOs1Tgn,0.192502,0.109796,0.151149,Moderate Comfort Zone,Treatment,7.971533,0.6616,0.7013
3,06zrBJ5cts5aemZmqe80J7,0.188636,0.120561,0.154599,Moderate Comfort Zone,Treatment,9.284545,0.5303,0.6862
4,07SNJ4MwYba9wwmzrbjmYi,0.203764,0.133636,0.1687,Moderate Comfort Zone,Treatment,6.64877,0.4984,0.5044


In [49]:
missing_valence = df['avg_valence'].isna().sum()
missing_energy = df['avg_energy'].isna().sum()

if missing_valence > 0 or missing_energy > 0:
    print(f"Missing values after merge:")
    print(f"avg_valence: {missing_valence}")
    print(f"vg_energy: {missing_energy}")
else:
    print("No missing values - all playlists have averages")


No missing values - all playlists have averages


## 3. Define Treatment: Opposite Recommendations

This is the logic that I used:

**Control Group:**
- Receives recommendations similar to current listening
- `target_valence = current_valence`
- `target_energy = current_energy`

**Treatment Group:**
- Receives **opposite** recommendations
- `target_valence = 1 - current_valence`
- `target_energy = 1 - current_energy`

This creates a shift in the valence-energy space, pushing users outside their comfort zones.

In [50]:
def calculate_targets(row):
    if row['experiment_group'] == 'Control':
        target_valence = row['avg_valence']
        target_energy = row['avg_energy']
    else:  
        target_valence = 1 - row['avg_valence']
        target_energy = 1 - row['avg_energy']

    return pd.Series({'target_valence': round(target_valence, 4),
        'target_energy': round(target_energy, 4) })

In [51]:
df[['target_valence', 'target_energy']] = df.apply(calculate_targets, axis=1)


### Shift Distance Analysis

- Here, I am calculating Euclidean distance in valence-energy space

- This is for the treatment group only


In [52]:

treatment_df = df[df['experiment_group'] == 'Treatment'].copy()

treatment_df['valence_shift'] = abs(treatment_df['target_valence'] - treatment_df['avg_valence'])
treatment_df['energy_shift'] = abs(treatment_df['target_energy'] - treatment_df['avg_energy'])
treatment_df['total_shift'] = np.sqrt(treatment_df['valence_shift']**2 + treatment_df['energy_shift']**2)

shift_by_segment = treatment_df.groupby('segment')[['valence_shift', 'energy_shift', 'total_shift']].agg(['mean', 'std']).round(3)

print("\nAverage Shift Distance by Segment:")
print(shift_by_segment.to_string())




Average Shift Distance by Segment:
                      valence_shift        energy_shift        total_shift       
                               mean    std         mean    std        mean    std
segment                                                                          
Moderate Comfort Zone         0.174  0.129        0.379  0.176       0.435  0.181
Narrow Comfort Zone           0.278  0.172        0.566  0.168       0.651  0.178
Wide Comfort Zone             0.165  0.109        0.318  0.169       0.378  0.160


### What this means:

Looking at the results, Narrow users have the largest shifts (0.651 average distance in the valence-energy space). This makes sense - they're clustered tightly in one area, so opposite recommendations are far away.

Wide users have smaller shifts (0.378) because they're already spread out. Their opposite isn't that far from what they already listen to.

This confirms my hypothesis about who will be most affected by the feature.

## Simulating Treatment Effects

Here's where I simulate what happens when users get opposite recommendations.

**My hypothesis:** the effect depends on how stuck users are in their comfort zones:
- **Narrow users** (most stuck) → biggest boost (+18% engagement)
- **Moderate users** (balanced) → medium boost (+10% engagement)  
- **Wide users** (already exploring) → small boost (+2% engagement)

I'm adding some randomness (±4-6% std deviation) because real effects always vary across users.

**The simulation math:**
- **Control group:** engagement = baseline + random noise (no change)
- **Treatment group:** engagement = baseline × (1 + lift%) + random noise

**Important reminder:** These effect sizes are made up for demonstration. In reality, you'd measure actual user behavior after launch and might see very different results.



In [53]:

def simulate_treatment_effect(row):
    baseline = row['baseline_engagement']
    segment = row['segment']
    group = row['experiment_group']
    
    if group == 'Control':
        post = baseline + np.random.normal(0, 0.5)
        return max(0.5, post)
    
    if segment == 'Narrow Comfort Zone':
        expected_lift = 0.18  
        std_lift = 0.05
    elif segment == 'Moderate Comfort Zone':
        expected_lift = 0.10  
        std_lift = 0.04
    else:  
        expected_lift = 0.02  
        std_lift = 0.06
    
    lift = np.random.normal(expected_lift, std_lift)
    
    post = baseline * (1 + lift) + np.random.normal(0, 0.5)
    return max(0.5, post)

df['post_treatment_engagement'] = df.apply(simulate_treatment_effect, axis=1)



In [197]:
# Actual lifts achieved
print("\nActual Average Lifts by Segment:")
treatment_df = df[df['experiment_group'] == 'Treatment'].copy()
treatment_df['actual_lift_pct'] = ((treatment_df['post_treatment_engagement'] - treatment_df['baseline_engagement']) / treatment_df['baseline_engagement'] * 100)

lift_summary = treatment_df.groupby('segment')['actual_lift_pct'].agg(['mean', 'std']).round(1)
print(lift_summary)


Actual Average Lifts by Segment:
                       mean   std
segment                          
Moderate Comfort Zone  10.4   8.4
Narrow Comfort Zone    20.1  12.7
Wide Comfort Zone       4.6   7.8


In [54]:
print("\nControl Group (should be ~0% lift):")
control_df = df[df['experiment_group'] == 'Control'].copy()
control_df['actual_lift_pct'] = ((control_df['post_treatment_engagement'] - control_df['baseline_engagement']) / control_df['baseline_engagement'] * 100)
control_lift = control_df['actual_lift_pct'].mean()
print(f"Average lift: {control_lift:.2f}% (expected: ~0%)")



Control Group (should be ~0% lift):
Average lift: -0.02% (expected: ~0%)


## Defining My Metrics

### Primary Metric: Songs Liked per Session

This is my main success metric - did opposite recs increase engagement?

I'm measuring how many songs users like (heart/save) during a session. This is a good metric because:
- It's direct engagement 
- We can actually improve it through better recommendations

**My hypothesis:** Opposite recommendations push users to discover new music they actually like, so they'll save more songs per session.


### Secondary Metric: Discovery Diversity  

I also want to check if the feature works HOW I think it works. Does it actually increase musical exploration?

I'll measure variance in the music characteristics users engage with (std dev across valence/energy). If the treatment works, users should explore a wider range of moods and energy levels.

This validates the mechanism behind the feature.


### Guardrail Metrics 

Increasing engagement is great, but what if users hate the experience? A feature could technically work but frustrate users.

Two things I'm monitoring to catch problems early:

**1. Session Length** - Are users quitting early because they're annoyed?
- **Threshold:** Can't drop by more than 5%
- If users leave early, that's a red flag even if engagement goes up

**2. Skip Rate** - Are users skipping tons of songs to find something they like?
- **Threshold:** Can't increase by more than 10 percentage points  
- Some skipping is ok as they are exploring, but too much means bad recommendations

**Why this matters:** The primary metric can improve but actually hurt the user experience. Guardrails prevent launching something that technically works but users hate.


### Adding baseline metrics

This is the logic that I used - 

- Narrow users have low diversity
- Wide users have HIGH diversity

#### Guardrail 1: Session length (minutes)
- Everyone has similar session length (~35 min)

#### Guardrail 2: Skip rate (proportion of songs skipped)
- Everyone has similar skip rate (~35%)

In [None]:

def assign_baseline_diversity(segment):
    if segment == 'Narrow Comfort Zone':
        return np.random.normal(0.15, 0.02)  
    elif segment == 'Moderate Comfort Zone':
        return np.random.normal(0.20, 0.02)  
    else:  
        return np.random.normal(0.25, 0.03) 

df['baseline_diversity'] = df['segment'].apply(assign_baseline_diversity)


df['baseline_session_length'] = np.random.normal(35, 5, len(df))

df['baseline_skip_rate'] = np.random.normal(0.35, 0.05, len(df))



## Simulation

### Simulation Logic:

#### **Control Group:**
No treatment effect - small random noise only
- Diversity: baseline + noise
- Session length: baseline + noise  
- Skip rate: baseline + noise

#### **Treatment Group (Segment-Specific Effects):**

**Narrow Comfort Zone:**
- Diversity: **+15%** (large discovery boost - exploring totally new genres)
- Session length: **+2%** (slight increase - curious to explore)
- Skip rate: **+12pp** (major increase - Opposite recommendations feel too jarring)

**Reason:** Users stuck in high-energy music get low-energy recommendations. They skip several songs before finding something they like. Eventually engagement improves, but initial friction is high.

**Moderate Comfort Zone:**
- Diversity: **+8%** (medium discovery boost)
- Session length: **+1%** (minimal change)
- Skip rate: **+3pp** (small, acceptable increase)

**Reason:** Some disruption but manageable. Users are already somewhat diverse, so opposite recs are less shocking.

**Wide Comfort Zone:**
- Diversity: **+2%** (minimal - already diverse)
- Session length: **0%** (no change)
- Skip rate: **+1pp** (negligible)

**Reason:** Already exploring diverse music. Opposite recs don't add much value.

### Note:

The **+12pp skip rate for Narrow** creates a **guardrail violation** (exceeds 10pp threshold). This is intentional in the simulation to handle guardrail violations.

### Constraints Applied:

- **Diversity:** Minimum 0.05 (can't be negative)
- **Session length:** Minimum 10 minutes (realistic lower bound)
- **Skip rate:** Clipped to [0.1, 0.8] range (10-80%, realistic bounds)
    

In [None]:

def simulate_secondary_guardrails(row):
    
    if row['experiment_group'] == 'Control':
        diversity = row['baseline_diversity'] + np.random.normal(0, 0.01)
        session = row['baseline_session_length'] + np.random.normal(0, 1)
        skip = row['baseline_skip_rate'] + np.random.normal(0, 0.02)
        
    else:  
        segment = row['segment']
        
        if segment == 'Narrow Comfort Zone':
            diversity = row['baseline_diversity'] * 1.15 + np.random.normal(0, 0.01)
            session = row['baseline_session_length'] * 1.02 + np.random.normal(0, 1)
            
            skip = row['baseline_skip_rate'] + 0.12 + np.random.normal(0, 0.02)
            
        elif segment == 'Moderate Comfort Zone':
            diversity = row['baseline_diversity'] * 1.08 + np.random.normal(0, 0.01)
            session = row['baseline_session_length'] * 1.01 + np.random.normal(0, 1)
            skip = row['baseline_skip_rate'] + 0.03 + np.random.normal(0, 0.02)
            
        else:  
            diversity = row['baseline_diversity'] * 1.02 + np.random.normal(0, 0.01)
            session = row['baseline_session_length'] * 1.00 + np.random.normal(0, 1)
            skip = row['baseline_skip_rate'] + 0.01 + np.random.normal(0, 0.02)
    
    return pd.Series({'post_diversity': max(0.05, diversity), 'post_session_length': max(10, session), 'post_skip_rate': np.clip(skip, 0.1, 0.8)})

df[['post_diversity', 'post_session_length', 'post_skip_rate']] = df.apply(simulate_secondary_guardrails, axis=1)



## Verification

In [212]:

print("Primary Metric - Engagement")
treatment_df = df[df['experiment_group'] == 'Treatment'].copy()
treatment_df['engagement_lift_pct'] = ((treatment_df['post_treatment_engagement'] - treatment_df['baseline_engagement']) / treatment_df['baseline_engagement'] * 100)
engagement_summary = treatment_df.groupby('segment')['engagement_lift_pct'].agg(['mean', 'std']).round(1)
print("\nTreatment Group - Engagement Lift by Segment:")
print(engagement_summary)

# Control group
control_df = df[df['experiment_group'] == 'Control'].copy()
control_engagement_lift = ((control_df['post_treatment_engagement'] - control_df['baseline_engagement']) / control_df['baseline_engagement'] * 100).mean()
print(f"\nControl Group - Engagement Lift: {control_engagement_lift:.2f}% (expected: ~0%)")



Primary Metric - Engagement

Treatment Group - Engagement Lift by Segment:
                       mean   std
segment                          
Moderate Comfort Zone  10.4   8.4
Narrow Comfort Zone    20.1  12.7
Wide Comfort Zone       4.6   7.8

Control Group - Engagement Lift: -0.02% (expected: ~0%)


In [None]:

print("\nSecondary Metric - Discovery Diversity")
# Treatment group
treatment_df['diversity_lift_pct'] = ((treatment_df['post_diversity'] - treatment_df['baseline_diversity']) / treatment_df['baseline_diversity'] * 100)
diversity_summary = treatment_df.groupby('segment')['diversity_lift_pct'].agg(['mean', 'std']).round(1)
print("\nTreatment Group - Diversity Lift by Segment:")
print(diversity_summary)

# Control group
control_diversity_lift = ((control_df['post_diversity'] - control_df['baseline_diversity']) / control_df['baseline_diversity'] * 100).mean()
print(f"\nControl Group - Diversity Lift: {control_diversity_lift:.2f}% (expected: ~0%)")




Secondary Metric - Discovery Diversity

Treatment Group - Diversity Lift by Segment:
                       mean  std
segment                         
Moderate Comfort Zone   8.5  4.4
Narrow Comfort Zone    14.1  6.5
Wide Comfort Zone       2.2  4.6

Control Group - Diversity Lift: 0.17% (expected: ~0%)


In [None]:

print("\nGuardrail Metric 1 - Session Length")
# Treatment group
treatment_df['session_lift_pct'] = ((treatment_df['post_session_length'] - treatment_df['baseline_session_length']) / treatment_df['baseline_session_length'] * 100)
session_summary = treatment_df.groupby('segment')['session_lift_pct'].agg(['mean', 'std']).round(1)
print("\nTreatment Group - Session Length Change by Segment:")
print(session_summary)

# Control group
control_session_lift = ((control_df['post_session_length'] - control_df['baseline_session_length']) / control_df['baseline_session_length'] * 100).mean()
print(f"\nControl Group - Session Length Change: {control_session_lift:.2f}% (expected: ~0%)")






Guardrail Metric 1 - Session Length

Treatment Group - Session Length Change by Segment:
                       mean  std
segment                         
Moderate Comfort Zone   0.5  3.3
Narrow Comfort Zone     2.5  3.0
Wide Comfort Zone       0.0  2.9

Control Group - Session Length Change: -0.44% (expected: ~0%)


In [None]:

print("\nGuardrail Metric 2 - Skip Rate")

# Treatment group
treatment_df['skip_change_pp'] = ((treatment_df['post_skip_rate'] - treatment_df['baseline_skip_rate']) * 100)
skip_summary = treatment_df.groupby('segment')['skip_change_pp'].agg(['mean', 'std']).round(1)
print("\nTreatment Group - Skip Rate Change (pp) by Segment:")
print(skip_summary)

# Control group
control_skip_change = ((control_df['post_skip_rate'] - control_df['baseline_skip_rate']) * 100).mean()
print(f"\nControl Group - Skip Rate Change: {control_skip_change:.2f}pp (expected: ~0pp)")



Guardrail Metric 2 - Skip Rate

Treatment Group - Skip Rate Change (pp) by Segment:
                       mean  std
segment                         
Moderate Comfort Zone   3.0  1.9
Narrow Comfort Zone    11.6  1.8
Wide Comfort Zone       1.5  1.6

Control Group - Skip Rate Change: -0.17pp (expected: ~0pp)


### Guardrail Violations Check


- Threshold: Session length should not decrease by >5%
- Threshold: Skip rate should not increase by >10pp

In [210]:

violations = []
for segment in treatment_df['segment'].unique():
    seg_data = treatment_df[treatment_df['segment'] == segment]
    
    avg_session = seg_data['session_lift_pct'].mean()
    avg_skip = seg_data['skip_change_pp'].mean()
    
    if avg_session < -5:
        violations.append(f"{segment}: Session length decreased by {avg_session:.1f}%")
    
    if avg_skip > 10:
        violations.append(f"{segment}: Skip rate increased by {avg_skip:.1f}pp (EXCEEDS THRESHOLD)")

if violations:
    for v in violations:
        print(v)
else:
    print("All guardrails within acceptable ranges")

Narrow Comfort Zone: Skip rate increased by 11.6pp (EXCEEDS THRESHOLD)


# Export

In [None]:
df.to_csv("playlist_complete_metrics.csv")