*A/B Test Analysis: Search Ranking System*
----------------------------------------

Goal
-----
Prepare the dataset and analyze an experiment comparing a new search ranking (variant)
against the current version (control). Decide whether to "go full on" (roll out) based on:

1) Primary metric: conversion (booking happened or not)
2) Guardrail metric: time_to_booking

Predefined parameters
---------------------
- Confidence level: 90%
- Significance level: alpha = 0.10

Steps
-----
1) Load data and join sessions (behavior) to users (experiment groups) -> sessions_x_users
2) Create the binary primary metric: conversion = 1 if booking_timestamp is present, else 0
3) Run a Sample Ratio Mismatch (SRM) check to confirm balanced assignment
4) Test experiment effect:
   - Primary (binary): two-sided Z-test for proportions
   - Guardrail (continuous): two-sided t-test of means
5) Compute relative effect sizes for both metrics:
   effect_size = mean(variant) / mean(control) - 1
6) Make decision:
   - "full on" (yes) if:
       a) primary p-value < alpha AND effect_size_primary > 0
       b) guardrail p-value > alpha OR effect_size_guardrail <= 0
     Else "pull back" (no).
"""


In [34]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
from scipy.stats import chisquare, ttest_ind
from statsmodels.stats.proportion import proportions_ztest

In [3]:
# HELPER: relative effect size
def estimate_effect_size(df: pd.DataFrame, metric: str) -> float:
    """
    Calculate relative effect size

    Parameters:
    - df (pd.DataFrame): data with experiment_group ('control', 'variant') and metric columns.
    - metric (str): name of the metric column

    Returns:
    - effect_size (float): average treatment effect (effect size)
    """
    avg_metric_per_group = df.groupby('experiment_group')[metric].mean()
    effect_size = avg_metric_per_group['variant'] / avg_metric_per_group['control'] - 1
    return effect_size


In [22]:
# FIXED PARAMETERS
confidence_level = 0.90               # 90% confidence
alpha = 1 - confidence_level          # significance level (0.10)

In [5]:
# LOAD DATA
users = pd.read_csv('users_data.csv')       # contains user_id and experiment_group (control/variant)
sessions = pd.read_csv('sessions_data.csv') # contains user_id, booking_timestamp, time_to_booking, etc.

In [41]:
print("=== USERS: shape / columns / dtypes ===")
print(users.shape)
print(list(users.columns))
print(users.dtypes, "\n")
print(users.head(3), "\n")

=== USERS: shape / columns / dtypes ===
(10000, 2)
['user_id', 'experiment_group']
user_id             object
experiment_group    object
dtype: object 

            user_id experiment_group
0  TcCIMrtQ75wHGXVj          variant
1  GUGVzto9KGqeX3dc          variant
2  uNcuV49WhPJ8C0MH          variant 



In [42]:
print("=== SESSIONS: shape / columns / dtypes ===")
print(sessions.shape)
print(list(sessions.columns))
print(sessions.dtypes, "\n")
print(sessions.head(3))

=== SESSIONS: shape / columns / dtypes ===
(16981, 5)
['session_id', 'user_id', 'session_start_timestamp', 'booking_timestamp', 'time_to_booking']
session_id                  object
user_id                     object
session_start_timestamp     object
booking_timestamp           object
time_to_booking            float64
dtype: object 

         session_id           user_id        session_start_timestamp  \
0  CP0lbAGnb5UNi3Ut  TcCIMrtQ75wHGXVj  2025-01-26 20:02:39.177358627   
1  UQAjrPYair63L1p8  TcCIMrtQ75wHGXVj  2025-01-20 16:12:51.536912203   
2  9zQrAPxV5oi2SzSa  TcCIMrtQ75wHGXVj  2025-01-28 03:46:40.839362144   

  booking_timestamp  time_to_booking  
0               NaN              NaN  
1               NaN              NaN  
2               NaN              NaN  


In [6]:
# JOIN DATA
sessions_x_users = sessions.merge(users, on='user_id', how='inner')

In [25]:
# PRIMARY METRIC: conversion
# Binary conversion flag: 1 if booking occurred, 0 otherwise
sessions_x_users['conversion'] = sessions_x_users['booking_timestamp'].notnull().astype(int)

In [29]:
# 1) SANITY CHECK: SAMPLE RATIO MISMATCH TEST
# Count assignments in each arm (ensure the index order is control, variant for clarity)
groups_count = sessions_x_users['experiment_group'].value_counts().reindex(['control','variant']).fillna(0).astype(int)
n = groups_count.sum()

# Expected counts under perfect 50/50 split
expected = [n/2, n/2]

srm_chi2_stat, srm_chi2_pval = chisquare(f_obs=groups_count.values, f_exp=expected)
srm_chi2_pval = round(float(srm_chi2_pval), 4)

print("=== SANITY CHECK: Sample Ratio Mismatch (SRM) ===")
print(f"Assignments -> control: {groups_count['control']:,} | variant: {groups_count['variant']:,} | total: {n:,}")
print(f"SRM chi-square p-value: {srm_chi2_pval:.4f}")
if srm_chi2_pval < alpha:
    print("Possible SRM (p < alpha). Interpret downstream results with caution.")
else:
    print("No SRM detected at alpha = 0.10.")

=== SANITY CHECK: Sample Ratio Mismatch (SRM) ===
Assignments -> control: 7,630 | variant: 7,653 | total: 15,283
SRM chi-square p-value: 0.8524
No SRM detected at alpha = 0.10.


The sample sizes between control (7,630 users) and variant (7,653 users) are nearly identical, with only a 0.15% imbalance. The chi-square test produced a p-value of 0.8524, well above the 0.10 significance threshold. This indicates there is no evidence of a Sample Ratio Mismatch (SRM). Randomization across experiment groups appears balanced, and we can confidently proceed with the analysis.

In [31]:
# 2) EFFECT ON PRIMARY METRIC
# Compute success counts and sample sizes for each group
success_counts = sessions_x_users.groupby('experiment_group', observed=True)['conversion'].sum().reindex(['control','variant'])
sample_sizes   = sessions_x_users['experiment_group'].value_counts().reindex(['control','variant'])

# Run Z-test for proportions (binary conversion metric)
zstat_primary, pval_primary = proportions_ztest(count=success_counts.values,
                                                nobs=sample_sizes.values,
                                                alternative='two-sided')
pval_primary = round(float(pval_primary), 4)

# Estimate effect size for the conversion metric
effect_size_primary = round(estimate_effect_size(sessions_x_users, 'conversion'), 4)

print("\n=== PRIMARY METRIC: Conversion ===")
print(f"Control conversion rate: {success_counts['control'] / sample_sizes['control']:.4%}")
print(f"Variant conversion rate: {success_counts['variant'] / sample_sizes['variant']:.4%}")
print(f"Z-test p-value: {pval_primary:.4f}")
print(f"Relative effect size: {effect_size_primary:.4%} (=(variant/ control) - 1)")

success_counts = sessions_x_users.groupby('experiment_group')['conversion'].sum().loc[['control', 'variant']]
sample_sizes = sessions_x_users['experiment_group'].value_counts().loc[['control', 'variant']]


=== PRIMARY METRIC: Conversion ===
Control conversion rate: 15.9240%
Variant conversion rate: 18.1889%
Z-test p-value: 0.0002
Relative effect size: 14.2200% (=(variant/ control) - 1)


The control group converted at a rate of 15.92%, while the variant achieved an 18.19% conversion rate. This is an absolute increase of ~2.26 percentage points and a relative lift of 14.22%. The Z-test p-value (0.0002) is far below the 0.10 threshold, meaning the observed improvement is statistically significant. This provides strong evidence that the new search ranking system positively impacts booking conversions.

In [36]:
# 3) EFFECT ON GUARDRAIL METRIC
# Note: Lower time_to_booking is better. Our decision rule treats:
# - Non-significant change as OK, OR
# - A significant change that *decreases* time_to_booking (i.e., variant <= control) as OK.
# T-test on time to booking for control vs variant

control_ttb = sessions_x_users.loc[sessions_x_users['experiment_group']=='control', 'time_to_booking'].dropna()
variant_ttb = sessions_x_users.loc[sessions_x_users['experiment_group']=='variant', 'time_to_booking'].dropna()

tstat_guardrail, pval_guardrail = ttest_ind(variant_ttb, control_ttb, equal_var=False, alternative='two-sided')
pval_guardrail = round(float(pval_guardrail), 4)

# Estimate effect size for the guardrail metric
effect_size_guardrail = round(estimate_effect_size(sessions_x_users, 'time_to_booking'), 4)

print("\n=== GUARDRAIL METRIC: Time to Booking ===")
print(f"Control mean time_to_booking: {control_ttb.mean():.4f}")
print(f"Variant mean time_to_booking: {variant_ttb.mean():.4f}")
print(f"T-test p-value: {pval_guardrail:.4f}")
print(f"Relative effect size: {effect_size_guardrail:.4%} (negative is good here, i.e., faster)")


=== GUARDRAIL METRIC: Time to Booking ===
Control mean time_to_booking: 15.0124
Variant mean time_to_booking: 14.8940
T-test p-value: 0.5365
Relative effect size: -0.7900% (negative is good here, i.e., faster)


The mean time to booking in the control group was 15.01, while in the variant it dropped slightly to 14.89. The relative effect size of -0.79% suggests bookings occurred marginally faster in the variant. However, the p-value of 0.5365 is much larger than 0.10, indicating this difference is not statistically significant. Importantly, the guardrail metric was not harmed—there was no evidence of slower bookings in the variant group.

In [23]:
# 4) DECISION LOGIC
# Primary metric must be statistically significant and show positive effect (increase)
criteria_full_on_primary = (pval_primary < alpha) & (effect_size_primary > 0)

# Guardrail must either be statistically insignificant or whow positive effect (decrease)
criteria_full_on_guardrail = (pval_guardrail > alpha) | (effect_size_guardrail <= 0)

In [39]:
if criteria_full_on_primary and criteria_full_on_guardrail:
    decision_full_on = 'yes'
    print("\nDecision: GO FULL ON")
    print("Reason: Primary metric improved significantly (p < 0.10 & effect > 0) AND guardrail not harmed")
    print("        (either no significant change or the time_to_booking decreased).")
else:
    decision_full_on = 'no'
    print("\nDecision: PULL BACK")
    print("Reason: Either the primary effect was not significantly positive, or the guardrail was harmed.")


Decision: GO FULL ON
Reason: Primary metric improved significantly (p < 0.10 & effect > 0) AND guardrail not harmed
        (either no significant change or the time_to_booking decreased).


Based on the decision criteria, the experiment supports rolling out the new search ranking system. The primary metric (conversion) showed a statistically significant and positive improvement. Meanwhile, the guardrail metric (time to booking) did not show any statistically significant harm and even suggested a slight speed-up. Together, these results justify moving forward with a full launch of the variant.

In [40]:
# 5) SAVE KEY OUTPUTS (as variables)

# Already set: srm_chi2_pval, pval_primary, pval_guardrail, effect_size_primary, effect_size_guardrail, decision_full_on
print("\n=== SUMMARY OUTPUTS ===")
print(f"srm_chi2_pval         = {srm_chi2_pval}")
print(f"pval_primary          = {pval_primary}")
print(f"effect_size_primary   = {effect_size_primary}")
print(f"pval_guardrail        = {pval_guardrail}")
print(f"effect_size_guardrail = {effect_size_guardrail}")
print(f"decision_full_on      = {decision_full_on}")


=== SUMMARY OUTPUTS ===
srm_chi2_pval         = 0.8524
pval_primary          = 0.0002
effect_size_primary   = 0.1422
pval_guardrail        = 0.5365
effect_size_guardrail = -0.0079
decision_full_on      = yes
