<h1 style="text-align:center; font-size:40px; color:black;"> New search ranking system: A/B Testing Analysis </h1>


This project focuses on evaluating a new search ranking system for an online travel agency. The goal is to see if the new system increases booking conversion without making the booking process longer. Using Python, the project analyzes data from an A/B test, including session-level bookings and user-level test groups. The results will show whether the new ranking system has a real impact on user behavior and provide a recommendation based on data.

<h1 style="font-size:32px">🔬 Data Description </h1>

*  **`sessions_data.csv`**

| Column                    | Data Type | Description                                                                                         |
| ------------------------- | --------- | --------------------------------------------------------------------------------------------------- |
| `session_id`              | `string`  | Unique ID for each session (one row per session)                                                    |
| `user_id`                 | `string`  | Unique ID for each user (may be missing for non-logged-in users; a user can have multiple sessions) |
| `session_start_timestamp` | `string`  | Timestamp when the session started                                                                  |
| `booking_timestamp`       | `string`  | Timestamp when a booking was made (missing if no booking occurred)                                  |
| `time_to_booking`         | `float`   | Minutes from session start to booking (missing if no booking occurred)                              |
| `conversion`              | `integer` | New column: indicates if the session resulted in a booking (1 if booking happened, 0 if not)   

</br>

* **`users_data.csv`**

| Column             | Data Type | Description                                                             |
| ------------------ | --------- | ----------------------------------------------------------------------- |
| `user_id`          | `string`  | Unique ID for logged-in users only                                      |
| `experiment_group` | `string`  | Experiment group: `control` or `variant` (expected roughly 50/50 split) |


</br>

### Evaluation Criteria

* Primary metric: Conversion rate must show a statistically significant increase.

* Guardrail metric: Time to booking must either show no significant change or a decrease.

In [16]:
import pandas as pd
from scipy.stats import chisquare
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import ttest_ind

In [17]:
sessions = pd.read_csv('sessions_data.csv')
users = pd.read_csv('users_data.csv')

## About the Dataset

* `sessions_data.csv`

In [18]:
sessions.head()

Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,,
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,,
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,,
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,,
4,AKDXZWWFYKViHC27,,2025-01-28 00:30:49.979124308,,


In [19]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16981 entries, 0 to 16980
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   session_id               16981 non-null  object 
 1   user_id                  15283 non-null  object 
 2   session_start_timestamp  16981 non-null  object 
 3   booking_timestamp        2844 non-null   object 
 4   time_to_booking          2844 non-null   float64
dtypes: float64(1), object(4)
memory usage: 663.4+ KB


* `users_data.csv`

In [20]:
users.head()

Unnamed: 0,user_id,experiment_group
0,TcCIMrtQ75wHGXVj,variant
1,GUGVzto9KGqeX3dc,variant
2,uNcuV49WhPJ8C0MH,variant
3,v2EBIHmOdQfalI6k,variant
4,wnsKpRB9SE0gTZAq,variant


In [21]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_id           10000 non-null  object
 1   experiment_group  10000 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB


## Create the `conversion` column
A session counts as a conversion if there’s a booking.

In [22]:
# Create conversion column
sessions['conversion'] = sessions['booking_timestamp'].notna().astype(int)

# check
print(sessions[['session_id', 'booking_timestamp', 'conversion']].sample(5))

            session_id              booking_timestamp  conversion
8739  2a6ChjFLVEWPsc0w                            NaN           0
8666  m9flO8JjhVpQmYzm                            NaN           0
3602  jQAcLAWOvao51KQR  2025-01-24 13:03:28.243145046           1
9191  1Dnm3tlOfIVidQ3w                            NaN           0
8539  mT0uBbazVOsyA2HF  2025-01-24 19:58:24.440769616           1


## Merge session data with experiment groups

To know which experiment group each session belongs to

In [23]:
# Merge sessions with users on user_id
merged_data = sessions.merge(users, on='user_id', how='left')

merged_data.head()

Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,conversion,experiment_group
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,,,0,variant
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,,,0,variant
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,,,0,variant
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,,,0,variant
4,AKDXZWWFYKViHC27,,2025-01-28 00:30:49.979124308,,,0,


## SRM Test (Sample Ratio Mismatch) 

Group Balance Check. 
If one group is much larger than the other, the experiment results could be biased.

In [24]:
groups_count = merged_data['experiment_group'].value_counts()

# Total of observed counts
total_count = groups_count.sum()

# Expected counts = equal split of observed counts
expected_counts = [total_count/2, total_count/2]

from scipy.stats import chisquare
srm_chi2_stat, srm_chi2_pval = chisquare(f_obs=groups_count, f_exp=expected_counts)
print("SRM Test - group counts:\n", groups_count)
print(f"SRM Test p-value: {srm_chi2_pval:.4f}")


SRM Test - group counts:
 experiment_group
variant    7653
control    7630
Name: count, dtype: int64
SRM Test p-value: 0.8524


---

* The two groups are very close in size (almost 50/50).
* p-value is much greater than 0.05: the difference in group sizes is not statistically significant.

The experiment is **balanced**, so the A/B test results are reliable.

## Effect Size Calculation

Measure how big the difference is between variant and control, not just whether it’s statistically significant.

In [25]:
# Function to calculate effect size
def estimate_effect_size(df, metric):
    """
    Calculate relative effect size: (variant_avg / control_avg) - 1
    """
    avg_per_group = df.groupby('experiment_group')[metric].mean()
    effect_size = avg_per_group['variant'] / avg_per_group['control'] - 1
    return effect_size

# 1️Conversion (primary metric)
effect_size_conversion = estimate_effect_size(merged_data, 'conversion')
conversion_rate = merged_data.groupby('experiment_group')['conversion'].mean()
print("Conversion rates:\n", conversion_rate)
print(f"Effect size (conversion): {effect_size_conversion:.4f}\n")

Conversion rates:
 experiment_group
control    0.159240
variant    0.181889
Name: conversion, dtype: float64
Effect size (conversion): 0.1422



---

* **Effect size: 0.1422**: this means the variant **increased bookings** by about 14.2% relative to control. 

The new search ranking system is **helping more** users complete bookings..1422

In [26]:
## Time to booking
effect_size_time = estimate_effect_size(merged_data, 'time_to_booking')
time_avg = merged_data.groupby('experiment_group')['time_to_booking'].mean()
print("Average time to booking:\n", time_avg)
print(f"Effect size (time to booking): {effect_size_time:.4f}")

Average time to booking:
 experiment_group
control    15.012404
variant    14.894029
Name: time_to_booking, dtype: float64
Effect size (time to booking): -0.0079


---

* **Effect size: -0.0079**: this means the variant slightly reduced the time to book by about 0.79%, which is very small. 

The new search ranking system **does not make booking slower**, which is good because we don’t want to harm the user experience.

A negative effect size here is actually **positive**, because it means users book slightly faster.

# 📝  Summary statistics for A/B groups
Check conversion rates and average time to booking for control and variant groups.

In [27]:
# Conversion rate by experiment group
conversion_summary = merged_data.groupby('experiment_group')['conversion'].mean()
print("Conversion rates:\n", conversion_summary)

Conversion rates:
 experiment_group
control    0.159240
variant    0.181889
Name: conversion, dtype: float64


---

Conversion rate: Variant (0.182) is **higher than** control (0.159). This suggests the new ranking system may improve bookings.

In [28]:
# Average time to booking
time_summary = merged_data.groupby('experiment_group')['time_to_booking'].mean()
print("\nAverage time to booking:\n", time_summary)


Average time to booking:
 experiment_group
control    15.012404
variant    14.894029
Name: time_to_booking, dtype: float64


---

Time to booking: Variant (14.89 min) is **slightly lower than** control (15.01 min). This indicates the new system doesn’t slow down the booking process.

# 📝 Statistical Testing


To testing if the differences are statistically significant. 


#### a. Conversion rate (Did more people book?)

Using `z-test` for proportions

* If the p-value < 0.05, it means the improvement is likely real.
* If p-value ≥ 0.05, it might just be luck.

In [29]:
# Count conversions and total sessions per group
conversion_counts = merged_data.groupby('experiment_group')['conversion'].sum().values
total_sessions = merged_data.groupby('experiment_group')['conversion'].count().values


z_stat, p_value = proportions_ztest(conversion_counts, total_sessions)
print(f"Z-statistic: {z_stat:.3f}, p-value: {p_value:.3f}")

Z-statistic: -3.722, p-value: 0.000


---

* **Z-statistic = -3.722**: Shows how big the difference is between control and variant, compared to random chance.
* **p-value = 0.000**: this small (< 0.05) means the difference is very unlikely to be random.

This means the new search ranking system really **increased the number of bookings**.

#### b. Time to booking (Did it take longer or shorter?)

In [30]:
# Filter sessions with a booking
booking_data = merged_data.dropna(subset=['time_to_booking'])

# Split by group
control_time = booking_data[booking_data['experiment_group'] == 'control']['time_to_booking']
variant_time = booking_data[booking_data['experiment_group'] == 'variant']['time_to_booking']


t_stat, t_p_value = ttest_ind(control_time, variant_time, equal_var=False)
print(f"T-statistic: {t_stat:.3f}, p-value: {t_p_value:.3f}")

T-statistic: 0.618, p-value: 0.536


---

* **T-statistic = 0.618**: Shows the difference between control and variant in terms of variability.

* **p-value = 0.536** It is much larger than 0.05, which means the small difference we saw (15.01-> 14.89 min) is likely just random.

The new ranking system **did not** make users take longer to book. The time to booking stayed basically the same.

## Overall Conclusion 

* Conversion rate: Significantly improved

* Time to booking: No negative impact

The new search ranking system can be **apply to all users**, because it increases bookings without slowing down the booking process.