In this notebook, we will analyze a synthetic dataset that simulates Search Everywhere usage. 

Our goal is to compare two experiment groups (Group 0 and Group 1) to understand the differences in user behavior and search effectiveness. We will calculate various metrics to evaluate the performance of each group:

- **Mean Reciprocal Rank (MRR):** Measures how quickly users find relevant results.
- **Kendall Tau Distance:** Measures the consistency in ranking relevant results.
- **Expected Reciprocal Rank (ERR):** Estimates the likelihood of users stopping their search due to satisfaction.
- **Mean Event Rank:** Indicates the average step at which users complete the search.
- **Success Rate:** The percentage of sessions ending with meaningful user interaction.
- **Success Rate at N:** The proportion of sessions where relevant results appear within the top N results.
- **Average Session Duration:** Represents the time from the start to the end of each search session.

Firstly, we import necessary libraries.

In [1]:
import pandas as pd
from scipy.stats import kendalltau
import json

We load the dataset and extract relevant fields from the JSON-encoded `event_data` column. The extraction process creates new columns (`session_id`, `searchStateFeatures`, `experimentGroup`, `selectedIndexes`, and `eventIndex`)

After extracting the fields, we split the dataset into two separate DataFrames, one for each experiment group (Group 0 and Group 1). This prepares the data for calculating metrics and comparing the two groups.

In [3]:
# Extract relevant fields from 'event_data'
def extract_event_data_fields(row):
    event_data = json.loads(row['event_data'])
    return {
        'session_id': event_data.get('session_id'),
        'searchStateFeatures': event_data.get('searchStateFeatures'),
        'experimentGroup': event_data.get('experimentGroup'),
        'selectedIndexes': event_data.get('selectedIndexes'),
        'eventIndex': event_data.get('eventIndex')
    }

# Load the dataset
file_path = '2024InternshipData.csv'
data = pd.read_csv(file_path)

# Apply extraction function to each row and create new columns
extracted_data = data.apply(extract_event_data_fields, axis=1, result_type='expand')
data = pd.concat([data, extracted_data], axis=1)

# Split data into two dataframes by 'experimentGroup'
group_0 = data[data['experimentGroup'] == 0].copy()
group_1 = data[data['experimentGroup'] == 1].copy()

group_0.head()

Unnamed: 0,time_epoch,device_id,event_data,event_id,session_id,searchStateFeatures,experimentGroup,selectedIndexes,eventIndex
20,1699913000000.0,ce7444e76fd636382f8193e69301e5a2,"{""session_id"": ""6fce0cbcd528ebe209090502c05e46...",searchRestarted,6fce0cbcd528ebe209090502c05e46c4,{'queryLength': 3},0,,0
21,1699913000000.0,ce7444e76fd636382f8193e69301e5a2,"{""session_id"": ""6fce0cbcd528ebe209090502c05e46...",searchRestarted,6fce0cbcd528ebe209090502c05e46c4,{'queryLength': 0},0,,1
22,1699913000000.0,ce7444e76fd636382f8193e69301e5a2,"{""session_id"": ""6fce0cbcd528ebe209090502c05e46...",searchRestarted,6fce0cbcd528ebe209090502c05e46c4,{'queryLength': 0},0,,2
23,1699913000000.0,ce7444e76fd636382f8193e69301e5a2,"{""session_id"": ""6fce0cbcd528ebe209090502c05e46...",searchRestarted,6fce0cbcd528ebe209090502c05e46c4,{'queryLength': 1},0,,3
24,1699913000000.0,ce7444e76fd636382f8193e69301e5a2,"{""session_id"": ""6fce0cbcd528ebe209090502c05e46...",sessionFinished,6fce0cbcd528ebe209090502c05e46c4,{'queryLength': 2},0,[0],4


We calculate key performance metrics for each experiment group. The function processes each session to compute the metrics using the available event data. These metrics provide insights into user behavior and search effectiveness for both groups.

The calculated metrics are then stored in a DataFrame for easy comparison between Experiment Group 0 and Experiment Group 1.

In [5]:
def calculate_metrics(group, n=5):
    """
    Calculate basic metrics for a given group of data.

    Parameters:
    - group: DataFrame containing the group's data.
    - n: The number of top results to consider for the Success Rate at N metric.

    Returns:
    - A dictionary containing the calculated metrics.
    """
    # Initialize metrics
    mrr_sum = 0
    event_total_rank = 0
    successful_sessions = 0
    successful_sessions_at_n = 0
    total_distance = 0
    total_err = 0
    total_sessions = len(group['session_id'].unique())
    session_durations = []

    # Loop through each session
    for session_id, session_data in group.groupby('session_id'):

        # Get the first and final event
        final_event = session_data[session_data['event_id'] == 'sessionFinished']
        first_event = session_data.iloc[0]

        if not final_event.empty and final_event.iloc[0]['selectedIndexes']:
        # Calculate MRR
            rank = final_event.iloc[0]['selectedIndexes'][0] + 1
            mrr_sum += 1 / rank

        # Calculate Mean Event Rank
            event_total_rank += final_event.iloc[0]['eventIndex']

        # Calculate Success Rate
            successful_sessions += 1

        # Calculate Success Rate at N
            if rank < n:
                successful_sessions_at_n += 1

        # Calculate Kendall Tau distance
            ranking = [0] * rank + [1]
            true_order = list(range(len(ranking)))
            distance, _ = kendalltau(true_order, ranking)
            total_distance += distance

        # Calculate Expected Reciprocal Rank
            # Calculate the probability of stopping at each rank
            err = 0
            p_continue = 1  # Probability of continuing the search

            for r in range(1, rank + 1):
                relevance = 1 if r == rank else 0
                p_stop = relevance / (rank + 1)  # The stopping probability at this rank
                err += p_continue * p_stop
                p_continue *= (1 - p_stop)  # Update the probability of continuing

            total_err += err

        # Calculate the duration in seconds from the first event to the 'sessionFinished' event
            start_time = first_event['time_epoch'] / 1000  # Convert to seconds
            end_time = final_event.iloc[0]['time_epoch'] / 1000  # Convert to seconds
            duration = end_time - start_time
            session_durations.append(duration)

    # Compute final metrics
    mrr = mrr_sum / successful_sessions if successful_sessions > 0 else 0
    mean_event_rank = event_total_rank / successful_sessions if successful_sessions > 0 else 0
    success_rate = (successful_sessions / total_sessions) * 100 if total_sessions > 0 else 0
    success_rate_at_n = (successful_sessions_at_n / total_sessions) * 100 if total_sessions > 0 else 0
    kendall_tau = total_distance / successful_sessions if successful_sessions > 0 else 0
    err = total_err / successful_sessions if successful_sessions > 0 else 0
    average_duration = sum(session_durations) / len(session_durations) if session_durations else 0

    # Return a dictionary with all the calculated metrics
    return {
        'MRR': mrr,
        'Mean Event Rank': mean_event_rank,
        'Success Rate': success_rate,
        'Success Rate at N': success_rate_at_n,
        'Kendall Tau Distance': kendall_tau,
        'ERR': err,
        'Average Session Duration': average_duration
    }


# Calculate metrics for both groups
metrics_group_0 = calculate_metrics(group_0, n=5)
metrics_group_1 = calculate_metrics(group_1, n=5)

# Convert metrics to a DataFrame and display to the user
metrics_df = pd.DataFrame([metrics_group_0, metrics_group_1], index=['Experiment Group 0', 'Experiment Group 1'])
print(metrics_df)

                         MRR  Mean Event Rank  Success Rate  \
Experiment Group 0  0.340626         5.995707     57.344092   
Experiment Group 1  0.383329         5.968467     56.440572   

                    Success Rate at N  Kendall Tau Distance       ERR  \
Experiment Group 0          23.564004              0.605933  0.212206   
Experiment Group 1          26.073429              0.640864  0.233878   

                    Average Session Duration  
Experiment Group 0                 25.683477  
Experiment Group 1                 25.511409  


The results from the analysis of Experiment Group 0 and Experiment Group 1 show some small differences in search performance metrics:

- **MRR (Mean Reciprocal Rank):** Group 1 has a higher MRR (0.383) compared to Group 0 (0.341), indicating that users in Group 1, on average, find relevant results more quickly. This suggests that the search experience might be more effective or efficient for Group 1.
- **Mean Event Rank:** The Mean Event Rank is quite similar for both groups (around 6), suggesting that the average number of steps taken by users to complete a search is nearly identical. This indicates that users in both groups generally require a similar number of interactions to reach a satisfactory result.
- **Success Rate:** Group 0 has a slightly higher Success Rate (57.3%) than Group 1 (56.4%), implying that users in Group 0 are more likely to end their sessions with a meaningful interaction, but the difference is really small.
- **Success Rate at N:** Group 1 performs better in finding relevant results within the top N results (N=5), with a Success Rate of 26.1% compared to 23.6% for Group 0. This suggests that users in Group 1 are more likely to find satisfactory results earlier in the list, possibly due to better initial ranking or search experience.
- **Kendall Tau Distance:** The higher Kendall Tau Distance for Group 1 (0.641) indicates more consistent ranking of relevant results compared to Group 0 (0.606). This may suggest that the ranking algorithm or search interface used in Group 1 is more effective in prioritizing relevant results.
- **ERR (Expected Reciprocal Rank):** Group 1 also shows a higher ERR (0.234) than Group 0 (0.212), which implies that users in Group 1 are more likely to stop searching earlier due to finding satisfactory results sooner. This aligns with the higher MRR and Success Rate at N observed for Group 1.
- **Average Session Duration:** The Average Session Duration is almost the same for both groups, with Group 0 at 25.68 seconds and Group 1 at 25.51 seconds. This indicates that the overall time spent searching does not differ significantly.

Overall, the results suggest that Experiment Group 1 has a slight advantage in search efficiency, as indicated by higher MRR, Success Rate at N, Kendall Tau Distance, and ERR. Users in Group 1 tend to find relevant results more quickly and are more likely to encounter satisfactory outcomes earlier in their search. However, the differences are relatively small, and the Success Rate for Group 0 is slightly higher, indicating that users in both groups are similarly effective in completing their searches.