# Ablation Testing

This notebook analyzes and compares different chunking parameter combinations for the Readiscoverers backend. It tests various chunking strategies by:

1. **Loading test questions** from a shared Google Drive resource (Wizard of Oz questions)
2. **Running parameter combination tests** against the local backend API (see `README.md` on how to get started)
3. **Comparing results** between the model's performance and expected baseline data
4. **Evaluating chunk matching accuracy** using distance metrics (chunk and chapter distances)
5. **Analyzing semantic similarity vs chunk size consistency** tradeoffs

### Setup Requirements

**Important:** You need to obtain your own `client_secrets.json` file from Google Cloud Console and place it in the current directory (`/notebooks/`) to authenticate with Google Drive. This is required to pull the latest `oz_questions.csv` file, which is a shared resource that we always want to be up-to-date when testing.

Additionally, ensure your local backend is running before executing the test cells:
```bash
docker compose up --build
```

In [1]:
%load_ext autoreload
%autoreload 2

import os
import time

import pandas as pd

import sys

sys.path.append("../")

from utils.config import BOOK_URLS, DEV_PARAM_COMBOS, TEST_PARAM_COMBOS
from utils.pipeline_readiscovers_app import run_all_tests
from utils.pipeline_oz_extractor import preprocess_oz_extractor_results
from utils.pipeline_merge import merge_model_and_expected_data
from utils.chunk_matching import precompute_text_locations_in_chunks, normalize_text

filepath = "oz_questions.csv"

In [2]:
%%capture
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive

gauth = GoogleAuth();
gauth.LoadClientConfigFile("client_secrets.json");
gauth.LocalWebserverAuth();

drive = GoogleDrive(gauth);

In [3]:
if os.path.exists(filepath):
    os.remove(filepath)
    print(f"Replacing existing file: {filepath}")
    time.sleep(1)

file = drive.CreateFile({"id": "1aa61xgSOBXu6qH1chEiFqUxgFmBNyvr4Q5bSCFUOkt8"})
file.GetContentFile(filepath, mimetype="text/csv")

Replacing existing file: oz_questions.csv


In [4]:
oz_q_df = pd.read_csv(filepath)

In [5]:
oz_q_df.head()

Unnamed: 0,Question,Book #,Book Title,Best Answer,length,Note
0,What color are Dorothy's shoes?,1,The Wonderful Wizard of Oz,"“She was so old,” explained the Witch of the N...",285.0,
1,How old is the Scarecrow when Dorothy finds him?,1,The Wonderful Wizard of Oz,“My life has been so short that I really know ...,412.0,
2,Which are the first antagonistic creatures the...,1,The Wonderful Wizard of Oz,In the morning they traveled on until they cam...,1085.0,
3,"When is the first time we read ""There's no pla...",1,The Wonderful Wizard of Oz,“That is because you have no brains” answered ...,236.0,
4,What is the wizard's secret in the Wonderful W...,1,The Wonderful Wizard of Oz,"“No, you are all wrong,” said the little man m...",546.0,


In [6]:
import numpy as np


def select_questions(
    oz_q_df, random_seed=42, questions_per_book=2, additional_random_questions=4
):
    # Set random seed for reproducibility
    np.random.seed(random_seed)

    # Get unique books
    unique_books = oz_q_df["Book #"].unique()

    # Select 2 questions from each book
    selected_per_book = []
    for book_num in unique_books:
        book_questions = oz_q_df[oz_q_df["Book #"] == book_num]
        sample = book_questions.sample(
            n=min(questions_per_book, len(book_questions)), random_state=random_seed
        )
        selected_per_book.append(sample)

    selected_from_books = pd.concat(selected_per_book)

    # Select 4 additional random questions from remaining questions
    remaining_questions = oz_q_df[~oz_q_df.index.isin(selected_from_books.index)]
    additional_questions = remaining_questions.sample(
        n=min(additional_random_questions, len(remaining_questions)),
        random_state=random_seed,
    )

    # Combine all selected questions
    final_selection = pd.concat([selected_from_books, additional_questions])

    print(
        f"Selected {len(selected_from_books)} questions from books ({questions_per_book} per book)"
    )
    print(f"Selected {len(additional_questions)} additional random questions")
    print(f"Total: {len(final_selection)} questions\n")

    return final_selection


selected_questions_dev = select_questions(oz_q_df)
selected_questions_test = oz_q_df[~oz_q_df.index.isin(selected_questions_dev.index)].copy()

Selected 16 questions from books (2 per book)
Selected 4 additional random questions
Total: 20 questions



#### IMPORTANT

Start up your backend locally for testing

> run docker compose up --build


This test will run through all development parameter combinations for all selected development questions.

> In case the API timesout or kills, the results are continously save to results_filename

In [None]:
readiscovers_app_results_df = await run_all_tests(
    selected_questions=selected_questions_dev,
    param_combos=DEV_PARAM_COMBOS,
    book_urls=BOOK_URLS,
    skip_book_processing=False,
    results_filename="readiscovers_app_results_top_3_param_combos_RUN_5"
)

This test will run through all test parameter combinations for all selected test questions.

> In case the API timesout or kills, the results are continously save to results_filename

In [None]:
readiscovers_app_results_test_df = await run_all_tests(
    selected_questions=selected_questions_test,
    param_combos=TEST_PARAM_COMBOS,
    book_urls=BOOK_URLS,
    skip_book_processing=False,
    results_filename="readiscovers_app_results_test_param_combos_RUN_2"
)

In [None]:
for test_num in readiscovers_app_results_df["test_number"].unique():
    test_sorted_df = readiscovers_app_results_df[
        readiscovers_app_results_df["test_number"] == test_num
    ]
    print(
        "Number of results where at least one result is within 1 chunks of the expected answer:",
        test_sorted_df[
            test_sorted_df["chunk_distance_from_expected"].apply(
                lambda x: isinstance(x, (int, float)) and x <= 1
            )
        ]["question_number"].nunique(),
    )
    print(
        "Number of results where at least one result is within 3 chunks of the expected answer:",
        test_sorted_df[
            test_sorted_df["chunk_distance_from_expected"].apply(
                lambda x: isinstance(x, (int, float)) and x <= 3
            )
        ]["question_number"].nunique(),
    )
    print("")

readiscovers_app_results_df[(readiscovers_app_results_df['test_number'] == 1)][['test_number', 'original_query', 'enhanced_query',
                             'result_rank', 'matched_book_title', 'expected_book_title',
                             'matched_chunk_index', 'chunk_distance_from_expected','expected_all_chunk_indices']]

In [None]:
print(
    "Number of results where at least one result is within 0 chapter of the expected answer:",
    readiscovers_app_results_df[
        readiscovers_app_results_df["chapter_distance_from_expected"].apply(
            lambda x: isinstance(x, (int, float)) and x <= 0
        )
    ]["question_number"].nunique(),
)
readiscovers_app_results_df[
    readiscovers_app_results_df["chapter_distance_from_expected"].apply(
        lambda x: isinstance(x, (int, float)) and x <= 0
    )
][
    [
        "original_query",
        "result_rank",
        "matched_chunk_index",
        "expected_primary_chunk_index",
        "expected_all_chunk_indices",
        "chapter_distance_from_expected",
    ]
]

In [None]:
# Force reload the module
import importlib
import utils.chunk_matching
importlib.reload(utils.chunk_matching)
from utils.chunk_matching import precompute_text_locations_in_chunks, normalize_text

oz_extractor_results_df = pd.read_csv(
    "oz_extractor_5.1_1_is_match_results.csv",
    usecols=[
        "question",
        "excerpt_1",
        "loc_1",
        "excerpt_2",
        "loc_2",
        "excerpt_3",
        "loc_3",
    ],
)

oz_extractor_results_df = preprocess_oz_extractor_results(oz_extractor_results_df)

In [None]:
oz_extractor_results_df = precompute_text_locations_in_chunks(
    oz_extractor_results_df, use_expected_settings=False
)

In [None]:
# Debug
q_row = oz_extractor_results_df.iloc[192]  # Use iloc instead of filtering by question_number
test_excerpt = q_row['excerpt']

print("=" * 80)
print(f"DEBUGGING ROW AT INDEX 192 (Question #{q_row['question_number']})")
print("=" * 80)
print(f"Excerpt text: {test_excerpt[:200]}...")
print(f"\nNormalized: {normalize_text(test_excerpt)[:200]}...")

# Check which pickle files exist
import glob
pickle_files = glob.glob("../temp/*.pkl")
print(f"\nAvailable pickle files: {[os.path.basename(f) for f in pickle_files]}")

# Manually run the search with debug output
from utils.chunk_matching import find_chunk_locations_with_continuity

for pickle_path in pickle_files:
    print(f"\n--- Searching in {os.path.basename(pickle_path)} ---")
    result = find_chunk_locations_with_continuity(pickle_path, test_excerpt)
    if result:
        print(f"FOUND! Result: {result}")
        break
    else:
        print("Not found in this file")

In [None]:
oz_extractor_results_df.to_csv(
    "oz_extractor_5.1_1_is_match_results_CONT_CHECKED_1.csv", index=False
)

In [None]:
oz_extractor_results_df = pd.read_csv("full_continuous_results.csv")

In [None]:
combined_results_df = merge_model_and_expected_data(
    oz_extractor_results_df, readiscovers_app_results_df
)

In [None]:
set(combined_results_df['question'].to_list()) ^ set(selected_questions['Question'].str.strip().to_list())

In [None]:
# Count questions with at least one correct match
questions_with_correct_match = combined_results_df[
    combined_results_df["correct_match"] == True
]["question"].nunique()

# Total questions
total_questions = combined_results_df["question"].nunique()

# Calculate percentage
percentage = (questions_with_correct_match / total_questions) * 100

print(f"Total unique questions: {total_questions}")
print(f"Questions with at least one correct match: {questions_with_correct_match}")
print(
    f"Questions without any correct match: {total_questions - questions_with_correct_match}"
)
print(f"Correct match rate: {percentage:.1f}%")

In [None]:
# Total unique questions in the dataset
total_questions = oz_extractor_results_df["question_number"].nunique()

# Questions with at least one match
num_questions_with_match = oz_extractor_results_df[
    oz_extractor_results_df["model_chunk_index"].notnull()
]["question_number"].nunique()

# Calculate percentage
percentage = (num_questions_with_match / total_questions) * 100

print(f"Total unique questions: {total_questions}")
print(f"Questions with at least one match: {num_questions_with_match}")
print(f"Questions without any match: {total_questions - num_questions_with_match}")
print(f"Match rate: {percentage:.1f}%")