# Statistical Analysis of Empiricist Experiment Data

In [10]:
import json
import statistics

Parsing the original analysis_results.jsonl to extract o1 high and o1 low results into separate files, assuming that these results are from evaluating with Sonnet 3.5 as the judge model.

In [20]:
official_results = '../analysis_results.jsonl'

# parse for o1 high results
o1_high_results = []
with open(official_results, 'r') as f:
    for line in f:
        data = json.loads(line)
        if data['model'] == 'o1_maxiter_30_N_v0.20.0-no-hint-run_1':
            o1_high_results.append(data)

# save to file for later
with open('official/o1_high_results.jsonl', 'w') as f:
    for result in o1_high_results:
        f.write(json.dumps(result) + '\n')

o1_low_results = []
with open(official_results, 'r') as f:
    for line in f:
        data = json.loads(line)
        if data['model'] == 'o1_low_maxiter_30_N_v0.20.0-no-hint-run_1':
            o1_low_results.append(data)

# save to file for later
with open('official/o1_low_results.jsonl', 'w') as f:
    for result in o1_low_results:
        f.write(json.dumps(result) + '\n')

len(o1_high_results), len(o1_low_results)

(200, 199)

Quick function to calculate mean and standard deviation of overthinking scores from a given file.

In [22]:
def calc_stats(filepath, sample_size=None):
    scores = []

    with open(filepath, 'r') as f:
        for line in f:
            data = json.loads(line)
            scores.append(float(data['overthinking_score']))

    if sample_size is not None:
        scores = scores[:sample_size]

    avg = statistics.mean(scores)
    std_dev = statistics.stdev(scores)

    return avg, std_dev, len(scores)

Next, I analyze both official and my_eval files for o1 high and o1 low results to verify the author's claim that overthinking is more prevalent in o1_low model than in o1_high model. I use a sample size of 50 due to limited data in my_eval files.

my_eval files were evaluated using Anthropic's Haiku 3.5 model instead of Sonnet 3.5. I assumed that the Haiku model would perform similarly to Sonnet as an LLM-as-a-judge. It was used to cut costs and speed up the evaluation process.

In [28]:
# Analyze files
files = [
    'official/o1_high_results.jsonl',
    'official/o1_low_results.jsonl',
    'my_eval/o1_high_overthinking.jsonl',
    'my_eval/o1_low_overthinking.jsonl'
]

for filepath in files:
    avg, std_dev, count = calc_stats(filepath, sample_size=50)
    print(f"{filepath}: {avg:.02f} ± {std_dev:.02f}")

official/o1_high_results.jsonl: 1.06 ± 0.89
official/o1_low_results.jsonl: 3.46 ± 3.57
my_eval/o1_high_overthinking.jsonl: 2.98 ± 2.51
my_eval/o1_low_overthinking.jsonl: 3.00 ± 2.16
