# Analysis of Overthinking vs Token Usage

In [2]:
import json
import statistics

Parsing the original analysis_results.jsonl to extract o1 high and o1 low results into separate files, assuming that these results are from evaluating with Sonnet 3.5 as the judge model.

In [3]:
official_results = '../analysis_results.jsonl'

# parse for o1 high results
o1_high_results = []
with open(official_results, 'r') as f:
    for line in f:
        data = json.loads(line)
        if data['model'] == 'o1_maxiter_30_N_v0.20.0-no-hint-run_1':
            o1_high_results.append(data)

# save to file for later
with open('official/o1_high_results.jsonl', 'w') as f:
    for result in o1_high_results:
        f.write(json.dumps(result) + '\n')

o1_low_results = []
with open(official_results, 'r') as f:
    for line in f:
        data = json.loads(line)
        if data['model'] == 'o1_low_maxiter_30_N_v0.20.0-no-hint-run_1':
            o1_low_results.append(data)

# save to file for later
with open('official/o1_low_results.jsonl', 'w') as f:
    for result in o1_low_results:
        f.write(json.dumps(result) + '\n')

len(o1_high_results), len(o1_low_results)

(200, 199)

Quick function to calculate mean and standard deviation of overthinking scores from a given file.

In [4]:
def calc_stats(filepath, sample_size=None):
    scores = []

    with open(filepath, 'r') as f:
        for line in f:
            data = json.loads(line)
            scores.append(float(data['overthinking_score']))

    if sample_size is not None:
        scores = scores[:sample_size]

    avg = statistics.mean(scores)
    std_dev = statistics.stdev(scores)

    return avg, std_dev, len(scores)

Next, I analyze both official and my_eval files for o1 high and o1 low results to verify the author's claim that overthinking is more prevalent in o1_low model than in o1_high model. I use a sample size of 50 due to limited data in my_eval files.

my_eval files were evaluated using Anthropic's Haiku 3.5 model instead of Sonnet 3.5. I assumed that the Haiku model would perform similarly to Sonnet as an LLM-as-a-judge. It was used to cut costs and speed up the evaluation process.

In [8]:
# Analyze files
files = [
    'official/o1_high_results.jsonl',
    'official/o1_low_results.jsonl',
    'my_eval/o1_high_overthinking.jsonl',
    'my_eval/o1_low_overthinking.jsonl'
]

for filepath in files:
    avg, std_dev, count = calc_stats(filepath, sample_size=50)
    print(f"{filepath}: {avg:.02f} ± {std_dev:.02f}")

official/o1_high_results.jsonl: 1.06 ± 0.89
official/o1_low_results.jsonl: 3.46 ± 3.57
my_eval/o1_high_overthinking.jsonl: 2.98 ± 2.51
my_eval/o1_low_overthinking.jsonl: 3.00 ± 2.16


The next steps are to evaluate overthinking scores from a different model. There were several ways I thought about to force "low reasoning effort" and "high reasoning effort". At first, I attempted to use Deepseek-R1-Distill-Qwen-14B, since the model's trajectories are already provided in the repository. I considered these trajectories as "high reasoning effort" as I assumed that it had an unlimited token budget. My thinking was that I could run Deepseek-R1-Distill-Qwen-14B with max_token set for my "low reasoning effort" trajectories.

Since I was going to use the OpenHands framework to evaluate the model using SWE-Bench, I had to set this up first. However this is where I hit my first major roadblock. Since OpenHands relies on docker, I couldn't run OpenHands on KOA (HPC server), meaning I needed to host the model remotely and then run OpenHands eval locally on my computer. Figuring out how to host the model remotely on the server and expose it publicly to my computer was a bit too difficult for me, so I looked into alternative models with an API that I could run the benchmark on instead. This led to my choice in evaluating Gemini-2.5-Flash. The Gemini API allows you to set "low reasoning effort" and "high reasoning effort", which was perfect for my experiment! 

Now that I had the model setup, I just needed to run the models on SWE-Bench. This is where I hit my second major roadblock. I needed to setup OpenHands locally on my computer, however my docker builds kept crashing on my Mac... I suspected it was due to the docker builds were built on x86_64 and my Mac uses an M2 chip. Luckily, I have an old Windows PC with WSL setup at my parents' house, so I planned to use it over Thanksgiving. There were still issues with Docker builds failing due to resource constraints, but with some adjustments I was able to successfully run evaluations for Gemini.

I ran the evaluations with the following commands on OpenHands:
```bash
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gemini_high HEAD CodeActAgent 10 10 1 princeton-nlp/SWE-bench_Verified test

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gemini_low HEAD CodeActAgent 10 10 1 princeton-nlp/SWE-bench_Verified test
```
Where `eval_limit=10` `max_iter=10` `num_workers=1`. However, due to time constraints, I cut it early for 5 problems from SWE-Bench for each setting.

Now I can analyze the results from Gemini-2.5-Flash for both low reasoning effort and high reasoning effort settings.

In [9]:
# Analyze files
files = [
    'my_eval/gemini_high_overthinking.jsonl',
    'my_eval/gemini_low_overthinking.jsonl'
]

for filepath in files:
    avg, std_dev, count = calc_stats(filepath, sample_size=5)
    print(f"{filepath}: {avg:.02f} ± {std_dev:.02f}")

my_eval/gemini_high_overthinking.jsonl: 4.80 ± 1.79
my_eval/gemini_low_overthinking.jsonl: 5.60 ± 2.19


Cool! So from my experiment with Gemini-2.5-Flash, I was able to verify the author's claim that overthinking is more prevalent in the "low reasoning effort" setting than in the "high reasoning effort" setting. But how does Gemini configure "low reasoning effort" and "high reasoning effort"? Gemini's API documentation states that the "low" and "high" reasoning settings correspond to the `ThinkingBudget` parameter and maps to 1024 and 24576 tokens respectively. Although the docs don't provide much more detail than, through some [research](https://www.reddit.com/r/Bard/comments/1lq4llt/how_does_geminis_thinkingbudget_actually_work/), its suggested that the 'ThinkingBudget' parameter acts a soft guide rather than an abrupt cutoff, so the model tries to finish its thinking within the token budget if possible. This might make the model rush through its steps in the "low" setting. I also conducted a response analysis of the LLM-judge's explanations for each overthinking score, and I noticed that in the "low" reasoning settings for both o1 and Gemini, the judge cited "rogue actions" more often than in their "high" reasoning counterparts, suggesting that the models were trying to string together quick actions to reach a solution which manifests in a higher overthinking score.