Skip to content

Add baseline evaluation scores (34.2% on k=1)#1

Merged
jonasneves merged 2 commits intomainfrom
results/baseline
Mar 10, 2026
Merged

Add baseline evaluation scores (34.2% on k=1)#1
jonasneves merged 2 commits intomainfrom
results/baseline

Conversation

@jonasneves
Copy link
Copy Markdown
Collaborator

Summary

Baseline evaluation of LFM2.5-1.2B-Instruct on the held-out StepGame eval set (500 examples) before any fine-tuning.

Changes

  • results/baseline/scores.json — accuracy scores from notebook 01

Results

Split Accuracy
Overall 34.2%
k=1 34.2%

Note: eval set currently contains only k=1 examples. Consider re-running notebook 02 with a stratified sample across k values to get per-hop breakdown.

Checklist items addressed

  • Closes EVAL1 (metric defined)
  • Closes EVAL3 (baseline measured before fine-tuning)

@github-actions
Copy link
Copy Markdown
Contributor

PR Review: Baseline Evaluation Scores

Summary

This PR modifies the evaluation dataset to use stratified sampling across k-hop levels (k=1..5) instead of taking the first 500 k=1 examples, and updates .gitignore to allow committing data files for reproducibility. However, the PR removes the baseline scores.json that was added in an earlier commit, leaving no evaluation results committed despite claiming to close EVAL3.

Checklist Items Affected

  • EVAL3 (Baseline performance measured) — REGRESSED: PR deletes scores.json despite description claiming to add baseline scores
  • EVAL4 (Per-category breakdown) — PARTIALLY ADDRESSED: Stratified eval set enables per-hop analysis when baseline is re-run
  • EVAL2 (No data leakage) — MAINTAINED: Properly uses separate train/validation splits

Issues (Blocking)

  1. Critical Inconsistency Between PR Description and Actual Changes

    • Title: "Add baseline evaluation scores (34.2% on k=1)"
    • Description: Claims to close EVAL3 with "results/baseline/scores.json — accuracy scores from notebook 01"
    • Reality: Commit 009aca2 (PR HEAD) deletes scores.json with message "Removes stale scores.json — re-run notebook 01 to get new baseline"
    • Impact: EVAL3 requirement is not satisfied; no baseline scores are committed
  2. Code Comment Inconsistent with Data

    • File: notebooks/02_dataset_prep.ipynb:93
    • Issue: Comment claims EVAL_PER_K = 50 # 50 examples per hop level → 500 total across k=1..10
    • Reality: Actual eval data has 250 examples across k=1..5 (5 levels × 50 = 250)
    • Fix: Update comment to match reality: "250 total across k=1..5"

Suggestions (Non-blocking)

  1. Gitignore Strategy Risk: Consider documenting rationale for tracking data/eval/ and adding safeguards against large file commits

  2. Per-Hop Accuracy Keys: When re-running baseline, ensure scoring generates accuracy_k1 through accuracy_k5 keys for EVAL4 compliance

  3. Train/Eval Split Documentation: Consider documenting upstream dataset split sizes and sampling methodology

Verdict: Request Changes

Primary Action Required: Resolve the critical discrepancy between PR description and actual changes. Either:

  1. Option A: Revert commit 009aca2 to keep original k=1-only baseline scores
  2. Option B: Re-run notebook 01 with new stratified eval set, commit updated scores.json, and update PR description
  3. Option C: Update PR description to accurately state this only prepares infrastructure (doesn't claim to close EVAL3)

The PR makes valuable infrastructure improvements but fails its primary stated objective of providing baseline evaluation scores.

Generated by PR Review for issue #1 ·

@github-actions
Copy link
Copy Markdown
Contributor

Summary

This PR introduces baseline evaluation infrastructure and stratifies the eval set for better per-hop analysis. The code implements proper LoRA fine-tuning setup with reasonable hyperparameters for T4 compute constraints.

Checklist items affected

✅ Completed:

  • EVAL1 — Evaluation metric clearly defined (accuracy with per-hop breakdown)
  • EVAL2 — Separate datasets maintained (train/eval split, stratified)
  • FT1 — LoRA method documented with proper target modules
  • FT2 — Hyperparameters documented (rank=16, lr=2e-4, batch_size=4×4)
  • FT3 — Training procedure documented in notebook 03

❌ Missing completion:

  • EVAL3 — Claims baseline measured but results/baseline/scores.json not included in PR

Issues (blocking)

  1. 🚫 Missing baseline results file — PR claims to add results/baseline/scores.json with 34.2% accuracy but file doesn't exist in the diff. Notebook 01 generates this file but it wasn't committed.

  2. ⚠️ Results without execution — PR claims specific accuracy numbers (34.2% overall, k=1) but notebooks show execution_count: null indicating they haven't been run. This violates the "no results without notebook run" requirement (−3 point deduction).

Suggestions (non-blocking)

  1. Model discrepancy — Notebooks reference LFM2.5-1.2B-Instruct but PR title/description mentions -Thinking variant. Consider clarifying which model was actually used.

  2. Evaluation completeness — Current eval set shows only k=1-5 (250 examples total) vs claimed 500 examples across k=1-10. Consider verifying eval set generation in notebook 02.

  3. Gitignore inconsistency — Changed from ignoring data/processed/ to allowing commits "for reproducibility" but then excludes large artifacts. Consider documenting this strategy.

Verdict: Request Changes

Primary blocker: Missing results/baseline/scores.json file that PR claims to add. Cannot verify baseline performance without actual results data.

Secondary issue: Results claimed without evidence of notebook execution violates submission requirements.

Please run the baseline evaluation notebook and commit the generated scores.json file to complete this PR.

Generated by PR Review for issue #1 ·

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jonasneves jonasneves merged commit 9d97e36 into main Mar 10, 2026
4 checks passed
@jonasneves jonasneves deleted the results/baseline branch March 10, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant