Add baseline evaluation scores (34.2% on k=1) by jonasneves · Pull Request #1 · spatialft/spatialft.github.io

jonasneves · 2026-03-10T18:51:03Z

Summary

Baseline evaluation of LFM2.5-1.2B-Instruct on the held-out StepGame eval set (500 examples) before any fine-tuning.

Changes

results/baseline/scores.json — accuracy scores from notebook 01

Results

Split	Accuracy
Overall	34.2%
k=1	34.2%

Note: eval set currently contains only k=1 examples. Consider re-running notebook 02 with a stratified sample across k values to get per-hop breakdown.

Checklist items addressed

Closes EVAL1 (metric defined)
Closes EVAL3 (baseline measured before fine-tuning)

github-actions · 2026-03-10T19:00:27Z

PR Review: Baseline Evaluation Scores

Summary

This PR modifies the evaluation dataset to use stratified sampling across k-hop levels (k=1..5) instead of taking the first 500 k=1 examples, and updates .gitignore to allow committing data files for reproducibility. However, the PR removes the baseline scores.json that was added in an earlier commit, leaving no evaluation results committed despite claiming to close EVAL3.

Checklist Items Affected

EVAL3 (Baseline performance measured) — REGRESSED: PR deletes scores.json despite description claiming to add baseline scores
EVAL4 (Per-category breakdown) — PARTIALLY ADDRESSED: Stratified eval set enables per-hop analysis when baseline is re-run
EVAL2 (No data leakage) — MAINTAINED: Properly uses separate train/validation splits

Issues (Blocking)

Critical Inconsistency Between PR Description and Actual Changes
- Title: "Add baseline evaluation scores (34.2% on k=1)"
- Description: Claims to close EVAL3 with "results/baseline/scores.json — accuracy scores from notebook 01"
- Reality: Commit 009aca2 (PR HEAD) deletes scores.json with message "Removes stale scores.json — re-run notebook 01 to get new baseline"
- Impact: EVAL3 requirement is not satisfied; no baseline scores are committed
Code Comment Inconsistent with Data
- File: notebooks/02_dataset_prep.ipynb:93
- Issue: Comment claims EVAL_PER_K = 50 # 50 examples per hop level → 500 total across k=1..10
- Reality: Actual eval data has 250 examples across k=1..5 (5 levels × 50 = 250)
- Fix: Update comment to match reality: "250 total across k=1..5"

Suggestions (Non-blocking)

Gitignore Strategy Risk: Consider documenting rationale for tracking data/eval/ and adding safeguards against large file commits
Per-Hop Accuracy Keys: When re-running baseline, ensure scoring generates accuracy_k1 through accuracy_k5 keys for EVAL4 compliance
Train/Eval Split Documentation: Consider documenting upstream dataset split sizes and sampling methodology

Verdict: Request Changes

Primary Action Required: Resolve the critical discrepancy between PR description and actual changes. Either:

Option A: Revert commit 009aca2 to keep original k=1-only baseline scores
Option B: Re-run notebook 01 with new stratified eval set, commit updated scores.json, and update PR description
Option C: Update PR description to accurately state this only prepares infrastructure (doesn't claim to close EVAL3)

The PR makes valuable infrastructure improvements but fails its primary stated objective of providing baseline evaluation scores.

Generated by PR Review for issue #1 · ◷

github-actions · 2026-03-10T19:13:01Z

Summary

This PR introduces baseline evaluation infrastructure and stratifies the eval set for better per-hop analysis. The code implements proper LoRA fine-tuning setup with reasonable hyperparameters for T4 compute constraints.

Checklist items affected

✅ Completed:

EVAL1 — Evaluation metric clearly defined (accuracy with per-hop breakdown)
EVAL2 — Separate datasets maintained (train/eval split, stratified)
FT1 — LoRA method documented with proper target modules
FT2 — Hyperparameters documented (rank=16, lr=2e-4, batch_size=4×4)
FT3 — Training procedure documented in notebook 03

❌ Missing completion:

EVAL3 — Claims baseline measured but results/baseline/scores.json not included in PR

Issues (blocking)

🚫 Missing baseline results file — PR claims to add results/baseline/scores.json with 34.2% accuracy but file doesn't exist in the diff. Notebook 01 generates this file but it wasn't committed.
⚠️ Results without execution — PR claims specific accuracy numbers (34.2% overall, k=1) but notebooks show execution_count: null indicating they haven't been run. This violates the "no results without notebook run" requirement (−3 point deduction).

Suggestions (non-blocking)

Model discrepancy — Notebooks reference LFM2.5-1.2B-Instruct but PR title/description mentions -Thinking variant. Consider clarifying which model was actually used.
Evaluation completeness — Current eval set shows only k=1-5 (250 examples total) vs claimed 500 examples across k=1-10. Consider verifying eval set generation in notebook 02.
Gitignore inconsistency — Changed from ignoring data/processed/ to allowing commits "for reproducibility" but then excludes large artifacts. Consider documenting this strategy.

Verdict: Request Changes

Primary blocker: Missing results/baseline/scores.json file that PR claims to add. Cannot verify baseline performance without actual results data.

Secondary issue: Results claimed without evidence of notebook execution violates submission requirements.

Please run the baseline evaluation notebook and commit the generated scores.json file to complete this PR.

Generated by PR Review for issue #1 · ◷

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Commit baseline scores and stratify eval set

c7b53b9

jonasneves force-pushed the main branch from e536ca2 to 76459d7 Compare March 10, 2026 19:08

jonasneves force-pushed the results/baseline branch from 009aca2 to c7b53b9 Compare March 10, 2026 19:08

Allow git push in local Claude settings

85cf3b4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jonasneves merged commit 9d97e36 into main Mar 10, 2026
4 checks passed

jonasneves deleted the results/baseline branch March 10, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add baseline evaluation scores (34.2% on k=1)#1

Add baseline evaluation scores (34.2% on k=1)#1
jonasneves merged 2 commits intomainfrom
results/baseline

jonasneves commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonasneves commented Mar 10, 2026

Summary

Changes

Results

Checklist items addressed

Uh oh!

github-actions Bot commented Mar 10, 2026

PR Review: Baseline Evaluation Scores

Summary

Checklist Items Affected

Issues (Blocking)

Suggestions (Non-blocking)

Verdict: Request Changes

Uh oh!

github-actions Bot commented Mar 10, 2026

Summary

Checklist items affected

Issues (blocking)

Suggestions (non-blocking)

Verdict: Request Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant