# Report (living draft)

## Project Scope (General)

1. Research Objective   
The central goal of this research is to investigate whether the correctness of a language model's reasoning trace can be predicted from simple, interpretable signals. We hypothesize that observable properties of a model's generated text, such as its length, structure, and semantic content, can serve as effective proxies for logical validity, particularly in formal domains like competitive mathematics. Our primary objective is to establish strong, reproducible baselines for this prediction task before exploring more complex, model-internal features.

2. Experimental Domain & Datasets   
Initial Domain: We begin our investigation with the AIME (American Invitational Mathematics Examination) dataset, covering problems from 1983–2024. This provides a rich, structured environment of formal reasoning problems with unambiguous ground-truth answers.     
Future Work: To test the generality of our findings, we plan to extend the analysis to the GPQA (Graduate-Level Google-Proof Q&A) dataset, which represents a different and more complex reasoning domain.

3. Models & Methodology   
Model Selection: To ensure a controlled and reproducible experimental setup, we will begin by focusing on small, open-source language models in the 1–4 billion parameter range. Our initial baseline will be established using the DeepSeek-R1-Distill-Qwen-1.5B model. We will subsequently introduce 1–2 additional models of a comparable size to measure the consistency of our findings across different architectures.    
Data Generation: For each problem in our dataset, we will generate a set of at least 8 independent reasoning traces to ensure a robust sample size for our analysis. All generated data will be stored in a standardized JSON format to facilitate downstream processing.



## Weekly Progress Report: AIME Data Cleaning & Initial Signal Analysis


This week, the AIME dataset was fully cleaned, standardized, and prepared for analysis. A simple quality gate was enforced on all generated reasoning traces, and uniform coverage was established by capping generations at 8 per problem. High-level figures were regenerated based on this clean data. The most significant early signal is that reasoning length (in tokens) strongly correlates with correctness, making it a powerful candidate for our initial baseline models. Logprobs were successfully used as a quality filter to remove degenerate traces, but are not being used as a primary modeling feature in this initial phase. The net result is that the AIME dataset is now analysis-ready, with consistent data quality across all years and variants.

**What I Did This Week** 

**Standardized Data**: Enforced a canonical key for all problems: (year, variant ∈ {I, II}, problem 1–15) and a consistent JSON schema for all outputs.  

**Implemented a Quality Gate**: Kept only those reasoning traces with avg_neg_logprob > 0, total_neg_logprob > 0, and non-empty token_neg_logprobs to ensure data integrity.  

**Ensured Uniform Coverage**: Capped generations at a maximum of 8 per problem after the quality control step, ensuring a balanced dataset.

**Generated Reports & Figures**: Created CSV summaries and initial plots for rapid inspection and signal analysis.  

**Conducted a Data Audit**: Verified that AIME-II exists for all post-2000 problems and confirmed that any "missing" problems in our set correspond to known gaps in the source dataset, not pipeline errors. 


**Quick Numbers (Week 01 AIME Run)**

Total Generations after QC & Capping: **7,744**

Correct Answers: **2,804**

Incorrect Answers: **4,818**

Unknown (Missing Ground-Truth or Extracted Answer): **122**

**Accuracy on Labeled Data: 36.79%**

Note: "Unknown" rows are kept for statistical analysis but are excluded from accuracy calculations.


**Key Figures & Takeaways**

1. **Token Length by Correctness**
![Token Length by Correctness](../figures/week01/01_tok_len_boxplot.png)

**Takeaway**: There is a clear and strong signal: **longer reasoning traces tend to be more accurate.** This makes token count a simple, transparent, and powerful feature for our first baseline classifiers. 

2. **Mean Negative Logprob by Correctness**
![Mean −log p — Correct](../figures/week01/02_mean_nlp_hist_correct.png)
![Mean −log p — Incorrect](../figures/week01/03_mean_nlp_hist_incorrect.png)

**Takeaway**: While logprobs were essential for the quality control phase to filter out empty or degenerate traces, they do not show a strong separation between correct and incorrect answers in this initial view. They will be held in reserve as a potential feature for more advanced models later.


**Key Insights & Implications (So Far)**

We now have a **clean, uniform AIME dataset,** which makes all future comparisons and analyses reliable.

**Reasoning length is a powerful initial signal** for correctness and will be the primary feature for our first baseline models.

The logprobs have successfully served their purpose as a data cleaning filter, which was a critical step for ensuring the quality of our dataset.


**Action Plan for Next Week**

**Add a Second Small Model:** Replicate the exact same data generation and cleaning pipeline for a second, comparable model.

**Deepen the Feature Analysis:** The initial finding that response_length correlates with correctness is a powerful signal. The immediate next step is to investigate this further and identify other simple textual cues. I will write a new analysis notebook to systematically extract and evaluate features like:

The frequency of specific mathematical keywords (e.g., "therefore," "hence," "thus").

The count of numbers, equations, and special symbols.


**Update Notebooks & Slides:** Keep the report.ipynb and the weekly slide deck updated with these new findings.
