# Citations Prediction Pipeline

**Author**: [Your Name]

**Date**: [Date]

**Pipeline ID**: `arxiv_citations_v1` (or your chosen ID)

---

## Pipeline Overview

This notebook implements a complete pipeline for generating citation prediction questions from arXiv papers.

**High-level approach**:
- [Describe your approach here]
- [arXiv categories selected and why]
- [Paper pairing strategy]
- [Citation data source]

**Key design decisions**:
- [Major decision 1 and rationale]
- [Major decision 2 and rationale]
- [etc.]


In [None]:
# Imports
import pandas as pd
import json
from pathlib import Path
from datetime import datetime

from src.data_classes import ForecastingQuestion, ArxivPaper

# TODO: Add your additional imports for scraping, citation APIs, etc.
# Example: import arxiv, requests, etc.

## 1. Data Collection

Describe your data collection approach:
- Which arXiv categories?
- What time period?
- How many papers collected?
- Any initial filtering?


In [None]:
# TODO: Implement arXiv scraping
# Your code here to:
# 1. Query arXiv API for papers
# 2. Extract full paper text
# 3. Get publication dates
# 4. Save to JSONL format

# Example structure:
# papers = scrape_arxiv_papers(
#     categories=['cs.LG', 'cs.AI'],
#     start_date='2025-04-01',
#     end_date='2025-05-01'
# )


## 2. Citation Data Collection

Describe how you obtained citation counts:
- Data source (Semantic Scholar, Google Scholar, etc.)
- Collection timestamp
- Any API limitations or challenges


In [None]:
# TODO: Implement citation collection
# Your code here to:
# 1. Query citation API
# 2. Match papers to citation counts
# 3. Add citation data to ArxivPaper objects

# Example:
# papers_with_citations = add_citation_counts(
#     papers
# )


## 3. Data Quality Checks

Show validation of your data:
- Distribution of citation counts
- Publication date distribution
- Paper length statistics
- Category distribution

In [None]:
# TODO: Add data validation and visualization
# Show distributions, check for outliers, etc.


## 4. Paper Pairing

Describe your pairing strategy:
- Minimum citation difference chosen and why
- Maximum publication date gap
- Category matching approach
- How you avoid spurious cues


In [None]:
# TODO: Implement pair_papers function
# Write a function that takes a list of papers and returns a list of pairs of papers.
# Think carefully about spurious cues and how to create a high-quality evaluation dataset.

# Example signature:
# def pair_papers(papers: List[ArxivPaper], ...) -> List[Tuple[ArxivPaper, ArxivPaper]]:
#     """
#     Create pairs of papers for citation comparison.
#     """
#     pass

# pairs = pair_papers(papers_with_citations)
# print(f"Created {len(pairs)} pairs from {len(papers_with_citations)} papers")

## 5. Question Generation

Generate questions from pairs using the standard template.


In [None]:
# TODO: Implement make_citations_comparison_question function
# This function should take two ArxivPaper objects and return a ForecastingQuestion
# Refer to src/data_classes.py for the ForecastingQuestion schema

# Template for the question text:
ARXIV_CITATION_COMPARISON_PROMPT = """
Will paper A receive more citations than paper B by {paper_a_citation_timestamp}? Yes or No? Here are the titles, abstracts, text and publication dates for both papers.

<paper_a>
<title>{paper_a_title}</title>
<full_text>{paper_a_full_text}</full_text>
<publication_date>{paper_a_published_timestamp}</publication_date>
</paper_a>

<paper_b>
<title>{paper_b_title}</title>
<full_text>{paper_b_full_text}</full_text>
<publication_date>{paper_b_published_timestamp}</publication_date>
</paper_b>

Resolution Criteria:
This question resolves to "Yes" if Paper A has more citations than Paper B on {paper_a_citation_timestamp}.
This question resolves to "No" if Paper B has more citations than Paper A on {paper_a_citation_timestamp}.

Question: Will paper A have more citations on {paper_a_citation_timestamp}, than paper B, Yes or No?"""

# def make_citations_comparison_question(paper_a: ArxivPaper, paper_b: ArxivPaper) -> ForecastingQuestion:
#     """
#     Create a ForecastingQuestion comparing citation counts of two papers.
#     Use the ARXIV_CITATION_COMPARISON_PROMPT template above and populate all required ForecastingQuestion fields.
#     """
#     pass

# Generate questions from pairs
# questions = []
# for paper_a, paper_b in pairs:
#     q = make_citations_comparison_question(paper_a, paper_b)
#     questions.append(q)
# 
# print(f"Generated {len(questions)} questions")

## 6. Common Mistakes Analysis

For each issue in the worktest document, address:

### 6.1 Spurious Cues
- **Issue**: [Describe the issue]
- **Your dataset**: [Yes/No - does your dataset have this issue?]
- **Mitigation**: [What you did to address it]
- **Tradeoffs**: [What you gave up]

### 6.2 Ambiguity in Questions
- **Issue**: [Describe]
- **Your dataset**: [Analysis]
- **Mitigation**: [Steps taken]
- **Tradeoffs**: [Costs]

### 6.3 Low Signal-to-Noise Ratio
- **Issue**: [Describe]
- **Your dataset**: [Analysis]
- **Mitigation**: [Steps taken]
- **Tradeoffs**: [Costs]

### 6.4 Selection Effects and Biases
- **Issue**: [Describe]
- **Your dataset**: [Analysis]
- **Mitigation**: [Steps taken]
- **Tradeoffs**: [Costs]

### 6.5 Data Contamination
- **Issue**: [Describe]
- **Your dataset**: [Analysis]
- **Mitigation**: [Steps taken]
- **Tradeoffs**: [Costs]


In [None]:
#TODO 
#Any code you might want to write to check and rectify the issues mentioned above before running the evaluation. 
# Are there paper pairs that dont make sense? Is there something wrong with the way you're collecting citations? 

## 7. Question Evaluation

Evaluate your dataset with multiple Claude models.


In [None]:
from src.eval import evaluate_and_plot
import asyncio

async def run_evaluation():
    predictions, metrics = await evaluate_and_plot(
        questions[100],
        model_ids=[
            "claude-3-5-haiku-latest",
            "claude-3-7-sonnet-20250219",
            "claude-sonnet-4-20250514"
        ],
        output_dir=Path("evaluation_results"),
        experiment_name="citations_pipeline"
    )
    return predictions, metrics

# Run evaluation
predictions, metrics = asyncio.run(run_evaluation())


## 8. Results Analysis
- Are you seeing accuracy scaling with more intelligent models? 
- If not, try to reason why not. Are there any spurious cues you need to remove? Is your datapipeline broken? Are there papers that are skewing your results? 


In [None]:
#TODO 
#Any code you might want to write to debug your pipeline. 


# 9. Export to parquet