# Step 1: Fetching Papers from OpenReview

This notebook demonstrates the first step in building a RAG (Retrieval-Augmented Generation) pipeline: **collecting research papers from OpenReview**.

## OpenReview

[OpenReview](https://openreview.net/) is an open platform for scientific peer review and publication. Major conferences like ICLR, NeurIPS, and ICML use OpenReview to manage their submission and review process. It provides a public API to access papers, reviews, and metadata.

Note: you can use your own pdfs and skip this notebook.

## Overview of This Step

In this notebook, we will:
1. Connect to the OpenReview API
2. Fetch papers from ICLR 2025 Conference
3. Download paper PDFs and metadata (title, abstract, authors)
4. Save the data for the next step in our RAG pipeline

## Why This Matters for RAG

A RAG pipeline needs a knowledge base. By collecting academic papers, we create a corpus of high-quality technical content that can be:
- Indexed and searched
- Used to answer questions about recent research
- Referenced with citations and proper attribution

## Import Required Libraries

In [1]:
# https://docs.openreview.net/getting-started/using-the-api/installing-and-instantiating-the-python-client

!pip install openreview-py

Collecting openreview-py
  Downloading openreview_py-1.54.7-py3-none-any.whl.metadata (4.1 kB)
Collecting pycryptodome (from openreview-py)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting Deprecated (from openreview-py)
  Downloading deprecated-1.3.1-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting pylatexenc (from openreview-py)
  Downloading pylatexenc-2.10.tar.gz (162 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tld>=0.12 (from openreview-py)
  Downloading tld-0.13.1-py2.py3-none-any.whl.metadata (10 kB)
Collecting litellm==1.76.1 (from openreview-py)
  Downloading litellm-1.76.1-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastuuid>=0.12.0 (from

In [2]:
from pathlib import Path
from google.colab import drive
drive.mount('/content/drive')

BASE_PATH = Path("/content/drive/MyDrive/RAG")
PDF_FOLDER = BASE_PATH / "block1_output_pdfs"

PDF_FOLDER.mkdir(parents=True, exist_ok=True)



Mounted at /content/drive


In [3]:
import json
import os
from typing import List, Dict
import openreview
import openreview.api

# API endpoints
API_V2 = "https://api2.openreview.net"
API_V1 = "https://api.openreview.net"

print("Libraries imported successfully")

Libraries imported successfully


## Understanding the OpenReview API

OpenReview provides two API versions:

### API v2
- **Endpoint**: `https://api2.openreview.net`
- **Client**: `openreview.api.OpenReviewClient`
- **Features**: Better performance, structured data format
- **Data Structure**: Content fields use `{'value': actual_value}` format

### API v1 (Legacy)
- **Endpoint**: `https://api.openreview.net`
- **Client**: `openreview.Client`
- **Features**: Older format, still maintained for backward compatibility
- **Data Structure**: Content fields are direct values

### Key Concepts

1. **Venue**: A conference or workshop (e.g., `ICLR.cc/2025/Conference`)
2. **Invitation**: Defines what type of notes to fetch (e.g., `Submission`, `Official_Review`)
3. **Note**: A paper submission, review, or comment
4. **Content**: The actual data (title, abstract, authors, PDF, etc.)



## Define the Main Function to Fetch Papers

This function handles:
- Connecting to OpenReview API v2 (with fallback to v1)
- Fetching paper submissions by invitation
- Extracting metadata (title, abstract, authors)
- Downloading PDFs
- Handling different data structures between API versions

In [4]:
def fetch_papers_from_openreview(
    venue_id: str,
    n: int = 10,
    download_path = PDF_FOLDER,
    download_pdfs: bool = False,
    verbose: bool = True
) -> List[Dict]:
    """
    Fetch papers from OpenReview.

    Args:
        venue_id: OpenReview venue identifier (e.g., 'ICLR.cc/2025/Conference')
        n: Number of papers to fetch
        download_pdfs: Whether to download PDF files
        verbose: Print progress messages

    Returns:
        List of paper dictionaries with metadata
    """
    papers = []
    notes = []

    # Try API v2 first (recommended for newer conferences)
    try:
        if verbose:
            print(f"Connecting to OpenReview API v2...")

        client = openreview.api.OpenReviewClient(baseurl=API_V2)
        invitation = f"{venue_id}/-/Submission"

        if verbose:
            print(f"Fetching papers from invitation: {invitation}")

        # Fetch submissions
        notes_iter = client.get_all_notes(invitation=invitation)

        for i, note in enumerate(notes_iter):
            if i >= n:
                break

            content = note.content or {}
            if not isinstance(content, dict):
                continue

            # Extract fields (API v2 uses {'value': actual_value} structure)
            title_obj = content.get('title') or content.get('paper_title') or {}
            title = title_obj.get('value', '') if isinstance(title_obj, dict) else title_obj

            abstract_obj = content.get('abstract', {})
            abstract = abstract_obj.get('value', '') if isinstance(abstract_obj, dict) else abstract_obj

            authors_obj = content.get('authors', {})
            authors = authors_obj.get('value', []) if isinstance(authors_obj, dict) else authors_obj

            pdf_obj = content.get('pdf', {})
            pdf_value = pdf_obj.get('value') if isinstance(pdf_obj, dict) else None

            paper = {
                'id': note.id,
                'number': note.number,
                'title': title,
                'abstract': abstract,
                'authors': authors,
                'pdf': pdf_value,
                'content': content,
            }
            papers.append(paper)
            notes.append(note)

        if verbose:
            print(f"Found {len(papers)} papers using API v2")

        # Download PDFs if requested
        if download_pdfs and papers:

            if verbose:
                print(f"Downloading PDFs to {download_path}/")

            for note in notes:
                if note.content and isinstance(note.content, dict) and note.content.get('pdf'):
                    try:
                        pdf_binary = client.get_pdf(id=note.id)
                        pdf_path = download_path / f"{note.number}.pdf"

                        with open(pdf_path, 'wb') as f:
                            f.write(pdf_binary)

                        # Add local path to paper dict
                        for paper in papers:
                            if paper['id'] == note.id:
                                paper['local_pdf_path'] = str(pdf_path)
                                break

                        if verbose:
                            print(f"  ✓ Downloaded paper {note.number}")
                    except Exception as e:
                        if verbose:
                            print(f"  ✗ Failed to download {note.id}: {e}")

        return papers

    except Exception as e:
        if verbose:
            print(f"API v2 failed: {e}")
            print("Falling back to API v1 is not implemented in this notebook version.")
        raise

print("✓ Function defined successfully")

✓ Function defined successfully


## Configure Parameters

Set the parameters for fetching papers:
- **Venue**: ICLR 2025 Conference
- **Download PDFs**: Yes (we'll need these for the next step)

In [5]:
# Configuration
VENUE_ID = "ICLR.cc/2025/Conference"
NUM_PAPERS = 10
DOWNLOAD_PDFS = True


OUTPUT_FILE = BASE_PATH / "iclr_papers.json"

print(f"   Configuration:")
print(f"   Venue: {VENUE_ID}")
print(f"   Number of papers: {NUM_PAPERS}")
print(f"   Download PDFs: {DOWNLOAD_PDFS}")
print(f"   Output file: {OUTPUT_FILE}")

   Configuration:
   Venue: ICLR.cc/2025/Conference
   Number of papers: 10
   Download PDFs: True
   Output file: /content/drive/MyDrive/RAG/iclr_papers.json


## Fetch Papers from OpenReview

Now let's fetch the papers! This will:
1. Connect to the OpenReview API
2. Retrieve paper metadata
3. Download PDF files (this may take a few minutes)

In [6]:
# Fetch papers
papers = fetch_papers_from_openreview(
    venue_id=VENUE_ID,
    n=NUM_PAPERS,
    download_pdfs=DOWNLOAD_PDFS,
    verbose=True
)

print(f"\nSuccessfully fetched {len(papers)} papers!")

Connecting to OpenReview API v2...
Fetching papers from invitation: ICLR.cc/2025/Conference/-/Submission
Found 10 papers using API v2
Downloading PDFs to /content/drive/MyDrive/RAG/block1_output_pdfs/
  ✓ Downloaded paper 14296
  ✓ Downloaded paper 14294
  ✓ Downloaded paper 14293
  ✓ Downloaded paper 14290
  ✓ Downloaded paper 14287
  ✓ Downloaded paper 14286
  ✓ Downloaded paper 14284
  ✓ Downloaded paper 14282
  ✓ Downloaded paper 14280
  ✓ Downloaded paper 14279

Successfully fetched 10 papers!


## Explore the Data

Let's examine what we've collected:

In [7]:
# Display summary statistics
print(f"Dataset Summary:")
print(f"   Total papers: {len(papers)}")
print(f"   Papers with PDFs: {sum(1 for p in papers if 'local_pdf_path' in p)}")
print(f"\nFirst Paper Example:")
print(f"   Title: {papers[0]['title']}")
print(f"   Authors: {', '.join(papers[0]['authors'][:3])}{'...' if len(papers[0]['authors']) > 3 else ''}")
print(f"   Abstract (first 150 chars): {papers[0]['abstract'][:150]}...")
print(f"   Paper ID: {papers[0]['id']}")
print(f"   Paper Number: {papers[0]['number']}")

Dataset Summary:
   Total papers: 10
   Papers with PDFs: 10

First Paper Example:
   Title: Neuroacoustic Patterns: Constant Q Cepstral Coefficients for the Classification of Neurodegenerative Disorders
   Authors: Aastha Kachhi, Shashank Ojha, Megha Pandey...
   Abstract (first 150 chars): Early identification of neurodegenerative diseases is crucial for effective diagnosis in neurological disorders. However, the quasi-periodic nature of...
   Paper ID: 5sRnsubyAK
   Paper Number: 14296


## Save the Data

Save the paper metadata to a JSON file for easy access in subsequent steps:

In [8]:
# Save to JSON file
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
    json.dump(papers, f, ensure_ascii=False, indent=2)

print(f"Saved {len(papers)} papers to {OUTPUT_FILE}")
print(f"File size: {Path(OUTPUT_FILE).stat().st_size / 1024:.1f} KB")

# Verify PDFs directory
if DOWNLOAD_PDFS:
    pdf_dir = Path('block1_output_pdfs')
    if pdf_dir.exists():
        pdf_count = len(list(pdf_dir.glob('*.pdf')))
        print(f"{pdf_count} PDFs saved in {pdf_dir}/")

Saved 10 papers to /content/drive/MyDrive/RAG/iclr_papers.json
File size: 51.5 KB


## View Sample Papers

Let's look at the titles of the papers we collected:

In [9]:
# Display first 10 paper titles
print("First 10 Papers:\n")
for i, paper in enumerate(papers[:10], 1):
    print(f"{i:2d}. {paper['title']}")
    print(f"    Authors: {', '.join(paper['authors'][:2])}{'...' if len(paper['authors']) > 2 else ''}")
    print()

First 10 Papers:

 1. Neuroacoustic Patterns: Constant Q Cepstral Coefficients for the Classification of Neurodegenerative Disorders
    Authors: Aastha Kachhi, Shashank Ojha...

 2. A Feature-Aware Federated Learning Framework for Unsupervised Anomaly Detection in 5G Networks
    Authors: Saeid Sheikhi

 3. UnoLoRA: Single Low-Rank Adaptation for Efficient Multitask Fine-tuning
    Authors: Anirudh Lakhotia, Akash Kamalesh...

 4. Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval
    Authors: Adel Elmahdy, Sheng-Chieh Lin...

 5. EXecution-Eval: Can language models execute real-world code?
    Authors: Rob Kopel

 6. The Rate-Distortion-Perception Trade-Off with Algorithmic Realism
    Authors: Yassine Hamdi, Aaron B. Wagner...

 7. Beyond Random Masking: When Dropout meets Graph Convolutional Networks
    Authors: Yuankai Luo, Xiao-Ming Wu...

 8. Defining Deception in Decision Making
    Authors: Marwa Abdulhai, 

## Summary of What We've Accomplished

In this notebook, we:

1. ✅ **Connected to OpenReview API**: Used the modern API v2 to access ICLR 2025 papers
2. ✅ **Fetched Paper Metadata**: Retrieved titles, abstracts, authors, and paper IDs
3. ✅ **Downloaded PDFs**: Saved PDF files locally for processing
4. ✅ **Saved Structured Data**: Created a JSON file with all paper information

### Data Structure

Each paper in our dataset contains:
- `id`: Unique OpenReview identifier
- `number`: Paper submission number
- `title`: Paper title
- `abstract`: Paper abstract
- `authors`: List of author names
- `pdf`: PDF URL (if available)
- `local_pdf_path`: Local path to downloaded PDF
- `content`: Raw content dictionary from OpenReview

### Files Created

- **`iclr_papers.json`**: JSON file with paper metadata
- **`block1_output_pdfs/`**: Directory containing downloaded PDFs

## Next Step: Converting PDFs to Markdown

### What's Next in the RAG Pipeline?

Now that we have collected the research papers, the next step is to **convert the PDFs into structured Markdown format**. This is crucial for our RAG system because:

#### Why Convert to Markdown?

1. **Text Extraction**: Extract clean, structured text from PDF documents
2. **Better Parsing**: Markdown preserves document structure (headings, lists, tables)
3. **Easier Chunking**: Structured text is easier to split into meaningful chunks for embeddings
4. **Improved Search**: LLMs can better understand and search through markdown content
5. **Citation Preservation**: Maintain references and bibliography information

#### How We'll Do It

In the next notebook (Step 2), we will:

1. **Use a Large Language Model (LLM)** to process each PDF
2. **Extract and structure the content** into clean Markdown format
3. **Preserve key elements**: sections, equations, figures, tables, and references
4. **Handle multi-column layouts** and complex formatting
5. **Save structured Markdown files** for each paper


### Why Use an LLM?

Traditional PDF parsers struggle with:
- Complex multi-column layouts
- Mathematical equations
- Tables and figures
- Reference formatting
- Section hierarchies

An LLM can understand the semantic structure of academic papers and produce high-quality markdown that preserves meaning and readability.
