
# Data-Centric ML — 500 Most Recent (Semantic Scholar, Scrape Only)

This notebook **only** fetches the 500 most recent papers related to *data-centric machine learning* from the Semantic Scholar Graph API, using:
- A **multi-query** strategy (since Boolean `OR` isn't supported like web search).
- Strict **1 request/second** throttle with retry/backoff for 429/5xx.
- Local **deduplication**, **date parsing**, sorting by recency, and trimming to **500** papers.
- Output: `dcml_recent_500.csv` (title, abstract, publicationDate/year, venue, DOI, URL).


In [None]:

!pip -q install requests pandas tqdm



## Set your Semantic Scholar API key (securely)


In [None]:

import os
from getpass import getpass

if "S2_API_KEY" not in os.environ or not os.environ["S2_API_KEY"]:
    os.environ["S2_API_KEY"] = getpass("Enter your Semantic Scholar API key (input hidden): ")
print("API key set in environment variable S2_API_KEY.")


API key set in environment variable S2_API_KEY.



## (Optional) Mount Google Drive
Uncomment to save the CSV directly to Drive.


In [None]:

# from google.colab import drive
# drive.mount('/content/drive')
# OUTDIR = "/content/drive/MyDrive/dcml_outputs"
# import os; os.makedirs(OUTDIR, exist_ok=True)
# print("Using:", OUTDIR)



## Fetch the most recent papers
We issue several focused queries, paginate a few pages each, and merge locally.  
Then we deduplicate (DOI + normalized title), parse dates, sort by recency, and keep the freshest 500 with nontrivial abstracts.


In [None]:
import os, time, re, warnings
from datetime import datetime
import requests
import pandas as pd
from tqdm import tqdm

warnings.filterwarnings("ignore", category=UserWarning)

S2_API_KEY = os.getenv("S2_API_KEY")
assert S2_API_KEY, "Please set S2_API_KEY in your environment."
HEADERS = {"x-api-key": S2_API_KEY}
SEARCH_URL = "https://api.semanticscholar.org/graph/v1/paper/search"

FIELDS = "title,abstract,year,venue,publicationDate,externalIds,url"
PAGE_LIMIT = 100
TARGET = 1000

# Focused queries covering the DCML landscape
QUERIES = [
    "data-centric machine learning"
]

def robust_get(url, headers, params, sleep_s=1.1, max_retries=6):
    for attempt in range(1, max_retries + 1):
        time.sleep(sleep_s)  # pre-request throttle for 1 req/sec
        resp = requests.get(url, headers=headers, params=params, timeout=60)
        if resp.status_code == 200:
            return resp.json()
        if resp.status_code in (429, 500, 502, 503, 504):
            wait = sleep_s * attempt * 1.7
            print(f"Backing off ({resp.status_code}). Waiting {wait:.1f}s...")
            time.sleep(wait)
            continue
        raise RuntimeError(f"HTTP {resp.status_code}: {resp.text[:300]}")
    raise RuntimeError("Failed after max retries.")

def fetch_search(query, offset=0, limit=100):
    params = {
        "query": query,
        "limit": limit,
        "offset": offset,
        "fields": FIELDS,
    }
    return robust_get(SEARCH_URL, HEADERS, params)

def fetch_recent_dcml_multi(target=TARGET, per_query_pages=8):
    all_items = []
    for q in QUERIES:
        print(f"Query: {q}")
        for page in range(per_query_pages):
            data = fetch_search(q, offset=page*PAGE_LIMIT, limit=PAGE_LIMIT)
            batch = data.get("data", [])
            if not batch:
                print("  (no more results)")
                break
            all_items.extend(batch)
            print(f"  +{len(batch)} (total={len(all_items)})")
            if len(all_items) >= target * 4:  # overfetch for dedupe & trimming
                break
        if len(all_items) >= target * 4:
            break
    return all_items

def parse_date(s):
    if not s:
        return None
    for fmt in ("%Y-%m-%d", "%Y-%m", "%Y"):
        try:
            return datetime.strptime(s, fmt)
        except:
            pass
    m = re.search(r"(19|20)\d{2}", s or "")
    return datetime(int(m.group(0)), 1, 1) if m else None

def normalize_text(s):
    s = (s or "").strip()
    s = re.sub(r"\s+", " ", s)
    return s

def to_rows(raw_items):
    rows = []
    for p in raw_items:
        if not isinstance(p, dict):
            continue
        title = normalize_text(p.get("title"))
        abstract = normalize_text(p.get("abstract"))
        if not title or not abstract:
            continue
        ext = p.get("externalIds") or {}
        pub_dt = parse_date(p.get("publicationDate"))
        if (pub_dt is None) and p.get("year"):
            try:
                pub_dt = datetime(int(p["year"]), 1, 1)
            except:
                pub_dt = None
        rows.append({
            "title": title,
            "abstract": abstract,
            "year": p.get("year"),
            "venue": (p.get("venue") or "")[:200],
            "publicationDate": p.get("publicationDate"),
            "pub_dt": pub_dt,
            "doi": ext.get("DOI"),
            "arxivId": ext.get("ArXiv"),
            "url": p.get("url"),
        })
    return pd.DataFrame(rows)

print("Step 1: Fetching (multi-query, throttled) ...")
raw = fetch_recent_dcml_multi(TARGET, per_query_pages=10)
print("Raw items fetched:", len(raw))

df = to_rows(raw)
print("After to_rows:", len(df))

# Diagnostics
print("Unique DOIs:", df["doi"].notna().sum())
print("Missing publicationDate:", df["publicationDate"].isna().sum(), "| Missing year:", df["year"].isna().sum())
print("Abstract length ≥ 80:", (df["abstract"].str.len() >= 80).sum())

# Deduplicate by DOI + normalized title
df["title_norm"] = df["title"].str.lower().str.replace(r"\s+", " ", regex=True)
df = df.drop_duplicates(subset=["doi", "title_norm"], keep="first")

# Keep records that have a usable date
df = df[df["pub_dt"].notna()].copy()

# Filter for computer science venues (this is a basic example, might need refinement)
cs_venues = ["NeurIPS", "ICML", "ICLR", "ACL", "EMNLP", "ECML PKDD", "COLT", "UAI", "AISTATS", "IJCAI", "AAAI", "ECAI", "Artificial Intelligence", "Machine Learning", "Journal of Machine Learning Research", "IEEE Transactions on Pattern Analysis and Machine Intelligence", "Data Mining and Knowledge Discovery", "Expert Systems with Applications", "Knowledge-Based Systems", "Pattern Recognition", "Neurocomputing", "Information Science"]
df = df[df['venue'].str.contains('|'.join(cs_venues), na=False)].copy()

# Sort by recency and keep the freshest 1000 with substantial abstracts
df = df.sort_values("pub_dt", ascending=False)
df_recent = df[df["abstract"].str.len() >= 200].head(TARGET).copy().reset_index(drop=True)

print(f"Kept {len(df_recent)} most recent with nontrivial abstracts from CS venues.")
df_recent = df_recent.drop(columns=["title_norm"])
df_recent.to_csv("dcml_recent_1000.csv", index=False)
print("Saved: dcml_recent_1000.csv")

Step 1: Fetching (multi-query, throttled) ...
Query: data-centric machine learning
  +100 (total=100)
  +100 (total=200)
  +100 (total=300)
  +100 (total=400)
  +100 (total=500)
  +100 (total=600)
  +100 (total=700)
  +100 (total=800)
  +100 (total=900)
  +100 (total=1000)
Raw items fetched: 1000
After to_rows: 762
Unique DOIs: 717
Missing publicationDate: 28 | Missing year: 0
Abstract length ≥ 80: 762
Kept 38 most recent with nontrivial abstracts from CS venues.
Saved: dcml_recent_1000.csv
