# Milestone 1 — Automated Paper Search & PDF Download

**Week 1–2 milestone:** environment setup, Semantic Scholar automated search, selection & PDF download, metadata CSV and dataset preparation. Replace the API key in the indicated cell before running.

**Files uploaded:** This environment may already contain your uploaded files at `/mnt/data/` (e.g., the PDF you mentioned). A cell below will list the uploaded files.

**How to use:** Run cells in order (Shift+Enter). The notebook is organized with numbered cells and clear section headings for demo.


## Cell 1 — Install dependencies

In [1]:
# Cell 1:
!pip install --upgrade pip
!pip install requests pandas tqdm PyMuPDF==1.22.3 nbformat

print("Dependencies installed: requests, pandas, tqdm, PyMuPDF, nbformat")

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Collecting PyMuPDF==1.22.3
  Downloading PyMuPDF-1.22.3.tar.gz (59.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 MB[0m [31m10.8 MB/s[0m  [33m0:00:05[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: PyMuPDF
  Building wheel for PyMuPDF (pyproject.toml) ... [?25l[?25hdone
  Created wheel for PyMuPDF: filename=pymupdf-1.22.3-cp312-cp312

## Cell 2 — Imports, configuration, and folders

This cell sets up folders and shows uploaded files in `/mnt/data/`. You'll add your API key in the next cell.

In [2]:
# Cell 2: Imports and folder setup
import os
from pathlib import Path
import json
import requests
import pandas as pd
from tqdm.auto import tqdm
import time
import getpass

ROOT_DIR = Path("/content/semantic_scholar_results")
PAPERS_DIR = ROOT_DIR / "papers"
ROOT_DIR.mkdir(parents=True, exist_ok=True)
PAPERS_DIR.mkdir(parents=True, exist_ok=True)

print("Results root:", ROOT_DIR)
print("Papers dir:", PAPERS_DIR)

# List uploaded files in /mnt/data to help you locate your uploaded PDF(s)
uploaded = list(Path("/mnt/data").glob("*"))
print('\nFiles in /mnt/data:')
for p in uploaded:
    print("-", p.name)

Results root: /content/semantic_scholar_results
Papers dir: /content/semantic_scholar_results/papers

Files in /mnt/data:


## Cell 3 — API Key (secure input)

**IMPORTANT:** For demo, paste your Semantic Scholar API key when prompted. The key is not printed.


In [4]:
# Cell 3: Load Semantic Scholar API key (secure)
import os, getpass
if 'SS_API_KEY' not in os.environ:
    key = getpass.getpass("API KEY:")
    os.environ['SS_API_KEY'] = key.strip()
SEMANTIC_SCHOLAR_API_KEY = os.environ['SS_API_KEY']
HEADERS = {"x-api-key": SEMANTIC_SCHOLAR_API_KEY, "User-Agent": "Colab-Milestone1/1.0"}
print("API key loaded into environment variable SS_API_KEY (hidden).")

API KEY:··········
API key loaded into environment variable SS_API_KEY (hidden).


## Cell 4 — Quick API connectivity test

This confirms your key works and the API is reachable. If you see `Unauthorized`, re-check your key.

In [5]:
# Cell 4: Quick test
import requests
test_url = "https://api.semanticscholar.org/graph/v1/author/search"
try:
    r = requests.get(test_url, headers=HEADERS, params={"query":"test","limit":1}, timeout=15)
    print("Status code:", r.status_code)
    if r.status_code == 401:
        print("Unauthorized — check your API key.")
    else:
        print("API reachable. You can proceed.")
except Exception as e:
    print("Request failed:", str(e))

Status code: 200
API reachable. You can proceed.


## Cell 5 — Search function

This function calls the Semantic Scholar paper search endpoint and returns a pandas DataFrame with key fields.

In [6]:
# Cell 5: Semantic Scholar search function
SS_API_BASE = "https://api.semanticscholar.org/graph/v1"

def search_papers(query, limit=20, offset=0, year_from=None, year_to=None, open_access_only=False):
    fields = "title,authors,year,abstract,url,openAccessPdf,paperId,citationCount,isOpenAccess,venue"
    params = {"query": query, "limit": limit, "offset": offset, "fields": fields}
    resp = requests.get(f"{SS_API_BASE}/paper/search", headers=HEADERS, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json().get("data", [])
    rows = []
    for p in data:
        rows.append({
            "paperId": p.get("paperId"),
            "title": p.get("title"),
            "year": p.get("year"),
            "venue": p.get("venue"),
            "abstract": p.get("abstract"),
            "url": p.get("url"),
            "openAccessPdf": p.get("openAccessPdf"),
            "isOpenAccess": p.get("isOpenAccess"),
            "citationCount": p.get("citationCount", 0),
            "authors": "; ".join([a.get("name","") for a in p.get("authors", [])])
        })
    df = pd.DataFrame(rows)
    if year_from is not None:
        df = df[df.year >= int(year_from)]
    if year_to is not None:
        df = df[df.year <= int(year_to)]
    if open_access_only:
        df = df[df.isOpenAccess == True]
    return df

# Example (commented): df = search_papers("explainable AI", limit=5)
print("Search function ready.")

Search function ready.


## Cell 6 — PDF download helpers

Functions to determine best PDF URL and download files robustly. Uses `openAccessPdf` first then falls back to `url`.

In [7]:
# Cell 6: Download helpers
import shutil, re

def _safe_filename(s):
    s = re.sub(r'[^0-9a-zA-Z \-_\.]', '', s or "")
    return s.strip().replace(" ", "_")[:150]

def get_best_pdf_url(paper_row):
    oap = paper_row.get("openAccessPdf") or {}
    if isinstance(oap, dict) and oap.get("url"):
        return oap.get("url")
    return paper_row.get("url")

def download_file(url, dest_path, max_retries=3):
    headers = {"User-Agent":"Mozilla/5.0"}
    for attempt in range(max_retries):
        try:
            with requests.get(url, stream=True, headers=headers, timeout=30, allow_redirects=True) as r:
                if r.status_code == 200:
                    with open(dest_path, "wb") as f:
                        shutil.copyfileobj(r.raw, f)
                    return True
                else:
                    time.sleep(1)
        except Exception as e:
            time.sleep(1)
    return False

def download_selected(df, indices, target_dir=PAPERS_DIR):
    results = []
    for i in tqdm(indices, desc="Downloading"):
        row = df.loc[i].to_dict()
        url = get_best_pdf_url(row)
        title = row.get("title") or row.get("paperId", "paper")
        fname = f"{row.get('paperId')}_{_safe_filename(title)}.pdf"
        out_path = target_dir / fname
        success = False
        if url:
            try:
                success = download_file(url, out_path)
            except Exception as ex:
                success = False
        results.append({
            "paperId": row.get("paperId"),
            "title": title,
            "url_used": url,
            "local_path": str(out_path) if success else None,
            "downloaded": bool(success),
            "year": row.get("year"),
            "authors": row.get("authors"),
            "citationCount": row.get("citationCount")
        })
    return pd.DataFrame(results)

print('Download helpers ready.')

Download helpers ready.


## Cell 7 — Dataset / metadata utilities

Save metadata and index local PDFs in the papers folder.

In [8]:
# Cell 7: Dataset utilities
def save_metadata(df_meta, csv_path=ROOT_DIR/"metadata_run.csv"):
    df_meta.to_csv(csv_path, index=False)
    print("Saved metadata:", csv_path)
    return csv_path

def index_pdfs(papers_dir=PAPERS_DIR, out_csv=ROOT_DIR/"indexed_papers.csv"):
    rows=[]
    for p in papers_dir.glob("*.pdf"):
        st = p.stat()
        rows.append({"filename":p.name, "local_path":str(p), "size_bytes": st.st_size})
    df = pd.DataFrame(rows)
    df.to_csv(out_csv, index=False)
    print("Indexed PDFs ->", out_csv)
    return df

print('Dataset utilities ready.')

Dataset utilities ready.


## Cell 8 — Demo: end-to-end example

Change `topic`, `fetch_k`, and `download_top` as needed. This cell performs: search → show results → download top N → save metadata → index PDFs.

In [9]:
# Cell 8: Demo end-to-end (run to demonstrate Milestone 1)
topic = "explainable AI interpretability"   # <-- change as needed
fetch_k = 10
download_top = 3

print(f"Searching for top {fetch_k} papers on: {topic}")
df = search_papers(topic, limit=fetch_k)
if df.empty:
    print("No results returned. Try a different query or increase limit.")
else:
    display(df[["title","year","venue","citationCount","isOpenAccess"]])

    # Download top N results (adjust indices if using reset_index)
    indices_to_download = list(range(min(download_top, len(df))))
    print(f"Downloading top {len(indices_to_download)} papers...")
    res = download_selected(df.reset_index(drop=True), indices=indices_to_download)
    display(res)
    # Save metadata and index files
    save_metadata(res, ROOT_DIR/f"metadata_{topic.replace(' ','_')}.csv")
    index_pdfs()
    print('Demo run complete. Check the papers folder and metadata CSV.')

Searching for top 10 papers on: explainable AI interpretability


Unnamed: 0,title,year,venue,citationCount,isOpenAccess
0,Explainable AI: A Review of Machine Learning I...,2020,Entropy,2277,True
1,Improving explainable AI interpretability with...,2025,International journal of information technology,2,False
2,Explainable AI (XAI) for trustworthy and trans...,2025,World Journal of Advanced Engineering Technolo...,10,False
3,TRANSFORMING CYBER DEFENSE THROUGH EXPLAINABLE...,2025,INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING ...,0,False
4,From local explanations to global understandin...,2020,Nature Machine Intelligence,6123,True
5,Enhancing Explainable AI: A Hybrid Approach Co...,2024,arXiv.org,10,False
6,Exploring Explainable AI Techniques for Improv...,2024,arXiv.org,8,False
7,Eye into AI: Evaluating the Interpretability o...,2023,Proc. ACM Hum. Comput. Interact.,14,True
8,A Comparative Analysis of Explainable AI Techn...,2023,2023 3rd International Conference on Pervasive...,14,False
9,Explainable AI for Tomato Leaf Disease Detecti...,2023,2023 26th International Conference on Computer...,29,False


Downloading top 3 papers...


Downloading:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,paperId,title,url_used,local_path,downloaded,year,authors,citationCount
0,f156ecbbb9243522275490d698c6825f4d2e01af,Explainable AI: A Review of Machine Learning I...,https://www.mdpi.com/1099-4300/23/1/18/pdf?ver...,,False,2020,Pantelis Linardatos; Vasilis Papastefanopoulos...,2277
1,fcdf01034779263661adf7c0425ae1d2245908de,Improving explainable AI interpretability with...,https://www.semanticscholar.org/paper/fcdf0103...,,False,2025,P. N. Ambritta; Parkshit N. Mahalle; H. Bhapka...,2
2,f29ad386529f7a46aab64d8dae4dbc4186599115,Explainable AI (XAI) for trustworthy and trans...,https://www.semanticscholar.org/paper/f29ad386...,,False,2025,Arunraju Chinnaraju,10


Saved metadata: /content/semantic_scholar_results/metadata_explainable_AI_interpretability.csv
Indexed PDFs -> /content/semantic_scholar_results/indexed_papers.csv
Demo run complete. Check the papers folder and metadata CSV.


## Cell 9 — Zip results (optional)

Create a zip file containing the `papers/` folder and metadata CSV for easy handover.

In [10]:
# Cell 9: Zip papers + metadata
import shutil
zip_base = ROOT_DIR/"papers_and_metadata"
shutil.make_archive(str(zip_base), 'zip', root_dir=ROOT_DIR)
print("Created zip:", str(zip_base) + ".zip")

Created zip: /content/semantic_scholar_results/papers_and_metadata.zip
