## Architecture Overview
- `PDFDownloader` (base) provides shared HTTP + save helpers.
- Strategy subclasses: `UnpaywallDownloader`, `CrossrefDownloader`, `SciHubDownloader`.
- `PDFDownloadManager` tries strategies in order, regenerating an email for Unpaywall every 50 DOIs.
- Added helper `run_bulk_download` in `downloader.py` that instantiates strategies and returns a DataFrame of results + saves `download_summary.csv`.

> Add new sources by creating another subclass of `PDFDownloader` and including it in the list inside `run_bulk_download`.

In [1]:
# Imports: keep notebook light; heavy logic resides in downloader.py
import pandas as pd
from downloader import run_bulk_download
from pathlib import Path

In [2]:
# Load DOIs from CSV (expects a column named 'doi')
doi_file = 'sample_doi.csv'
dois = pd.read_csv(doi_file)['doi'].dropna().tolist()
len(dois)

30

In [3]:
# Run bulk download via plugin manager helper
results_df = run_bulk_download(dois, download_dir='fulldownloads')
results_df.head()

2025-11-14 14:07:55,379 - INFO - Starting download for 30 DOI(s). Initial email generated.
2025-11-14 14:07:55,380 - INFO - --- Starting download process for DOI: 10.1093/humrep/dex273 ---
2025-11-14 14:07:55,380 - INFO - Trying strategy: UnpaywallDownloader
2025-11-14 14:07:55,381 - INFO - Unpaywall: querying API at https://api.unpaywall.org/v2/10.1093/humrep/dex273?email=ar8vd7i5pi@gmail.com
2025-11-14 14:07:55,380 - INFO - --- Starting download process for DOI: 10.1093/humrep/dex273 ---
2025-11-14 14:07:55,380 - INFO - Trying strategy: UnpaywallDownloader
2025-11-14 14:07:55,381 - INFO - Unpaywall: querying API at https://api.unpaywall.org/v2/10.1093/humrep/dex273?email=ar8vd7i5pi@gmail.com
2025-11-14 14:07:56,386 - INFO - Trying strategy: CrossrefDownloader
2025-11-14 14:07:56,386 - INFO - Crossref: querying API for 10.1093/humrep/dex273
2025-11-14 14:07:56,386 - INFO - Trying strategy: CrossrefDownloader
2025-11-14 14:07:56,386 - INFO - Crossref: querying API for 10.1093/humrep/de

Unnamed: 0,doi,success
0,10.1093/humrep/dex273,True
1,10.1371/journal.pbio.2002173,True
2,10.1186/s13063-017-2034-0,True
3,10.1097/SLA.0000000000001795,True
4,10.1136/postgradmedj-2020-139392,True


In [4]:
# Quick success summary
success_count = results_df['success'].sum()
total = len(results_df)
print(f'Success: {success_count}/{total} ( {success_count/total:.1%} )')
print('Saved summary CSV at: fulldownloads/download_summary.csv')

Success: 28/30 ( 93.3% )
Saved summary CSV at: fulldownloads/download_summary.csv


## Extending
To add a new source: create a new subclass of `PDFDownloader` in `plugins_class.py` implementing `try_download(doi)` and then include it inside the list built in `run_bulk_download` (modify in `downloader.py`). The notebook code remains unchanged.

In [1]:
# Create a zip archive of all downloaded PDFs + summary CSV
from downloader import zip_downloads
zip_path = zip_downloads('fulldownloads', zip_name='papers_download')
zip_path

2025-11-14 14:11:16,478 - INFO - Created zip archive at papers_download.zip


WindowsPath('papers_download.zip')

## Download as Zip
You can run the previous cell to create `papers_download.zip` in the project root for easy manual download or sharing.