## Architecture Overview
- `PDFDownloader` (base) provides shared HTTP + save helpers.
- Strategy subclasses: `UnpaywallDownloader`, `CrossrefDownloader`, `SciHubDownloader`.
- `PDFDownloadManager` tries strategies in order, regenerating an email for Unpaywall every 50 DOIs.
- Added helper `run_bulk_download` in `downloader.py` that instantiates strategies and returns a DataFrame of results + saves `download_summary.csv`.

> Add new sources by creating another subclass of `PDFDownloader` and including it in the list inside `run_bulk_download`.

In [4]:
!pip install -r requirements.txt

Collecting streamlit>=1.36.0 (from -r requirements.txt (line 1))
  Using cached streamlit-1.51.0-py3-none-any.whl.metadata (9.5 kB)
Collecting altair!=5.4.0,!=5.4.1,<6,>=4.0 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting blinker<2,>=1.5.0 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached blinker-1.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools<7,>=4.0 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached cachetools-6.2.2-py3-none-any.whl.metadata (5.6 kB)
Collecting click<9,>=7.0 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting pillow<13,>=7.1.0 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached pillow-12.0.0-cp314-cp314-win_amd64.whl.metadata (9.0 kB)
Collecting protobuf<7,>=3.20 (from streamlit>=1.36.0->-r requirements.txt (line 1))
  Using cached protobuf-6.

  error: subprocess-exited-with-error
  
  × Building wheel for pyarrow (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [876 lines of output]
      !!
      
              ********************************************************************************
              Please use a simple string containing a SPDX expression for `project.license`. You can also use `project.license-files`. (Both options available on setuptools>=77.0.0).
      
              By 2026-Feb-18, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
              ********************************************************************************
      
      !!
        corresp(dist, value, root_dir)
      !!
      
              ********************************************************************************
              

In [1]:
# Imports: keep notebook light; heavy logic resides in downloader.py
import pandas as pd
from downloader import run_bulk_download
from pathlib import Path

In [2]:
# Load DOIs from CSV (expects a column named 'doi')
doi_file = 'sample_doi.csv'
dois = pd.read_csv(doi_file)['doi'].dropna().tolist()
len(dois)

30

In [3]:
# Run bulk download via plugin manager helper
results_df = run_bulk_download(dois, download_dir='fulldownloads')
results_df.head()

2025-11-17 22:18:52,747 - INFO - Starting download for 30 DOI(s). Initial email generated.
2025-11-17 22:18:52,748 - INFO - --- Starting download process for DOI: 10.1093/humrep/dex273 ---
2025-11-17 22:18:52,749 - INFO - Trying strategy: UnpaywallDownloader
2025-11-17 22:18:52,750 - INFO - Unpaywall: querying API at https://api.unpaywall.org/v2/10.1093/humrep/dex273?email=9shc0h7@outlook.com
2025-11-17 22:18:53,909 - INFO - Trying strategy: CrossrefDownloader
2025-11-17 22:18:53,910 - INFO - Crossref: querying API for 10.1093/humrep/dex273
2025-11-17 22:18:55,131 - INFO - Trying strategy: SciHubDownloader
2025-11-17 22:18:55,132 - INFO - Sci-Hub: trying mirrors for 10.1093/humrep/dex273
2025-11-17 22:18:55,133 - INFO - Trying mirror: https://sci-hub.st/
2025-11-17 22:18:55,629 - INFO - Found PDF URL: https://2024.sci-hub.st/6763/927de616dbf59438aa7c7b08220a216e/braakhekke2017.pdf#navpanes=0&view=FitH
2025-11-17 22:18:56,196 - INFO - SUCCESS with SciHubDownloader. Saved to: fulldownloa

Unnamed: 0,doi,success
0,10.1093/humrep/dex273,True
1,10.1371/journal.pbio.2002173,True
2,10.1186/s13063-017-2034-0,True
3,10.1097/SLA.0000000000001795,True
4,10.1136/postgradmedj-2020-139392,True


In [4]:
# Quick success summary
success_count = results_df['success'].sum()
total = len(results_df)
print(f'Success: {success_count}/{total} ( {success_count/total:.1%} )')
print('Saved summary CSV at: fulldownloads/download_summary.csv')

Success: 28/30 ( 93.3% )
Saved summary CSV at: fulldownloads/download_summary.csv


## Extending
To add a new source: create a new subclass of `PDFDownloader` in `plugins_class.py` implementing `try_download(doi)` and then include it inside the list built in `run_bulk_download` (modify in `downloader.py`). The notebook code remains unchanged.

In [1]:
# Create a zip archive of all downloaded PDFs + summary CSV
from downloader import zip_downloads
zip_path = zip_downloads('fulldownloads', zip_name='papers_download')
zip_path

2025-11-14 14:11:16,478 - INFO - Created zip archive at papers_download.zip


WindowsPath('papers_download.zip')

## Download as Zip
You can run the previous cell to create `papers_download.zip` in the project root for easy manual download or sharing.