# Stanford Preprints

This notebook downloads Stanford preprints from [OpenAlex](https://openalex.org).

If we query OpenAlex directly using [has_oa_submitted_version](https://docs.openalex.org/api-entities/works/filter-works#has_oa_submitted_version) we can get all preprints, irregardless of whether they have a DOI assigned to them. But that also means we will need to use the openalex ID to save the PDF instead of the DOI.

## Institution Codes

First we need to know what institution codes to use for Stanford. Those can be collected from the API, and we can see the number of pre-print publications for each.

In [2]:
import re
import pyalex

inst_codes = []

for inst in pyalex.Institutions().search("stanford").get():
    inst_id = re.sub(r'https://.+/', '', inst['id'])
    count = pyalex.Works().filter(institutions={"id": inst_id}, has_oa_submitted_version=True).count()
    print(f"{inst['display_name']} ({inst_id}): {count}")
    inst_codes.append(inst_id)

Stanford University (I97018004): 62331
Stanford Medicine (I4210137306): 1936
Stanford Synchrotron Radiation Lightsource (I4210120900): 7052
Stanford Health Care (I4210105015): 254
SRI International (I1298353152): 1359
Stanford Blood Center (I4210133340): 46
SLAC National Accelerator Laboratory (I2801935854): 15206
Stanford Cancer Institute (I4390039303): 0
Stanford SystemX Alliance (I4392738099): 0
Stanford Maternal and Child Health Research Institute (I4391767688): 0
Institute for Stem Cell Biology and Regenerative Medicine (I4394709089): 0


We can query them all with this:

In [132]:
pyalex.Works().filter(institutions={"id": '|'.join(inst_codes)}, has_oa_submitted_version=True).count()

70133

## Download PDFs

So we want to download the PDFs for these works. Here's a (kinda complex) download function that takes an OpenAlex Work (dictionary) and downloads the PDF to the filesystem using the domain of the PDF URL and the OpenAlex ID as a file name.

In [6]:
from urllib.parse import urlparse
import requests
import pathlib
import time

# we are fetching with SSL verification off because requests doesn't seem to like www.slac.stanford.edu (at least for me)
requests.packages.urllib3.disable_warnings() 

# use a User-Agent that looks legit, but has my email address at the end.
# Experimentation showed that some servers respond differently based on the User Agent.

ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:130.0) Gecko/20100101 Firefox/130.0"

download_dir = pathlib.Path('preprints')

def download_work(work, sleep_seconds=0.5):
    preprints = [l for l in work['locations'] if l['version'] == 'submittedVersion']

    if len(preprints) == 0:
        print(f"No preprint location found for {work['id']}")

    pdf_url = preprints[0]['pdf_url']
    if pdf_url is None:
        print(f"No preprint pdf_url found for {work['id']}")
        return None

    # construct the pdf filename
    domain = urlparse(pdf_url).netloc
    filename = download_dir / domain / (work['id'].split('/')[-1] + ".pdf")

    # sleep to not overwhelm servers by accident
    time.sleep(sleep_seconds)

    # get the pdf, only waiting 30 seconds for the response
    try:
        resp = requests.get(pdf_url, allow_redirects=True, timeout=30, verify=False, headers={"User-Agent": ua})
    except Exception as e:
        print(f"exception when fetching {pdf_url} - {e}")
        return None

    # If it's a 200 OK response and looks like a PDF write it to the filesystem using the DOI.
    # Note: some servers serve up PDFs using binary/octet-stream 
    # e.g. https://figshare.com/articles/journal_contribution/Estimation_and_Inference_of_Heterogeneous_Treatment_Effects_using_Random_Forests_sup_sup_/4902002/2/files/8238752.pdf 
    
    content_type = resp.headers.get('Content-Type', '').lower()
    if resp.status_code != 200:
        print(f"got {resp.status_code} from {pdf_url}")
        return None
    elif 'pdf' in content_type or 'octet-stream' in content_type:
        filename.parent.mkdir(parents=True, exist_ok=True)
        filename.open('wb').write(resp.content)
        return filename
    else:
        print(f"{pdf_url} didn't look like a application/pdf response")
        return None

download_work(pyalex.Works().filter(institutions={"id": "I97018004"}, has_oa_submitted_version=True).get()[0])

PosixPath('preprints/arxiv.org/W2117539524.pdf')

Now we can download all the PDFs again while writing the metadata file. Note, this took 3 days to complete!

In [None]:
import json
import tqdm

metadata_file = download_dir / "metadata.jsonl"
jsonl = metadata_file.open('a')

works = pyalex.Works().filter(institutions={"id": "I97018004"}, has_oa_submitted_version=True)
for page in tqdm.tqdm(works.paginate(per_page=100, n_max=None), total=(works.count() / 100)):
    for work in page:            
        pdf_file = download_work(work)
        
        # only write the metadata if we were able to download a PDF
        if pdf_file:
            work['download_file'] = re.sub('^preprints/', '', str(pdf_file))
            jsonl.write(json.dumps(work) + "\n")