https://docs.openalex.org/about-the-data/work


We're going to filter down to folks on the Ann Arbor campus, and affiliated institutions with >1k works listed. This is only to capture the authors with these affiliations, so occasionally listing Michigan Sea Grant won't exclude someone from the search (unless that's their ONLY listed affiliation).

- https://explore.openalex.org/institutions/I27837315 U-M Ann Arbor (695.7k)
- https://explore.openalex.org/institutions/I4210114445 Michigan Medicine (2.8k)
- https://explore.openalex.org/institutions/I4210151884 C.S. Mott Children's Hospital (2.7k)

This produces 953 jsons, up to 20 works each, about 3.1 GB.


## Setup and imports

In [46]:
import math
import json
from pathlib import Path
import pickle
import time

from IPython.display import display, clear_output
import numpy as np
import pandas as pd
import httpx

# for i in range(10):
#     clear_output(wait=True)
#     print(i)
#     time.sleep(0.1)

DATA_DIR = Path('./data')
WORKS_DATA_DIR = DATA_DIR / 'works_of_institutions'
if not DATA_DIR.exists():
    DATA_DIR.mkdir()
if not WORKS_DATA_DIR.exists():
    WORKS_DATA_DIR.mkdir()

# tester = DATA_DIR / 'test.txt'
# tester.touch()
# assert tester.exists()

In [32]:
# API details
PER_PAGE = 200 # API max

SHARED_FILTERS = "type:journal-article,has_abstract:true,institutions.id:I27837315|I4210114445|I4210151884"
CONTACT_EMAIL = "matvan@umich.edu"

## Data building

In [33]:
base_url = f"https://api.openalex.org/works?mailto={CONTACT_EMAIL}"
filtering = f"filter={SHARED_FILTERS}"
pagination = f"per_page={PER_PAGE}&page=1"
url = "&".join((base_url, filtering, pagination))
r = httpx.get(url)
print(r)

j = r.json()
print(j['meta'])

print(url)

n_records = j['meta']['count']
print(f'Records:          {n_records:>8,}')
n_pages = math.ceil(n_records / PER_PAGE)
print(f'Pages of records: {n_pages:>8,}')

<Response [200 OK]>
{'count': 245131, 'db_response_time_ms': 80, 'page': 1, 'per_page': 200}
https://api.openalex.org/works?mailto=matvan@umich.edu&filter=type:journal-article,has_abstract:true,institutions.id:I27837315|I4210114445|I4210151884&per_page=200&page=1
Records:           245,131
Pages of records:    1,226


In [48]:
# There's a 10k limit here so... I need to get a little creative

records = []
# for year in range(2000,2023):
for year in range(2022, 1994, -1):
    for is_oa in ('true', 'false'):
        
        filtering = f"filter={SHARED_FILTERS},publication_year:{year},is_oa:{is_oa}"
        
        for i in range(1, 1000):
            clear_output(wait=True)
            
            pagination = f"per_page={PER_PAGE}&page={i}"
            url = "&".join((base_url, filtering, pagination))
            print(url)

            path = WORKS_DATA_DIR / f"work-{year}-{is_oa}-{i}.json"
            if path.exists():
                print(f"Skipping {path}")
                continue

            j = httpx.get(url).json()
            print(j['meta'])

            count = j['meta']['count']
            page = j['meta']['page']
            per_page = j['meta']['per_page']


            with path.open('w', encoding='UTF-8') as outfile:
                json.dump(j, outfile)
            if per_page * page >= count:
                break
            time.sleep(1)
print('done')


https://api.openalex.org/works?mailto=matvan@umich.edu&filter=type:journal-article,has_abstract:true,institutions.id:I27837315|I4210114445|I4210151884,publication_year:1995,is_oa:false&per_page=200&page=13
{'count': 2221, 'db_response_time_ms': 19, 'page': 13, 'per_page': 200}
done


In [53]:
# next:
# validate each json as load()able
# -- this will take a minute but will be much easier to refetch any errors rather than deal with the entire json

full_file_list = list(WORKS_DATA_DIR.glob('*.json'))
n_works = 0
for i, path in enumerate(full_file_list):
    clear_output(wait=True)
    print(i+1, 'of', len(full_file_list), path)
    with path.open('r', encoding='UTF-8') as infile:
        j = json.load(infile)
        print(j.keys())
        n_works += len(j['results'])
    print('done:', n_works)
# if it doesn't error out, we're good

1133 of 1133 data\works_of_institutions\work-2022-true-9.json
dict_keys(['meta', 'results', 'group_by'])
done: 199696
