# Loading data from the Harvard Art Museums

This script loads all objects in the online collection of the [Harvard Art Museums](https://harvardartmuseums.org/collections).

Requests are parallelized for maximum efficiency. Note that there may be IP-based rate limits on the website, so you won't necessarily be able to download the data at maximum network speed. It shouldn't take more than a few minutes though.

In [1]:
import asyncio
import aiohttp

import sys
from aiohttp_retry import RetryClient
from tqdm import tqdm

In [2]:
async def load_page(client: aiohttp.ClientSession, num: int):
    offset = 100 * num
    async with client.get(f'https://harvardartmuseums.org/browse?load_amount=100&offset={offset}') as resp:
        assert resp.status == 200
        return await resp.json()

async def load_data():
    concurrency = 10

    records = []
    sema = asyncio.Semaphore(concurrency)

    async with aiohttp.ClientSession() as client:
        client = RetryClient(client)

        metadata = (await load_page(client, 0))["info"]  # type: ignore
        n_pages = metadata['pages']
        print(f"found {metadata['totalrecords']} objects and {n_pages} pages")

        async def query_task(num: int) -> None:
            try:
                data = await load_page(client, i)  # type: ignore
                records.extend(data['records'])
            except Exception as e:
                print(f"request failure: {e}", file=sys.stderr)
            finally:
                sema.release()

        for i in tqdm(range(n_pages)):
            await sema.acquire()
            asyncio.create_task(query_task(i))

    for _ in range(concurrency):
        await sema.acquire()
    
    return records

In [3]:
data = await load_data()

found 242827 objects and 2429 pages


100%|██████████| 2429/2429 [05:24<00:00,  7.49it/s]
request failure: Session is closed
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2747)
request failure: [Errno 1] [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] app

In [4]:
deduped_data = list({record['id']: record for record in data}.values())

print(f"Retrieved {len(data)} records, and {len(deduped_data)} distinct records")

Retrieved 241900 records, and 231500 distinct records


In [5]:
import json

with open("../data/artmuseums.json", "w") as f:
    json.dump(deduped_data, f)