RecordSearch Series Harvests
Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.
Note this will only work with series that have less than 20,000 items because of RecordSearch limits, but it could be easily modified to harvest a subset (or indeed a series of subsets).
The metadata is saved into a MongoDB database. This can be local, or on a cloud service like mLab. Just copy
credentials.py and add in your database's url.
You'll need to git clone my recordsearch-tools repository into a directory called
rstools. Then in Python you can just:
import harvest # Initiate harvester with a series id harvester = harvest.SeriesClient(series='A712') # Harvest item metadata harvester.do_harvest() # Harvest ALL the digitised images in this series harvest.harvest_images()
Note that harvest_images() is set up to create derivatives of every image. To disable this, just delete the contents of the
IMAGE_SIZES list in
To save the metadata as a CSV file:
import process process.export_csv('A712')
To save a summary of the harvested series to a CSV file:
import harvest harvest.series_summary(['A712', 'A711'])