Skip to content

Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.

License

Notifications You must be signed in to change notification settings

wragge/recordsearch-series-harvests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RecordSearch Series Harvests

Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.

Note this will only work with series that have less than 20,000 items because of RecordSearch limits, but it could be easily modified to harvest a subset (or indeed a series of subsets).

Start harvesting

The metadata is saved into a MongoDB database. This can be local, or on a cloud service like mLab. Just copy credentials_blank.py to credentials.py and add in your database's url.

You'll need to git clone my recordsearch-tools repository into a directory called rstools. Then in Python you can just:

import harvest
# Initiate harvester with a series id
harvester = harvest.SeriesClient(series='A712')
# Harvest item metadata
harvester.do_harvest()
# Harvest ALL the digitised images in this series
harvest.harvest_images()

Note that harvest_images() is set up to create derivatives of every image. To disable this, just delete the contents of the IMAGE_SIZES list in harvest.py.

To save the metadata as a CSV file:

import process
process.export_csv('A712')

To save a summary of the harvested series to a CSV file:

import harvest
harvest.series_summary(['A712', 'A711'])

Harvested series

About

Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages