Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

README.md

RecordSearch Series Harvests

Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.

Note this will only work with series that have less than 20,000 items because of RecordSearch limits, but it could be easily modified to harvest a subset (or indeed a series of subsets).

Start harvesting

The metadata is saved into a MongoDB database. This can be local, or on a cloud service like mLab. Just copy credentials_blank.py to credentials.py and add in your database's url.

You'll need to git clone my recordsearch-tools repository into a directory called rstools. Then in Python you can just:

import harvest
# Initiate harvester with a series id
harvester = harvest.SeriesClient(series='A712')
# Harvest item metadata
harvester.do_harvest()
# Harvest ALL the digitised images in this series
harvest.harvest_images()

Note that harvest_images() is set up to create derivatives of every image. To disable this, just delete the contents of the IMAGE_SIZES list in harvest.py.

To save the metadata as a CSV file:

import process
process.export_csv('A712')

To save a summary of the harvested series to a CSV file:

import harvest
harvest.series_summary(['A712', 'A711'])

Harvested series

About

Code to harvest the metadata and digitised images of all items in a series from the National Archives of Australia.

Topics

Resources

License

Releases

No releases published

Languages

You can’t perform that action at this time.