# Harvest Victorian naturalization records in the National Archives of Australia

This notebook was used to harvest item data from the following series in April 2019:

* [A7796](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A7796) – Nominal index for pre-1904 Victorian naturalizations
* [A3977](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A3977) – 'Naturalisation Index, Victoria - Register of Patents with index
* [A726](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A726) – 'Registers of Certificates of Naturalization' [Volumes of enrolled certificates with index]
* [A728](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A728) – [1] 'Naturalization Index, Victoria' [Register of Patents with Index]; [2] 'Naturalization Indexes, Victoria' [Registers and indexes to enrolled letters of naturalization]
* [A727](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A727) – Volumes of enrolled letters of naturalization
* [A712](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A712) – Letters received, annual single number series with letter prefix or infix 
* [A725](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A725) – Register of Patents and (from 19 Feb. 1851) Register of Certificates of naturalization volume of enrolled certificates with index
* [A801](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A801) – Cancelled certificates of naturalization, Victoria

This code is now out-of-date due to the deprecation of some of the packages used. If you want to  undertake your own harvest of these or other series in the NAA, go the the [RecordSearch section of the GLAM Workbench](https://glam-workbench.net/recordsearch/) for current versions of the harvesting code.

The harvested data is available as individual `tinydb` JSON files in the `data/victoria` directory, or as a [single combined CSV file](naa_victoria_combined.csv).

See [this notebook](naa_victoria.ipynb) for an overview of the harvested data.

In [5]:
import time
import os
import math
import string
import requests
import pandas as pd
from requests import ConnectionError
from recordsearch_tools.utilities import retry
from recordsearch_tools.client import RSSearchClient, RSSeriesClient
from tinydb import TinyDB, Query
try:
    from io import BytesIO
except ImportError:
    from StringIO import StringIO

os.makedirs('data/victoria', exist_ok=True)

In [6]:
series = [
    'A7796',
    'A3977',
    'A726',
    'A728',
    'A727',
    'A712',
    'A725',
    'A801'  
]

output_dir = 'data/victoria'

In [65]:
# %load series_harvester.py
class SeriesHarvester():
    def __init__(self, series, control=None, output_dir='data'):
        self.series = series
        self.control = control
        self.output_dir = output_dir
        self.total_pages = None
        self.pages_complete = 0
        self.client = RSSearchClient()
        self.prepare_harvest()
        self.db = TinyDB(os.path.join(output_dir, 'db-{}.json'.format(self.series.replace('/', '-'))))
        self.items = self.db.table('items')
        self.images = self.db.table('images')

    def get_total(self):
        return self.client.total_results

    def prepare_harvest(self):
        if self.control:
            self.client.search(series=self.series, control=self.control)
        else:
            self.client.search(series=self.series)
        total_results = self.client.total_results
        print('{} items'.format(total_results))
        self.total_pages = math.floor(int(total_results) / self.client.results_per_page) + 1
        print(self.total_pages)

    @retry(ConnectionError, tries=20, delay=10, backoff=1)
    def start_harvest(self, page=None):
        Record = Query()
        if not page:
            page = self.pages_complete + 1
        while self.pages_complete < self.total_pages:
            if self.control:
                response = self.client.search(series=self.series, page=page, control=self.control, sort='9')
            else:
                response = self.client.search(series=self.series, page=page, sort='9')
            for result in response['results']:
                self.items.upsert(result, Record.identifier == result['identifier'])
            self.pages_complete += 1
            page += 1
            print('{} pages complete'.format(self.pages_complete))
            time.sleep(1)
            
def harvest_series(series, output_dir):
    h = SeriesHarvester(series=series, output_dir=output_dir)
    h.start_harvest()

In [None]:
for s in series:
    harvest_series(s, output_dir)