# Harvest South Australian naturalization records in the National Archives of Australia

This notebook was used to harvest item data from the following series in April 2019:

* [A7419](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A7419) – Nominal index for pre-1904 South Australian naturalizations
* [A729](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A729) – Books of enrolled certificates of naturalization, issued 1848-1858, enrolled 1850-1889
* [A730](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A730) – Naturalized Aliens Journals
* [A731](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A731) – (1) 'Index to Aliens', name index book to certificates of naturalization, issued 1848-1858, enrolled 1850-1888 (2) List of aliens registered
* [A734](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A734) – Journal and index, naturalized aliens
* [A821](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A821) – Memorials of naturalization, with unenrolled or uncollected certificates
* [A732](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A732) – Journal and index, naturalized aliens
* [A735](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A735) – Oaths of Allegiance
* [A822](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A822) – Memorials and certificates of naturalization (unenrolled or uncollected), for South Australia under Act 20 of 21 Victoria
* [A823](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A823) – Enrolled Certificates of Naturalization and Memorials
* [A825](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A825) – Memorials of Naturalization, unregistered (1865)
* [A826](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A826) – Uncollected Certificates of Naturalization
* [A711](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A711) – Memorials of naturalization
* [A733](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A733) – Volumes of enrolled letters of naturalization
* [A805](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A805) – Cancelled Certificates of Naturalisation, South Australia

This code is now out-of-date due to the deprecation of some of the packages used. If you want to  undertake your own harvest of these or other series in the NAA, go the the [RecordSearch section of the GLAM Workbench](https://glam-workbench.net/recordsearch/) for current versions of the harvesting code.

The harvested data is available as individual `tinydb` JSON files in the `data/south_australia` directory, or as a [single combined CSV file](naa_south_australia_combined.csv).

See [this notebook](naa_south_australia.ipynb) for an overview of the harvested data.


In [10]:
import time
import os
import math
import string
import requests
import pandas as pd
from requests import ConnectionError
from recordsearch_tools.utilities import retry
from recordsearch_tools.client import RSSearchClient, RSSeriesClient
from tinydb import TinyDB, Query
try:
    from io import BytesIO
except ImportError:
    from StringIO import StringIO

os.makedirs('data/south-australia', exist_ok=True)

In [3]:
series = [
    'A7419',
    'A729',
    'A730',
    'A731',
    'A734',
    'A821',
    'A732',
    'A735',
    'A822',
    'A823',
    'A825',
    'A826',
    'A711',
    'A733',
    'A805'
]

In [11]:
class SeriesHarvester():
    def __init__(self, series, control=None):
        self.series = series
        self.control = control
        self.total_pages = None
        self.pages_complete = 0
        self.client = RSSearchClient()
        self.prepare_harvest()
        self.db = TinyDB('data/south-australia/db-{}.json'.format(self.series.replace('/', '-')))
        self.items = self.db.table('items')
        self.images = self.db.table('images')

    def get_total(self):
        return self.client.total_results

    def prepare_harvest(self):
        if self.control:
            self.client.search(series=self.series, control=self.control)
        else:
            self.client.search(series=self.series)
        total_results = self.client.total_results
        print('{} items'.format(total_results))
        self.total_pages = math.floor(int(total_results) / self.client.results_per_page) + 1
        print(self.total_pages)

    @retry(ConnectionError, tries=20, delay=10, backoff=1)
    def start_harvest(self, page=None):
        Record = Query()
        if not page:
            page = self.pages_complete + 1
        while self.pages_complete < self.total_pages:
            if self.control:
                response = self.client.search(series=self.series, page=page, control=self.control, sort='9')
            else:
                response = self.client.search(series=self.series, page=page, sort='9')
            for result in response['results']:
                self.items.upsert(result, Record.identifier == result['identifier'])
            self.pages_complete += 1
            page += 1
            print('{} pages complete'.format(self.pages_complete))
            time.sleep(1)
            
def harvest_series(series):
    h = SeriesHarvester(series=series)
    h.start_harvest()

In [None]:
for s in series:
    harvest_series(s)