# Explore South Australian naturalization records in the National Archives of Australia

This notebook explores data harvested from the National Archives of Australia's online database RecordSearch. Details of items in the following series relating to naturalizations in the colony of South Australia were harvested in April 2019:

* [A7419](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A7419) – Nominal index for pre-1904 South Australian naturalizations
* [A729](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A729) – Books of enrolled certificates of naturalization, issued 1848-1858, enrolled 1850-1889
* [A730](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A730) – Naturalized Aliens Journals
* [A731](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A731) – (1) 'Index to Aliens', name index book to certificates of naturalization, issued 1848-1858, enrolled 1850-1888 (2) List of aliens registered
* [A734](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A734) – Journal and index, naturalized aliens
* [A821](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A821) – Memorials of naturalization, with unenrolled or uncollected certificates
* [A732](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A732) – Journal and index, naturalized aliens
* [A735](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A735) – Oaths of Allegiance
* [A822](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A822) – Memorials and certificates of naturalization (unenrolled or uncollected), for South Australia under Act 20 of 21 Victoria
* [A823](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A823) – Enrolled Certificates of Naturalization and Memorials
* [A825](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A825) – Memorials of Naturalization, unregistered (1865)
* [A826](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A826) – Uncollected Certificates of Naturalization
* [A711](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A711) – Memorials of naturalization
* [A733](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A733) – Volumes of enrolled letters of naturalization
* [A805](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A805) – Cancelled Certificates of Naturalisation, South Australia

Although these series relate to naturalization, they do not all contain the same types of records. Their provenance and level of description also vary. This means we can't aggregate them into a single dataset without additional work to understand how the items relate to each other. Some of this information is contained within the series notes, but investigation at item level would also be necessary. For this project we're just looking at item metadata to construct a broad overview of each series to see what information is available. A future project might analyse their contents in more detail.

In [5]:
import pandas as pd
import altair as alt
from tinydb import TinyDB, Query
import os

In [17]:
series = [
    'A7419',
    'A729',
    'A730',
    'A731',
    'A734',
    'A821',
    'A732',
    'A735',
    'A822',
    'A823',
    'A825',
    'A826',
    'A711',
    'A733',
    'A805'
]

output_dir = 'data/south-australia'

## Combine harvested data 

Load the item data from each series into a single dataframe for exploration.

In [25]:
def convert_to_df(series, output_dir=output_dir):
    '''
    Get the series data from TinyDB and save as a Pandas dataframe.
    Also flattens the date dictionary, and does a bit of ordering.
    '''
    # Load the series db
    db = TinyDB(os.path.join(output_dir, 'db-{}.json'.format(series)))
    items = db.table('items')
    # Let's convert the database into a simple list
    item_list = [i for i in items]
    # Now let's turm that list into a Pandas Dataframe
    df = pd.json_normalize(item_list)
    # Rename column
    df.rename({
        'contents_dates.date_str': 'contents_date_str', 
        'contents_dates.start_date': 'contents_start_date',
        'contents_dates.end_date': 'contents_end_date'
    }, axis=1, inplace=True)
    # Put columns in preferred order
    df = df[['identifier', 'series', 'control_symbol', 'title', 'contents_date_str', 'contents_start_date', 'contents_end_date', 'access_status', 'location', 'digitised_status', 'digitised_pages']]
    df.sort_values(['identifier'])
    return df 

In [30]:
dfs = []
for s in series:
    df_series = convert_to_df(s)
    dfs.append(df_series)
df = pd.concat(dfs)
df.head()

Unnamed: 0,identifier,series,control_symbol,title,contents_date_str,contents_start_date,contents_end_date,access_status,location,digitised_status,digitised_pages
0,11318497,A7419,FICHE 1,"ABEL, J to BONACIACH, N [Nominal index for pre...",1848 - 1903,1848,1903,Open,Melbourne,False,0
1,11318498,A7419,FICHE 10,"SHILLING, J to SPIEWKOWSKI, J [Nominal index f...",1848 - 1903,1848,1903,Open,Melbourne,False,0
2,11318499,A7419,FICHE 11,"SPIEWKOWSKY, I to WANERBACK, O [Nominal index ...",1848 - 1903,1848,1903,Open,Melbourne,False,0
3,11318500,A7419,FICHE 12,"WANKE, A to ZWECK, F [Nominal index for pre-19...",1848 - 1903,1848,1903,Open,Melbourne,False,0
4,11318501,A7419,FICHE 2,"BONGUEY, M to DULFER, E [Nominal index for pre...",1848 - 1903,1848,1903,Open,Melbourne,False,0


In [63]:
df.to_csv('naa_south_australia_combined.csv', index=False)

## Visualise date ranges

Using the contents start date, we'll group items by year and visualise the date range of each series.

In [36]:
# Get counts by series and year and create a new dataframe
series_year_counts = df.value_counts(['series', 'contents_start_date']).to_frame().reset_index()
series_year_counts.columns = ['series', 'year', 'count']
series_year_counts

Unnamed: 0,series,year,count
0,A711,1895,590
1,A711,1884,405
2,A821,1851,367
3,A711,1883,339
4,A821,1850,286
...,...,...,...
100,A734,1898,1
101,A735,1856,1
102,A805,1887,1
103,A823,1858,1


Chart the new dataset.

In [57]:
alt.Chart(series_year_counts).mark_bar(size=10).encode(
    x=alt.X('year:Q', axis=alt.Axis(format='c')),
    y=alt.Y('count:Q', title='number of records'),
    color=alt.Color('series:N', scale=alt.Scale(scheme='tableau20')),
    tooltip=['year', 'series', 'count']
).properties(width=700)

## Number of items described and digitised

In [32]:
counts = df['series'].value_counts()
digitised = df.groupby(['series'])['digitised_status'].agg('sum').astype('int')
totals = pd.concat([counts, digitised], axis=1, sort=True).reset_index()
totals.columns = ['series', 'items', 'digitised']
totals['not_digitised'] = totals['items'] - totals['digitised']
totals

Unnamed: 0,series,items,digitised,not_digitised
0,A711,4237,4232,5
1,A729,2,2,0
2,A730,2,2,0
3,A731,1,1,0
4,A732,2,1,1
5,A733,9,9,0
6,A734,2,2,0
7,A735,1,0,1
8,A7419,12,0,12
9,A805,10,0,10


In [31]:
# Number of pages digitised
df.groupby(['series'])['digitised_pages'].agg('sum')

series
A711     14277
A729       996
A730       105
A731        80
A732        33
A733      4532
A734       196
A735         0
A7419        0
A805         0
A821      5278
A822       301
A823       348
A825         3
A826       320
Name: digitised_pages, dtype: int64

## Access status

In [62]:
df.groupby(['series', 'access_status']).size()

series  access_status
A711    Open             4237
A729    Open                2
A730    Open                2
A731    Open                1
A732    Open                2
A733    Open                9
A734    Open                2
A735    Open                1
A7419   Open               12
A805    Open               10
A821    Open             1703
A822    Open               90
A823    Open              106
A825    Open                2
A826    Open              160
dtype: int64