# Explore Victorian naturalization records in the National Archives of Australia

This notebook explores data harvested from the National Archives of Australia's online database RecordSearch. Details of items in the following series relating to naturalizations in the colony of Victoria [were harvested](naa_victoria_harvesting.ipynb) in April 2019:

* [A7796](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A7796) – Nominal index for pre-1904 Victorian naturalizations
* [A3977](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A3977) – 'Naturalisation Index, Victoria - Register of Patents with index
* [A726](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A726) – 'Registers of Certificates of Naturalization' [Volumes of enrolled certificates with index]
* [A728](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A728) – [1] 'Naturalization Index, Victoria' [Register of Patents with Index]; [2] 'Naturalization Indexes, Victoria' [Registers and indexes to enrolled letters of naturalization]
* [A727](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A727) – Volumes of enrolled letters of naturalization
* [A712](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A712) – Letters received, annual single number series with letter prefix or infix 
* [A725](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A725) – Register of Patents and (from 19 Feb. 1851) Register of Certificates of naturalization volume of enrolled certificates with index
* [A801](http://recordsearch.naa.gov.au/scripts/AutoSearch.asp?Number=A801) – Cancelled certificates of naturalization, Victoria

Although these series relate to naturalization, they do not all contain the same types of records. Their provenance and level of description also vary. This means we can't aggregate them into a single dataset without additional work to understand how the items relate to each other. Some of this information is contained within the series notes, but investigation at item level would also be necessary. For this project we're just looking at item metadata to construct a broad overview of each series to see what information is available. A future project might analyse their contents in more detail.

In [1]:
import pandas as pd
import altair as alt
from tinydb import TinyDB, Query
import os

In [2]:
series = [
    'A7796',
    'A3977',
    'A726',
    'A728',
    'A727',
    'A712',
    'A725',
    'A801'  
]

output_dir = 'data/victoria'

## Combine harvested data 

Load the item data from each series into a single dataframe for exploration.

In [3]:
def convert_to_df(series, output_dir=output_dir):
    '''
    Get the series data from TinyDB and save as a Pandas dataframe.
    Also flattens the date dictionary, and does a bit of ordering.
    '''
    # Load the series db
    db = TinyDB(os.path.join(output_dir, 'db-{}.json'.format(series)))
    items = db.table('items')
    # Let's convert the database into a simple list
    item_list = [i for i in items]
    # Now let's turm that list into a Pandas Dataframe
    df = pd.json_normalize(item_list)
    # Rename column
    df.rename({
        'contents_dates.date_str': 'contents_date_str', 
        'contents_dates.start_date': 'contents_start_date',
        'contents_dates.end_date': 'contents_end_date'
    }, axis=1, inplace=True)
    # Put columns in preferred order
    df = df[['identifier', 'series', 'control_symbol', 'title', 'contents_date_str', 'contents_start_date', 'contents_end_date', 'access_status', 'location', 'digitised_status', 'digitised_pages']]
    df.sort_values(['identifier'])
    return df 

In [4]:
dfs = []
for s in series:
    df_series = convert_to_df(s)
    dfs.append(df_series)
df = pd.concat(dfs)
df.head()

Unnamed: 0,identifier,series,control_symbol,title,contents_date_str,contents_start_date,contents_end_date,access_status,location,digitised_status,digitised_pages
0,11256435,A7796,1,"AANENSEN to AH, Hen [Nominal index for pre-190...",1852 - 1903,1852,1903,Open,Melbourne,False,0
1,11256436,A7796,10,"FEUERMAN, A to GEISLER, G [Nominal index for p...",1852 - 1903,1852,1903,Open,Canberra,False,0
2,11256437,A7796,11,"GEITNNER, F to HABER, W [Nominal index for pre...",1852 - 1903,1852,1903,Open,Canberra,False,0
3,11256438,A7796,12,"HABERLE, A to HENNET, H [Nominal index for pre...",1852 - 1903,1852,1903,Open,Canberra,False,0
4,11256439,A7796,13,"HENNING, W to IMBESCHEID, J [Nominal index for...",1852 - 1903,1852,1903,Open,Canberra,False,0


In [12]:
df.to_csv('naa_victoria_combined.csv', index=False)

## Visualise date ranges

Using the contents start date, we'll group items by year and visualise the date range of each series.

In [5]:
# Get counts by series and year and create a new dataframe
series_year_counts = df.value_counts(['series', 'contents_start_date']).to_frame().reset_index()
series_year_counts.columns = ['series', 'year', 'count']
series_year_counts

Unnamed: 0,series,year,count
0,A712,1885,1345
1,A712,1897,988
2,A712,1883,717
3,A712,1884,685
4,A712,1901,600
...,...,...,...
75,A801,1874,1
76,A801,1873,1
77,A801,1871,1
78,A801,1857,1


Chart the new dataset.

In [6]:
alt.Chart(series_year_counts).mark_bar(size=10).encode(
    x=alt.X('year:Q', axis=alt.Axis(format='c')),
    y=alt.Y('count:Q', title='number of records'),
    color=alt.Color('series:N', scale=alt.Scale(scheme='tableau20')),
    tooltip=['year', 'series', 'count']
).properties(width=700)

## Number of items described and digitised

In [7]:
counts = df['series'].value_counts()
digitised = df.groupby(['series'])['digitised_status'].agg('sum').astype('int')
totals = pd.concat([counts, digitised], axis=1, sort=True).reset_index()
totals.columns = ['series', 'items', 'digitised']
totals['not_digitised'] = totals['items'] - totals['digitised']
totals

Unnamed: 0,series,items,digitised,not_digitised
0,A3977,1,1,0
1,A712,12908,3476,9432
2,A725,1,1,0
3,A726,6,6,0
4,A727,14,14,0
5,A728,3,3,0
6,A7796,29,0,29
7,A801,1115,85,1030


In [8]:
# Number of pages digitised
df.groupby(['series'])['digitised_pages'].agg('sum')

series
A3977      299
A712     21654
A725        81
A726      2272
A727     10934
A728       994
A7796        0
A801       214
Name: digitised_pages, dtype: int64

## Access status

In [9]:
df.groupby(['series', 'access_status']).size()

series  access_status   
A3977   Open                    1
A712    Not yet examined        1
        Open                12907
A725    Open                    1
A726    Open                    6
A727    Open                   14
A728    Open                    3
A7796   Open                   29
A801    Open                 1115
dtype: int64