# NSW naturalizations data from NSW State Archives 

For information about relevant records see the [Naturalization / Citizenship Guide](https://www.records.nsw.gov.au/archives/collections-and-research/guides-and-indexes/naturalization-citizenship-guide).

Relevant series include:

* [NRS 1038](https://www.records.nsw.gov.au/series/1038) –  Letters of Denization

* [NRS 1039](https://www.records.nsw.gov.au/series/1039) –  Certificates of Naturalization
* [NRS 1040](https://www.records.nsw.gov.au/series/1040) –  Registers of Certificates of Naturalization
* [NRS 1041](https://www.records.nsw.gov.au/series/1041) – Lists of Aliens to whom Certificates of Naturalization have been issued
* [NRS 1042](https://www.records.nsw.gov.au/series/1042) –  Index to Registers of Certificates of Naturalization and Lists of Aliens to whom Certificates of Naturalization have been issued

NRS 1042 is an index to NRS 1040 and NRS 1041. Data transcribed from NRS 1042 is available online in the [Naturalisation Index, 1834-1903](https://www.records.nsw.gov.au/archives/collections-and-research/guides-and-indexes/naturalization-and-denization/indexes).

Along with other online indexes, this data was scraped from the State Archives website and [shared on GitHub](https://github.com/wragge/srnsw-indexes) as a [CSV file](https://github.com/wragge/srnsw-indexes/blob/master/data/naturalisation.csv). See the [NSW State Archives section](https://glam-workbench.net/nsw-state-archives/) of the GLAM Workbench for more information.

In [1]:
import pandas as pd
import re
import altair as alt
import numpy as np
# alt.renderers.enable('default')
#alt.data_transformers.enable('default')

## Load the data

Load the CSV data scraped from the online index.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/wragge/srnsw-indexes/master/data/naturalisation.csv')

In [3]:
# Try to convert the date string into a datetime object
df['date'] = pd.to_datetime(df['DateOfCertificate'], errors='coerce')

In [4]:
df.head()

Unnamed: 0,Surname,FirstName,NativePlace,DateOfCertificate,RegisterNo,Page,Remarks,Item,Reel,date
0,AARONS,Iszra,Turkey,27 Feb 1900,14,370,-,[4/1213],138,1900-02-27
1,AARONS,Joseph,Bagdad,21 Nov 1900,15,98,-,[4/1214],138,1900-11-21
2,ABAID,Faid,Syria,23 Sep 1903,17,330,-,[4/1216],141,1903-09-23
3,ABDOO,Anthony,Syria,18 Mar 1902,16,253,-,[4/1215],139,1902-03-18
4,ABDULLAH,Assid,Syria,24 Dec 1896,13,183,-,[4/1212],137,1896-12-24


These are records with problematic dates that wouldn't automatically convert into datetimes.

In [5]:
df.loc[df['date'].isnull()]

Unnamed: 0,Surname,FirstName,NativePlace,DateOfCertificate,RegisterNo,Page,Remarks,Item,Reel,date
898,BEILEITER,John,Germany,31 Sep 1872,3,130,-,[4/1202],130,NaT
1522,CARNAP,-,-,-,0,0,Not naturalised See B06/2056,-,0,NaT
5408,LOLLBACH,Jacob,Germany,16 Jun/Jul 1858,2,283,-,[4/1201],129,NaT


How many records are there?

In [6]:
len(df)

9860

What is the earliest date?

In [7]:
df['date'].min()

Timestamp('1834-05-24 00:00:00')

What is the latest date?

In [8]:
df['date'].max()

Timestamp('1903-12-31 00:00:00')

## Removing duplicates

Some preliminary investigation showed that a number of records were duplicates — these seem to be the result of uncertainty around the name order in Chinese names. So, for example, 'Ah You' is also entered in its reversed form as 'You Ah'. Similarly, 'Jimmy Ah Moy' is entered both as 'Jimmy Ah' with the surname 'Moy', and 'Jimmy' with the surname 'Ah Moy'.

This, of course, makes it impossible to do much with the data, so I've made an attempt to remove duplicates. Note that this process might remove variations in the `NativePlace` and `Remarks` fields.

In [9]:
deduped_df = df.copy()
# Create a new column with both names combined in a single string
# This will enable us to identify duplicated records where the division between first name and surname varies
deduped_df['name'] = deduped_df['FirstName'].str.cat(deduped_df['Surname'], sep=' ').str.lower()
# Create a new column with both names combined in a single string in reverse order
deduped_df['name_reversed'] = deduped_df['Surname'].str.cat(deduped_df['FirstName'], sep=' ').str.lower()

def make_name_list(row):
    names = sorted([row['name'], row['name_reversed']])
    return ' '.join(names)

# Create a new column with both name forms in the same order
# This will enable us to identify duplicated records with reversed names
deduped_df['names'] = deduped_df.apply(make_name_list, axis=1)

In [10]:
# First round of deduping -- remove those where the separation between first name and surname varies
deduped_df.drop_duplicates(['name', 'DateOfCertificate', 'RegisterNo', 'Page', 'Item', 'Reel'], inplace=True)
# Second round -- remove those where name parts are reversed
deduped_df.drop_duplicates(['names', 'DateOfCertificate', 'RegisterNo', 'Page', 'Item', 'Reel'], inplace=True)

In [11]:
len(deduped_df)

9097

In [12]:
deduped_df['year'] = deduped_df['date'].dt.year

## Countries of origin

Group the records by the `NativePlace` field.

In [21]:
country_counts = deduped_df['NativePlace'].value_counts().to_frame().reset_index()
country_counts.columns = ['NativePlace', 'Count']
country_counts[:25]

Unnamed: 0,NativePlace,Count
0,Germany,3170
1,Sweden,700
2,Prussia,598
3,China,530
4,Denmark,522
5,"Canton, China",484
6,France,352
7,Italy,314
8,Norway,288
9,Syria,239


Save the complete list of places to a [CSV file](nsw_country_counts.csv) for further investigation.

In [22]:
country_counts.to_csv('nsw_country_counts.csv', index=False)

## Aggregate Chinese places

Examination of the country counts show that places in China are recorded in a number of different ways. Here we'll try to aggregate them.

First let's look at the values in the `NativePlace` field that include the word 'China'.

In [15]:
sorted(list(pd.unique(df.loc[df['NativePlace'].str.contains('China')]['NativePlace'])))

['Amoy, China',
 'Canton, China',
 'China',
 'Foochow, China',
 'Kouton, China',
 'Macao China',
 'Macoa, China',
 'Shanghai, China',
 'Sun On, China',
 'W Canton, China',
 'Whampoa, China',
 'Whompoa, China']

As well as these, examination of the country counts showed a number of other variations – these are listed below.

In [16]:
places_in_china = [
                    'Amoy',
                    'Amoy, China',
                    'Canton',
                    'Canton, China',
                    'China',
                    'Foochow, China',
                    'Hong Kong',
                    'Kouton, China',
                    'Macao',
                    'Macao China',
                    'Macoa, China',
                    'Near Hong Kong',
                    'Shanghai',
                    'Shanghai, China',
                    'Singapore',
                    'Sun On, China',
                    'Vhina'
                    'W Canton, China',
                    'West Canton',
                    'Whompoa',
                    'Whampoa, China',
                    'Whompoa, China'
                  ]

Now we can create a new dataset of records of people who came from China and region.

In [17]:
chinese_nats = deduped_df.loc[deduped_df['NativePlace'].isin(places_in_china)].sort_values(by='date')
chinese_nats

Unnamed: 0,Surname,FirstName,NativePlace,DateOfCertificate,RegisterNo,Page,Remarks,Item,Reel,date,name,name_reversed,names,year
8409,SOUT,Yan,"Amoy, China",11 Sep 1857,2,47,-,[4/1201],129,1857-09-11,yan sout,sout yan,sout yan yan sout,1857.0
1640,CHIAM,-,"Amoy, China",15 Oct 1857,2,53,-,[4/1201],129,1857-10-15,- chiam,chiam -,- chiam chiam -,1857.0
4426,KAI,Koon,China,16 Nov 1857,2,77,-,[4/1201],129,1857-11-16,koon kai,kai koon,kai koon koon kai,1857.0
4385,JUANSING,John,"Amoy, China",22 Jan 1858,2,103,-,[4/1201],129,1858-01-22,john juansing,juansing john,john juansing juansing john,1858.0
4512,KEONG,George Arthur,"Amoy, China",1 Mar 1858,2,127,-,[4/1201],129,1858-03-01,george arthur keong,keong george arthur,george arthur keong keong george arthur,1858.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,AH WHA,James,China,12 Jan 1886,9,434,-,[4/1208],133,1886-01-12,james ah wha,ah wha james,ah wha james james ah wha,1886.0
5119,LEE,George,Macao China,23 Feb 1886,9,452,-,[4/1208],133,1886-02-23,george lee,lee george,george lee lee george,1886.0
2678,FONG,Loodi,"Canton, China",27 Mar 1886,10,6,-,[4/1209],134,1886-03-27,loodi fong,fong loodi,fong loodi loodi fong,1886.0
130,AH,Hi,"Canton, China",16 Apr 1886,10,17,Impounded,[4/1209],134,1886-04-16,hi ah,ah hi,ah hi hi ah,1886.0


In [23]:
chinese_nats.to_csv('nsw_from_china.csv', index=False)

## Naturalisations over time

Chart the number of naturalizations over time, highlighting records were the `NativePlace` is from China and region.

First we'll create a new column that indicates whether the `NativePlace` is in China or not.

In [24]:
deduped_df['country'] = np.where(deduped_df['NativePlace'].isin(places_in_china), 'China', 'Other')

Now we'll calculate the total number of records by year and `country` (ie China or other).

In [25]:
china_counts = deduped_df.value_counts(['year', 'country']).to_frame().reset_index()
china_counts.columns = ['year', 'country', 'count']
china_counts

Unnamed: 0,year,country,count
0,1901.0,Other,519
1,1903.0,Other,404
2,1902.0,Other,393
3,1883.0,China,352
4,1900.0,Other,322
...,...,...,...
81,1887.0,China,1
82,1836.0,Other,1
83,1841.0,Other,1
84,1837.0,Other,1


Visualise the results, showing both the combined dataset, and just the 'China' records.

In [26]:
c1 = alt.Chart(china_counts).mark_bar(size=8).encode(
    x=alt.X('year:Q', axis=alt.Axis(format='c')),
    y=alt.Y('count:Q', stack=True),
    color='country:N',
    tooltip=['country', 'year', 'count'],
).properties(
    width=700
)

c2 = alt.Chart(china_counts.loc[china_counts['country'] == 'China']).mark_bar(size=8).encode(
    x=alt.X('year:Q', axis=alt.Axis(format='c'), scale=alt.Scale(domain=(1835,1905))),
    y='count:Q'
).properties(
    width=700
)

c1 & c2

Save the deduped dataset to a CSV file.

In [29]:
deduped_df.to_csv('nsw_deduped_with_country.csv', index=False)