<center> <h1>Exploring the Sir Joseph Banks Collection</h1> 

In this noteboook, we will use the AdLib and Transcription APIs to explore the file contents and transcripts of the Sir Joseph Banks collection. 


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Exploring-the-Banks-Collection-Through-AdLib-API" data-toc-modified-id="Exploring-the-Banks-Collection-Through-AdLib-API-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Exploring the Banks Collection Through AdLib API</a></span><ul class="toc-item"><li><span><a href="#Exploring-a-Series-in-the-Banks-Collection" data-toc-modified-id="Exploring-a-Series-in-the-Banks-Collection-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Exploring a Series in the Banks Collection</a></span></li></ul></li><li><span><a href="#Exploring-the-Banks-Collection-Through-the-Transcription-Tool-API" data-toc-modified-id="Exploring-the-Banks-Collection-Through-the-Transcription-Tool-API-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring the Banks Collection Through the Transcription Tool API</a></span><ul class="toc-item"><li><span><a href="#Exploring-the-Collections-in-the-Transcription-Tool" data-toc-modified-id="Exploring-the-Collections-in-the-Transcription-Tool-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Exploring the Collections in the Transcription Tool</a></span></li><li><span><a href="#Exploring-the-Series-Within-the-Banks-Collection" data-toc-modified-id="Exploring-the-Series-Within-the-Banks-Collection-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Exploring the Series Within the Banks Collection</a></span></li><li><span><a href="#Exploring-a-Specific-Series" data-toc-modified-id="Exploring-a-Specific-Series-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Exploring a Specific Series</a></span></li><li><span><a href="#Retrieving-the-Transcript-of-a-Specific-Document" data-toc-modified-id="Retrieving-the-Transcript-of-a-Specific-Document-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Retrieving the Transcript of a Specific Document</a></span></li></ul></li></ul></div>

## Exploring the Banks Collection Through AdLib API

Through the AdLib API, we can traverse the Banks collection database, and obtain the IE IDs for each document.

We can then use the Rosetta API to download the file contents of the documents through their IE IDs.

In [1]:
# Importing relevant libraries

import numpy as np
import pandas as pd
import requests
import IPython.display as Disp
import xml.etree.ElementTree as ET
from io import StringIO
from lxml import etree, objectify
from xmljson import yahoo
import collections
import json

Knowing the 'priref' identification of the Banks collection, we can make RESTful queries to the AdLib API endpoint.

Here, 'resp' represents the database entry for the collection itself, and 'resp1' represents the database entries for the series within the collection

In [2]:
banks_priref = 110311728
archive_url = "http://oai-archival.sl.nsw.gov.au/oaix_primo/wwwopac.ashx"

search_collection = 'priref='+str(banks_priref)
resp = requests.get(archive_url,params={'database':'archive', 'search':search_collection, 'limit':999})

search_series = 'part_of_reference.lref='+str(banks_priref)
resp1 = requests.get(archive_url,params={'database':'archive', 'search':search_series, 'limit':999})

In [3]:
# Function for converting XML responses to JSON

def xml_json(data, remove_ns=True, preserve_root=False, encoding='utf-8') -> dict:
    if type(data) == str:
        if remove_ns:
            xml_data = ET.iterparse(StringIO(data))
            for _, el in xml_data:
                if '}' in el.tag:
                    el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
            data = ET.tostring(xml_data.root, encoding=encoding).decode(encoding)
        encoded_data = data.encode(encoding)
        # noinspection PyArgumentList
        parser = etree.XMLParser(encoding=encoding, recover=False, huge_tree=True)
        xml_data = objectify.fromstring(encoded_data, parser=parser)
    else:
        xml_data = data
    json_data = yahoo.data(xml_data)
    if type(json_data) == collections.OrderedDict and not preserve_root:
        json_data = json_data.get(list(json_data.keys())[0])
    return json_data

We will now convert resp from XML to JSON

In [4]:
banks = xml_json(resp.text)
banks_json = json.dumps([banks])
banks_dict = json.loads(banks_json)

In [6]:
banks_collection_info = {
    'priref' : banks_dict[0]['recordList']['record']['priref'],
    'Title' : banks_dict[0]['recordList']['record']['Title']['title']['value']['content'],
    'contents' : banks_dict[0]['recordList']['record']['content.description']
}

Here, we can see the Titles of all 95 series within the Banks collection

In [8]:
print(banks_collection_info['contents'])

Joseph Banks was born in London on 13 February 1743. He was educated at Harrow and Eton Schools and Oxford University. In 1767 he was elected a Fellow of the Royal Society. In 1778 he was elected President of the Royal Society, a position which he held until his death in 1820. 
He successfully lobbied the Royal Society to be included on James Cook's first Pacific voyage on board the Endeavour (1768-1771). Following his return from the Pacific, Banks was actively involved in almost every aspect of Pacific exploration and early Australian colonial life. He actively supported the proposal of Botany Bay in New South Wales as a site for British settlement, founded on 26 January 1788. He corresponded with the early governors of New South Wales and recommended William Bligh to succeed Philip Gidley King as fourth Governor. 
Banks sent botanists to all parts of the known world, often at his own expense. He organised the breadfruit voyages under the command of William Bligh (1787-1789 and 1791-

Now, we will convert resp1 from XML to JSON, and obtain the prirefs of all the series within the Banks collection.

In [9]:
j = xml_json(resp1.text)
banks_series_json = json.dumps([j])
banks_series_dict = json.loads(banks_series_json)

In [10]:
banks_series = []
for f in banks_series_dict[0]['recordList']['record']:
    ser = f['Title']['title']['value']['content'].split(':')[0]
    s = [s1 for s1 in ser if s1.isdigit()]
    serie = ''.join(s)
    d = {
        'Series':int(serie),
        'priref':int(f['priref'])
    }
    banks_series.append(d)

In [11]:
banks_series =  sorted(banks_series, key = lambda i: i['Series'])
banks_series

[{'Series': 1, 'priref': 110355936},
 {'Series': 2, 'priref': 110374173},
 {'Series': 3, 'priref': 110327948},
 {'Series': 4, 'priref': 110374174},
 {'Series': 5, 'priref': 110374176},
 {'Series': 6, 'priref': 110578133},
 {'Series': 7, 'priref': 110374185},
 {'Series': 8, 'priref': 110374192},
 {'Series': 9, 'priref': 110374203},
 {'Series': 10, 'priref': 110374204},
 {'Series': 11, 'priref': 110374225},
 {'Series': 12, 'priref': 110374230},
 {'Series': 13, 'priref': 110374265},
 {'Series': 14, 'priref': 110374337},
 {'Series': 15, 'priref': 110374398},
 {'Series': 16, 'priref': 110374399},
 {'Series': 17, 'priref': 110374400},
 {'Series': 18, 'priref': 110374402},
 {'Series': 19, 'priref': 110374404},
 {'Series': 20, 'priref': 110374437},
 {'Series': 21, 'priref': 110374438},
 {'Series': 22, 'priref': 110374439},
 {'Series': 23, 'priref': 110374440},
 {'Series': 24, 'priref': 110374441},
 {'Series': 25, 'priref': 110374443},
 {'Series': 26, 'priref': 110374444},
 {'Series': 27, 'prir

### Exploring a Series in the Banks Collection

In this example, we will take note of the priref of Series 03, and send a query to the AdLib API.

In [12]:
series_priref = 110327948
archive_url = "http://oai-archival.sl.nsw.gov.au/oaix_primo/wwwopac.ashx"

search_collection = 'priref='+str(series_priref)
resp = requests.get(archive_url,params={'database':'archive', 'search':search_collection, 'limit':999})

search_series = 'part_of_reference.lref='+str(series_priref)
resp1 = requests.get(archive_url,params={'database':'archive', 'search':search_series, 'limit':999})

In [13]:
j = xml_json(resp.text)
series_json = json.dumps([j])
series_dict = json.loads(series_json)

The IE IDs of the documents, along with the FL IDs of the pages of each document, can be seen in series_child_dict, as defined below:

In [24]:
j = xml_json(resp1.text)
series_child_json = json.dumps([j])
series_child_dict = json.loads(series_child_json)

An example of a page FL ID, along with the document IE ID is this:

In [25]:
series_child_dict[0]['recordList']['record'][0]['Reproduction'][0]

{'reproduction.reference': 'FL3327997',
 'reproduction.Rosetta.intellectual_entity': 'IE3327995'}

## Exploring the Banks Collection Through the Transcription Tool API

With the transcription tool API, we can obtain the transcripts of the papers in the Banks collection

### Exploring the Collections in the Transcription Tool

We will find the Banks collection among the other collections available for transcription.



We will input appropriate login credentials to access the transcript tool API, and request a list of 'collection' entries in the database

In [27]:
from requests.auth import HTTPBasicAuth
import getpass
        
user = getpass.getpass('User: ', stream=None) 
password = getpass.getpass(prompt='Password: ', stream=None)

coll_json = requests.get(
    'https://transcripts.sl.nsw.gov.au/node.json',
    auth=HTTPBasicAuth(user, password),params={'type':'project','status':1}
    )
coll_json

User: ········
Password: ········


<Response [200]>

In [28]:
collj = json.loads(coll_json.text)

In [29]:
hh = [f['field_canonical_id'] for f in collj['list']]

The first entry in the collection list is the Banks collection, therefore we will need to find its node ID.

In [31]:
hh

['Banks', 'Ind.Langs.', 'WW1', 'Cook', 'Macarthur', 'ww1_internal', 'Hassall']

In [32]:
collection_node_id = int(collj['list'][0]['nid'])
collection_node_id

1

### Exploring the Series Within the Banks Collection

Now that we know the node ID of the Banks collection, we can make a subsequent querys to retrieve a list of the series within the Banks collection. 

In [33]:
series_json = requests.get(
    'https://transcripts.sl.nsw.gov.au/node.json',
    auth=HTTPBasicAuth(user, password),params={'type':'collection','field_tt_project':collection_node_id,'status':1}
    )
series_json

<Response [200]>

In [34]:
seriesj = json.loads(series_json.text)

Here, <b>seriesj['list']</b> respresents a list of the series within this collection. Iterating through the items in this list will help you see the name of the series, along with other metadata. For our purpose, we will only need the node IDs of the series.

In [35]:
series_node = []
for f in seriesj['list']:
    series_name = f['field_canonical_id']
    s = [s1 for s1 in series_name if s1.isdigit()]
    s_id = int(''.join(s))
    d = {
        'series': s_id,
        'node_id': int(f['nid'])
    }
    series_node.append(d)
series_node = sorted(series_node, key = lambda i: i['series']) 

In [36]:
series_node

[{'series': 1, 'node_id': 131366},
 {'series': 2, 'node_id': 131365},
 {'series': 4, 'node_id': 131363},
 {'series': 5, 'node_id': 131362},
 {'series': 6, 'node_id': 131361},
 {'series': 7, 'node_id': 131360},
 {'series': 8, 'node_id': 131359},
 {'series': 9, 'node_id': 131358},
 {'series': 10, 'node_id': 131357},
 {'series': 11, 'node_id': 131356},
 {'series': 12, 'node_id': 131355},
 {'series': 13, 'node_id': 131354},
 {'series': 14, 'node_id': 131353},
 {'series': 15, 'node_id': 131352},
 {'series': 16, 'node_id': 131351},
 {'series': 17, 'node_id': 131350},
 {'series': 18, 'node_id': 131349},
 {'series': 19, 'node_id': 131348},
 {'series': 20, 'node_id': 131347},
 {'series': 21, 'node_id': 131346},
 {'series': 22, 'node_id': 131345},
 {'series': 23, 'node_id': 131344},
 {'series': 24, 'node_id': 131343},
 {'series': 25, 'node_id': 131342},
 {'series': 26, 'node_id': 131341},
 {'series': 27, 'node_id': 131340},
 {'series': 28, 'node_id': 131339},
 {'series': 29, 'node_id': 131338},


### Exploring a Specific Series 

Now, let's take a note of the node ID for series 95, and explore its contents.

In [37]:
doc_json = requests.get(
    'https://transcripts.sl.nsw.gov.au/node.json',
    auth=HTTPBasicAuth(user, password),params={'type':'document','field_tt_collection':131272,'status':1}
    )
doc_json

<Response [200]>

In [38]:
docj = json.loads(doc_json.text)

Here, <b>docj['list']</b> respresents a list of the documents within this series. Iterating through the items in this list will help you see the name of the document, the number of pages transcribed, etc. For our purpose, we will only need the node IDs of the documents in this series.

In [68]:
docs_node = []
for f in docj['list']:
    d = {
        'document': int(f['field_canonical_id'].split('.')[1]),
        'node_id': int(f['nid']),
        'pagecount': int(f['field_total_completed_pages'])
    }
    docs_node.append(d)
docs_node = sorted(docs_node, key = lambda i: i['document']) 

In [69]:
docs_node

[{'document': 1, 'node_id': 71376, 'pagecount': 2},
 {'document': 2, 'node_id': 71378, 'pagecount': 2},
 {'document': 3, 'node_id': 71380, 'pagecount': 3},
 {'document': 4, 'node_id': 71382, 'pagecount': 3},
 {'document': 5, 'node_id': 71384, 'pagecount': 2},
 {'document': 6, 'node_id': 71386, 'pagecount': 3},
 {'document': 7, 'node_id': 71388, 'pagecount': 3},
 {'document': 8, 'node_id': 71390, 'pagecount': 2},
 {'document': 9, 'node_id': 71392, 'pagecount': 9},
 {'document': 10, 'node_id': 71394, 'pagecount': 1}]

### Retrieving the Transcript of a Specific Document

Now, we will explore a document in this series, and obtain the transcriptions of the pages within it.

In [73]:
doctitle = docj['list'][3]['body']['value']
print(doctitle)

Letter received by Dorothea, Lady Banks from the Dutch Ambassador,Baron Fugel, 3 November 1818 (Series 95.04)
Created by: Fugel, Baron


In [75]:
nid = int(docj['list'][3]['nid'])
nid

71382

In [76]:
pages_json = requests.get(
    'https://transcripts.sl.nsw.gov.au/node.json',
    auth=HTTPBasicAuth(user, password),params={'type':'page','field_document':nid,'status':1}
    )
pages_json

<Response [200]>

In [77]:
pages = json.loads(pages_json.text)

In [78]:
import re
document_text =''
for page in pages['list']:
    page_text = page['metatag']['value']['description']
    page_text = re.sub("[\\[].*?[\\]]", "", page_text)
    document_text+=page_text

In [79]:
import unicodedata
new_str = unicodedata.normalize("NFKD", document_text)
print(new_str.lstrip())

Lady Banks Soho Square Whitehall pl. 3. Nov. 1818 Dear Madam, I have just received a Letter from HRH the Princess Dowager of Orange containing the following passage which I beg your Ladyship will have the goodness to communicate to Sir Joseph.  interested in his health at this moment from a motive which is a little Selfish as I look forward to his complete re-establishment for the honor of Your Ladyship & his Company to Dinner the promise of which gives me and my Brother the greatest pleasure. I have the honor to be, Dear Madam Your Ladyship's most obedient faithful Servant H Fugel
