<center> <h1>Fetching the ALTOs for a Specific Book Through Rosetta API</h1> 

In this example, we will use Pandas to read in an excel file and create a dataframe, from which we can obtain IE IDs and retrieve the ALTO for specific IE IDs. 


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-relevant-libraries" data-toc-modified-id="Importing-relevant-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing relevant libraries</a></span></li><li><span><a href="#Reading-in-the-excel-file" data-toc-modified-id="Reading-in-the-excel-file-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reading in the excel file</a></span></li><li><span><a href="#Obtaining-the-METS-file-for-a-specific-IE-ID" data-toc-modified-id="Obtaining-the-METS-file-for-a-specific-IE-ID-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Obtaining the METS file for a specific IE ID</a></span><ul class="toc-item"><li><span><a href="#Setting-API-variables" data-toc-modified-id="Setting-API-variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Setting API variables</a></span></li><li><span><a href="#Creating-a-Rosetta-object-from-utils.rosetta-and-authenticating-through-HTTP." data-toc-modified-id="Creating-a-Rosetta-object-from-utils.rosetta-and-authenticating-through-HTTP.-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Creating a Rosetta object from utils.rosetta and authenticating through HTTP.</a></span></li><li><span><a href="#Creating-a-SOAP-client-object-using-Zeep's-Client-class." data-toc-modified-id="Creating-a-SOAP-client-object-using-Zeep's-Client-class.-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Creating a SOAP client object using Zeep's Client class.</a></span></li><li><span><a href="#Retrieving-the-pdsHandle-for-issuing-requests" data-toc-modified-id="Retrieving-the-pdsHandle-for-issuing-requests-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Retrieving the pdsHandle for issuing requests</a></span></li><li><span><a href="#Accessing-the-getIE-method" data-toc-modified-id="Accessing-the-getIE-method-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Accessing the getIE method</a></span></li></ul></li><li><span><a href="#Exploring-the-METS-file" data-toc-modified-id="Exploring-the-METS-file-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exploring the METS file</a></span><ul class="toc-item"><li><span><a href="#Converting-the-METS-xml-data-to-json" data-toc-modified-id="Converting-the-METS-xml-data-to-json-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Converting the METS xml data to json</a></span></li><li><span><a href="#Exploring-the-METS-json-through-pandas-dataframes" data-toc-modified-id="Exploring-the-METS-json-through-pandas-dataframes-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Exploring the METS json through pandas dataframes</a></span></li></ul></li><li><span><a href="#Retrieving-the-ALTOs-for-a-specific-book" data-toc-modified-id="Retrieving-the-ALTOs-for-a-specific-book-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Retrieving the ALTOs for a specific book</a></span></li></ul></div>

## Importing relevant libraries

We will import NumPy and Pandas for reading in the excel file and creating the dataframe. The requests library will retrieve bibliographical metadata and image files. The excel file is located in the same directory as the notebook.

The requests library is used to issue requests on the Rosetta API. The zeep library classes Client and Transport are used to create authenticated SOAP API client objects. The Rosetta helper class is imported from the utils directory.

In [1]:
import numpy as np
import pandas as pd
import requests
import collections
import IPython.display as Disp
import xml.etree.ElementTree as ET
from io import StringIO
from lxml import etree, objectify
from xmljson import yahoo
import os

from requests.auth import HTTPBasicAuth  # or HTTPDigestAuth, or OAuth1, etc.
from zeep.transports import Transport
from zeep import Client
from utils.rosetta import Rosetta  # this is a rosetta helper class from the SL github

## Reading in the excel file

We will create a dataframe "df" which we will use to fetch specific ALTO files.


In [2]:
df = pd.read_excel('ALTO_IEs.xlsx', sheet_name='Query List 2019-09-23 10.31.03')
df = df[["IE PID", "MMSIDs", "Barcodes","Title (DC)"]]

In [3]:
df.head() # Displaying first 5 rows

Unnamed: 0,IE PID,MMSIDs,Barcodes,Title (DC)
0,IE14008261,#991000259739702626,2147751.0,"Statutes in force in the colony of Queensland,..."
1,IE6230053,#991000259739702626,2147752.0,"Statutes in force in the colony of Queensland,..."
2,IE6240045,#991000259739702626,2147750.0,"Statutes in force in the colony of Queensland,..."
3,IE4783007,#991000317789702626,1665038.0,Pioneer work in the alps of New Zealand : a re...
4,IE4877439,#991000317789702626,2128780.0,Pioneer work in the alps of New Zealand : a re...


## Obtaining the METS file for a specific IE ID

We will use the Rosetta API to obtain the METS file for a specific IE PID. 

The METS file will include the latest revisions of all the Representations, including the Derivative Copies.

### Setting API variables

In [4]:
api_endpoint = 'http://digital.sl.nsw.gov.au'
api_pds_endpoint = 'https://libprd70.sl.nsw.gov.au/pds'
api_sru_endpoint = 'http://digital.sl.nsw.gov.au/search/permanent/sru'

ws_url = api_endpoint + '/dpsws/repository/IEWebServices?wsdl'

api_username = 'xxxxxxx'
api_password = 'xxxxxxxxxxx'
api_institude_code = 'SLNSW'

### Creating a Rosetta object from the Rosetta helper file and authenticating it.

In [5]:
ros = Rosetta(api_endpoint, api_pds_endpoint, api_sru_endpoint, api_username, api_password, api_institude_code, api_timeout=1200)
ros.session.auth = HTTPBasicAuth(api_username, api_password)

### Creating a SOAP client object using Zeep's Client class.

In [6]:
transport = Transport(session=ros.session, timeout=ros.api_timeout, operation_timeout=ros.api_timeout)
client = Client(ws_url, transport=transport, plugins=[ros.client_history])

### Requesting the METS File

Here, we retrieve the METS through the getIE method.

In [7]:
pdsHandle = ros.get_pds_handle()

In [8]:
IE_PID = df["IE PID"][12]
r = client.service.getIE(pdsHandle,IE_PID,2)

In [9]:
tree = ET.ElementTree()
root = ET.fromstring(r)
tree._setroot(root)
tree.write("mets.xml")

## Exploring the METS file

### Converting the METS xml data to json

We will output the equivalent 'mets.json' file, which we will use to create pandas dataframes.

In [10]:
def xml_json(data, remove_ns=True, preserve_root=False, encoding='utf-8') -> dict:
    if type(data) == str:
        if remove_ns:
            xml_data = ET.iterparse(StringIO(data))
            for _, el in xml_data:
                if '}' in el.tag:
                    el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
            data = ET.tostring(xml_data.root, encoding=encoding).decode(encoding)
        encoded_data = data.encode(encoding)
        # noinspection PyArgumentList
        parser = etree.XMLParser(encoding=encoding, recover=False, huge_tree=True)
        xml_data = objectify.fromstring(encoded_data, parser=parser)
    else:
        xml_data = data
    json_data = yahoo.data(xml_data)
    if type(json_data) == collections.OrderedDict and not preserve_root:
        json_data = json_data.get(list(json_data.keys())[0])
    return json_data

In [11]:
j=xml_json(r)

In [12]:
import json
with open('mets.json', 'w') as f:
    dictionaries = [j]
    f.write(json.dumps(dictionaries))

### Exploring the METS json through pandas

We will create a dataframe to improve our visualisation of the METS file contents.

In [13]:
json_file = 'mets.json'
json1 = pd.read_json(json_file)

In [15]:
table_of_contents = json1['structMap'][0][0]

The File ID under the label 'Table of Contents' will fetch the entire book in pdf format.

We will store the alto file list in 'altos', and image file list in 'images'

In [16]:
structMap = json1['structMap'][0]

altos = list(filter(lambda file: file['div']['LABEL'] == 'Dynamically Linked Transcript', structMap))[0]['div']['div']['div']
altos_list = []
for i in altos:
    altos_list.append(i['fptr']['FILEID'])

In [17]:
altos_list[:5] # Displaying first 5 items

['FL4837857', 'FL4837858', 'FL4837859', 'FL4837860', 'FL4837861']

In [20]:
structMap = json1['structMap'][0]
images = list(filter(lambda file: file['div']['LABEL'] == 'Derivative Copy', structMap))[1]['div']['div']['div']
image_list=[]
for i in images:
    image_list.append(i['fptr']['FILEID'])

In [21]:
image_list[:5] # Displaying first 5 items

['FL4838053', 'FL4838054', 'FL4838055', 'FL4838056', 'FL4838057']

## Saving the ALTOs for a specific book to file.

In [22]:
def altopull(flid):
    img_url = "http://digital.sl.nsw.gov.au/delivery/DeliveryManagerServlet"
    resp1 = requests.get(img_url,params={'dps_pid':flid, 'dps_func':'stream'})
    return resp1

def write_alto(r,flid):
    tree = ET.ElementTree()
    root = ET.fromstring(r.text)
    tree._setroot(root)
    try:
        filename = IE_PID + "/" + flid + ".xml"
        tree.write(filename)
    except:
        os.mkdir(IE_PID)
        filename = IE_PID + "/" + flid + ".xml"
        tree.write(filename)    

In [24]:
for i in altos_list:
    r = altopull(i)
    write_alto(r,i)