# e-rara: accessing metadata and fulltexts

<h1><span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#0-Introduction" data-toc-modified-id="0-Introduction-1">0 Introduction</a></span><ul class="toc-item"><li><span><a href="#0.0-Scope-and-content" data-toc-modified-id="0.0-Scope-and-content-1.1">0.0 Scope and content</a></span></li><li><span><a href="#0.1-E-rara" data-toc-modified-id="0.1-E-rara-1.2">0.1 E-rara</a></span></li><li><span><a href="#0.2-OAI-PMH" data-toc-modified-id="0.2-OAI-PMH-1.3">0.2 OAI-PMH</a></span></li></ul></li><li><span><a href="#1-Metadata-access-with-Polymatheia" data-toc-modified-id="1-Metadata-access-with-Polymatheia-2">1 Metadata access with Polymatheia</a></span><ul class="toc-item"><li><span><a href="#1.0-Prerequisites" data-toc-modified-id="1.0-Prerequisites-2.1">1.0 Prerequisites</a></span></li><li><span><a href="#1.1-Start-with-the-OAI-interface-via-Polymatheia" data-toc-modified-id="1.1-Start-with-the-OAI-interface-via-Polymatheia-2.2">1.1 Start with the OAI interface via Polymatheia</a></span></li><li><span><a href="#1.2-Retrieve-metadata-records-via-Polymatheia" data-toc-modified-id="1.2-Retrieve-metadata-records-via-Polymatheia-2.3">1.2 Retrieve metadata records via Polymatheia</a></span></li><li><span><a href="#1.3-Retrieve-the-licences" data-toc-modified-id="1.3-Retrieve-the-licences-2.4">1.3 Retrieve the licences</a></span></li><li><span><a href="#1.4-Save--and-recover-complex-metadata-structures" data-toc-modified-id="1.4-Save--and-recover-complex-metadata-structures-2.5">1.4 Save  and recover complex metadata structures</a></span></li></ul></li><li><span><a href="#2-Direct-metadata-access-via-OAI-PMH" data-toc-modified-id="2-Direct-metadata-access-via-OAI-PMH-3">2 Direct metadata access via OAI-PMH</a></span><ul class="toc-item"><li><span><a href="#2.0-Prerequisites" data-toc-modified-id="2.0-Prerequisites-3.1">2.0 Prerequisites</a></span></li><li><span><a href="#2.1-Start-with-the-native-OAI-interface" data-toc-modified-id="2.1-Start-with-the-native-OAI-interface-3.2">2.1 Start with the native OAI interface</a></span></li><li><span><a href="#2.2-Download-metadata-records" data-toc-modified-id="2.2-Download-metadata-records-3.3">2.2 Download metadata records</a></span></li><li><span><a href="#2.3-Get-set-size-&amp;-download-metadata-by-set" data-toc-modified-id="2.3-Get-set-size-&amp;-download-metadata-by-set-3.4">2.3 Get set size &amp; download metadata by set</a></span></li></ul></li><li><span><a href="#3-Download-fulltext-files-from-e-rara-website" data-toc-modified-id="3-Download-fulltext-files-from-e-rara-website-4">3 Download fulltext files from e-rara website</a></span><ul class="toc-item"><li><span><a href="#3.0-Prerequisites" data-toc-modified-id="3.0-Prerequisites-4.1">3.0 Prerequisites</a></span></li><li><span><a href="#3.1-Download-fulltext-files-by-e-rara-ID" data-toc-modified-id="3.1-Download-fulltext-files-by-e-rara-ID-4.2">3.1 Download fulltext files by e-rara ID</a></span></li><li><span><a href="#3.2-Download-fulltext-files-by-set" data-toc-modified-id="3.2-Download-fulltext-files-by-set-4.3">3.2 Download fulltext files by set</a></span></li></ul></li></ul></div>

## 0 Introduction

### 0.0 Scope and content

This Python [Jupyter notebook](https://jupyter.org/) aims to help you with **accessing metadata and fulltexts of the [e-rara platform](https://www.e-rara.ch/)**. It uses the OAI-PMH interface of the e-rara service for retrieving metadata in different formats, and the e-rara website in addition for downloading fulltexts.

The notebook consists of three parts:
1. Metadata access with Polymatheia
2. Direct metadata access via OAI-PMH
3. Download fulltext files from e-rara website.

So, there are two ways to access e-rara metadata. The **first chapter** introduces the Polymatheia library, which allows very convenient requests to the OAI interface by wrapping otherwise more elaborate functions. Working with Polymatheia is an **easy solution for quick access** without going deep into coding.

The **second (and the third) chapter** shows how to access the OAI interface natively. Hence, more code will be needed and **some functions will be defined**. You can use the functions without deeper programming skills - nevertheless these might be helpful if you want to adapt those functions.

You may start from the beginning and walk trough the whole notebook or jump to the section that suits you. Also, it's a good idea to play around with the code in the cells and see what happens. Have fun!

Have any comments, questions and the like? Try kathi.woitas[at]ub.unibe.ch.

### 0.1 E-rara

[E-rara](https://www.e-rara.ch/) is the platform for digitized rare prints from Swiss institutions. E-rara holds approximately 60k old books, 17k printed graphic works, 6600 maps and 700 music prints as image files, in PDF format and partly as TXT files. You may consult e-rara's [Terms of Use](https://www.e-rara.ch/wiki/termsOfUse?lang=en) to check the licences of the e-rara documents.

### 0.2 OAI-PMH

The **Open Archives Initiative Protocol for Metadata Harvesting** (**OAI-PMH**) is a well-known interface for libraries,
archives etc. for delivering their metadata in various formats, librarian's specific like [MODS](http://www.loc.gov/standards/mods/index.html) and common ones like [Dublin Core](https://www.dublincore.org/specifications/dublin-core/dces/) alike.
Further information on OAI-PMH is available [here](http://www.openarchives.org/OAI/openarchivesprotocol.html).

First of all, a few OAI-PMH related concepts should be introduced:

**repository**:
A repository is a server-side application that exposes metadata via OAI-PMH. It can process the *six OAI-PMH request types* aka *OAI verbs*. So, the e-rara OAI-PMH facility is a repository in this sense. 

**harvester**: OAI-PMH client applications are called harvesters. When you are approaching the OAI-PMH interface and requesting records, you do *harvesting*.

**resource**: A resource is the object that the delivered metadata is "about". Of course in case of e-rara OAI-PMH, the referred resources are the publications of the e-rara platform. Note that resources themselves are always outside of the OAI-PMH.

**record**: A record is the XML-encoded container for the metadata of a single resource (i.e. publication) item. It consists of a header and a metadata section.

**header**:
The record header contains the unique identifier of the record, a datestamp and optionally the set specification.

**metadata**: The record metadata contains the resource (i.e. publication) metadata in a defined metadata format.

**set**: A structure for grouping records for selective harvesting. Sets often refer to collections of thematic scopes/subjects, to collections of different owners/institutions (in case of aggregated content) or to collections of certain publication types.

Now let's look at some example requests of the e-rara OAI interface with the **six OAI verbs**:

- Identify ([specification](http://www.openarchives.org/OAI/openarchivesprotocol.html#Identify)):
 https://www.e-rara.ch/oai?verb=Identify

- ListSets ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListSets)):
https://www.e-rara.ch/oai?verb=ListSets

- ListMetadataFormats ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListMetadataFormats)):
https://www.e-rara.ch/oai?verb=ListMetadataFormats

- ListIdentifiers ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListIdentifiers)):
https://www.e-rara.ch/oai?verb=ListIdentifiers&metadataPrefix=mods&set=bernensia

- GetRecord ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#GetRecord)):
https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=mods&identifier=23216296 

- ListRecords ([spec](http://www.openarchives.org/OAI/openarchivesprotocol.html#ListRecords)):
https://www.e-rara.ch/oai?verb=ListRecords&from=1900-01-01&set=bernensia&metadataPrefix=oai_dc

These examples with the given *parameters* are somewhat easy to encode - and so is building similar request URLs.
But how to download the delivered data and to interact with it? That's the aim of this notebook. So, here we go!

## 1 Metadata access with Polymatheia

### 1.0 Prerequisites

First, some basic Python libraries have to be imported. Just **click on the arrow icon** on the left side of the code cell - or first click into the cell and then select 'Crtl' + 'Enter' or 'Shift' + 'Enter'. When the code runs, a star symbol next to the cell appears and when it's done a number turns up. And most important, the provoked output is given beneath the code cell.

In [1]:
import os                              # navigate and manipulate file directories
import pandas as pd                    # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame     # embed website views in Jupyter notebook
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


**Polymatheia** is a Python library to support working with digital library/archive metadata. It supports accessing metadata of different formats from OAI-PMH and also offers methods to handle the retrieved data. The metadata will be turned into a Python-style ['navigable dictionary'](https://polymatheia.readthedocs.io/en/latest/concepts.html), which allows convenient access to certain metadata fields.
Its aim is not necessarily to cover all ways of working with metadata, but to make it easy to undertake most types of tasks and analysis. See the [documentation](https://polymatheia.readthedocs.io/en/latest/) of the Polymatheia library.

Using Polymatheia package **for the first time**, you will need to **install this code library**: Just remove the `#` from the second line of code, and then execute the cell like the one before.

In [2]:
# uncomment !pip command to install polymatheia 
#!pip install polymatheia          
from polymatheia.data.reader import OAISetReader               # list OAI sets
from polymatheia.data.reader import OAIMetadataFormatReader    # list available metadata formats
from polymatheia.data.reader import OAIRecordReader            # read one metadata record from OAI
from polymatheia.data.writer import PandasDFWriter             # easy transformation of flat data into a dataframe
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


https://www.e-rara.ch/oai/ will be the **base URL** for all OAI requests. To make life easier we put it into the variable `oai`.

In [3]:
oai = 'https://www.e-rara.ch/oai/'

### 1.1 Start with the OAI interface via Polymatheia

First, it's good to know **which sets are available**. To take a look at the sets from the native OAI interface let's take a look of https://www.e-rara.ch/bes_1/oai?verb=ListSets with the `IFrame` function. For every set, there is the `setName`, and a `setSpec`, which is a short cut for the set name and will be used as parameter with the OAI accesses.

In [4]:
IFrame('https://www.e-rara.ch/oai?verb=ListSets', width=970, height=300)

That's nice, but how to retrieve these contents as data? Polymatheia's 'OAISetReader' does this conveniently. Here's how it works.

In [5]:
reader = OAISetReader(oai)             # instantiate ('make') a OAISetReader named reader
# 'Instantiation' is a standard procedure with Python - so it's a good idea to get familiar with it.

print(type(reader))                    # print the object type of 'reader' for information

<class 'polymatheia.data.reader.OAISetReader'>


In [6]:
for x in reader:                       # for-loop which iterates through the reader-content and prints each
    print(x)                           # note that 'x' is an arbitrary term

{
  "setSpec": "frc_g",
  "setName": "BCU Fribourg (GLN)"
}
{
  "setSpec": "elibch",
  "setName": "Alle Bibliotheken"
}
{
  "setSpec": "zbs",
  "setName": "Zentralbibliothek Solothurn"
}
{
  "setSpec": "sbs",
  "setName": "Stadtbibliothek Schaffhausen"
}
{
  "setSpec": "astrorara",
  "setName": "Astronomie-rara"
}
{
  "setSpec": "bau_1",
  "setName": "UB Basel (DSV01)"
}
{
  "setSpec": "kbg",
  "setName": "Kantonsbibliothek Graub\u00fcnden"
}
{
  "setSpec": "nep_r",
  "setName": "Biblioth\u00e8que des Pasteurs, BPU Neuch\u00e2tel (RERO)"
}
{
  "setSpec": "astrozut",
  "setName": "ETH-Bibliothek Z\u00fcrich"
}
{
  "setSpec": "frc_r",
  "setName": "BCU Fribourg (RERO)"
}
{
  "setSpec": "ebs",
  "setName": "Eisenbibliothek Schlatt"
}
{
  "setSpec": "nep_g",
  "setName": "Biblioth\u00e8que des Pasteurs, BPU Neuch\u00e2tel (GLN)"
}
{
  "setSpec": "lg1",
  "setName": "Biblioteca Salita dei Frati, Lugano"
}
{
  "setSpec": "demusmu",
  "setName": "Deutsches Museum, M\u00fcnchen"
}
{
  "setSpec

We might put this together and then turn the retrieved data into a *Pandas dataframe* with the 'PandasDFWriter' command. A **dataframe** is a table-like data object, which is a nice breakdown and moreover an useful format for further investigation. *Pandas* is the standard library in Python for dataframe handling.

In [7]:
reader = OAISetReader(oai)
setspec = []                          # make an empty list named 'setspec'

for x in reader:                 
    setspec.append(x)                 # .append adds all the single reader-contents to the list 'setspec'

print(setspec[0:3])                   # print the first 3 items of the list (of key-value pairs), just to see

df = PandasDFWriter().write(setspec)  # write list 'setspec' into a Pandas dataframe named 'df'
df                                    # shows 'df' 

[{'setSpec': 'frc_g', 'setName': 'BCU Fribourg (GLN)'}, {'setSpec': 'elibch', 'setName': 'Alle Bibliotheken'}, {'setSpec': 'zbs', 'setName': 'Zentralbibliothek Solothurn'}]


Unnamed: 0,setSpec,setName
0,frc_g,BCU Fribourg (GLN)
1,elibch,Alle Bibliotheken
2,zbs,Zentralbibliothek Solothurn
3,sbs,Stadtbibliothek Schaffhausen
4,astrorara,Astronomie-rara
...,...,...
77,doi,unknown spec
78,notated_music,notated music
79,book,book
80,illustration_document,illustration document


If a great number of sets are given, you might **search for a certain collection by string**. This can be also helpful to **get to know the set short cut** `setSpec` used by the OAI interface for further investigation of a certain set.

In [8]:
# Example: Searching for strings 'bern' or 'Bern' in the 'setName' column
for i in df.index:                                             # for-loop which iterates through 'df' contents
    if df.setName[i]:                                          # to exclude problematic 'None' content
        if 'bern' in df.setName[i] or 'Bern' in df.setName[i]: # if-condition which looks for 'bern'- or 'Bern'
                                                               # in the 'setName' column
            print(df.loc[i])                                       # print 'df' row, if if-condition is True

setSpec              bes_5
setName    UB Bern (NEBIS)
Name: 26, dtype: object
setSpec              bes_1
setName    UB Bern (DSV01)
Name: 28, dtype: object
setSpec                                            bernensia
setName    Bernensia des 18. bis frühen 20. Jahrhunderts ...
Name: 47, dtype: object
setSpec                        rossica
setName    Rossica Europeana (UB Bern)
Name: 56, dtype: object
setSpec                                   russexil
setName    Russisches Schrifttum im Exil (UB Bern)
Name: 57, dtype: object


In [9]:
# A nicer view to explore all given sets
df.style

Unnamed: 0,setSpec,setName
0,frc_g,BCU Fribourg (GLN)
1,elibch,Alle Bibliotheken
2,zbs,Zentralbibliothek Solothurn
3,sbs,Stadtbibliothek Schaffhausen
4,astrorara,Astronomie-rara
5,bau_1,UB Basel (DSV01)
6,kbg,Kantonsbibliothek Graubünden
7,nep_r,"Bibliothèque des Pasteurs, BPU Neuchâtel (RERO)"
8,astrozut,ETH-Bibliothek Zürich
9,frc_r,BCU Fribourg (RERO)


It's also very useful to know in which **formats the metadata records** are available. The genuine interface does this by requesting the URL https://www.e-rara.ch/oai?verb=ListMetadataFormats. Here, we use the 'OAIMetadataFormatReader' of Polymatheia.

As you might see, you can directly select some information like `metadataPrefix` and `metadataNamespace` from the retrieved data by **using the dot-notation**. Dot-notation just adds the desired subordinated element after a dot.

In [10]:
reader = OAIMetadataFormatReader(oai)
for formats in reader:
    print(formats)
    print(formats.metadataPrefix)    # dot-notation: chooses element 'metadataPrefix' from the reader content
    print('---')

{
  "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd",
  "metadataPrefix": "oai_dc",
  "metadataNamespace": "http://www.openarchives.org/OAI/2.0/oai_dc/"
}
oai_dc
---
{
  "schema": "http://www.loc.gov/standards/mets/mets.xsd",
  "metadataPrefix": "mets",
  "metadataNamespace": "http://www.loc.gov/METS/"
}
mets
---
{
  "schema": "http://www.loc.gov/standards/mods/v3/mods-3-0.xsd",
  "metadataPrefix": "mods",
  "metadataNamespace": "http://www.loc.gov/mods/v3"
}
mods
---
{
  "schema": "http://www.loc.gov/standards/mods/v3/mods-3-0.xsd",
  "metadataPrefix": "rawmods",
  "metadataNamespace": "http://www.loc.gov/mods/v3"
}
rawmods
---
{
  "schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd",
  "metadataPrefix": "epicur",
  "metadataNamespace": "urn:nbn:de:1111-2004033116"
}
epicur
---


In [11]:
reader = OAIMetadataFormatReader(oai)
[formats.metadataPrefix for formats in reader]   # shorter notation for the for-loops above, which outputs a list

['oai_dc', 'mets', 'mods', 'rawmods', 'epicur']

### 1.2 Retrieve metadata records via Polymatheia

 Retrieving available **metadata as a bunch** is simple with the 'OAIRecordReader' command. Just specify the following parameters in the 'OAIRecordReader' function.
 
- `metadata_prefix`: mandatory
- `set_spec` (the short cut for the set you want to retrieve): not mandatory, but default will be *all = many* available records!
- `max_records` (the number of records): not mandatory, but default will be *all = many* available records!

To compare this result with the native OAI interface you might check the top item of 
https://www.e-rara.ch/oai?verb=ListRecords&metadataPrefix=oai_dc&set=bernensia.


In [12]:
reader = OAIRecordReader(oai, metadata_prefix='oai_dc', set_spec='bernensia', max_records=1)
[record for record in reader]      

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1395833'},
   'datestamp': {'_text': '2012-09-26T14:23:16Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc': {'_attrib': {'xsi_schemaLocation': 'http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd'},
    'dc_title': {'_text': 'Adressbuch der Stadt Bern'},
    'dc_creator': {'_text': '[s.n.]'},
    'dc_description': [{'_text': '1860 - Jg. 75(1957)'},
     {'_text': 'Mit Stadtplan (zuerst eingeklebt, später als lose Beilage)'}],
    'dc_publisher': {'_text': 'Hallwag'},
    'dc_date': [{'_text': '1860'}, {'_text': '1957'}],
    'dc_type': [{'_text': 'Text'},
     {'_text': 'Periodical'},
     {'_text': 'Zeitschrift'}],
    'dc_format': {'_text': '35 cm'},
    'dc_identifier': [{'_text': 'doi:10.3931/e-rara-4614'},

To access a certain metadata content, you can **follow down the *navigable dictionary* path** with dot-notation, like the following example.

In [13]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.identifier._text)        # compare to the first line of the output above

oai:www.e-rara.ch:1395833


Not always metadata content is a simple flat value like the identifier above. **Many fields in structured metadata formats are lists** as they hold multiple values. A good example is the `header` field `setSpec` which holds the information about the different OAI set memberships of the item.

In [14]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.setSpec)

[{'_text': 'bes_1'}, {'_text': 'journal'}, {'_text': 'collections'}, {'_text': 'bernensia'}, {'_text': 'ch'}, {'_text': 'ch19'}]


The surrounding square brackets `[ ]` indicate a list (here of key-value pairs). To access each content of the list items of its own you might use *subsetting*, which calls the relevant item by its number in the list. 

In [15]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.header.setSpec[0]._text)
    print(record.header.setSpec[1]._text)
    print(record.header.setSpec[2]._text)
    print(record.header.setSpec[3]._text)
    print(record.header.setSpec[4]._text)
    print(record.header.setSpec[5]._text)

bes_1
journal
collections
bernensia
ch
ch19


For retrieving contents from the `metadata` section a similar subsetting insertion has to be done according to its qualifying string `'{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'`. To give some background information here: This string refers to the `metadataNamespace` element we've seen at retrieving the available metadata formats above.

In [16]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)
for record in reader:
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_title._text)
    print('---')
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[0]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[1]._text)
    print(record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_identifier[2]._text)

Adressbuch der Stadt Bern
---
doi:10.3931/e-rara-4614
https://www.e-rara.ch/bes_1/doi/10.3931/e-rara-4614
system:99116914771105511


Of course, MODS metadata is by far more rich in content. The request can easily be adjusted by the `metadata_prefix` parameter. But, as you might see in the title fields for instance, it also bears more complexity.

In [17]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1395833'},
   'datestamp': {'_text': '2012-09-26T14:23:16Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.loc.gov/mods/v3}mods': {'_attrib': {'version': '3.6',
     'xsi_schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-6.xsd'},
    'mods_titleInfo': [{'mods_title': {'_text': 'Adressbuch der Stadt Bern'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adressbuch der Stadt Bern und Umgebung'}},
     {'_attrib': {'type': 'alternative'},
      'mods_title': {'_text': 'Adress-Kalender der Stadt Bern'}}],
    'mods_typeOfResource': {'_text': 'text'},
    'mods_genre': [{'_text': 

When you are heading for some information in the `metadata` section of the record, the qualifying string has to be adapted to `'{http://www.loc.gov/mods/v3}mods'`.

In [18]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
for record in reader:
    print(record.header.identifier._text)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[0].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[1].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[2].mods_title._text)
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[3].mods_title._text)
    print('---')
    print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_titleInfo[3]._attrib.type)

oai:www.e-rara.ch:1395833
---
[{'mods_title': {'_text': 'Adressbuch der Stadt Bern'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adressbuch der Stadt Bern und Umgebung'}}, {'_attrib': {'type': 'alternative'}, 'mods_title': {'_text': 'Adress-Kalender der Stadt Bern'}}]
---
Adressbuch der Stadt Bern
Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...
Adressbuch der Stadt Bern und Umgebung
Adress-Kalender der Stadt Bern
---
alternative


Because drilling down the *navigable dictionary* path can lead to long and complicated commands - which might not be very clear, either - there is a catchier way to do so with the `get` command applied on the records.  
And: There is **no issue anymore with single values versus lists and qualifying strings**. Just putting the terms together as a list of `get` parameters!

Note that in the case of more than one retrieved element a result list (in squared brackets) will be created.

In [19]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=1)
for record in reader:
    print(record.get(['header', 'identifier', '_text']))
    print('---')
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_titleInfo', 'mods_title', '_text']))
    print('---')
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_titleInfo', '_attrib', 'type']))

oai:www.e-rara.ch:1395833
---
['Adressbuch der Stadt Bern', 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...', 'Adressbuch der Stadt Bern und Umgebung', 'Adress-Kalender der Stadt Bern']
---
[None, 'alternative', 'alternative', 'alternative']


This also works with the shorter form of for-loops. But mind that it delivers a nested - or 'doubled' - list, if there are the same elements several times, like `dc_identifier` here.

In [20]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='oai_dc', max_records=1)

[record.get(['metadata', '{http://www.openarchives.org/OAI/2.0/oai_dc/}dc', 'dc_identifier', '_text']) \
            for record in reader]                          # '\' indicates that command proceeds on the next line

[['doi:10.3931/e-rara-4614',
  'https://www.e-rara.ch/bes_1/doi/10.3931/e-rara-4614',
  'system:99116914771105511']]

Now, it's really easy to access whatever metadata content you like.

For instance, you might be interested in **all responsible persons and bodies**, their role, and GND identifiers...

In [21]:
# First looking at the 'mods_name' section of one record to get an overview of its structure
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='bernensia', max_records=2)

[record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name']) for record in reader]

[None,
 {'_attrib': {'type': 'personal',
   'usage': 'primary',
   'authority': 'gnd',
   'authorityURI': 'http://d-nb.info/gnd/',
   'valueURI': 'http://d-nb.info/gnd/127809651,'},
  'mods_nameIdentifier': {'_text': '(DE-588)127809651,'},
  'mods_namePart': [{'_text': 'Raemy, Alfred de'},
   {'_text': '1825-1909', '_attrib': {'type': 'date'}}],
  'mods_role': [{'mods_roleTerm': {'_text': 'Verfasser',
     '_attrib': {'type': 'text'}}},
   {'mods_roleTerm': {'_text': 'aut',
     '_attrib': {'authority': 'marcrelator', 'type': 'code'}}}]}]

In [22]:
# Selecting the 'mods_name' sub-elements of interest of ten records
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
for record in reader:
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', 'mods_namePart', '_text']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', '_attrib', 'valueURI']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_name', 'mods_role', 'mods_roleTerm', \
                      '_text']))
    print('---')

None
None
None
---
['Raemy, Alfred de', '1825-1909']
http://d-nb.info/gnd/127809651,
['Verfasser', 'aut']
---
Sommerlatt, Christian Vollrath von
http://d-nb.info/gnd/100559921
None
---
Typographische Societäts-Buchhandlung (Bern)
http://d-nb.info/gnd/1086438582,
['Drucker', 'prt']
---
Messerli, Johann Ch.
None
None
---
None
None
None
---
Sterchi, Jakob
None
None
---
['Jenni, Christian Albrecht', '1786-1861']
http://d-nb.info/gnd/1037555503
['Herausgeber', 'edt']
---
None
None
None
---
['Tscharner, Friedrich', ['Haller, Ludwig Albrecht', '1773-1837'], 'Stadtbibliothek Bern']
[None, 'http://d-nb.info/gnd/1037511646', 'http://d-nb.info/gnd/508313-8']
[None, ['Drucker', 'prt'], None]
---


### 1.3 Retrieve the licences

E-rara's documents are published under different licences. To get an overview you might look at the [Terms of Use](https://www.e-rara.ch/wiki/termsOfUse?lang=en). Here's a short how-to to get the rights information out of the OAI metadata.

First, let's check the rights information in the *Dublin Core* format. Often 'pdm' for *Public Domain Mark* will be delivered.

In [23]:
reader = OAIRecordReader(oai, set_spec='ch20', metadata_prefix='oai_dc', max_records=10)
[record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_rights._text \
  for record in reader]

['pdm', 'pdm', 'pdm', 'pdm', 'pdm', 'pdm', 'pdm', 'pdm', 'pdm', 'pdm']

Grabbing the rights information of the *MODS* format might be an alternative.

In [24]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='ch20', max_records=5)
for record in reader:
    print(record.header.identifier._text) 
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_accessCondition', '_attrib', \
                      'type']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_accessCondition', '_attrib', \
                      'displayLabel']))
    print('---')

oai:www.e-rara.ch:1423055
use and reproduction
Public Domain Mark
---
oai:www.e-rara.ch:2554533
use and reproduction
Public Domain Mark
---
oai:www.e-rara.ch:3869077
use and reproduction
Public Domain Mark
---
oai:www.e-rara.ch:3870000
use and reproduction
Public Domain Mark
---
oai:www.e-rara.ch:6286695
use and reproduction
Public Domain Mark
---


To get a better view and a more usable format for further work, you can write the rights into a dataframe, too.

In [25]:
reader = OAIRecordReader(oai, metadata_prefix='mods', set_spec='ch20', max_records=5)
identifier = []
dc_rights = []                          

for record in reader:
    identifier.append(record.header.identifier._text)
    dc_rights.append(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_accessCondition', \
                                 '_attrib', 'displayLabel']))

dic = {'ID': identifier, 'rights': dc_rights} 
df = pd.DataFrame(dic)  
df

Unnamed: 0,ID,rights
0,oai:www.e-rara.ch:1423055,Public Domain Mark
1,oai:www.e-rara.ch:2554533,Public Domain Mark
2,oai:www.e-rara.ch:3869077,Public Domain Mark
3,oai:www.e-rara.ch:3870000,Public Domain Mark
4,oai:www.e-rara.ch:6286695,Public Domain Mark


Also, you easily can get an overview over the rights of a whole OAI set with the `Counter` function from the `collections` library. Doing this you can choose the maximum records via the known parameter.

In [26]:
import collections
reader = OAIRecordReader(oai, set_spec='ch20', metadata_prefix='oai_dc', max_records=100)

dc_rights = [record.metadata['{http://www.openarchives.org/OAI/2.0/oai_dc/}dc'].dc_rights._text \
  for record in reader]
counter = collections.Counter(dc_rights)
counter

Counter({'pdm': 100})

### 1.4 Save  and recover complex metadata structures

Before any data will be downloaded, let's build a common folder named `data` aside our working directory to store any data into.

In [27]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`.  While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory. Just uncomment (and maybe multiply) the commands you need.

In [28]:
#os.chdir(os.pardir)                              # change to parent directory
#os.chdir('...')                                  # change to '...' subdirectory
#print(os.getcwd())

In [29]:
os.makedirs('data', exist_ok=True)                # make new folder 'data'
os.chdir('data')                                  # change to 'data' folder
print(os.getcwd())

C:\Users\kwoit\Documents\GitHub\e-rara-access\data


To **download a whole bunch of metadata items** in nested formats like *MODS*, the 'JSONWriter' from Polymatheia is very helpful.
It creates a complex folder structure and JSON files to reproduce the structured metadata. And with 'JSONReader' one can easily recover the metadata set.

In [30]:
from polymatheia.data.writer import JSONWriter     # also available: CSVReader (for flat data), XMLReader and Writer
from polymatheia.data.reader import JSONReader

'JSONWriter' takes two parameters:
- The first is the name of the directory into which the data should be stored.
- The second is the dot-notated path (via its `header.identifier`) used to access the item's metadata.

For more clarity, these are the contents of `header.identifier` for the first ten Bernensia records we will refer to:

In [31]:
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
for record in reader:
    print(record.header.identifier._text)

oai:www.e-rara.ch:1395833
oai:www.e-rara.ch:1396731
oai:www.e-rara.ch:1397203
oai:www.e-rara.ch:1757425
oai:www.e-rara.ch:1757509
oai:www.e-rara.ch:1757592
oai:www.e-rara.ch:1757931
oai:www.e-rara.ch:1758267
oai:www.e-rara.ch:2069554
oai:www.e-rara.ch:4709578


In [32]:
# Download and save the first ten Bernensia records from MODS format
# 'poly_metadata' = directory to store into
reader = OAIRecordReader(oai, set_spec='bernensia', metadata_prefix='mods', max_records=10)
writer = JSONWriter('poly_metadata', 'header.identifier._text')
writer.write(reader)

In [33]:
# Recover the first ten Bernensia records from local disk
reader = JSONReader('poly_metadata')
[record for record in reader]

[{'header': {'identifier': {'_text': 'oai:www.e-rara.ch:1757592'},
   'datestamp': {'_text': '2012-05-08T13:37:52Z'},
   'setSpec': [{'_text': 'bes_1'},
    {'_text': 'journal'},
    {'_text': 'collections'},
    {'_text': 'bernensia'},
    {'_text': 'ch'},
    {'_text': 'ch19'}]},
  'metadata': {'{http://www.loc.gov/mods/v3}mods': {'_attrib': {'version': '3.6',
     'xsi_schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-6.xsd'},
    'mods_titleInfo': {'mods_title': {'_text': 'Hand- und Adressbuch der Bundesstadt Bern'},
     'mods_subTitle': {'_text': 'Verzeichniss der Behörden, Angabe der Häuserbesitzer und Wohnungen, der Handels- und Gewerbstreibenden, der Gesellschaften und Vereine, Tarife und Verordnungen der Verkehrsanstalten u. dgl. m'}},
    'mods_typeOfResource': {'_text': 'text'},
    'mods_genre': [{'_text': 'periodical', '_attrib': {'authority': 'marcgt'}},
     {'_text': 'Text', '_attrib': {'authority': 'rdacontent'}},
     {'_text': 

The stored data **can be used just the same way** as the direct accessed one, like for instance for the `mods_titleInfo` element. Note that the order of the records is shuffled now.

In [34]:
reader = JSONReader('poly_metadata')
for record in reader:
    print(record.header.identifier._text)
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_titleInfo', 'mods_title', '_text']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_genre', '_text']))
    print('---')

oai:www.e-rara.ch:1757592
Hand- und Adressbuch der Bundesstadt Bern
['periodical', 'Text', 'Zeitschrift']
---
oai:www.e-rara.ch:1397203
Adressenbuch der Republik Bern
Text
---
oai:www.e-rara.ch:4709578
['Verzeichniss aller auf der Stadt-Bibliothek in Bern vorhandenen gedruckten Werke', 'Verzeichnis aller auf der Stadtbibliothek in Bern vorhandenen gedruckten Werke', '[Erstes - Drittes] Supplement zum Catalog der Stadt-Bibliothek in Bern']
None
---
oai:www.e-rara.ch:1758267
['Sammlung der Grabschriften der gegenwärtigen Bernischen Gottesäcker Monbijou, Rosengarten und Klösterlein', 'Fortsetzung der Sammlung der Grabschriften der ... Gottesäcker Monbijou und Rosengarten']
None
---
oai:www.e-rara.ch:1757931
Berner Stadtchronik
None
---
oai:www.e-rara.ch:1395833
['Adressbuch der Stadt Bern', 'Adressbuch der Stadt Bern einschliesslich Bümpliz, Köniz, Liebefeld...', 'Adressbuch der Stadt Bern und Umgebung', 'Adress-Kalender der Stadt Bern']
['periodical', 'Text', 'Zeitschrift']
---
oai:www.e

In [35]:
# One more example: The 'mods_note' element and its '_attrib' sub-element
reader = JSONReader('poly_metadata')
for record in reader:
    print(record.header.identifier._text)
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_note', '_text']))
    print(record.get(['metadata', '{http://www.loc.gov/mods/v3}mods', 'mods_note', '_attrib', 'type']))
    print('---')

oai:www.e-rara.ch:1757592
1859
date/sequential designation
---
oai:www.e-rara.ch:1397203
['Bearb. und hrsg. von C. v. Sommerlatt', 'Beigebunden: Ergänzungsheft zu dem Adressbuch der Republik Bern von 1836. Erschienen im April 1839, Thun.']
['statement of responsibility', None]
---
oai:www.e-rara.ch:4709578
['Erster alphabetischer Katalog der damaligen Stadtbibliothek, verfasst vom Oberbibliothekar F. Tscharner', 'Theil 1 (1811): XLVIII, 470 S.; Theil 2 (1811): 508 S.; Theil 3 (1811): 462 S.; Suppl. (1839): XXII, 287 S.; Suppl. 2 (1847): 224 S.; Suppl. 3 (1856): 390 S.', 'Betrifft die Handschrift Cod. 757.I der Burgerbibliothek Bern (S. 212).']
[None, None, None]
---
oai:www.e-rara.ch:1758267
['Hrsg. von Christian Albrecht Jenni ; Fortsetzung der Sammlung der Grabschriften der ... Gottesäcker Monbijou und Rosengarten', 'Herausgeber am Ende des Vorworts von Band 2: "C. A. Jenni"']
['statement of responsibility', None]
---
oai:www.e-rara.ch:1757931
von J. Sterchi
statement of responsibili

Of course, there is also the way to read out certain metadata fields **via basic dot-notation**. But this will take a bit more of code to cope with the list vs. single value issue.

In [36]:
reader = JSONReader('poly_metadata')
for record in reader:
    print(record.header.identifier._text)
    if 'mods_genre' in record.metadata['{http://www.loc.gov/mods/v3}mods']:
        if isinstance(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre, list):
            le = len(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre)
            for i in range(le):
                print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre[i]._text)
        else:
            print(record.metadata['{http://www.loc.gov/mods/v3}mods'].mods_genre._text)
    else: 
        print(None)
    print('---')

oai:www.e-rara.ch:1757592
periodical
Text
Zeitschrift
---
oai:www.e-rara.ch:1397203
Text
---
oai:www.e-rara.ch:4709578
None
---
oai:www.e-rara.ch:1758267
None
---
oai:www.e-rara.ch:1757931
None
---
oai:www.e-rara.ch:1395833
periodical
Text
Zeitschrift
---
oai:www.e-rara.ch:2069554
Text
---
oai:www.e-rara.ch:1757425
Text
---
oai:www.e-rara.ch:1396731
Text
---
oai:www.e-rara.ch:1757509
Text
---


## 2 Direct metadata access via OAI-PMH 

Unfortunately, the Polymatheia library doesn't offer methods for *all* OAI verbs. For instance, there is no `ListIdentifiers` method (which delivers only the identifiers of a given set) and no `GetRecord` for retrieving the metadata of a certain item using its e-rara ID.

That's where especially the common libraries **requests** and  **BeautifulSoup** come into play, and more manual coding is needed.


### 2.0 Prerequisites

In [37]:
# Load the necessary libraries
import requests                                 # request URLs
from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # XML parser supported by bs4
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Succesfully imported necessary libraries")

Succesfully imported necessary libraries


https://www.e-rara.ch/oai/ will be the **base URL** for all OAI requests. To make life easier we put it into the variable `oai`.

In [38]:
oai = 'https://www.e-rara.ch/oai/'

### 2.1 Start with the native OAI interface

The very **core of all operations on the OAI interface** will be a small function called `load_xml()`. It simply requests the base URL with the various parameters and decodes the answer to XML. Therefore, it can be used with all OAI verbs and their respective parameters.

In [39]:
def load_xml(params):
    '''
    Accesses the OAI interface according to given parameters and scrapes its content.
    Parameters:
    All available native OAI verbs and parameter/value pairs.
    '''
    base_url = oai
    response = requests.get(base_url, params=params)
    output_soup = soup(response.content, "lxml")
    return output_soup

You may use it to read out the basic `Identify` response of the OAI interface.

Note, that the parameters to be used by the `load_xml` function are the same as in the genuine URL `https://www.e-rara.ch/oai?verb=Identify`. That is, `verb` as the parameter key, and `Identify` as the parameter value. Therefore, we need a **parameter key-value pair**, which will be indicated by enclosing them in curly braces.

In [40]:
xml_soup = load_xml({'verb': 'Identify'})
xml_soup

<html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-06-23T16:28:36Z</responsedate><request verb="Identify">https://www.e-rara.ch/oai/</request><identify><repositoryname>Visual Library Server</repositoryname><baseurl>https://www.e-rara.ch/oai/</baseurl><protocolversion>2.0</protocolversion><adminemail>issue-erara@library.ethz.ch</adminemail><earliestdatestamp>2009-11-10T09:38:31Z</earliestdatestamp><deletedrecord>no</deletedrecord><granularity>YYYY-MM-DDThh:mm:ssZ</granularity></identify></oai-pmh></body></html>

You can easily check with the `IFrame` method underneath.

In [41]:
IFrame('https://www.e-rara.ch/oai?verb=Identify', width=970, height=330)

### 2.2 Download metadata records

The same can be done with the `GetRecord` verb, here `metadataPrefix`and `identifier` are mandatory parameters, naturally. Note, that as the identifier is an integer, you can discard the quotation marks used with the other parameter key-value pairs - anyway, often it's a good idea to keep these quotation marks regularly.

In [42]:
# Example for accessing a single metadata record
# https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=20329783

xml_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': 'oai_dc', 'identifier': 20329783})
xml_soup

<html><body><oai-pmh xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responsedate>2021-06-23T16:28:37Z</responsedate><request identifier="20329783" metadataprefix="oai_dc" verb="GetRecord">https://www.e-rara.ch/oai/</request><getrecord><record><header><identifier>oai:www.e-rara.ch:20329783</identifier><datestamp>2018-11-12T15:27:09Z</datestamp><setspec>zut</setspec><setspec>book</setspec></header><metadata><oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"><dc:title>Bibliotheca magica et pneumatica, oder, wissenschaftlich geordnete Bibliographie der wichtigsten in das Gebiet des Zauber-, Wunder-, Geister- 

Again before downloading, first make a dedicated folder for the retrieved metadata.

In [43]:
print(os.getcwd())                        # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory.  Just uncomment (and maybe multiply) the commands you need.

In [44]:
#os.chdir(os.pardir)                      # change to parent directory
#os.chdir(...)                            # change to subdirectory '...'
os.makedirs('metadata', exist_ok=True)    # make folder 'metadata'
os.chdir('metadata')                      # change to folder 'metadata'
print(os.getcwd()) 

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata


You might want to **download the metadata record directly** by its e-rara ID and in a specified metadata format. The `download_record()` function does this for you easily. If you choose no format, *MODS* will be delivered.

In [45]:
def download_record(ID, metadataPrefix='mods'):
    '''
    Downloads a certain metadata record from OAI to a single XML file.
    Throws a notice if metadata file already exists and leaves the existing one.
    Parameters:
    ID = E-rara ID of the wanted record.
    metadataPrefix = Metadata format to be delivered. Default value is MODS.
    '''
    path = os.getcwd()
    output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': metadataPrefix, 'identifier': ID})
    outfile = path + '/{}.xml'.format(ID) 
    try:
        with open(outfile, mode='x', encoding='utf-8') as f:
            f.write(output_soup.decode())
            print("Metadata file {}.xml saved".format(ID))
    except FileExistsError:
            print("Metadata file {}.xml exists already".format(ID))
    finally:
            pass

In [46]:
# Example for downloading a single metadata record
# https://www.e-rara.ch/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=20329783

download_record(20329783, 'oai_dc')

Metadata file 20329783.xml saved


### 2.3 Get set size & download metadata by set

Scraping the OAI interface output directly yields a problem with a bigger data volume. The output is **split into segments of ten records, which are presented on single webpages**. Looking at a sample request with `ListIdentifier` method, you will find the `resumptionToken` element, which on the one hand delivers the `completeListSize`, and on the other hand holds the resumption token. The [resumption token](http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl) is required to access the next segment web page, which again includes a resumption token to the next page and so on.

In [47]:
# Scroll to the end of the page for the resumption token
IFrame('https://www.e-rara.ch/oai?verb=ListIdentifiers&set=russexil&metadataPrefix=oai_dc', width=970, height=300)

We can use this information to define a function `setsize()` which reads out the size of a certain set.

In [48]:
def setsize(Set):  
    '''
    Accesses the OAI interface and retrieves the size of a given OAI set.
    Parameters:
    Set: The 'setSpec' short cut of the desired OAI set.
    '''
    base_url = oai
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Basic function
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        Parameters:
        All available native OAI verbs and parameter/value pairs.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup
    
    xml_soup = load_xml(listsearch_term)
    if xml_soup.resumptiontoken:
        set_size = int(xml_soup.resumptiontoken['completelistsize'])
    else:
        set_size = len(xml_soup.find_all('identifier'))
    return set_size

In [49]:
setsize('russexil')

320

With this function and recurring to the Polymatheia functions from the start it is also possible to **read out all the set sizes** of the OAI interface at once.

In [50]:
from polymatheia.data.reader import OAISetReader
from polymatheia.data.writer import PandasDFWriter

reader = OAISetReader(oai)
setspec = []                          
[setspec.append(x) for x in reader]                
df = PandasDFWriter().write(setspec)
df['setSize'] = df['setSpec']
for i in df.index:
    set_size = setsize(df.setSpec[i])
    df.setSize[i] = set_size
df.style

Unnamed: 0,setSpec,setName,setSize
0,frc_g,BCU Fribourg (GLN),0
1,elibch,Alle Bibliotheken,0
2,zbs,Zentralbibliothek Solothurn,77
3,sbs,Stadtbibliothek Schaffhausen,68
4,astrorara,Astronomie-rara,0
5,bau_1,UB Basel (DSV01),13185
6,kbg,Kantonsbibliothek Graubünden,207
7,nep_r,"Bibliothèque des Pasteurs, BPU Neuchâtel (RERO)",9
8,astrozut,ETH-Bibliothek Zürich,0
9,frc_r,BCU Fribourg (RERO),46


Because of the results splitting with resumption tokens it is a bit more complex to bulk download metadata directly from the OAI interface. With `retrieve_set_metadata()` we create a function to **retrieve  all metadata records of a set** in a certain format, and save the XML files to a designated folder.

In [51]:
print(os.getcwd())                        # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata


In [52]:
def retrieve_set_metadata(Set, foldername, metadataPrefix='mods'):
    '''
    Downloads metadata records of a given set and in a given format from OAI to XML files
    in a designated folder.
    Therefore it
    * requests e-rara OAI-PMH interface according to a set 
    * creates a folder for the records according to parameter foldername
    * retrieves the set's e-rara IDs
    * retrieves metadata according to IDs and given metadata format (default: MODS)
    * saves metadata to single <e-rara ID>.xml files in the folder.
    Parameters:
    Set = The 'setSpec' short cut of the desired OAI set.
    foldername = The name of the folder which will be created to hold the metadata files.
    metadataPrefix = Metadata format to be delivered. Default value is MODS.
    '''
    start = time.perf_counter()

    # Set parameters to the interface
    base_url = oai
    recordsearch_term = {'verb': 'GetRecord', 'metadataPrefix': metadataPrefix}
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': metadataPrefix, 'set': Set}
    
    # Make a folder <metadata> with subfolder named like the set to store files in it
    path = os.getcwd() + '/' + foldername
    try:
        os.makedirs(path, exist_ok = True)
        print("Path {} is already available or created successfully".format(path))
    except OSError as error:
        print("Path {} can not be created".format(path))
    
        
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_record(ID):
        '''
        Downloads a certain metadata record from OAI to a single XML file.
        Throws a notice if metadata file already exists and leaves the existing one.
        Parameter:
        ID = E-rara ID of the desired record.
        '''
        output_soup = load_xml({'verb': 'GetRecord', 'metadataPrefix': metadataPrefix, 'identifier': ID})
        outfile = path + '/{}.xml'.format(ID) 
        try:
            with open(outfile, mode='x', encoding='utf-8') as f:
                    f.write(output_soup.decode())
        except FileExistsError:
                print("Metadata file {}.xml exists already".format(ID))
        finally:
                pass

    # Start with the first access to OAI interface - get the item IDs of a set
    xml_soup = load_xml(listsearch_term)

    # Calculate how many accesses it takes to go through all the pages of the results list, print notice
    splits = math.ceil(int(xml_soup.resumptiontoken['completelistsize']) // 10) + 1
    print(xml_soup.resumptiontoken['completelistsize'], 'identifiers to request in ', splits, 'data splits')
    

    for i in range(splits):
        if i == 0:
            # First access for item IDs - first page + information about whole length of results list
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses for item IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-rara IDs
        ids = [] 
        for ID in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            match = re.search('oai:www.e-rara.ch:(\d+)', ID)      # extract the number following 'oai:www.e-rara.ch:'
            if match:
                ids.append(match.group(1))     # first parenthesized subgroup of group() = number

        # Download the MODS metadata records according to retrieved e-rara IDs
        print('Start retrieving metadata for e-rara IDs ', ids)  
        for ID in ids:
            download_record(ID)
        ids = []

        # Actualize the resumtpion token to retrieve the the next page
        try:
            new_token = xml_soup_new.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last page is done
        finally:
            pass

    with os.scandir(path) as entries:
        count = 0
        for entry in entries:
            count += 1       
    print("{} metadata files in {}".format(count, path))
    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))

In [53]:
# Just choose the appropriate set short cut, the desired folder name and metadata format
retrieve_set_metadata('snm', 'e-rara_Nationalmuseum', 'oai_dc')

Path C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata/e-rara_Nationalmuseum is already available or created successfully
19 identifiers to request in  2 data splits
Start retrieving metadata for e-rara IDs  ['19682232', '21806832', '19571743', '19571670', '19571862', '19571979', '19572108', '19668657', '19668871', '19668984']
New resumption token: 0x17704e76c786fc03363ee26700362373-cursor_p_3D10_p_26set_p_3Dsnm_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_3D11
Start retrieving metadata for e-rara IDs  ['19669094', '19669201', '19669322', '19669447', '19669572', '19669648', '21807225', '24463697', '24463840']
Reached end of IDs/results list
19 metadata files in C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata/e-rara_Nationalmuseum
Finished in 3.12 second(s)


## 3 Download fulltext files from e-rara website

### 3.0 Prerequisites

In [54]:
# Load the necessary libraries
import requests                                 # request URLs
from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
import lxml                                     # XML parser supported by bs4
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


https://www.e-rara.ch/oai/ will be the **base URL** for all OAI requests. To make life easier we put it into the variable `oai`.

In [55]:
oai = 'https://www.e-rara.ch/oai/'

### 3.1 Download fulltext files by e-rara ID

Downloading e-rara fulltetxts can be done from the e-rara website. For e-rara items which have a fulltext file available, a link is provided in the *Links* > *Download* section of the item page.

In [56]:
IFrame('https://www.e-rara.ch/content/titleinfo/20329783', width=970, height=300)

At first, next to the `metadata` folder a new directory `fulltexts`will be created.

In [57]:
print(os.getcwd())                                # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\metadata


In case you might change your directory you can easily do this with `os.chdir` or `os.chdir(os.pardir)`. While `os.chdir()` changes the working directory to a subdirectory, `os.chdir(os.pardir)` will change it to the parent directory.

In [58]:
os.chdir(os.pardir)                               # change to parent directory
os.makedirs('fulltexts', exist_ok=True)           # make new folder 'fulltexts'
os.chdir('fulltexts')                             # change to 'fulltexts' folder
print(os.getcwd())

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts


A single fulltext file can be retrieved by a given e-rara ID with the following function `download_fulltext()`. Note that **for fulltexts a different base URL**  - in combination with the given e-rara ID - has to be used: `https://www.e-rara.ch/download/fulltext/plain/`.

In [59]:
def download_fulltext(ID):
    '''
    Downloads a certain fulltext file of TXT format.
    Builds with e-rara ID the fulltext URL, reads the TXT and saves it to 
    <e-rara ID>.txt file on local disk.
    Parameter:
    ID = E-rara ID of the desired record.
    '''
    baseurl_fulltext = "https://www.e-rara.ch/download/fulltext/plain/"
    webadd = baseurl_fulltext + str(ID)
    response = requests.get(webadd) 
    soup_out = soup(response.text, 'html.parser')
    outfile = '{}.txt'.format(ID)
    
    try:
        with open(outfile, 'x', encoding='utf-8') as f:
            f.write(soup_out.get_text())
            print("Fulltext file {}.txt saved".format(ID))
    except FileExistsError:
        print("Fulltext file {}.txt exists already".format(ID))
    except:
        print("Saving fulltext file {}.txt failed".format(ID))
    finally:
        pass

In [60]:
# Retrieving example fulltexts by e-rara IDs
e_rara_ids = [20329783, 6156847, 6094442, 6125674]

for identifier in e_rara_ids: 
    download_fulltext(identifier)

Fulltext file 20329783.txt saved
Fulltext file 6156847.txt saved
Fulltext file 6094442.txt saved
Fulltext file 6125674.txt saved


We might read the files then from local disk.

In [61]:
with open('20329783.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
print(fulltext)

m. ‘ ■* ■VÆ:- \ wm ETH-Bibliothek EM000007324054 mm

V A 'V MM MW I v-,v V. MiràSKKZ 8WKWWW Km'»? W« »Ü5E 2 » • WM **C '4M

y .. ' . ä® 'S I V j' }' y I  l'xT' " \ ' ■ \ i Jf/Ayy x - . .. V .'AM V ^ ' y '' / A A W'^ ■ / a- X’ / V r 7 V S ; ' v r‘,.V \- /T'A 'V; V> A' vì'", A yA' ''. V.'; y\,V 'X '- 'À:V'V- ,\ T V';'* * ■:' 'j-\ - ^ y s^X_> ' l^'v ; , ’ y ' 7 ,s t J ' X T ? • ,: . : ■  r ; t, v .s'- t-y '- v ». tttA, :■> ■'' ' 7 ' 'TA>'T.i ■: i y- ! y ■' ■ '/■, y . \' .-‘y I , ' ;\- ■ ■ ‘/ \ y ;■ '} ^ /!  ' <. • y > \ V \C •' ' ' v . \ '' ' J ’ ' y <  . 1 \  ‘ L ' • 1 ■•' • • ■’ H V v V, ,; '  V ; vTy'V AA A a* \ aaa'ta x y. - - .'.V /;■ V ,,y y • ; N- ^ >• r‘vA- / ' fi K*** V < - , A A X ls~ ; .yAA ,y < j y yyx X / -, ' ' / , V. 'V ^ " v ■■• r y- '■ ■'  • 1 "°' ' ? . / • A/ A ' ; '-y y IV ■ ' t / ^ ^ ^ . \ ‘‘yyy ^^ ' ; -a,;-:';■'. 'VTy A if'T: y•: r • * A *.v / .. „y. *fisAA - , A - A; y ■ y ' v\ - Yy. ,, V .' 1 /ly./- ■■. ■ ' ! , / f * \ 1  ? /A' ,  i y - J. ■ !}■{■ v- -t , r '■ y, ï'

But when we try to read another fulltext file from the download we face a severe problem: There is nothing!
Here's the explanation: The file which was successfully downloaded *is* just an empty file. We might check this at source.

Note that not all documents on the e-rara platform have an OCR-based fulltext file available.

In [62]:
# Empty file! Resulting from an empty fulltext page
with open('6125674.txt', 'r', encoding='utf-8') as f:
    fulltext = f.read()
fulltext

''

In [63]:
# Looking at the source fulltext page online
IFrame('https://www.e-rara.ch/download/fulltext/plain/6125674', width=970, height=100)

So, if you **don't know which of the e-rara items have fulltext files available** and which not, there's the following way to get rid of the resulting empty files.

In [64]:
print(os.getcwd())                 # print current working directory

C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts


In [65]:
# Delete empty files in the current working directory
path = os.getcwd()
count = 0
count_empty = 0
for entry in os.scandir(path):
    if os.path.getsize(entry) == 0:
        print("File {} is empty".format(entry.name))
        os.remove(entry.name)
        count_empty += 1
    else: 
        count += 1
        pass
print("{} empty fulltext files in {} deleted".format(count_empty, path))
print("{} fulltext files in {}".format(count, path))

File 6094442.txt is empty
File 6125674.txt is empty
2 empty fulltext files in C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts deleted
2 fulltext files in C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts


### 3.2 Download fulltext files by set

Finally, let's build a function `retrieve_set_fulltexts` to retrieve **all fulltexts of a certain e-rara set**. Note that also empty fulltext files are retrieved and stored in between, but cleaned up in the end.

**WARNING**: Note that e-rara sets can be large, so better **check the set size** before starting the download.

In [66]:
def setsize(Set):  
    base_url = oai
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Basic function
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup
    
    xml_soup = load_xml(listsearch_term)
    if xml_soup.resumptiontoken:
        set_size = int(xml_soup.resumptiontoken['completelistsize'])
    else:
        set_size = len(xml_soup.find_all('identifier'))
    return set_size

In [67]:
setsize('zhdk')

31

In [68]:
def retrieve_set_fulltexts(Set, foldername):
    '''
    Downloads fulltext files of TXT format of a given OAI set to a certain folder.
    Builds with e-rara IDs the fulltext URLs, reads the TXT and saves them to 
    <e-rara ID>.txt files on local disk.
    Therefore it
    * requests e-rara OAI-PMH interface according to a set 
    * creates a folder according to parameter foldername
    * retrieves the set's e-rara IDs from OAI interface
    * retrieves fulltexts according to the IDs from e-rara website
    * writes fulltexts into single <e_rara_id>.txt files in the folder
    * finally checks all fulltext files in the folder if they are empty, and 
    deletes those empty files.
    Parameters:
    Set = The 'setSpec' short cut of the desired set.
    foldername = The name of the folder which will be created to hold the fulltext files.
    '''
    start = time.perf_counter()

    # Set parameters to the interface
    base_url = oai
    baseurl_fulltext = "https://www.e-rara.ch/download/fulltext/plain/"
    listsearch_term = {'verb': 'ListIdentifiers', 'metadataPrefix': 'oai_dc', 'set': Set}
    
    # Make a folder <fulltexts> with subfolder named like the set to store files in it
    path = os.getcwd() + '/' + foldername
    try:
        os.makedirs(path, exist_ok = True)
        print("Path {} is already available or created successfully".format(path))
    except OSError as error:
        print("Path {} cannot be created".format(path))
           
    # Basic functions
    def load_xml(params):
        '''
        Accesses the OAI interface according to given parameters and scrapes its content.
        '''
        response = requests.get(base_url, params=params)
        output_soup = soup(response.content, "lxml")
        return output_soup

    def download_fulltext(ID):
        '''
        Downloads fulltext from e-rara website to a single TXT file.
        Throws a notice if fulltext file already exists and leaves it.
        Parameter:
        ID = E-rara ID of the desired record.
        '''
        webadd = baseurl_fulltext + str(ID)
        response = requests.get(webadd) 
        soup_out = soup(response.text, 'html.parser')
        outfile = path + '/{}.txt'.format(ID) 
        try:
            with open(outfile, 'w', encoding='utf-8') as f:
                f.write(soup_out.get_text())
                print("Fulltext file {}.txt saved".format(ID))
        except FileExistsError:
            print("Fulltext file {}.txt exists already".format(ID))
        except:
            print("Saving fulltext file {}.txt failed".format(ID))
        finally:
                pass
            
    # Start with the first access to OAI interface
    xml_soup = load_xml(listsearch_term)

    # Calculate how many accesses it takes to go through all the pages of the results list, print notice
    splits = math.ceil(int(xml_soup.resumptiontoken['completelistsize']) // 10) + 1
    print(xml_soup.resumptiontoken['completelistsize'], 'identifiers to request in ', splits, 'data splits')
            
            
    for i in range(splits):
        if i == 0:
            # First access to OAI for e-rara IDs - first page + information about whole length of results list
            xml_soup_new = load_xml(listsearch_term)      
        else:
            # Following accesses to OAI for e-rara IDs
            xml_soup_new = load_xml({'verb': 'ListIdentifiers', 'resumptionToken': resumption_token})

        # Scraping out the e-rara IDs
        e_rara_ids = [] 
        for e_rara_id in [(i.contents[0]) for i in xml_soup_new.find_all('identifier')]:
            match = re.search('oai:www.e-rara.ch:(\d+)', e_rara_id) # extract number following 'oai:www.e-rara.ch:'
            if match:
                e_rara_ids.append(match.group(1))       # first parenthesized subgroup of group() = number

        # Download the fulltexts according to retrieved e-rara IDs
        print('Start retrieving fulltetxts for e-rara IDs ', e_rara_ids) 
        for e_rara_id in e_rara_ids:
            download_fulltext(e_rara_id)
        e_rara_ids = []

        # Actualize the resumption token to retrieve the the next page
        try:
            new_token = xml_soup_new.find('resumptiontoken').get_text()
            resumption_token = new_token
            print('New resumption token:', resumption_token)
        except AttributeError:
            print('Reached end of IDs/results list')       # notice when last page is done
        finally:
            pass
        
    # Clean up empty files
    count = 0
    count_empty = 0
    for entry in os.scandir(path):
        if os.path.getsize(entry) == 0:
            print("File {} is empty".format(entry.name))
            os.remove(path + '/' + entry.name)
            count_empty += 1
        else: 
            count += 1
    print("{} empty fulltext file(s) in {} deleted".format(count_empty, path))
    print("{} fulltext file(s) in {}".format(count, path))

    finish = time.perf_counter()
    print("Finished in {} second(s)".format(round(finish - start, 2)))
    

In [69]:
retrieve_set_fulltexts('zhdk', 'e-rara_ZHDK')

Path C:\Users\kwoit\Documents\GitHub\e-rara-access\data\fulltexts/e-rara_ZHDK is already available or created successfully
31 identifiers to request in  4 data splits
Start retrieving fulltetxts for e-rara IDs  ['24464158', '24464392', '24464517', '24464637', '24537418', '24537593', '24537791', '24538034', '24770753', '24770918']
Fulltext file 24464158.txt saved
Fulltext file 24464392.txt saved
Fulltext file 24464517.txt saved
Fulltext file 24464637.txt saved
Fulltext file 24537418.txt saved
Fulltext file 24537593.txt saved
Fulltext file 24537791.txt saved
Fulltext file 24538034.txt saved
Fulltext file 24770753.txt saved
Fulltext file 24770918.txt saved
New resumption token: 0x29c03ab5a09eaf8245bf06f379f4c64a-cursor_p_3D10_p_26set_p_3Dzhdk_p_26metadataPrefix_p_3Doai_dc_p_26batch_size_p_3D11
Start retrieving fulltetxts for e-rara IDs  ['24771232', '24771517', '24567987', '24573367', '24573559', '24573632', '24573804', '24573911', '24764534', '24765325']
Fulltext file 24771232.txt saved
