# Example of protocols to harvest metadata from Zenodo

---------------------
#### Notebook outline 
 - Zenodo OAI-PMH protocol
 - Zenodo REST API
     - Explore the REST API answer (payload) with the `request` library
     - Using `eossr` library
 - Pros and cons of both methods
 
---------------------

## Pros and cons of each method
 - Using AOI-PMH for harvesting;
    + $+$ More efficient harvest:
       - faster,
       - thought for large and continues queries of a repository.
    + $-$ Metadata representation of files is provided by the data provider.
 - Using the REST API;
    + $+$ Access to the full entry/record/community information.
    + $-$ Harvest not optimised for large searches.
 

## OAI-PMH protocol

####  First have a look at a nice [tutorial on the protocol](https://indico.cern.ch/event/5710/sessions/108048/attachments/988151/1405129/Simeon_tutorial.pdf).

The [OAI-PMH protocol](https://www.openarchives.org/pmh/) uses a base URL + special syntax ('verbs') to query and find metadata representation(s) of a data provider.

In the case of zenodo the base URL is:  https://zenodo.org/oai2d.

For example:
- to retrieve all the entries (`verb=ListRecords`)
- belonging to escape2020 community (`set=user-escape2020`)
- in the OAI DataCite metadata representation (`metadataPrefix=oai_datacite`)     


https://zenodo.org/oai2d?verb=ListRecords&set=user-escape2020&metadataPrefix=oai_datacite


Ex2:
- To obtain a single entry (`verb=GetRecord`)
- of a certain zenodo record - identified by the entry_id (`identifier=oai:zenodo.org:4105896`)
- in the Dublin Core metadata representation (`metadataPrefix=oai_dc`)
 
https://zenodo.org/oai2d?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:zenodo.org:4105896


### Example with the OAI-PMH protocol: A python OAI-Harvester

```
pip install oaiharvest
oai-harvest -h

# Examples of usage
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d oai_dc
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d oai_datacite4
oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d datacite3

# Example of output
$ oai-harvest https://zenodo.org/oai2d -s "user-escape2020" -d datacite3
$ cd datacite3
$ ls
oai:zenodo.org:1689986.oai_dc.xml oai:zenodo.org:3884963.oai_dc.xml
oai:zenodo.org:2533132.oai_dc.xml oai:zenodo.org:3967386.oai_dc.xml
oai:zenodo.org:2542652.oai_dc.xml oai:zenodo.org:4012169.oai_dc.xml
oai:zenodo.org:2542664.oai_dc.xml oai:zenodo.org:4028908.oai_dc.xml
oai:zenodo.org:3356656.oai_dc.xml oai:zenodo.org:4044010.oai_dc.xml
oai:zenodo.org:3362435.oai_dc.xml oai:zenodo.org:4055176.oai_dc.xml
oai:zenodo.org:3572655.oai_dc.xml oai:zenodo.org:4105896.oai_dc.xml
oai:zenodo.org:3614662.oai_dc.xml oai:zenodo.org:4311271.oai_dc.xml
oai:zenodo.org:3659184.oai_dc.xml oai:zenodo.org:4419866.oai_dc.xml
oai:zenodo.org:3675081.oai_dc.xml oai:zenodo.org:4601451.oai_dc.xml
oai:zenodo.org:3734091.oai_dc.xml oai:zenodo.org:4687123.oai_dc.xml
oai:zenodo.org:3743489.oai_dc.xml oai:zenodo.org:4786641.oai_dc.xml
oai:zenodo.org:3743490.oai_dc.xml oai:zenodo.org:4790629.oai_dc.xml
oai:zenodo.org:3854976.oai_dc.xml
$ cat <FILE>
```


 No token is needed to fetch metadata files provided by Zenodo (the provider). 
 However please note that the **metadata schema representation of the records is chosen by the provider !**  
 
Zenodo supports the following schema representations:
 - `DataCite` (various version),
 - `Dublin Core`,
 - `MARC21`,
 - However it **does not provide** metadata under the `codemeta.json` schema.
 

## Zenodo's REST API

In [None]:
import requests

We would need to specify some arguments to reduce the search

In [None]:
parameters = {'communities': 'escape2020',
              'size':100}

**NOTE** No token is needed to fetch/communicate with the REST API. 
However, you would need to [create one](https://zenodo.org/account/settings/applications/) if you would like to write or publish through the API.

### Example with the `requests` lib

How to recover all ESCAPE2020 community records ?

In [None]:
escape2020 = requests.get('https://zenodo.org/api/records', params=parameters).json()
escape2020.keys()

Let's explore the REST API payload to find the desired information.

In [None]:
# Nice summary of the request we just made
escape2020['aggregations']

In [None]:
# Total number of entries in the payload
print(escape2020['hits'].keys())
print(escape2020['hits']['total'])

In [None]:
all_entries = escape2020['hits']['hits']

In [None]:
# The content of the first entry of the payload - It contain all the info that we can also find in Zenodo
all_entries[0]

In [None]:
# Example to retrieve entries_ids and titles
for entry in all_entries:
    print(f"{entry['id']} \t {entry['metadata']['title']}")

In [None]:
# Example of all the keywords within each entry
for entry in all_entries:
    try:
        print(f"{entry['id']} \t {entry['metadata']['keywords']}")
    except KeyError:
        pass

#### A specific ESCAPE2020 entry: `agnpy`.

In [None]:
agnpy = requests.get('https://zenodo.org/api/records/4687123', params=parameters).json()
agnpy.keys()

In [None]:
agnpy['metadata']

In [None]:
for file in agnpy['files']:
    print(file['links']['self'])

We could do a simple `wget` of the previous URL and recover the file updoaded to Zenodo.

Let's see and example with various files uploaded.

In [None]:
ESCAPE_template = requests.get('https://zenodo.org/api/records/4790629', params=parameters).json()

In [None]:
for file in ESCAPE_template['files']:
    print(file['links']['self'])

## eossr

The eossr library uses the Zenodo REST API.
See the OSSR API notebooks `Explore the OSSR` and `How to upload records to the OSSR` for examples on how to use it.
