# This document is intended to provide a guidance to use the OAI-PMH protocol to harvest metadata and subsequently crawl ETDs from the institutional repositories

## Please keep the information of the URLs that you use for this process for each repositories in a separate file. 
Make sure to look at Robots.txt for the site to obey instructions to crawl their website. Look at how many requests per second they support. Consider adding a delay (sleep) if needed. Usage: url/robots.txt

### Verbs:
1. Identify -  This verb is used to retrieve information about a repository.

2. ListMetadataFormats - This verb is used to retrieve the metadata formats available from a repository.

3. ListSets - This verb is used to retrieve the set structure of a repository

4. ListRecords - This verb is used to harvest records from a repository.
   
5. GetRecord - This verb is used to retrieve an individual metadata record from a repository

For information on what "verb" to use and how to use them look at the documentation at https://www.openarchives.org/OAI/openarchivesprotocol.html

## Find OAI-PMH endpoint and use verb "Identify"


In [1]:
# To do so, navigate to the institutional repository you want to harvest. 
# Look at the URL and then try to find the oai-pmh endpoint by trying some combinations of keywords with the URL. 
# For example, https://scholar.afit.edu/do/oai/?verb=Identify or https://vtechworks.lib.vt.edu/server/oai/request?verb=Identify


## Use verb "ListMetadataFormats" verb to find the available metadata formats 
We need this information to get the metadata record in the format desired and supported by the repository. 

In [2]:
#Example, 
## https://scholar.afit.edu/do/oai/?verb=ListMetadataFormats
## https://vtechworks.lib.vt.edu/server/oai/request?verb=ListMetadataFormats

## Use verb "ListSets" to find the ETD metadata endpoint. For this, first find the set list by using the and identifying the name of the ETD set. 


In [3]:
### For example, for https://scholar.afit.edu/do/oai/?verb=ListSets , the ETD set name is publication:etd 
### For https://vtechworks.lib.vt.edu/server/oai/request?verb=ListSets, the ETD set name is com_10919_5534

# There might be multiple such sets. It depends on the University on how they decided to name and organize collections. For example, there 
# might be separate collection for Masters' or Doctoral documents. Make sure to collect the information for all such ETDs

## Use verb "ListRecords" to fetch the records from the "set" (ETD) and the metadata format. It needs arguments 
Arguments 

- **from** an optional argument with a UTCdatetime value, which specifies a lower bound for datestamp-based selective harvesting.
**until** an optional argument with a UTCdatetime value, which specifies a upper bound for datestamp-based selective harvesting.

- **metadataPrefix** a required argument, which specifies that headers should be returned only if the metadata format matching the supplied metadataPrefix is available or, depending on the repository's support for deletions, has been deleted. The metadata formats supported by a repository and for a particular item can be retrieved using the ListMetadataFormats request.

- **set** an optional argument with a setSpec value , which specifies set criteria for selective harvesting.

- **resumptionToken** an exclusive argument with a value that is the flow control token returned by a previous ListIdentifiers request that issued an incomplete list.



In [4]:
# Examples:https://vtechworks.lib.vt.edu/server/oai/request?verb=ListRecords&set=com_10919_5534&metadataPrefix=oai_dc
#and https://scholar.afit.edu/do/oai/?verb=ListRecords&set=publication:etd&metadataPrefix=oai_openaire

## Use verb "GetRecord" to fetch information about one record.

Arguments

- **identifier** a required argument that specifies the unique identifier of the item in the repository from which the record must be disseminated.

- **metadataPrefix** a required argument that specifies the metadataPrefix of the format that should be included in the metadata part of the returned record . A record should only be returned if the format specified by the metadataPrefix can be disseminated from the item identified by the value of the identifier argument. The metadata formats supported by a repository and for a particular record can be retrieved using the ListMetadataFormats request.

In [5]:
# Examples: https://vtechworks.lib.vt.edu/server/oai/request?verb=GetRecord&identifier=oai:vtechworks.lib.vt.edu:10919/10342&metadataPrefix=oai_dc
#and https://scholar.afit.edu/do/oai/?verb=GetRecord&identifier=oai:scholar.afit.edu:etd-1211&metadataPrefix=oai_dc

### Harvesting metadata 

In [6]:
from sickle import Sickle
#Sickle documentation https://sickle.readthedocs.io/en/latest/ & https://sickle.readthedocs.io/_/downloads/en/latest/pdf/

In [7]:
# from sickle.iterator import OAIResponseIterator
# sickle = Sickle('https://vtechworks.lib.vt.edu/server/oai/request', iterator=OAIResponseIterator)
# responses = sickle.ListRecords(metadataPrefix='oai_dc')
# responses.next()
# with open('response_VT.xml', 'wb') as fp:
#     fp.write(responses.next().raw.encode('utf8'))

In [8]:
# from sickle.iterator import OAIResponseIterator
# sickle = Sickle('https://scholar.afit.edu/do/oai', iterator=OAIResponseIterator)
# responses = sickle.ListRecords(metadataPrefix='oai_dc')
# responses.next()
# with open('response_AIF.xml', 'wb') as fp:
#     fp.write(responses.next().raw.encode('utf8'))

In [9]:
#Examples of parsing the harvested XMLs can be found in https://github.com/lamps-lab/ETDMiner/blob/master/webcrawler/Duke/harvest_etds_Duke.py

## Exceptions: Universities not using OAI-PMH - UC-systems, GaTech, CalTech.
Use sitemap information to find the metadata. Sitemaps can be named in different ways. The page's robots.txt can come in handy eg: https://escholarship.org/robots.txt
## More resources:
- https://www.openarchives.org/Register/BrowseSites
  