### Using the API of Delpher for downloading newspaper content on page level based


This notebook guides you through the endpoints of the KB: National Library of the Netherlands

- The SRU (Search and Retrieval via URL) endpoint allows you to search newspaper issues and articles. 
- The OAI-PMH endpoint is a protocol for harvesting metadata.
- The resolver endpoint provides the actual content, like transcripts (OCR) and images (scans).

... in order to collect content. Content can be collected either on the level of the entire newspaper issue, or on page level.

The starting point for this tutorial is the *ppn* of a newspaper. A ppn is the identifier of a (newspaper) title in the library catalogue. You can find a list of newspaper titles and corresponding links through this link: https://www.kb.nl/kbhtml/delpher/documentatie/beschikbare_kranten_alfabetisch.pdf. Please note that one newspaper title can have several PPN’s, due to name changes, precursors and successors of titles, supplements or special editions.

As an example, we use the newspaper 'Trouw' with the ppn '412789353' from 01 january 1970 until 31 december  1970. 

--- PDF verwijderen



### Install the neccesary packages

It is preffered to install the package through a commandline, but installing through the Jupypter Notebook is also possible.

In [None]:
# If not already installed, install the following packages
!pip install pandas
!pip install requests
!pip install BeautifulSoup4
!pip install six

### Import  the neccesary packages

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from six.moves import urllib
import os

### Defining the api key

For accessing post-1945 material, you'll need an api key which can be requested through our Dataservices department,
via dataservices@kb.nl. 



In [None]:
apikey = "insert api key" ## Insert the recieved APIkey between the quotes

### Defining the search parameters

There are various parameters that can be used to search through the selections.
The code in this notebook is based on searching with a ppn (an identifier of a newspaper title) as entrypoint. 

Furthermore, other parameters can be searched, such as the desired start and end date, the type of content or the geographical area. In this notebook, we use only the ppn and the date range. Reference the user manual (available upon request at dataservices@kb.nl) for more information about the other parameters. 

In [None]:
## Insert required variables
ppn = "412789353" ## Example ppn of 'de Trouw' newspaper, please replace with the  desired ppn
startdate =  "1970-01-01" ## Example date, please replace with the  desired start date, use "" (empty string) for all years
enddate = "1970-12-31" ## Example date, please replace with the  desired end date, use "" (empty string) for all years

### Retrieving the newspaper issue identifiers

Before we can download the actual content, we need a list of identifiers from the newspapers that fit in the selection  we made above. We put this list in a dataframe in which we store some additional metadata  about the newspaper. This  dataframe is used later on for accessing the content. 

In [None]:
## We start by creating the sru queries based on the parameters above
def sru_query(ppn, startdate, enddate, startRecord, batchSize = 1000):
    if startdate == "" and enddate == "":
        sru_query = f"http://jsru.kb.nl/sru/sru/{apikey}?operations=searchRetrieve&recordSchema=ddd&x-collection=DDD_krantnr"\
            f"&query=(ppn={ppn})"\
            f"&startRecord={startRecord}&maximumRecords={batchSize}"
    else:
        sru_query = f"http://jsru.kb.nl/sru/sru/{apikey}?operations=searchRetrieve&recordSchema=ddd&x-collection=DDD_krantnr"\
            f"&query=(date%20within%20%22{startdate}%20{enddate}%22)%20and%20(ppn={ppn})"\
            f"&startRecord={startRecord}&maximumRecords={batchSize}" 
    return(sru_query)

## Request metadata from SRU as specified in query and return processed XML (beautiful soup)
def sru_request(query):
    page = requests.get(query)
    soup = BeautifulSoup(page.content, "xml")
    return soup
    

In [None]:
## First, we extract the total number of newspapers that were found. 
## We need this to iterate through the sru untill all files are downloaded
soup = sru_request(sru_query(ppn, startdate, enddate, startRecord = 0, batchSize = 0))
for item in soup.findAll('srw:searchRetrieveResponse'):
    nRecords = 5#int(item.find('srw:numberOfRecords').text)
    
## Create an empty dataframe to store the identifiers and some metadata
dfIdentifiers = pd.DataFrame(columns = ['identifier', 'papertitle', 'date'])

## Then we iterate through the sru, extract the required data and put it in the dataframe
## After each loop, increase 'startRecord' to access the next batch of newpaper identifiers.  
startRecord = 0
batchSize = 2
while startRecord <= nRecords:
    soup = sru_request(sru_query(ppn, startdate, enddate, startRecord,batchSize = batchSize))
    identifiers, papertitles, dates = [],[],[]
    for item in soup.findAll('srw:recordData'):
        identifiers.append(item.find('ddd:metadataKey').text)
        papertitles.append(item.find('ddd:publisher').text)
        dates.append(item.find('dc:date').text)
    batch = pd.DataFrame({'identifier': identifiers, 'papertitle': papertitles, 'date': dates})
    dfIdentifiers = pd.concat([dfIdentifiers,batch], ignore_index = True)
    startRecord += batchSize


In [None]:
## Show the first 4 records of the dataframe to see what we got
dfIdentifiers.head(30)

In [None]:
## And show how many newspapers were found
print('Number of newspaper issues: ' + str(len(dfIdentifiers)))

### Download the desired items

You can recieve each newspaper in several formats:
* The image of a page (jp2 format, high quality)
* The complete  newspaper in pdf (lower quality)
* The alto xml of a page
* The metadata of the complete newspaper 
* The plain text of every article in the newspaper

First, you have to define the function 'download_file' (see next cell). Then, you can choose which of the above types you want to download and run the corresponding cells beneath to collect the data. Please note: depending on the amount of newspapers, this can take a while. 

NB: make sure the folder you are referring to in your path (see cells below) does exist before running the code. If the folder does not exist, the code will stop running and gives an error. 

**Explanation of the download code** <br>
You iterate through the dataframe and perform the following steps for each newspaper:
* Retrieve the identifier.
* Add or change information from the identifier if needed (see the cells: download image, alto and metadata).
* Create a filename out of the identifier, with characters that are allowed for filenames (e.g. replace the colon in the identifier name with a underscore).
* Create the url for quering the sru of oai. 
* Run the function "download_file" to download the content from the url. 

In [None]:
## Function for downloading a file from a url
def download_file(download_url, filename):
    ## Retrieve the desired information from the given url
    response = urllib.request.urlopen(download_url)
    # TODO: check for errors
    ## Save to the desired location
    with open(filename, 'wb') as f:
        f. write(response. read())

### Download image (jp2 format)

NB: the jp2 files have large filesizes and not all image manipulation software can open these files

In [None]:
path = "images" ## Change to the desired location
if not os.path.exists(path):
   os.makedirs(path)
pagenumber = '001' ##  Choose the pagenumber you want to retrieve, NB: pagenumbers are always three digits 

for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    prefix = identifier.split("=")[1].split(":")[0]
    if prefix == 'ddd':
        identifier = identifier + ":p"+pagenumber
    else: 
        identifier = identifier.replace("mpeg21", pagenumber)
    filename = identifier.split("=")[1].replace(":","_")
    url = identifier + ":image"
    download_file(url, os.path.join(path, filename+ ".jp2"))

### Download pdf

In [None]:
path = "pdfs" ## Change to the desired location
if not os.path.exists(path):
   os.makedirs(path)

for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    filename = identifier.split("=")[1].replace(":","_")
    url = identifier + ":pdf"
    download_file(url, os.path.join(path, filename+ ".pdf"))

### Download alto xml of page

In [None]:
path = "alto" ## Change to the desired location
if not os.path.exists(path):
   os.makedirs(path)

pagenumber = '001' ##  Choose the pagenumber you want to retrieve, NB: pagenumbers are always three digits 

for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    prefix = identifier.split("=")[1]
    prefix = prefix.split(":",1)[0]
    if prefix == 'ddd':
        identifier = identifier + ":p"+pagenumber
    else:
        identifier = identifier.replace("mpeg21", pagenumber)
    filename = identifier.split("=")[1].replace(":","_")
    url = identifier + ":alto"
    download_file(url, os.path.join(path, filename+ ".xml"))

### Download metadata

In [None]:
path = "metadata" ## Change to the desired location
apikey="ab5a8969-b339-4d3b-a76a-636342f71e55"
if not os.path.exists(path):
   os.makedirs(path)
for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    identifier = identifier.split("=")[1]
    prefix = identifier.split(":",1)[0]
    identifier = identifier.split(":",1)[1]
    filename = identifier.replace(":","_")
    if prefix == 'ddd':
        url = f"http://services.kb.nl/mdo/oai/{apikey}?verb=GetRecord&" \
              f"identifier=DDD:ddd:{identifier}&metadataPrefix=didl"
    if prefix == 'ABCDDD':
        url = f"http://services.kb.nl/mdo/oai/{apikey}?verb=GetRecord&" \
              f"identifier=KRANTEN:DDD:ddd:{identifier}&metadataPrefix=didl"
    else:
        url = f"http://services.kb.nl/mdo/oai/{apikey}?verb=GetRecord&" \
              f"identifier=KRANTEN:{prefix}:{prefix}:{identifier}&metadataPrefix=didl"
    download_file(url, os.path.join(path, filename+ ".xml"))
    # TODO: check that the request succeeded, i.e. you were authorized to view the record

### Download plain text from articles

For this step, you first need to download the metadata files and store them on your computer

In [None]:
### Collect all the identifiers for the individual articles. These are used later on for downloading the text

metadata_path = "metadata" ## Change to the path where you saved the metadata files

def obtain_article_identifiers(metadatafile):
    with open(os.path.join(path,filename), "r", encoding="utf8") as f:
            content = "".join(f.readlines())
            soup = BeautifulSoup(content, "xml")
    identifiers = []
    for identifier in soup.findAll('dc:identifier'):
        identifier = identifier.text
        if identifier.startswith("http://resolver.kb.nl/"):
            if 'a' in identifier.rsplit(":", 1)[1]: 
                identifiers.append(identifier)
    return identifiers

def obtain_article_content(identifier_ocr):
    r = requests.get(identifier_ocr)
    soup = BeautifulSoup(r.content, "xml")
    try:
        title = soup.find('title').text
    except AttributeError: 
        print(f'Item without title: {identifier_ocr}')
        title = ''
    content = ' '.join([item.text for item in soup.findAll('p')])
    return title, content

## Create empty dataframe to contain article content
dfArticleContent = pd.DataFrame(columns = ['identifier', 'title', 'content'])
i=0

## Iterate over the metadata files. Retrieve article identifiers and obtain corresponding contents
for (_, _, filenames) in os.walk(metadata_path):
    for filename in [fn for fn in filenames if fn.endswith('.xml')]:
        for article_identifier in obtain_article_identifiers(filename):
            title, content = obtain_article_content(article_identifier+':ocr')
            dfArticleContent.loc[i] = pd.Series({'identifier': article_identifier, 'title': title, 'content':  content})
            i+=1

            
## Show the first records of the dataframe to see what we got
dfArticleContent.head()

In [None]:
## Save the dataframe for further use
dataframe_path = 'Insert pathname' #"insert a path to store the dataframe"
dataframe_name = "Insert filename" # insert the desired filename

# Save the dataframe as comma seperated file
dfArticleContent.to_csv(dataframe_path + "/" + dataframe_name + ".csv")