# Download news articles per subject

This notebook guides you through the SRU and OAI of the KB: National Library of the Netherlands, in order to collect news articles based on subject and time range. 

### Install the neccesary packages

It is preffered to install the package through a commandline, but installing through the Jupypter Notebook is also possible.

In [1]:
!pip install pandas
!pip install requests
!pip install BeautifulSoup4



### Import  the neccesary packages

In [1]:
## Import the necessary packages 
import pandas as pd
from bs4 import BeautifulSoup
import requests

### Defining the API key

An API key is needed to query and download material. 

In [2]:
apikey = "insert API key" #Insert the API key here

### Defining the search parameters

There are various parameters that can be used to search through the collection.
The code in this notebook is based on searching with a keyword and a time range.  

Furthermore, other parameters can be searched, such as the type of content or the spatial. Reference the user manual for more information about the other parameters. 

In [3]:
keyword = '(Inhuldiging+and+koningin+and+Beatrix)' ## use '+or+' or '+and+' to search with multiple keywords, such as 'griep+and+ziekte'
startdate = '01-01-1980'
enddate = '31-12-1980'

### Retrieving the newspaper issue identifiers

Before we can download the actual content, we need a list of identifiers from the news articles that fit to the selection criteria we made above. We put this list in a dataframe in which we store some additional metadata  about the newspaper. This  dataframe is used later on for accessing the content. 

In [4]:
## Extract the identifiers
## This might take a while
identifierList = []
startRecord = 0
recordCounter = 0

## Assemble the query based on the parameters, we set the  maximumRecords to 1000 to prevent overloading the system
query = f"https://jsru.kb.nl/sru/sru/{apikey}?query={keyword}"\
        f"%20and%20(date%20within%20%22{startdate}%20{enddate}%22)"\
        f"&x-collection=DDD_artikel&recordSchema=dc&startRecord={startRecord}&maximumRecords=1000"
print(query)

page = requests.get(query)
soup = BeautifulSoup(page.content, "xml")
for item in soup.findAll('srw:searchRetrieveResponse'):
    records = item.find('srw:numberOfRecords').text
    
## Iterate through the query results to extract the metadata 
while recordCounter < int(records):
    page = requests.get(query)
    soup = BeautifulSoup(page.content, "xml")
    ## The query returns an xml page with (in this example) 1000 articles 
    ## We extract the metadate per article
    for item in soup.findAll('srw:recordData'):
        identifier = item.find('dc:identifier')
        kind =  item.find('dc:type')    
        title = item.find('dc:title')
        papertitle = item.find('dc:publisher')
        date = item.find('dc:date')
        if title is None:
            title = ""    
            identifierList.append([identifier.text, kind.text, title, date.text])
        else:
            identifierList.append([identifier.text, kind.text, title.text, date.text])
        recordCounter += 1
    ## If there are more than 1000 results, 
    ## this code is used to proceed to the next pages to collect the remainder of the results
    startRecord = startRecord + 1000
    query = f"https://jsru.kb.nl/sru/sru/{apikey}?query={keyword}"\
        f"%20and%20(date%20within%20%22{startdate}%20{enddate}%22)"\
        f"&x-collection=DDD_artikel&recordSchema=dc&startRecord={startRecord}&maximumRecords=1000"

https://jsru.kb.nl/sru/sru/f7ff6440-2c1b-4303-8617-ec23fc783021?query=(Inhuldiging+and+koningin+and+Beatrix)%20and%20(date%20within%20%2201-01-1980%2031-12-1980%22)&x-collection=DDD_artikel&recordSchema=dc&startRecord=0&maximumRecords=1000


In [5]:
## Create the dataframe
dfIdentifiers = pd.DataFrame(identifierList, columns = ['identifier', 'type', 'title', 'date'])

In [6]:
## Show the number of found identifiers
len(dfIdentifiers)

1079

In [7]:
dfIdentifiers.head(40)

Unnamed: 0,identifier,type,title,date
0,http://resolver.kb.nl/resolve?urn=ddd:01120070...,artikel,Inhuldiging Beatrix op de plaat,1980/02/18 00:00:00
1,http://resolver.kb.nl/resolve?urn=ddd:01057316...,artikel,Inhuldiging koningin Beatrix trok meeste kijke...,1980/08/15 00:00:00
2,http://resolver.kb.nl/resolve?urn=ddd:01037692...,artikel,Troonsafstand en inhuldiging op één dag,1980/02/11 00:00:00
3,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,artikel,Prins Charles bij inhuldiging,1980/03/07 00:00:00
4,http://resolver.kb.nl/resolve?urn=ddd:01057080...,artikel,NEDERLAND I,1980/04/30 00:00:00
5,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,artikel,TV,1980/04/29 00:00:00
6,http://resolver.kb.nl/resolve?urn=ddd:01120069...,artikel,VROUW DE KLEREN VAN DE KONINGIN door Dieuwke G...,1980/02/09 00:00:00
7,http://resolver.kb.nl/resolve?urn=ddd:01120076...,artikel,Beatrix na inhuldiging beschermvrouwe Unicef,1980/03/31 00:00:00
8,http://resolver.kb.nl/resolve?urn=ddd:01120490...,artikel,Inhuldiging Koningin,1980/08/07 00:00:00
9,http://resolver.kb.nl/resolve?urn=ABCDDD:01084...,artikel,Troonsafstand zal niet tot collectieve gratie ...,1980/02/09 00:00:00


### Retrieve the content of the articles

In [8]:
## Retreive the content of the articles based on the identifiers
## If there are a lot of articles, this can take a while

contentList = []

for index, row in dfIdentifiers.iterrows():
    identifier = row['identifier']
    url = requests.get(identifier)
    soup = BeautifulSoup(url.content, "xml")
    text = ''
    for item in soup.findAll('p'):
        text = text + (item.text)
    contentList.append([identifier, text])   

In [9]:
## Create a dataframe
dfText = pd.DataFrame(contentList, columns = ['identifier', 'content'])

In [10]:
len(dfText)

1079

In [11]:
dfText.head(4)

Unnamed: 0,identifier,content
0,http://resolver.kb.nl/resolve?urn=ddd:01120070...,PHONOGRAM heeft de rechten van de NOS gekocht ...
1,http://resolver.kb.nl/resolve?urn=ddd:01057316...,HILVERSUM [ANP] - Met een kijkdichtheidspercen...
2,http://resolver.kb.nl/resolve?urn=ddd:01037692...,De troonsafstand van Juliana en de inhuldiging...
3,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,LONDEN (Reuter) Kroonprins Charles zal de Brit...


In [18]:
dfText[dfText['content'].str.contains('rel')]

Unnamed: 0,identifier,content
25,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,• WIELERKAMPIOENEN OP BEZOEK BIJ KONINGINSlaot...
26,http://resolver.kb.nl/resolve?urn=KBPERS01:003...,Van een onzer verslaggevers AMSTERDAM - Onze n...
43,http://resolver.kb.nl/resolve?urn=ddd:01062141...,DEN HAAG - Koningin Beatrix zal in een toespra...
58,http://resolver.kb.nl/resolve?urn=ddd:01101904...,Koningin Beatrix zal op dinsdag 10 juni in een...
66,http://resolver.kb.nl/resolve?urn=ddd:01120078...,f Twee informatieve kleurenfolders van > Monum...
...,...,...
1073,http://resolver.kb.nl/resolve?urn=ABCDDD:01087...,"r» ui gp„n, r- Als nst'me* de regelmaat van ee..."
1074,http://resolver.kb.nl/resolve?urn=ABCDDD:01088...,Ll artkruis is bondscoach en militair. Hij sta...
1076,http://resolver.kb.nl/resolve?urn=ddd:01057314...,JTW??T^CTIfj_[3B^No. 1361: 'n boek hoort erbij...
1077,http://resolver.kb.nl/resolve?urn=ddd:01057309...,_TT?H?f!ff!ffl No. 1352 Nagelaten boodschap Di...


### Merge the metadata with the content

This is an additional step to store everything in one dataframe. 

In [19]:
dfArticles = dfIdentifiers.merge(dfText, on = 'identifier', how = 'inner')

In [20]:
dfArticles.head(4)

Unnamed: 0,identifier,type,title,date,content
0,http://resolver.kb.nl/resolve?urn=ddd:01120070...,artikel,Inhuldiging Beatrix op de plaat,1980/02/18 00:00:00,PHONOGRAM heeft de rechten van de NOS gekocht ...
1,http://resolver.kb.nl/resolve?urn=ddd:01057316...,artikel,Inhuldiging koningin Beatrix trok meeste kijke...,1980/08/15 00:00:00,HILVERSUM [ANP] - Met een kijkdichtheidspercen...
2,http://resolver.kb.nl/resolve?urn=ddd:01037692...,artikel,Troonsafstand en inhuldiging op één dag,1980/02/11 00:00:00,De troonsafstand van Juliana en de inhuldiging...
3,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,artikel,Prins Charles bij inhuldiging,1980/03/07 00:00:00,LONDEN (Reuter) Kroonprins Charles zal de Brit...


In [22]:
dfArticles[dfArticles['content'].str.contains('rel')]

Unnamed: 0,identifier,type,title,date,content
25,http://resolver.kb.nl/resolve?urn=KBPERS01:002...,artikel,SOBERE INHULDIGING,1980/02/09 00:00:00,• WIELERKAMPIOENEN OP BEZOEK BIJ KONINGINSlaot...
26,http://resolver.kb.nl/resolve?urn=KBPERS01:003...,artikel,Ongedwongen koningin,1980/05/01 00:00:00,Van een onzer verslaggevers AMSTERDAM - Onze n...
43,http://resolver.kb.nl/resolve?urn=ddd:01062141...,artikel,Danktoespraak van koningin Beatrix,1980/05/29 00:00:00,DEN HAAG - Koningin Beatrix zal in een toespra...
58,http://resolver.kb.nl/resolve?urn=ddd:01101904...,artikel,Dankwoord koningin Beatrix,1980/05/30 00:00:00,Koningin Beatrix zal op dinsdag 10 juni in een...
66,http://resolver.kb.nl/resolve?urn=ddd:01120078...,advertentie,Advertentie,1980/04/26 00:00:00,f Twee informatieve kleurenfolders van > Monum...
...,...,...,...,...,...
1075,http://resolver.kb.nl/resolve?urn=ABCDDD:01087...,artikel,"AMOK, de homars van het nieuwe Pompeï",1980/05/10 00:00:00,"r» ui gp„n, r- Als nst'me* de regelmaat van ee..."
1076,http://resolver.kb.nl/resolve?urn=ABCDDD:01088...,artikel,SPORT „Huidige generatie niet spontaan of zelf...,1980/06/07 00:00:00,Ll artkruis is bondscoach en militair. Hij sta...
1078,http://resolver.kb.nl/resolve?urn=ddd:01057314...,advertentie,Advertentie,1980/07/19 00:00:00,JTW??T^CTIfj_[3B^No. 1361: 'n boek hoort erbij...
1079,http://resolver.kb.nl/resolve?urn=ddd:01057309...,advertentie,Advertentie,1980/05/17 00:00:00,_TT?H?f!ff!ffl No. 1352 Nagelaten boodschap Di...


In [24]:
dfArticles = dfArticles.head(10)