# Scraping from Wikipedia

[The User-Agent Policy of the Wikimedia Foundation](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy) states that requests should contain a descriptive user-agent header. Therefore the bot name and version are identified in the header as well as my domain name.

The following code requests a Dutch Wikipedia page that contains a list of links to persons with a Wikipedia entry. The code creates a list of links that we later use for webscraping.




In [68]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

useragent = {                               #Create a user agent compliant to wikimedia policy
    "User-Agent": (
        "WikiScraper/1.0 (contact@tijndonners.nl) "
        "requests/2.32.3"
    )
}

#the link in the code below can be altered to any wiki category page containing a list of persons :)
response = requests.get("https://en.wikipedia.org/wiki/Category:16th-century_Flemish_painters", headers= useragent).text  #This Dutch wikipedia page lists all links to 'persons' in the second century 

soup = BeautifulSoup(response, "html.parser")

links = []

for a in soup.select("#mw-pages li a"):       #select is used for css selection
    href = "https://nl.wikipedia.org" + a["href"]    #a is a beautifulsoup tag object, ['href'] extracts the second part of the link
    links.append(href)

# Wikidata Links

Since Wikidata entries contain more structured data than a wikipedia page does, we want to retrieve the wikidata links from the wikipedia links that was just created

In [69]:
wikidata_links =[]

#this might take a while! It's processing a lot of requests
for url in links:
    r = BeautifulSoup(requests.get(url, headers=useragent).text, "html.parser")
    # The Wikidata link is in the sidebar: <li id="t-wikibase" ...> <a href="https://www.wikidata.org/wiki/Q...">
    wd = r.select_one("li#t-wikibase a")
    if wd:
        wikidata_links.append(wd["href"])

# Scraping data
Now that we have all the wikidata links of all persons in the 2nd century with wikipedia entries, we can scrape data. We can retrieve the given name, family name, gender, date of birth, date of death, place of birth and place of death and the wikidata short description. We add this data to a pandas dataframe.

In [72]:
from bs4.element import Tag   #for error handling 'None' returns in the scrape function

def scrape(url):
    r = BeautifulSoup(requests.get(url, headers= useragent).text, 'html.parser')

    #scrape the title and wikidata item identifier
    wikidata_item = r.find('h1').text
    print(wikidata_item)

    #scrape description
    description = r.find('div', attrs="wikibase-entitytermsview-heading-description").text
    print(description)

    #scrape gender
    try:
        gender = r.find("div", id="P21").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text  #P21 is the wikidata property tag for sex or gender
        print(gender)
    except Exception as e:
        print("gender not found. error:", e)
        gender = ''


    #scrape birth date 
    try:
        bdate = r.find("div", id="P569").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        print(bdate)
    except Exception as e:
        print('birth date not found. error:', e)
        bdate = ''


    #scrape death date
    try:
        ddate = r.find("div", id="P570").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        print(ddate)
    except Exception as e:
        print('death date not found. error:', e)
        ddate = ''

    #scrape birth place
    try:
        bplace = r.find("div", id="P19").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        print(bplace)
    except Exception as e:
        print('birth place not found. error:', e)
        bplace = ''

    #scrape death place
    try:
        dplace = r.find("div", id="P20").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        print(dplace)
    except Exception as e:
        print('death place not found. error:', e)
        dplace = ''
        
for url in wikidata_links:
    scrape(url)
    




Nicolaus van Aelst
(Q3876454)

Flemish engraver and publisher (1526-1613)
male
1526Gregorian
19 July 1613Gregorian
City of Brussels
Rome

Simon Bening
(Q928015)

Flemish painter and manuscript illuminator (1483-1561)
male
1483Gregorian
6 November 1561
Ghent
Bruges

Matthijs Bril
(Q18757752)

Wikimedia disambiguation page
gender not found. error: 'NoneType' object has no attribute 'find'
birth date not found. error: 'NoneType' object has no attribute 'find'
death date not found. error: 'NoneType' object has no attribute 'find'
birth place not found. error: 'NoneType' object has no attribute 'find'
death place not found. error: 'NoneType' object has no attribute 'find'

Paul Bril
(Q540753)

Flemish painter
male
1556
7 October 1626Gregorian
Antwerp
Rome

Pedro de Campaña
(Q2066867)

Flemish Renaissance painter (1503-1580)
male
1503
1580Gregorian
City of Brussels
City of Brussels

Wenceslas Cobergher
(Q513704)

Flemish painter, draftsman, architect and engineer (1560-1634)
male
1560
23 No