# Scraping from Wikipedia

[The User-Agent Policy of the Wikimedia Foundation](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy) states that requests should contain a descriptive user-agent header. Therefore the bot name and version are identified in the header as well as my domain name.

The following code requests a Wikipedia page that contains a list of links to Wikipedia entries of 17th century Dutch painters. The code creates a list of links that we later use for webscraping.

## 1. Getting a list of links


In [56]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

useragent = {                               #Create a user agent compliant to wikimedia policy
    "User-Agent": ("WikiScraper/1.0 (contact@tijndonners.nl) " 
                   "requests/2.32.3")}

response = requests.get("https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_kunstschilders", headers= useragent).text  #This Dutch wikipedia page lists all links to 'persons' in the second century 
soup = BeautifulSoup(response, "html.parser")

h2_17 = soup.find("h2", id="17e_eeuw")

links = []

#iterate through the elements in h2_17 (these contain all htmtl elements after the 17th century header)
for el in h2_17.find_all_next():
    # Stop when the next h2 (18th century) starts
    if el.name == "h2":
        break
    
    # Extract the link tags inside the element
    if el.name == "ul":      #this leaves out figure elements that appear in between the person links
        for a in el.find_all("a", href=True):
            href = "https://nl.wikipedia.org" + a["href"]    #a is a beautifulsoup tag object, ['href'] extracts the second part of the link
            links.append(href)
        
print(len(links))

468


### Wikidata Links

Since Wikidata entries contain more structured data than a wikipedia page does, we want to retrieve the wikidata links from the wikipedia links that was just created

#### This takes a long time! (minutes)

In [57]:
import json

wikidata_links =[]

#this might take a while! It's processing a lot of requests
for url in links:
    r = BeautifulSoup(requests.get(url, headers= useragent).text, 'html.parser')
    wd = r.select_one("li#t-wikibase a")
    if wd:
        wikidata_links.append(wd["href"])        

#### In order to save time, we save the wikidata_links and links variables to json files so we can import than locally rather than keep doing requests.


In [58]:
import json

with open("wikidata_links.json", "w", encoding="utf-8") as f:
    json.dump(wikidata_links, f, ensure_ascii=False, indent=2)

with open("links.json", "w", encoding="utf-8") as f:
    json.dump(links, f, ensure_ascii=False, indent=2)

#### Now that that we have json files with the links saved, we can import them without performing any more requests.

In [61]:
with open("links.json", "r", encoding="utf-8") as f:
    links = json.load(f)

with open("wikidata_links.json", "r", encoding="utf-8") as f:
    wikidata_links = json.load(f)

Some Painters do not have a wikipedia page yet, and therefore they should be removed from the list of links in order to have an equal amount of links in both lists, which is important for creating a DataFrame later.

In [62]:
print(len(wikidata_links))
print(len(links))

for link in links:
    if 'redlink' in link:
        links.remove(link)

print(len(wikidata_links))
print(len(links))

456
468
456
456


## 2. Scraping data
Now that we have all the wikidata links alongside the wikipedia links, we can scrape the articles itseld but also metadata like gender, place of birth and place of death on wikidata. We add this data to a pandas dataframe.

## This takes a lot of time! Grab some coffee. (490 seconds for me)

In [65]:
results = {'Name':[],
           'Wikidata Identifier':[],
           'Wikipedia Article':[],
           'Gender':[],
           'Year of Birth':[],
           'Year of Death':[],
           'Place of Birth':[],
           'Place of Death':[]}

for url in links:
    r = BeautifulSoup(requests.get(url, headers=useragent).text, "html.parser")
    text = " ".join(p.get_text() for p in r.find_all("p"))                    #Note that headers are not included
    results['Wikipedia Article'].append(text.replace('\n', ' '))

    
for url in wikidata_links:
    r = BeautifulSoup(requests.get(url, headers= useragent).text, 'html.parser')
    
    #scrape the title and wikidata item identifier
    wikidata_item = r.find('h1').text.replace('\n', ' ').strip().rsplit(" ",1)
    results['Wikidata Identifier'].append(wikidata_item[1].replace('(','').replace(')',''))
    results['Name'].append(wikidata_item[0])

    #scrape birth year
    try:
        byear = r.find("div", id="P569").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Year of Birth'].append(int(re.search(r"\d{4}", byear).group(0)))  #extract only the 4 digit sequence and remove clutter
    except Exception as e:
        print(e)
        results['Year of Birth'].append('')

    #scrape death year
    try:
        dyear = r.find("div", id="P570").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Year of Death'].append(int(re.search(r"\d{4}", dyear).group(0)))
    except Exception as e:
        print(e)
        results['Year of Death'].append('')
    
    
    #scrape gender
    try:
        gender = r.find("div", id="P21").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text  #P21 is the wikidata property tag for sex or gender
        results['Gender'].append(gender)
    except Exception as e:
        results['Gender'].append('')

    #scrape birth place
    try:
        bplace = r.find("div", id="P19").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Place of Birth'].append(bplace)
    except Exception as e:
        results['Place of Birth'].append('')

    #scrape death place
    try:
        dplace = r.find("div", id="P20").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Place of Death'].append(dplace)
    except Exception as e:
        results['Place of Death'].append('')

df = pd.DataFrame(results)
df.head(10)

'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'find'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'find'


Unnamed: 0,Name,Wikidata Identifier,Wikipedia Article,Gender,Year of Birth,Year of Death,Place of Birth,Place of Death
0,Johannes van der Aeck,Q1033616,Johannes Claesz. van der Aeck of Aack (gedoopt...,male,1636,1682,Leiden,Leiden
1,Evert van Aelst,Q759747,"Evert van Aelst (Delft, 1602 - aldaar, 19 febr...",male,1602,1657,Delft,Delft
2,Willem van Aelst,Q553273,Willem van Aelst of Guillermo van Aelst (Delft...,male,1627,1679,Delft,Amsterdam
3,Jan van Aken,Q6150343,"Jan van Aken, ook Jean van Aken (Amsterdam, ca...",male,1614,1661,Kampen,Amsterdam
4,Herman van Aldewereld,Q1703946,"Herman van Aldewereld (1628/1629, Amsterdam – ...",male,1620,1669,Amsterdam,Amsterdam
5,Philips Angel,Q457712,Philips Angel I of Philips Angel van Middelbur...,male,1616,1683,Middelburg,Middelburg
6,Philips Angel II,Q1562098,Philips Angel II of Philips Angel van Leiden (...,male,1618,1664,Leiden,Batavia
7,Pieter van Anraedt,Q2311583,"Pieter van Anraedt, ook Pieter van Anraadt, (U...",male,1635,1678,Utrecht,Deventer
8,Arnoldus van Anthonissen,Q696967,"Arnoldus van Anthonissen (Leiden, gedoopt op 1...",male,1631,1703,Leiden,Zierikzee
9,Hendrick van Anthonissen,Q920814,"Hendrick van Anthonissen (Amsterdam, 29 mei 16...",male,1605,1656,Amsterdam,Amsterdam


#### Let's quickly save this DataFrame to a CSV after all the waiting.

In [66]:
df.to_csv('data.csv', index=False)

Now we can import the local csv instead of scarping again

In [72]:
df = pd.read_csv('data.csv', dtype={"Year of Birth": "Int64", "Year of Death": "Int64" }) #dtype is required because this method will convert intgers to floats by standard
df.head(10)

Unnamed: 0,Name,Wikidata Identifier,Wikipedia Article,Gender,Year of Birth,Year of Death,Place of Birth,Place of Death
0,Johannes van der Aeck,Q1033616,Johannes Claesz. van der Aeck of Aack (gedoopt...,male,1636,1682,Leiden,Leiden
1,Evert van Aelst,Q759747,"Evert van Aelst (Delft, 1602 - aldaar, 19 febr...",male,1602,1657,Delft,Delft
2,Willem van Aelst,Q553273,Willem van Aelst of Guillermo van Aelst (Delft...,male,1627,1679,Delft,Amsterdam
3,Jan van Aken,Q6150343,"Jan van Aken, ook Jean van Aken (Amsterdam, ca...",male,1614,1661,Kampen,Amsterdam
4,Herman van Aldewereld,Q1703946,"Herman van Aldewereld (1628/1629, Amsterdam – ...",male,1620,1669,Amsterdam,Amsterdam
5,Philips Angel,Q457712,Philips Angel I of Philips Angel van Middelbur...,male,1616,1683,Middelburg,Middelburg
6,Philips Angel II,Q1562098,Philips Angel II of Philips Angel van Leiden (...,male,1618,1664,Leiden,Batavia
7,Pieter van Anraedt,Q2311583,"Pieter van Anraedt, ook Pieter van Anraadt, (U...",male,1635,1678,Utrecht,Deventer
8,Arnoldus van Anthonissen,Q696967,"Arnoldus van Anthonissen (Leiden, gedoopt op 1...",male,1631,1703,Leiden,Zierikzee
9,Hendrick van Anthonissen,Q920814,"Hendrick van Anthonissen (Amsterdam, 29 mei 16...",male,1605,1656,Amsterdam,Amsterdam


## 3. Text cleaning

#### When looking at the results of the wikipedia page, there are three main problems that occur:
1. citations like [2] and [5]
2. Some articles have a standard text: 'This article about a Dutch painter is a stub. You can help Wikipedia by expanding it.'

These elements will be removed (using regex when neccesary).

Additionaly, the first entry in th current dataframe is not a Dutch painter but a Wikipedia about the Dutch Golden Age painting. This appears in the first row and can be removed.

In [75]:
import re

def clean_text(text):
    cleaned_text = re.sub(r"\[\d+\]", '', text)
    cleaned_text = cleaned_text.replace("This article about a Dutch painter is a stub. You can help Wikipedia by expanding it.", " ").strip()
    return cleaned_text
    
cleaned_df = df.copy()

cleaned_df['Wikipedia Article'] = cleaned_df['Wikipedia Article'].apply(clean_text) #note that this overwrites the original column

# DataFrame to CSV


In [76]:
cleaned_df.to_csv('cleaned_data.csv', index=False)