# Scraping from Wikipedia

[The User-Agent Policy of the Wikimedia Foundation](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy) states that requests should contain a descriptive user-agent header. Therefore the bot name and version are identified in the header as well as my domain name.

The following code requests a Wikipedia page that contains a list of links to Wikipedia entries of 17th century Dutch painters. The code creates a list of links that we later use for webscraping.

the following libraries are used:
- pandas
- requests
- beautifulsoup
- re
- json
- geopy / Nominatim

## 1. Getting a list of links (skip to 1.1 to save time on the scraping process)

In [12]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

useragent = {                               #Create a user agent compliant to wikimedia policy
    "User-Agent": ("WikiScraper/1.0 (contact@tijndonners.nl) " 
                   "requests/2.32.3")}

response = requests.get("https://nl.wikipedia.org/wiki/Lijst_van_Nederlandse_kunstschilders", headers= useragent).text  #This Dutch wikipedia page lists 17th century painters that have a wikipedia page 
soup = BeautifulSoup(response, "html.parser")

h2_17 = soup.find("h2", id="17e_eeuw")

links = []

#Iterate through the elements in h2_17 (these contain all htmtl elements after the 17th century header)
for el in h2_17.find_all_next():
    # Stop when the next h2 (18th century) starts
    if el.name == "h2":
        break
    
    # Extract the link tags inside the element
    if el.name == "ul":      #this leaves out figure elements that appear in between the person links
        for a in el.find_all("a", href=True):
            href =  "https:" + a["href"]    #a is a beautifulsoup tag object, ['href'] extracts the second part of the link
            links.append(href)
            
print(links[0:5])

['https://nl.wikipedia.org/wiki/Johannes_van_der_Aeck', 'https://nl.wikipedia.org/wiki/Evert_van_Aelst', 'https://nl.wikipedia.org/wiki/Willem_van_Aelst', 'https://nl.wikipedia.org/wiki/Jan_van_Aken_(1614-1661)', 'https://nl.wikipedia.org/wiki/Herman_van_Aldewereld']


### Wikidata Links

Since Wikidata entries contain more structured data than a wikipedia page does, we want to retrieve the wikidata links from the wikipedia links that was just created

#### This takes a long time! (minutes)

In [13]:
import json

wikidata_links =[]

#this might take a while! It's processing a lot of requests
for url in links:
    r = BeautifulSoup(requests.get(url, headers= useragent).text, 'html.parser')
    wd = r.select_one("li#t-wikibase a") #identify the element that contains the wikidata link
    if wd:
        wikidata_links.append(wd["href"]) #if there is a link, append it to the wikidata_links list  

print(wikidata_links[0:5])

['https://www.wikidata.org/wiki/Special:EntityPage/Q1033616', 'https://www.wikidata.org/wiki/Special:EntityPage/Q759747', 'https://www.wikidata.org/wiki/Special:EntityPage/Q553273', 'https://www.wikidata.org/wiki/Special:EntityPage/Q6150343', 'https://www.wikidata.org/wiki/Special:EntityPage/Q1703946']


In [15]:
print(len(wikidata_links))
print(len(links))

456
468


### To reduce request overhead and save time, we cache the wikidata_links and links variables in local JSON files for reuse.


In [18]:
import json

with open("wikidata_links.json", "w", encoding="utf-8") as f:
        json.dump(wikidata_links, f, ensure_ascii=False, indent=2)
    
with open("links.json", "w", encoding="utf-8") as f:
        json.dump(links, f, ensure_ascii=False, indent=2)


## 1.1  Here you can import the results from all previous code without performing the lengthy request process:

In [19]:
import json

with open("links.json", "r", encoding="utf-8") as f:
    links = json.load(f)

with open("wikidata_links.json", "r", encoding="utf-8") as f:
    wikidata_links = json.load(f)

Some Painters do not have a wikipedia page yet, and therefore they should be removed from the list of links in order to have an equal amount of links in both lists, which is important for creating a DataFrame later.

In [23]:
for link in links:
    if 'redlink' in link:   #the word redlink identifies broken links!
        links.remove(link)
        
if len(links) == len(wikidata_links):
    print(f'lists have equal length: {len(links)} & {len(wikidata_links)}')

lists have equal length: 456 & 456


## 2. Scraping data
Now that we have all the wikidata links alongside the wikipedia links, we can scrape the articles itseld but also metadata like gender, place of birth and place of death on wikidata. We add this data to a pandas dataframe.

**Please note that since this data concerns historical persons, not all data is available on wikipedia/wikidata. Therefore, missing values will be assigned 'na'.**

## This takes a lot of time! Grab some coffee. (c. 8 minutes for me)

In [None]:
import re

# initiate a dictionary for the scraped data
results = {'Name':[],
           'Wikidata Identifier':[],
           'Wikipedia Article':[],
           'Gender':[],
           'Year of Birth':[],
           'Year of Death':[],
           'Place of Birth':[],
           'Place of Death':[]} 

for url in links:
    r = BeautifulSoup(requests.get(url, headers=useragent).text, "html.parser")
    text = " ".join(p.get_text() for p in r.find_all("p"))                    #Note that headers are not included
    results['Wikipedia Article'].append(text.replace('\n', ' '))

    
for url in wikidata_links:
    r = BeautifulSoup(requests.get(url, headers= useragent).text, 'html.parser')
    
    #scrape the title and wikidata item identifier
    wikidata_item = r.find('h1').text.replace('\n', ' ').strip().rsplit(" ",1)
    results['Wikidata Identifier'].append(wikidata_item[1].replace('(','').replace(')',''))
    results['Name'].append(wikidata_item[0])

    # For the rest of the data, try blocks are used since BeuatifulSoup raises errors when values are missing.
    # If this happens, instead of the code breaking, 'na' will be added to to dictionary with results.
    
    # scrape birth year
    try:
        byear = r.find("div", id="P569").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Year of Birth'].append(int(re.search(r"\d{4}", byear).group(0)))  #extract only the 4 digit sequence and remove clutter
    except Exception as e:
        results['Year of Birth'].append('na')

    # scrape death year
    try:
        dyear = r.find("div", id="P570").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Year of Death'].append(int(re.search(r"\d{4}", dyear).group(0)))
    except Exception as e:
        results['Year of Death'].append('na')
    
    # scrape gender
    try:
        gender = r.find("div", id="P21").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text  #P21 is the wikidata property tag for sex or gender
        results['Gender'].append(gender)
    except Exception as e:
        results['Gender'].append('na')

    # scrape birth place
    try:
        bplace = r.find("div", id="P19").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Place of Birth'].append(bplace)
    except Exception as e:
        results['Place of Birth'].append('na')

    # scrape death place
    try:
        dplace = r.find("div", id="P20").find("div", attrs="wikibase-snakview-value wikibase-snakview-variation-valuesnak").text
        results['Place of Death'].append(dplace)
    except Exception as e:
        results['Place of Death'].append('na')

df = pd.DataFrame(results)
df.head()

## 3. Text cleaning

The wikipedia articles contain citations in this form: [1] and [2]. These will be removed

In [25]:
def clean_text(text):
    return re.sub(r"\[\d+\]", '', text) #regex is used to replace citations with empty strings
    
cleaned_df = df.copy()

cleaned_df['Wikipedia Article'] = cleaned_df['Wikipedia Article'].apply(clean_text) #note that this overwrites the original column

## 4. cleaning the CSV file
After a manual check of the data, a few anomalies emerge. One way to check if there are rows that are not persons, is to check if the gender is NOT male or female.

In [26]:
# check if data entry is a person:
filtered_df = cleaned_df[cleaned_df['Gender'].isin(['na'])]
filtered_df.head()

Unnamed: 0,Name,Wikidata Identifier,Wikipedia Article,Gender,Year of Birth,Year of Death,Place of Birth,Place of Death
128,Groningen,Q749,De Nederlandse stad Groningen uitspraakⓘ (Gron...,na,na,na,na,na
263,Cambrai,Q181285,"Cambrai (Nederlands, in historische context ge...",na,na,na,na,na
326,Jan and Jacob Pynas,Q14632488,"De broers Jan Symonsz Pynas (ca. 1583, Alkmaar...",na,na,na,na,na
327,Jan and Jacob Pynas,Q14632488,"De broers Jan Symonsz Pynas (ca. 1583, Alkmaar...",na,na,na,na,na


As we can see, Groningen en Cambrai (cities) and the brothers Jan and Jacob are showing up. Since there is no metadata available for the brothers, we remove them from the data alongside the two cities.

In [27]:
cleaned_df = cleaned_df[cleaned_df['Gender'].isin(['male','female'])] #filters out the unwanted rows

#### Now we check for duplicates and remove them

In [28]:
cleaned_df['Wikidata Identifier'].value_counts()

Wikidata Identifier
Q339270     2
Q2459649    2
Q1854019    2
Q2057521    2
Q944834     2
           ..
Q864447     1
Q2112735    1
Q2462870    1
Q2103594    1
Q5529917    1
Name: count, Length: 447, dtype: int64

We can see some duplicate values!
In the output above, we see **length: 447**, which means there are 447 unique wikidata ID's and therefore painters.
In the code below, the duplicates are removed.

In [29]:
cleaned_df = cleaned_df.drop_duplicates('Wikidata Identifier') #removes duplicate rows (keeps the first row by default, but this does not matter much in this case)
cleaned_df.shape

(447, 8)

## 5. Adding coordinates
In order to make some geographical visualizations, we need to geocode the birth and death places and add them to the dataframe.
The coordinates are decimal degrees (WGS84, EPSG:4326) in (lat,lon) format.

#### Once again, this takes a while (c. 16 minutes for me)
The RateLimiter might catch some timeout errors in the output cell, but this will not affect the result.

In [30]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

df = cleaned_df.copy() 

# load geocoder
geolocator = Nominatim(user_agent="tijn_d_scraper")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1) #limiting rates for OpenStreetMaps

# geocode function
def get_coords(place):
    try:
        location = geocode(place)
        if location:
            return location.latitude, location.longitude
        else:
            return "na"
    except:
        return "na"

df['Birth Coordinates'] = df['Place of Birth'].apply(get_coords)
df['Death Coordinates'] = df['Place of Death'].apply(get_coords)
df.head()

RateLimiter caught an error, retrying (0/2 tries). Called with (*('Amsterdam',), **{}).
Traceback (most recent call last):
  File "/home/tijn-do/anaconda3/lib/python3.12/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/home/tijn-do/anaconda3/lib/python3.12/site-packages/urllib3/connection.py", line 507, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tijn-do/anaconda3/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/home/tijn-do/anaconda3/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/home/tijn-do/anaconda3/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  

Unnamed: 0,Name,Wikidata Identifier,Wikipedia Article,Gender,Year of Birth,Year of Death,Place of Birth,Place of Death,Birth Coordinates,Death Coordinates
0,Johannes van der Aeck,Q1033616,Johannes Claesz. van der Aeck of Aack (gedoopt...,male,1636,1682,Leiden,Leiden,"(52.1594747, 4.4908843)","(52.1594747, 4.4908843)"
1,Evert van Aelst,Q759747,"Evert van Aelst (Delft, 1602 - aldaar, 19 febr...",male,1602,1657,Delft,Delft,"(52.0114017, 4.35839)","(52.0114017, 4.35839)"
2,Willem van Aelst,Q553273,Willem van Aelst of Guillermo van Aelst (Delft...,male,1627,1679,Delft,Amsterdam,"(52.0114017, 4.35839)","(52.3730796, 4.8924534)"
3,Jan van Aken,Q6150343,"Jan van Aken, ook Jean van Aken (Amsterdam, ca...",male,1614,1661,Kampen,Amsterdam,"(52.5559484, 5.9033303)","(52.3730796, 4.8924534)"
4,Herman van Aldewereld,Q1703946,"Herman van Aldewereld (1628/1629, Amsterdam – ...",male,1620,1669,Amsterdam,Amsterdam,"(52.3730796, 4.8924534)","(52.3730796, 4.8924534)"


### Dutch Painters from Namibia?

For the 'na' values for place of birth/death, Nominatim returned the coordinates for Namibia, probably because 'na' was recognized as a country code. Naturally, we want to replace the coordinates of Namibia(-23.2335499, 17.3231107) with 'na'.

In [51]:
def na_cleaning(coords):
    if coords == (-23.2335499, 17.3231107):
        return 'na'
    else:
        return coords

df['Birth Coordinates'] = df['Birth Coordinates'].apply(na_cleaning)
df['Death Coordinates'] = df['Death Coordinates'].apply(na_cleaning)


#### Unfortunately, Geopy has assigned slightly different cooirdinates for some places (like The Hague and Haarlem), so we have to normalize them first

In [2]:
def normalize_coords(df, place_col, coord_col):
    # create a dictionary of place -> first coordinate found
    mapping = df.groupby(place_col)[coord_col].first()  #it grabs the first coordinate it encounters and maps it the place name
    # apply mapping to column
    df[coord_col] = df[place_col].map(mapping)

# applying the function
normalize_coords(df, 'Place of Birth', 'Birth Coordinates')
normalize_coords(df, 'Place of Death', 'Death Coordinates')

Birth Coordinates
(52.3730796, 4.8924534)    69
(52.3885317, 4.6388048)    62
(52.0799838, 4.3113461)    32
(52.1594747, 4.4908843)    29
(52.0114017, 4.35839)      27
                           ..
(51.782164, 5.1898267)      1
(52.0373934, 4.3225028)     1
(50.6450944, 5.5736112)     1
(51.5725501, 8.1061259)     1
(50.6365654, 3.0635282)     1
Name: count, Length: 84, dtype: int64

### Finally, we save the dataframe as a CSV File

In [3]:
df.to_csv("data.csv", index=False)

**Continue to the the folium_analysis.ipynb notebook in this repo to see georgraphical visualizations and analysis of the collected data.**
