# Scraper for collecting Belgium deaths
## 1. Background
This is a program that scrapes the reported deaths in Belgium publically available on inmemoriam.be. This scraper was made to get an idea of the impact of COVID-19 on the Belgian death rate.
Parameters in this scraper are set to collect all deaths from the 1st of January 2009 up until the 1st of January 2021. As the last time this scraper was run was on the 18th of October 2020, data for the final months is not yet included.
The collected data consists out of the name of the deceased (`name`), the age of the deceased (`age`), the date the person passed away (`date`), the place of death (`location`) as well as separate columns for the month (`month`), week (`week`) and day (`day`) of the year. 

## 2. Necessary packages
In order to build the scraper, the following packages were imported and used.

In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import datetime as dt
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
import multiprocessing

ModuleNotFoundError: No module named 'httpclient'

## 3. Scraper
### 3.1 Set parameters
First the parameters for the scraper are set to include the desired date range. As the website separates these in several pages, adjustments are made in order to collect the total number of pages and extract all the required data from each page.

In [2]:
current_page = 1
begin_date = "2009-01-01"
end_date = "2021-01-01"
url = "https://www.inmemoriam.be/nl/rouwberichten/?page=" + str(current_page) + "&filter=&periodStart=" + str(begin_date) + "&periodEnd=" + str(end_date) + "&yearOfBirth=&undertakerId=&placeOfResidence=&provinceId=&newsPaper=&obituary=1"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

page_list = soup.find_all('a', class_ = 'c-pagination__item')

pages = []

for page in page_list:
    page = page.get_text()
    pages.append(page)

end_page = pages[-2]

### 3.2 Collect the data
Next the scraper is run using the parameters set above, collecting `name`, `age`, `date` and `location` for each deceased person in the set date range. The result is stored in a dataframe called `df`. With there being over 5600 pages to be scraped, this process will take a long time to finish.

In [14]:
df = pd.DataFrame()
df["name"] = []
df["age"] = []
df["date"] = []
df["location"] = []
page_list = np.arange(current_page,int(end_page) + 1)
print("Total amount of pages is: ", end_page)

def get_deaths(num):
    print("Page ", num, " is being scraped now.")
    tempdf = pd.DataFrame()
    url = "https://www.inmemoriam.be/nl/rouwberichten/?page=" + str(num) + "&filter=&periodStart=" + str(begin_date) + "&periodEnd=" + str(end_date) + "&yearOfBirth=&undertakerId=&placeOfResidence=&provinceId=&newsPaper=&obituary=1"
    # page = requests.get(url)
    for i in [1,2,3]:
        try:  
            page = requests.get(url, timeout=30)
            msg = page.text
            if msg: break
        except Exception as e:
            sys.stderr.write('Got error when requesting URL "' + '": ' + str(e) + '\n')
            if i == 3 :
                sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL ==> {2}\n'.format(inspect.getframeinfo(inspect.currentframe()), e))
                raise e
            time.sleep(10*(i-1))

    soup = BeautifulSoup(page.content, 'html.parser')

    name_list = soup.find_all('h3', class_ = 'c-deceased__name')
    age_list = soup.find_all('span', class_ = 'c-deceased__age')
    date_list = soup.find_all('div', class_ = 'c-deceased__departed')
    location_list = soup.find_all('div', class_ = 'c-deceased__location')
    
    names = []
    ages = []
    dates = []
    locations = []

    for name in name_list:
        name = name.get_text()
        names.append(name)

    for age in age_list:
        age = age.get_text()
        age = int(age[0:-5])
        ages.append(age)

    for location in location_list:
        location = location.get_text()
        locations.append(location)

    for date in date_list:
        date = date.get_text()
        date = date[11:-1]
        date = dt.datetime.strptime(date, "%d/%m/%Y")
        dates.append(date)

    ages = Series(ages)
    locations = Series(locations)
    dates = Series(dates)
    names = Series(names)
    
    tempdf["name"] = names
    tempdf["age"] = ages
    tempdf["date"] = dates
    tempdf["location"] = locations
    return tempdf

df = Parallel(n_jobs=-1, verbose=10, backend="loky")(map(delayed(get_deaths), page_list))

Total amount of pages is:  14570
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   15.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   26.9s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   34.1s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   44.9s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   54.2s
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]:

In [15]:
new_df = pd.DataFrame()
for item in df:
    new_df = pd.concat([new_df, item])

print(new_df.shape)

(174759, 4)


## 4. Clean dataset
### 4.1 Reshape
The dataframe is reshaped in order to have the index numbers to be correct.

In [16]:
new_df.reset_index(inplace = True, drop = True)
new_df = new_df[::-1].reset_index(drop = True)
new_df.shape

(174759, 4)

### 4.2 Reformat
`month`, `week` and `day` of the year are extracted from `date` in order to store these in different variables to make exploratory analysis of the dataset a bit more straightforward.

In [17]:
dates = new_df["date"]

months = []
weeks = []
days = []

for date in dates:
    month = dt.datetime.strftime(date, "%B")
    week = dt.datetime.strftime(date, "%W")
    day = dt.datetime.strftime(date, "%j")
    months.append(month)
    weeks.append(week)
    days.append(day)
    
new_df["month"] = months
new_df["week"] = weeks
new_df["day"] = days

new_df['location']= pd.Series(new_df['location'], dtype="string")
new_df['month']= pd.Series(new_df['month'], dtype="string")
new_df['week']= pd.Series(new_df['week'], dtype="string")
new_df['name']= pd.Series(new_df['name'], dtype="string")
new_df['day']= pd.Series(new_df['day'], dtype="string")

booler = new_df["month"] == "March"
df = new_df.loc[booler,]

new_df.head()

Unnamed: 0,name,age,date,location,month,week,day
0,Marcel Willems,73.0,2009-01-01,Lummen,January,0,1
1,Leopoldine Willems,77.0,2009-01-01,Sint-Truiden,January,0,1
2,Lydia Velle,86.0,2009-01-01,Antwerpen 1,January,0,1
3,Suzette Van der Venne,87.0,2009-01-01,Zwijndrecht,January,0,1
4,Jaak Van de Velde,87.0,2009-01-01,Wijnegem,January,0,1


## 5. Save the data
The resulting dataset is ultimately saved in a .csv file called `deaths_2016-2020.csv` for later analysis.

In [18]:
new_df.to_csv("deaths_2009-2020.csv")