# Scraper for collecting Belgium deaths
## 1. Background
This is a program that scrapes the reported deaths in Belgium publically available on inmemoriam.be. This scraper was made to get an idea of the impact of COVID-19 on the Belgian death rate.
Parameters in this scraper are set to collect all deaths from the 1st of January 2016 up until the 1st of January 2021. As the last time this scraper was run was on the 18th of October 2020, data for the final months is not yet included.
The collected data consists out of the name of the deceased (`name`), the age of the deceased (`age`), the date the person passed away (`date`), the place of death (`location`) as well as separate columns for the month (`month`), week (`week`) and day (`day`) of the year. 

## 2. Necessary packages
In order to build the scraper, the following packages were imported and used.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import datetime as dt
import matplotlib.pyplot as plt

## 3. Scraper
### 3.1 Set parameters
First the parameters for the scraper are set to include the desired date range. As the website separates these in several pages, adjustments are made in order to collect the total number of pages and extract all the required data from each page.

In [4]:
current_page = 1
begin_date = "2016-01-01"
end_date = "2021-01-01"
url = "https://www.inmemoriam.be/nl/rouwberichten/?page=" + str(current_page) + "&filter=&periodStart=" + str(begin_date) + "&periodEnd=" + str(end_date) + "&yearOfBirth=&undertakerId=&placeOfResidence=&provinceId=&newsPaper=&obituary=1"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

page_list = soup.find_all('a', class_ = 'c-pagination__item')

pages = []

for page in page_list:
    page = page.get_text()
    pages.append(page)

end_page = pages[-2]

### 3.2 Collect the data
Next the scraper is run using the parameters set above, collecting `name`, `age`, `date` and `location` for each deceased person in the set date range. The result is stored in a dataframe called `df`. With there being over 5600 pages to be scraped, this process will take a long time to finish.

In [6]:
df = pd.DataFrame()
df["name"] = []
df["age"] = []
df["date"] = []
df["location"] = []
page_list = np.arange(current_page,int(end_page) + 1)
print("Total amount of pages is: ", end_page)

for num in page_list:
    print("Page ", num, " is being scraped now.")
    tempdf = pd.DataFrame()
    url = "https://www.inmemoriam.be/nl/rouwberichten/?page=" + str(num) + "&filter=&periodStart=" + str(begin_date) + "&periodEnd=" + str(end_date) + "&yearOfBirth=&undertakerId=&placeOfResidence=&provinceId=&newsPaper=&obituary=1"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    name_list = soup.find_all('h3', class_ = 'c-deceased__name')
    age_list = soup.find_all('span', class_ = 'c-deceased__age')
    date_list = soup.find_all('div', class_ = 'c-deceased__departed')
    location_list = soup.find_all('div', class_ = 'c-deceased__location')
    
    names = []
    ages = []
    dates = []
    locations = []

    for name in name_list:
        name = name.get_text()
        names.append(name)

    for age in age_list:
        age = age.get_text()
        age = int(age[0:-5])
        ages.append(age)

    for location in location_list:
        location = location.get_text()
        locations.append(location)

    for date in date_list:
        date = date.get_text()
        date = date[11:-1]
        date = dt.datetime.strptime(date, "%d/%m/%Y")
        dates.append(date)
        
    ages = Series(ages)
    locations = Series(locations)
    dates = Series(dates)
    names = Series(names)
    
    tempdf["name"] = names
    tempdf["age"] = ages
    tempdf["date"] = dates
    tempdf["location"] = locations
    
    df = df.append(tempdf)

print("Done.")

.
Page  352  is being scraped now.
Page  353  is being scraped now.
Page  354  is being scraped now.
Page  355  is being scraped now.
Page  356  is being scraped now.
Page  357  is being scraped now.
Page  358  is being scraped now.
Page  359  is being scraped now.
Page  360  is being scraped now.
Page  361  is being scraped now.
Page  362  is being scraped now.
Page  363  is being scraped now.
Page  364  is being scraped now.
Page  365  is being scraped now.
Page  366  is being scraped now.
Page  367  is being scraped now.
Page  368  is being scraped now.
Page  369  is being scraped now.
Page  370  is being scraped now.
Page  371  is being scraped now.
Page  372  is being scraped now.
Page  373  is being scraped now.
Page  374  is being scraped now.
Page  375  is being scraped now.
Page  376  is being scraped now.
Page  377  is being scraped now.
Page  378  is being scraped now.
Page  379  is being scraped now.
Page  380  is being scraped now.
Page  381  is being scraped now.
Page  38

ConnectionError: HTTPSConnectionPool(host='www.inmemoriam.be', port=443): Max retries exceeded with url: /nl/rouwberichten/?page=957&filter=&periodStart=2016-01-01&periodEnd=2021-01-01&yearOfBirth=&undertakerId=&placeOfResidence=&provinceId=&newsPaper=&obituary=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f0d0055d520>: Failed to establish a new connection: [Errno -2] Name or service not known'))

## 4. Clean dataset
### 4.1 Reshape
The dataframe is reshaped in order to have the index numbers to be correct.

In [158]:
df.reset_index(inplace = True, drop = True)
df = df[::-1].reset_index(drop = True)
df.shape

(1368, 4)

### 4.2 Reformat
`month`, `week` and `day` of the year are extracted from `date` in order to store these in different variables to make exploratory analysis of the dataset a bit more straightforward.

In [159]:
dates = df["date"]

months = []
weeks = []
days = []

for date in dates:
    month = dt.datetime.strftime(date, "%B")
    week = dt.datetime.strftime(date, "%W")
    day = dt.datetime.strftime(date, "%j")
    months.append(month)
    weeks.append(week)
    days.append(day)
    
df["month"] = months
df["week"] = weeks
df["day"] = days

df['location']= pd.Series(df['location'], dtype="string")
df['month']= pd.Series(df['month'], dtype="string")
df['week']= pd.Series(df['week'], dtype="string")
df['name']= pd.Series(df['name'], dtype="string")
df['day']= pd.Series(df['day'], dtype="string")

booler = df["month"] == "March"
df = df.loc[booler,]

df.head()

Unnamed: 0,name,age,date,location,month,week,day
0,Maria Winters,82.0,2016-03-01,LOMMEL,March,9,61
1,Albert WINDELS,95.0,2016-03-01,Ingooigem,March,9,61
2,Yvonne Wijnants,90.0,2016-03-01,ZONHOVEN,March,9,61
3,LAOUREUX Victorine,87.0,2016-03-01,Stembert,March,9,61
4,Philippe VANDENHOVEN,59.0,2016-03-01,Bastogne,March,9,61


## 5. Save the data
The resulting dataset is ultimately saved in a .csv file called `deaths_2016-2020.csv` for later analysis.

In [160]:
df.to_csv("deaths_2016-2020.csv")