# Implement scraper
Sociepy's dataset consists on multiple country data sources. The first step to retrieve and construct this dataset are the country scrapers from `covid_updater` library. In this notebook we will explain how to add a new scraper to the project. To this end, we will use Finland.


We will illustrate (1) get the data, (2) process the data and (3) export data. Once this is clear and reliable, we will integrate this logic into the `covid_updater` library, specifically in `covid_updater.scraping.countries` module.


Note that scraping is only the first part of the `update_all.py` pipeline, which is responsible of updating all project's data. More details about the whole pipeline are found [here](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/scripts/update_all.sh).

## 0. Fork project
First thing is to fork the project, clone it and start editing.

Remember to keep your fork [synced with upstream](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).

### 0.1 Find country candidate
To this end, use following tool to see what is available on OWID

In [1]:
# Get potential candidates and urls
from covid_updater.additions import get_owid_diff_source_urls
get_owid_diff_source_urls()

[{'country': 'Bulgaria', 'source_url': 'https://coronavirus.bg/bg/statistika'},
 {'country': 'Estonia', 'source_url': 'https://www.terviseamet.ee/et/uudised'},
 {'country': 'Finland',
  'source_url': 'https://sampo.thl.fi/pivot/prod/en/vaccreg/cov19cov/fact_cov19cov.csv?row=cov_vac_dose-533174L&column=measure-533185.533172.433796.533175&'},
 {'country': 'Indonesia', 'source_url': 'https://www.kemkes.go.id/'},
 {'country': 'Ireland', 'source_url': ''},
 {'country': 'Jersey',
  'source_url': 'https://www.gov.je/Health/Coronavirus/Vaccine/Pages/VaccinationStatistics.aspx'},
 {'country': 'Liechtenstein',
  'source_url': 'https://www.covid19.admin.ch/en/epidemiologic/vacc-doses?detGeo=FL'},
 {'country': 'Luxembourg',
  'source_url': 'https://data.public.lu/fr/datasets/covid-19-rapports-journaliers/#_'},
 {'country': 'Montenegro',
  'source_url': 'https://atlas.jifo.co/api/connectors/520021dc-c292-4903-9cdb-a2467f64ed97'},
 {'country': 'Morocco',
  'source_url': 'http://www.covidmaroc.ma/pag

## 1. Get the data
### 1.1 Format of the data
You have to determine which is the format of the source data. Based on this, different libraries may come handy.

- csv: Might want to use pandas.
- API, json-like: Use requests, json
- Table in static website: Use bs4, requests, urllib
- Table in dynamic website: Use selenium, bs4, requests, urllib

## 1.2. Download the data
In our case, the data is available as CSV per week. That is, for each week there is a CSV available.

URLs for these links are a bit tricky to get in this particular case, some scraping is required!

In [2]:
from datetime import datetime, timedelta
import urllib.request
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

In [3]:
def get_urls(url_base):
    # Get HTML
    code_base = "dateweek20201226-525425"
    html_page = urllib.request.urlopen(url_base.format(ext="", code=code_base))
    soup = BeautifulSoup(html_page, "html.parser")
    # Get URLs for each week
    div = soup.find_all(class_="pivot-content")[0]
    l = [x.find("a").get("data-ref") for x in div.find_all(class_="row-target sticky-cell leaf", attrs={"data-level": "1"})]
    date_codes = [code for code in np.unique(l).tolist() if code != code_base]
    urls = [url_base.format(ext=".csv", code=code) for code in date_codes]
    return urls

In [4]:
# Get CSV urls
url_base = "https://sampo.thl.fi/pivot/prod/en/vaccreg/cov19cov/fact_cov19cov{ext}?row=area-518362&row={code}&column=cov_vac_dose-533174"
urls = get_urls(url_base)

In [5]:
# Build df
dfs = [pd.read_csv(url, sep=";") for url in urls]
df = pd.concat(dfs)
df = df[df["Time"].apply(lambda x: "Year" not in x)]

In [6]:
# Show some rows
df.head(5)

Unnamed: 0,Vaccination dose,Area,Time,val
0,First dose,Åland,2021-01-11,88.0
1,Second dose,Åland,2021-01-11,0.0
2,All doses,Åland,2021-01-11,88.0
3,First dose,Åland,2021-01-12,53.0
4,Second dose,Åland,2021-01-12,0.0


## 2. Process data
We are interested in processing the data such that its format is compliant with the project's format. Check [Australia.csv](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/data/countries/Australia.csv) as an example. But basically, the following fields are required:

- `location`: Country name.
- `region`: Region name.
- `date`: Date of observation.
- `location_iso`: ISO 3166-1 alpha 2 code of country.
- `region_iso`: ISO 3166-2 code of region
- `total_vaccinations`: Total number of vaccinations (doses) administered.
- `people_vaccinated` (OPTIONAL): Number of people with at least one vaccination administred. Not all countries make this data available.
- `people_fully_vaccinated` (OPTIONAL): Number of people fully vaccinated (with all required doses administered). Not all countries make this data available.


### 2.1. Pivot table
We observe that the shape of the table is not the one we want. Pivoting is required!

In [7]:
# Pivot
df = df.pivot(index=["Time", "Area"], columns="Vaccination dose", values="val").reset_index()
df.columns.name = None

In [8]:
df.head(5)

Unnamed: 0,Time,Area,All doses,First dose,Second dose
0,2020-12-26,All areas,,,
1,2020-12-26,Central Finland Hospital District,,,
2,2020-12-26,Central Ostrobothnia Hospital District,,,
3,2020-12-26,Helsinki and Uusimaa Hospital District,,,
4,2020-12-26,Itä-Savo Hospital District,,,


### 2.2 Rename fields
From the retrieved columns, which are interesting?

In our example we have three columns: `Time`, `Area` and `val`. What is what?

- `Time` -> `date`
- `Area` -> `region`
- `All doses` -> `total_vaccinations`
- `First dose` -> `people_vaccinated`
- `Second dose` -> `people_fully_vaccinated`

In [9]:
df = df.rename(columns={
    "Time": "date",
    "Area": "region",
    "First dose": "people_vaccinated",
    "Second dose": "people_fully_vaccinated",
    "All doses": "total_vaccinations"
})

In [10]:
df.head(5)

Unnamed: 0,date,region,total_vaccinations,people_vaccinated,people_fully_vaccinated
0,2020-12-26,All areas,,,
1,2020-12-26,Central Finland Hospital District,,,
2,2020-12-26,Central Ostrobothnia Hospital District,,,
3,2020-12-26,Helsinki and Uusimaa Hospital District,,,
4,2020-12-26,Itä-Savo Hospital District,,,


### 2.3 Region field
Identify region field, and make sure it makes sense with country's regions. You may want to search "ISO 3166-2 Finland" in Wikipedia (https://en.wikipedia.org/wiki/ISO_3166-2:FI) or "Final Regions" (https://en.wikipedia.org/wiki/Regions_of_Finland) for this example.

Once we deem that the region field makes sense and is complete, we need to rename the regions, using the standarized names from our ISO db:

- Note 1: In this example we have that some regions have more than one district.
- Note 2: There are two regions which we will be ignoring, namely "All areas" and "Other areas".

In [11]:
# Regions to be ignored
df = df[~df["region"].isin(["All areas", "Other areas"])]

In [12]:
# Get standard Finland's region names
from covid_updater.iso import ISODB
df_iso = ISODB()._df
df_iso.loc[df_iso["location_iso"] == "FI"]

Unnamed: 0,location_iso,subdivision_name,region_iso
805,FI,Aland,FI-01
806,FI,Etela-Karjala,FI-02
807,FI,Etela-Pohjanmaa,FI-03
808,FI,Etela-Savo,FI-04
809,FI,Kainuu,FI-05
810,FI,Kanta-Hame,FI-06
811,FI,Keski-Pohjanmaa,FI-07
812,FI,Keski-Suomi,FI-08
813,FI,Kymenlaakso,FI-09
814,FI,Lappi,FI-10


In [13]:
# Area renaming
region_renaming = {
    'Åland': 'Aland',
    'South Karelia Hospital District': 'Etela-Karjala',
    'South Ostrobothnia Hospital District': 'Etela-Pohjanmaa',
    'South Savo Hospital District': 'Etela-Savo',
    'Helsinki and Uusimaa Hospital District': 'Uusimaa',
    'Itä-Savo Hospital District': 'Etela-Savo',
    'Kainuu Hospital District': 'Kainuu',
    'Kanta-Häme Hospital District': 'Kanta-Hame',
    'Central Ostrobothnia Hospital District': 'Keski-Pohjanmaa',
    'Central Finland Hospital District': 'Keski-Suomi',
    'Kymenlaakso Hospital District': 'Kymenlaakso',
    'Lappi Hospital District': 'Lappi',
    'Länsi-Pohja Hospital District': 'Lappi Hospital District',
    'Pirkanmaa Hospital District': 'Pirkanmaa',
    'North Karelia Hospital District': 'Pohjois-Karjala',
    'North Ostrobothnia Hospital District': 'Pohjois-Pohjanmaa',
    'North Savo Hospital District': 'Pohjois-Savo',
    'Päijät-Häme Hospital District': 'Paijat-Hame',
    'Satakunta Hospital District': 'Satakunta',
    'Vaasa Hospital District': 'Pohjanmaa',
    'Southwest Finland Hospital District': 'Varsinais-Suomi'
}

df.loc[:, "region"] = df.loc[:, "region"].replace(region_renaming)

In [14]:
df.tail()

Unnamed: 0,date,region,total_vaccinations,people_vaccinated,people_fully_vaccinated
1582,2021-03-04,Etela-Pohjanmaa,,,
1583,2021-03-04,Etela-Savo,,,
1584,2021-03-04,Varsinais-Suomi,,,
1585,2021-03-04,Pohjanmaa,,,
1586,2021-03-04,Aland,,,


### 2.4 Discard today's entries
I have noticed that data of date of access is NaN, hence we can discard it (it will be added during the next day)

In [15]:
date_str_limit = datetime.now().date().strftime("%Y-%m-%d")
df = df[df["date"]<date_str_limit]

In [16]:
df.tail()

Unnamed: 0,date,region,total_vaccinations,people_vaccinated,people_fully_vaccinated
1559,2021-03-03,Etela-Pohjanmaa,963.0,960.0,3.0
1560,2021-03-03,Etela-Savo,200.0,200.0,0.0
1561,2021-03-03,Varsinais-Suomi,2825.0,2804.0,21.0
1562,2021-03-03,Pohjanmaa,922.0,920.0,2.0
1563,2021-03-03,Aland,37.0,36.0,1.0


### 2.5 Add missing columns
In this case, we are missing columns related to location and ISO codes

In [17]:
# Country
df = df.assign(location="Finland")

In [18]:
# ISO
df = ISODB().merge(df, country_iso="FI")

In [19]:
df.head()

Unnamed: 0,date,region,total_vaccinations,people_vaccinated,people_fully_vaccinated,location,location_iso,region_iso
0,2020-12-26,Keski-Suomi,,,,Finland,FI,FI-08
1,2020-12-26,Keski-Pohjanmaa,,,,Finland,FI,FI-07
2,2020-12-26,Uusimaa,,,,Finland,FI,FI-18
3,2020-12-26,Etela-Savo,,,,Finland,FI,FI-04
4,2020-12-26,Kainuu,,,,Finland,FI,FI-05


## 3. Exporting
Exporting would consist on:

- Order columns based on standard (check [Australia.csv](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/data/countries/Australia.csv) as an example).
- Sort rows by region & date.
- Ensure fields `total_vaccinations`, `people_vaccinated`, `people_fully_vaccinated` are of type int.

## 4. Implementing scraper
First, you need to decide if you will be implementing this as a batch Scraper or an incremental Scraper. In our case here, we will be implementing it as a batch scraper and hence will implement a class inheriting from class `covid_updater.scraping.base.Scraper`. 

### Steps to follow:

- Create new file `covid_updater.scraping.countries.finland.py`.
- Design and implement new class `FinlandScraper` in the aforementined file. 
    - Use the code that you have tested in this notebook, and place it in the different class methods (`load_data`, `_process`, or create new ones, etc.)
    - Use other country files as reference
    - Rememeber to inherit from `covid_updater.scraping.base.Scraper` and use already implemented
- Add required lines in `covid_updater.scraping.core.py`:
    - New import: `from covid_updater.scraping.countries.finland import FinlandScraper`
    - Add an instance of `FinlandScraper` to list `scrapers`.
- Add correponsing ISO code to list `ISO_CODES` in script `scripts/update_countries.py`.