# Implement scraper
Sociepy's dataset consists on multiple country data sources. The first step to retrieve and construct this dataset are the country scrapers from `covid_updater` library. In this notebook we will explain how to add a new scraper to the project. To this end, we will use Finland as an example.


We will illustrate how to (1) get the data, (2) process the data and (3) export data. Once this is clear and the code is reliable, we will integrate the logic into `covid_updater` library, specifically in `covid_updater.scraping.countries` module.


Note that scraping is only the first part of the `update_all.py` pipeline, which is responsible of updating all project's data. More details about the whole pipeline are found [here](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/scripts/update_all.sh).

## 0. Fork project
First thing is to fork the project, clone it and start editing.

Remember to keep your fork [synced with upstream](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).

### 0.1 Find country candidate
To this end, use following tool to see what is available on OWID

In [1]:
# Get potential candidates and urls
from covid_updater.additions import get_owid_diff_source_urls
get_owid_diff_source_urls()

[{'country': 'Bulgaria', 'source_url': 'https://coronavirus.bg/bg/statistika'},
 {'country': 'Estonia', 'source_url': 'https://www.terviseamet.ee/et/uudised'},
 {'country': 'Finland',
  'source_url': 'https://sampo.thl.fi/pivot/prod/en/vaccreg/cov19cov/fact_cov19cov.csv?row=cov_vac_dose-533174L&column=measure-533185.533172.433796.533175&'},
 {'country': 'Indonesia', 'source_url': 'https://www.kemkes.go.id/'},
 {'country': 'Ireland', 'source_url': ''},
 {'country': 'Jersey',
  'source_url': 'https://www.gov.je/Health/Coronavirus/Vaccine/Pages/VaccinationStatistics.aspx'},
 {'country': 'Liechtenstein',
  'source_url': 'https://www.covid19.admin.ch/en/epidemiologic/vacc-doses?detGeo=FL'},
 {'country': 'Luxembourg',
  'source_url': 'https://data.public.lu/fr/datasets/covid-19-rapports-journaliers/#_'},
 {'country': 'Montenegro',
  'source_url': 'https://atlas.jifo.co/api/connectors/520021dc-c292-4903-9cdb-a2467f64ed97'},
 {'country': 'Morocco',
  'source_url': 'http://www.covidmaroc.ma/pag

## 1. Get the data
### 1.1 Format of the data
You have to determine which is the format of the source data. Based on this, different libraries may come handy.

- csv: Might want to use pandas.
- API, json-like: Use requests, json
- Table in static website: Use bs4, requests, urllib
- Table in dynamic website: Use selenium, bs4, requests, urllib

## 1.2. Download the data
In our case, the data is available as CSV per week. That is, for each week there is a CSV available.

URLs for these links are a bit tricky to get in this particular case, some scraping is required!

In [79]:
import requests
import json
import pandas as pd

In [80]:
def load_data(url):
    data_raw = json.loads(requests.get(url).content)
    dfs = [pd.DataFrame.from_records(sample["shotHistory"]).assign(region=sample["areaName"]) for sample in data_raw]
    return pd.concat(dfs)

In [81]:
url = "https://piikki-api.lab.juiciness.io/administrations"
df = load_data(url)

In [82]:
# Show some rows
df.head(5)

Unnamed: 0,areaId,date,firstDoseShots,secondDoseShots,region
0,ffb7d597-9211-49ed-8570-53347cb5b304,2020-12-26T00:00:00.000Z,0,0,Åland
1,ffb7d597-9211-49ed-8570-53347cb5b304,2020-12-27T00:00:00.000Z,0,0,Åland
2,ffb7d597-9211-49ed-8570-53347cb5b304,2020-12-28T00:00:00.000Z,0,0,Åland
3,ffb7d597-9211-49ed-8570-53347cb5b304,2020-12-29T00:00:00.000Z,0,0,Åland
4,ffb7d597-9211-49ed-8570-53347cb5b304,2020-12-30T00:00:00.000Z,0,0,Åland


## 2. Process data
We are interested in processing the data such that its format is compliant with the project's format. Check [Australia.csv](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/data/countries/Australia.csv) as an example. But basically, the following fields are required:

- `location`: Country name.
- `region`: Region name.
- `date`: Date of observation.
- `location_iso`: ISO 3166-1 alpha 2 code of country.
- `region_iso`: ISO 3166-2 code of region
- `total_vaccinations`: Total number of vaccinations (doses) administered.
- `people_vaccinated` (OPTIONAL): Number of people with at least one vaccination administred. Not all countries make this data available.
- `people_fully_vaccinated` (OPTIONAL): Number of people fully vaccinated (with all required doses administered). Not all countries make this data available.

### 2.1 Rename and discard fields
From the retrieved columns, which are interesting?

The following should be renamed:

- `date` -> `date`
- `region` -> `region`
- `firstDoseShots` -> `people_vaccinated`
- `secondDoseShots` -> `people_fully_vaccinated`

And `areaId` should be removed

In [83]:
df = df.rename(columns={
    "firstDoseShots": "people_vaccinated",
    "secondDoseShots": "people_fully_vaccinated",
})
df = df.drop(columns="areaId")

In [84]:
# Show some rows
df.head(5)

Unnamed: 0,date,people_vaccinated,people_fully_vaccinated,region
0,2020-12-26T00:00:00.000Z,0,0,Åland
1,2020-12-27T00:00:00.000Z,0,0,Åland
2,2020-12-28T00:00:00.000Z,0,0,Åland
3,2020-12-29T00:00:00.000Z,0,0,Åland
4,2020-12-30T00:00:00.000Z,0,0,Åland


### 2.2 Date field
For our dataset it is important to have the same date format across all countries. In particular, we use "YYYY-mm-dd". Let us process `date` field accordingly.

In [85]:
df.loc[:, "date"] = pd.to_datetime(df.date).apply(lambda x: x.strftime("%Y-%m-%d"))

In [86]:
# Show some rows
df.head(5)

Unnamed: 0,date,people_vaccinated,people_fully_vaccinated,region
0,2020-12-26,0,0,Åland
1,2020-12-27,0,0,Åland
2,2020-12-28,0,0,Åland
3,2020-12-29,0,0,Åland
4,2020-12-30,0,0,Åland


### 2.3 Region field
Identify region field, and make sure it makes sense with country's regions. You may want to search "ISO 3166-2 Finland" in Wikipedia (https://en.wikipedia.org/wiki/ISO_3166-2:FI) or "Final Regions" (https://en.wikipedia.org/wiki/Regions_of_Finland) for this example.

Once we deem that the region field makes sense and is complete, we need to rename the regions, using the standarized names from our ISO db:

- Note 1: In this example we have that some regions have more than one district.
- Note 2: There are two regions which we will be ignoring, namely "All areas" and "Other areas".

In [87]:
# Regions to be ignored
df = df[~df["region"].isin(["All areas", "Other areas"])]

In [88]:
# Get standard Finland's region names
from covid_updater.iso import ISODB
df_iso = ISODB()._df
df_iso.loc[df_iso["location_iso"] == "FI"]

Unnamed: 0,location_iso,subdivision_name,region_iso
805,FI,Aland,FI-01
806,FI,Etela-Karjala,FI-02
807,FI,Etela-Pohjanmaa,FI-03
808,FI,Etela-Savo,FI-04
809,FI,Kainuu,FI-05
810,FI,Kanta-Hame,FI-06
811,FI,Keski-Pohjanmaa,FI-07
812,FI,Keski-Suomi,FI-08
813,FI,Kymenlaakso,FI-09
814,FI,Lappi,FI-10


We define a mapping dictionary, that "translates" input region names to our stardad ISO names. Note that due to the nature of the data, more than one Hospital District may correspond to a region.

In [89]:
# Area renaming
region_renaming = {
    'Åland': 'Aland',
    'South Karelia Hospital District': 'Etela-Karjala',
    'South Ostrobothnia Hospital District': 'Etela-Pohjanmaa',
    'South Savo Hospital District': 'Etela-Savo',
    'Helsinki and Uusimaa Hospital District': 'Uusimaa',
    'Itä-Savo Hospital District': 'Etela-Savo',
    'Kainuu Hospital District': 'Kainuu',
    'Kanta-Häme Hospital District': 'Kanta-Hame',
    'Central Ostrobothnia Hospital District': 'Keski-Pohjanmaa',
    'Central Finland Hospital District': 'Keski-Suomi',
    'Kymenlaakso Hospital District': 'Kymenlaakso',
    'Lappi Hospital District': 'Lappi',
    'Länsi-Pohja Hospital District': 'Lappi',
    'Pirkanmaa Hospital District': 'Pirkanmaa',
    'North Karelia Hospital District': 'Pohjois-Karjala',
    'North Ostrobothnia Hospital District': 'Pohjois-Pohjanmaa',
    'North Savo Hospital District': 'Pohjois-Savo',
    'Päijät-Häme Hospital District': 'Paijat-Hame',
    'Satakunta Hospital District': 'Satakunta',
    'Vaasa Hospital District': 'Pohjanmaa',
    'Southwest Finland Hospital District': 'Varsinais-Suomi'
}

df.loc[:, "region"] = df.loc[:, "region"].replace(region_renaming)

Next, we need to aggregate all data per region, as some regions are duplicated for a given date (data from more than one hospital).

In [93]:
df = df.groupby(["region", "date"]).sum().reset_index()

In [94]:
# Show some rows
df.head(5)

Unnamed: 0,region,date,people_vaccinated,people_fully_vaccinated
0,Aland,2020-12-26,0,0
1,Aland,2020-12-27,0,0
2,Aland,2020-12-28,0,0
3,Aland,2020-12-29,0,0
4,Aland,2020-12-30,0,0


### 2.5 Add missing columns
In this case, we are missing columns related to location and ISO codes

In [95]:
# Country
df = df.assign(location="Finland")

In [96]:
# ISO
df = ISODB().merge(df, country_iso="FI")

In [98]:
# total_vaccinations
df.loc[:, "total_vaccinations"] = df.loc[:, "people_vaccinated"] + df.loc[:, "people_fully_vaccinated"]

In [99]:
# Show some rows
df.head(5)

Unnamed: 0,region,date,people_vaccinated,people_fully_vaccinated,location,location_iso,region_iso,total_vaccinations
0,Aland,2020-12-26,0,0,Finland,FI,FI-01,0
1,Aland,2020-12-27,0,0,Finland,FI,FI-01,0
2,Aland,2020-12-28,0,0,Finland,FI,FI-01,0
3,Aland,2020-12-29,0,0,Finland,FI,FI-01,0
4,Aland,2020-12-30,0,0,Finland,FI,FI-01,0


### 2.6 Get cumulative values
The date from some countries may be cumulative. In some cases it is not, which is the case of Finland.

In [102]:
# See last values for Aland
df[df["region"]=="Aland"].tail()

Unnamed: 0,region,date,people_vaccinated,people_fully_vaccinated,location,location_iso,region_iso,total_vaccinations
65,Aland,2021-03-01,38,0,Finland,FI,FI-01,38
66,Aland,2021-03-02,46,0,Finland,FI,FI-01,46
67,Aland,2021-03-03,36,1,Finland,FI,FI-01,37
68,Aland,2021-03-04,0,0,Finland,FI,FI-01,0
69,Aland,2021-03-05,0,0,Finland,FI,FI-01,0


To overcome this, we need to apply `cumsum` function.

In [103]:
for field in ["people_vaccinated", "people_fully_vaccinated", "total_vaccinations"]:
    df.loc[:, field] = df.groupby("region")[field].cumsum().values

In [105]:
# See again last values for Aland
df[df["region"]=="Aland"].tail()

Unnamed: 0,region,date,people_vaccinated,people_fully_vaccinated,location,location_iso,region_iso,total_vaccinations
65,Aland,2021-03-01,3363,776,Finland,FI,FI-01,4139
66,Aland,2021-03-02,3409,776,Finland,FI,FI-01,4185
67,Aland,2021-03-03,3445,777,Finland,FI,FI-01,4222
68,Aland,2021-03-04,3445,777,Finland,FI,FI-01,4222
69,Aland,2021-03-05,3445,777,Finland,FI,FI-01,4222


## 3. Exporting
Exporting would consist on:

- Order columns based on standard (check [Australia.csv](https://github.com/sociepy/covid19-vaccination-subnational/blob/main/data/countries/Australia.csv) as an example).
- Sort rows by region & date.
- Ensure fields `total_vaccinations`, `people_vaccinated`, `people_fully_vaccinated` are of type int.

## 4. Implementing scraper
First, you need to decide if you will be implementing this as a batch Scraper or an incremental Scraper. In our case here, we will be implementing it as a batch scraper and hence will implement a class inheriting from class `covid_updater.scraping.base.Scraper`. 

### Steps to follow:

- Create new file `covid_updater.scraping.countries.finland.py`.
- Design and implement new class `FinlandScraper` in the aforementined file. 
    - Use the code that you have tested in this notebook, and place it in the different class methods (`load_data`, `_process`, or create new ones, etc.)
    - Use other country files as reference. Take your time to familiarize with its different parts and parent's logic,  too.
    - Rememeber to inherit from `covid_updater.scraping.base.Scraper` and use already implemented
- Add required lines in `covid_updater.scraping.core.py`:
    - New import: `from covid_updater.scraping.countries.finland import FinlandScraper`
    - Add an instance of `FinlandScraper` to list `scrapers`.
- Add correponding ISO code to list `ISO_CODES` in script `scripts/update_countries.py`.
- Run `update_all.sh` script with option `--update-population` to update `data/population.csv` file with new added regions

```
$ bash scripts/update_all.sh --update-population
```

- Verify that `data/vaccinations.csv` has been correctly generated (i.e., contains new added country data, fields `*_per_100` do not contain NaNs or empty values, etc.)
- [Sync your fork with upstream](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)
- Create a Pull Request