 <img src="https://i0.wp.com/kpmgeestiblog.ee/wp-content/uploads/2018/12/cropped-LOGIO_KPMG_NoCP_RGB_280-2.png?resize=110%2C51&ssl=1"> </img>

## Web-mining document registry


#### The purpose of this notebook is to quickly:
1. Review Riigikantselei's (Chancellory of Government Office of Estonia) public Document Registry, 
2. Understand structure of document entries and define several sample fields (like type of document, direction of document flow, date etc),
3. Devise functions to web-scrape contents - intentionally at very limited scope,
4. Run a scraper for last full month (May) and save results as structured date to table file,
6. Check results.

The time-frame to devise and implement scraper is 1-2 hours

link to registry:
https://dhs.riigikantselei.ee/avalikteave.nsf/byjournalkey?open

#### 1. Review Riigikantselei's (Chancellory of Government Office of Estonia) public Document Registry

The registry has comprehensive search engine
<br>The registry has got customized html tags like

<br><i>&lt;fieldtitle name="receivedfrom">Kellelt saabunud&lt;/fieldtitle></i>

<br>The registry uses HEX numbering to generate links to documents

<i>"https://dhs.riigikantselei.ee/avalikteave.nsf/documents/NT003822E6"</i>

#### 2. Understand structure of document entries and define several sample fields

The fields can be parsed out by manipulated html text. This is less ideal than using Beautiful Soup functionalities, especially to parse out HTML tables. However, Riigikatselei's document registry uses completely customized tags for tables, instead of standard one and digging into them is a must. Also, there are only certain fields to be taken for this particular business task (defined in point 3 above), such as receivers, senders, type of document and very few others.

#### 3. Devise functions to web-scrape contents

The fields are defined in a .py module "riigikantselei_functions"
<br>There are several functions from this module, visible below

In [1]:
import riigikantselei_functions

Functions defined:

 get_web_contents(link)			returns "contents"
 get_links(soup)			returns "links"
 parse_entry(contents, url, counter)	returns "dataframe"
 generate_hexes(hex_code = "37E2A2")	returns "hex links"
 scrape_web(links)			saves web scraped contents to Excel file


Fields to scrape and place to table format are defined inside function
"parse_entry" and can be printed from inner help of that function

In [6]:
from riigikantselei_functions import get_web_contents, get_links, parse_entry, generate_hexes, scrape_web

help(parse_entry)

Help on function parse_entry in module riigikantselei_functions:

parse_entry(contents, url, counter)
    This function parses out the contents of one document entry in Document Registry
    The idea is to use ad-hoc functionalities based on text manipulation and obtain values 
    for only select fields, which are:
                         columns = ['URL',
                                'Kellelt',
                                'Kellele',
                                'Väljaandja',
                                'Dok No', 
                                'Kuupäev',
                                'Dok Tüüp', 
                                'Dok Klass', 
                                'AK']



#### 4. Run a scraper for last full month (May 2021) and save results as structured date to table file

As mentioned above, Riigikantselei has chosen - in addition to html tags customization - also customized links by replacing normal decimal incrementation with HEX incrementation. Therefore in the below code I create some 16341 links, which encompass May 2021, and which use HEX incrementation

The initial link is based on patterns of links in the first few days of May, which were found by manual review. 16K additional links generated are very much likely to contain all of the document turnover registered in the registry in May 2021, the whole month

In [2]:
generated_links = generate_hexes(hex_code = "37E2A2")

After 16 links, starting from the beginning of May were created, I also create 15K links goint to the past, to be sure that no links were missing

In [4]:
generated_links_prev = generate_hexes(hex_code = '37A80A')

The code below will web scrape 16K links that were just automatically generated, parse all documents to predifined tabel and save results to MS Excel XLSX files 

In [166]:
scrape_web(generated_links)

Commencing scraping... 20:51:44
Saving.. 11_06_2021_20_54_52_408_temp_.xlsx on 408 iteration 20:54:52
Saving.. 11_06_2021_20_58_04_812_temp_.xlsx on 812 iteration 20:58:05
Saving.. 11_06_2021_21_01_16_1212_temp_.xlsx on 1212 iteration 21:01:16
Saving.. 11_06_2021_22_03_03_7720_temp_.xlsx on 7720 iteration 22:03:03
Saving.. 11_06_2021_22_06_09_8128_temp_.xlsx on 8128 iteration 22:06:09
Saving.. 11_06_2021_22_09_12_8532_temp_.xlsx on 8532 iteration 22:09:12
Saving.. 11_06_2021_22_12_12_8932_temp_.xlsx on 8932 iteration 22:12:12
Saving.. 11_06_2021_22_15_15_9336_temp_.xlsx on 9336 iteration 22:15:15
Saving.. 11_06_2021_23_14_19_15836_temp_.xlsx on 15836 iteration 23:14:19
Saving.. 11_06_2021_23_17_19_16240_temp_.xlsx on 16240 iteration 23:17:19
Saving.. 11_06_2021_23_18_08_16341_FINAL_.xlsx on 16341 iteration
Scraping completed with 16341 runs altogether 1025 collected 23:18:09


The code below will additionally web scrape 15K links that were just automatically generated, parse all documents to predifined tabel and save results to MS Excel XLSX files 

In [186]:
scrape_web(generated_links_prev)

Commencing scraping... 09:37:02
Saving.. 12_06_2021_10_00_30_2636_temp_.xlsx on 2636 iteration 10:00:30
Saving.. 12_06_2021_10_03_37_3040_temp_.xlsx on 3040 iteration 10:03:37
Saving.. 12_06_2021_10_06_43_3444_temp_.xlsx on 3444 iteration 10:06:43
Saving.. 12_06_2021_10_09_49_3848_temp_.xlsx on 3848 iteration 10:09:49
Saving.. 12_06_2021_10_12_51_4248_temp_.xlsx on 4248 iteration 10:12:51
Saving.. 12_06_2021_10_53_14_8712_temp_.xlsx on 8712 iteration 10:53:15
Saving.. 12_06_2021_10_56_17_9112_temp_.xlsx on 9112 iteration 10:56:17
Saving.. 12_06_2021_10_59_19_9512_temp_.xlsx on 9512 iteration 10:59:20
Saving.. 12_06_2021_11_02_33_9920_temp_.xlsx on 9920 iteration 11:02:34
Saving.. 12_06_2021_11_05_38_10320_temp_.xlsx on 10320 iteration 11:05:39
Saving.. 12_06_2021_11_46_05_14792_temp_.xlsx on 14792 iteration 11:46:05
Saving.. 12_06_2021_11_47_45_15000_FINAL_.xlsx on 15000 iteration
Scraping completed with 15000 runs altogether 1151 collected 11:47:45


#### 5. Check results
I wil check results of downloaded files to see if a nice table format was created

In [13]:
columns = ['URL','Kellelt','Kellele','Väljaandja','Dok No','Kuupäev','Dok Tüüp','Dok Klass','AK']
pd.read_excel('11_06_2021_23_18_08_16341_FINAL_.xlsx')[columns][200:203].fillna('')

Unnamed: 0,URL,Kellelt,Kellele,Väljaandja,Dok No,Kuupäev,Dok Tüüp,Dok Klass,AK
200,https://dhs.riigikantselei.ee/avalikteave.nsf/...,,Kandidaat,,21-00376-10,07.05.2021,Kiri,18 Avaliku teenistuse tippjuhtide värbamine ja...,Asutusesiseseks kasutamiseks
201,https://dhs.riigikantselei.ee/avalikteave.nsf/...,Justiitsministeerium,,,21-01098-1,07.05.2021,Määruse eelnõu,02 Vabariigi Valitsuse istungite ja nõupidamis...,
202,https://dhs.riigikantselei.ee/avalikteave.nsf/...,Lääne-Viru Omavalitsuste Liit,,,21-01094-1,06.05.2021,Kiri,07 Vabariigi Valitsuse ja peaministri muu asja...,
