# Exploring COVID-19 and dengue numbers: the data acquisition process

### Environment log:

(1) Using the "analytics3" anaconda environment; installing python packages directly through *pip* command.  
(2) Installed *pyvirtualenv*, *selenium*, and *[bs4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)* packages.  
(3) Installed *chromedriver* through "pip install chromedriver".  

(4) Got this error when initializing the Display function:
> display = Display(visible=0, size=(800,600))
> display.start()  
>> FileNotFoundError: [Errno 2] No such file or directory: 'Xvfb': 'Xvfb'
>> EasyProcessError: start error EasyProcess cmd_param=['Xvfb', '-help'] cmd=['Xvfb', '-help'] oserror=[Errno 2] No such file or directory: 'Xvfb': 'Xvfb' return_code=None stdout="None" stderr="None" timeout_happened=False>

**Solution found in: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=860501**   
 
(5) Installed the system package xvfb through the shell line: *sudo apt-get install xvfb*.

- When running the "driver = webdriver.Chrome()" python command, I received the following error:  
> driver = webdriver.Chrome()
>> FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'  

(6) Even though, since I'm using conda environment I have set the chromedriver path before:  
> chromeDriverPath = '~/anaconda3/envs/anaytics3/'

The solution was to create a symbolic link in the path shown above, through the system bash, which made it work properly:  
> $ ln -s ~/anaconda3/envs/analytics3/chromedriver-Linux64 chromedriver  

(7) There are some broken links in the BMH oficial page: it misses the "https://portalarquivos.saude.gov.br/" portion of the link. We defined a function to address this issue.


## Epidemiological bulletins from World Health Organization - WHO
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/  
https://github.com/danielsteman/COVID-19_WHO  
https://github.com/danielsteman/COVID-19_WHO/blob/master/WHO_webscrape.ipynb (the author uses PyPDF2)

## Epidemiological bulletins from Brazilian Ministry of Health
https://www.saude.gov.br/boletins-epidemiologicos
  - Coronavírus/COVID-19  
  - Dengue

### Initial statements

In [1]:
import time
import pdb   # Python debugger
from pyvirtualdisplay import Display
from selenium import webdriver
from bs4 import BeautifulSoup

In [2]:
# Definição de parâmetros
url = 'https://www.saude.gov.br/boletins-epidemiologicos'
chromeDriverPath = '~/anaconda3/envs/anaytics3/'

### Web browsing

In this step of our data crawling we want to access the webpage through the Chromium driver. It is then appropriate to create some functions both for the page access as for its content analysis. However, it would be only possible if we know the page structure. In order to reach it, at this time we will try to read and explore it, trying to find the patterns and links we are looking for.

In [3]:
# Display starting:
display = Display(visible=0, size=(800,600))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '800x600x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '800x600x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [4]:
# Open the Chromium.driver for the intended page:
driver = webdriver.Chrome()

In [5]:
# Reading the page content and saving it specifying its encoding scheme:
driver.get(url)
page = driver.page_source.encode('utf-8')

### Initial exploration

In [6]:
# How large is the loaded page?
print(len(page))
# Data streaming type is expected:
print(type(page))

490660
<class 'bytes'>


In [7]:
## Showing only the first 2k positions of the bytes stream:
print(page[:2000])

b'<!DOCTYPE html><!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="pt-br" dir="ltr"> <![endif]--><!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8" lang="pt-br" dir="ltr"> <![endif]--><!--[if IE 8]>         <html class="no-js lt-ie9" lang="pt-br" dir="ltr"> <![endif]--><!--[if gt IE 8]><!--><html xmlns="http://www.w3.org/1999/xhtml" class="no-js" lang="pt-br" dir="ltr"><!--<![endif]--><head>\n<!-- Google Tag Manager -->\n<script type="text/javascript" async="" src="https://www.google-analytics.com/gtm/js?id=GTM-W26S6PL&amp;t=gtm2&amp;cid=639199945.1588552376"></script><script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script><script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script><script src="https://connect.facebook.net/pt_BR/sdk.js?hash=054ff8773cb77fd6b084be4b8df5b59e&amp;ua=modern_es6" async="" crossorigin="anonymous"></script><script id="facebook-jssdk" src="//connect.

**Since the loaded page has too many links, as well as other contents we are not interested in, we will use BeautifulSoap to parse it in order to obtain the desired links.**

In [8]:
# Aplicando o *parsing* na página:
soup = BeautifulSoup(page)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Although every HTML file shares an equivalent structure, parsing a web page involves knowing how it was specifically built. If you are not the page developer, you should expect to spend some time at it, exploring the way the information you are looking for is arranged among all classes and divisions.  

A helpful and simpler way to do this (at least from the point of view of a data scientist and not a web developer) is by making use of the Chrome inspection tool, as demonstrated in the image below:  
![Inspecting the BMH web page with the Chrome browser tool.]("BMH_Chrome_inspection.png")

---------------------

### Getting the data:  
We are interested in the *Coronavírus/COVID-19* and *Dengue* data, which can be reached from the "Por assunto" (By subject) menu in the main page. Through some iterations, we can see that the links (what we are looking for) are inserted in a sibling class of the one that contains the reference title. For that, what we must do is search for the parent element of our title of interest and, from there, explore all the links that match our search criteria (link of the type "Http" for a file of the type PDF). The function below condenses this search:

In [9]:
def BMH_findLinks_bySubject(subject, soup, interestStructure = 'h3'):
    '''Given a subject of interest, look for the HTTP links available from the web page of the Brazilian Ministry of Health.
    BMH page: https://www.saude.gov.br/boletins-epidemiologicos
    Syntax: 
        subject: string type with the subject of interest. Check in the BMH page for the right spelling.
            (e.g.: "Coronavírus/COVID-19" or "Dengue")
        soup: BeautifulSoup data structure from the web page of interest.
        interestStructure: the HTML type of the structure of interest. The default value is 'h3'. 
        
    Returns:
        links: a list of links corresponding to the searching criteria.
    '''
    ## Finding the section corresponding to the subject:
    section = soup.find(interestStructure, string=subject)
    
    ## Finding the parent element:
    section_parent = section.find_parent()
    
    ## Finding all the HTTP links in the parent section corresponding to a PDF file:
    linkList = []
    for link in section_parent.find_all('a', href=True):
        if link['href'].lower().endswith(".pdf"):
            if link['href'].lower().startswith("http"):
                linkList.append(link['href'])

    return linkList

---------------------

From the function above, we can now look for all epidemiological bulletins about *Coronavírus/COVID-19* from the scraped page of the Brazilian Ministry of Health:

In [10]:
## Obtaining the list:
links_COVID = BMH_findLinks_bySubject('Coronavírus/COVID-19', soup)

In [11]:
## Checking out each item:
for link in links_COVID:
    print(link)

https://portalarquivos.saude.gov.br/images/pdf/2020/April/27/2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/21/BE13---Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/03/BE6-Boletim-Especial-do-COE.pdf
https://portalarquivos2.saude.gov.br/images/pdf/2020/fevereiro/13/Boletim-epidemiologico-COEcorona-SVS-13fev20.pdf


----------------------------

#### WARNING: broken links
Always check if your scripts are retrieving the desired data. If we open the page of interest in a web browser, we would find there are more bulletins than those resulting from the query above. To understand why this is happening, we must go back to a step-by-step manual inspection:

In [12]:
## (1) Looking for the section we are interested in the parsed page:
section = soup.find('h3', string='Coronavírus/COVID-19')

In [13]:
## (2) Moving up to the parent section:
section_parent = section.find_parent()

In [14]:
## (3) Checking manually what is going on:
section_parent.find_all('a')

[<a href="#">Coronavírus/COVID-19</a>,
 <a href="https://portalarquivos.saude.gov.br/images/pdf/2020/April/27/2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf" rel="noopener" target="_blank">BE 14 - Boletim COE Coronavírus</a>,
 <a href="https://portalarquivos.saude.gov.br/images/pdf/2020/April/21/BE13---Boletim-do-COE.pdf">BE 13 - Boletim COE Coronavírus</a>,
 <a href="https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf">BE 12 - Boletim COE Coronavírus</a>,
 <a href="/images/pdf/2020/April/18/2020-04-17---BE11---Boletim-do-COE-21h.pdf" target="_blank">BE 11 - Boletim COE Coronavírus</a>,
 <a href="https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf" rel="noopener" target="_blank">BE 10 - Boletim COE Coronavirus</a>,
 <a href="https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf" rel="noopener" target="_blank">BE9 - Boletim Especial do COE Coronavírus Avaliação de Ri

> We can always make use of visualization commands to turn things easier:

In [15]:
for report in section_parent.find_all('a')[:8]:  #Notice we are limiting the visualization for the first nine links
    print(report['href'])
    print(report.contents,'\n')

#
['Coronavírus/COVID-19'] 

https://portalarquivos.saude.gov.br/images/pdf/2020/April/27/2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf
['BE 14 - Boletim COE Coronavírus'] 

https://portalarquivos.saude.gov.br/images/pdf/2020/April/21/BE13---Boletim-do-COE.pdf
['BE 13 - Boletim COE Coronavírus'] 

https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
['BE 12 - Boletim COE Coronavírus'] 

/images/pdf/2020/April/18/2020-04-17---BE11---Boletim-do-COE-21h.pdf
['BE 11 - Boletim COE Coronavírus'] 

https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
['BE 10 - Boletim COE Coronavirus'] 

https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
['BE9 - Boletim Especial do COE Coronavírus Avaliação de Risco'] 

/images/pdf/2020/April/09/be-covid-08-final-2.pdf
['BE8 - Boletim Especial do COE Coronavírus Avaliação de Risco'] 



Bingo! **There are some broken links...** Observe the lines where instead of starting with "https", goes straightly to the directory structure. To address this issue, let's rewrite our previous function:

In [16]:
def BMH_findLinks_bySubject_v2(subject, soup, rootPage='https://www.saude.gov.br', interestStructure ='h3'):
    '''Given a subject of interest, look for the HTTP links available from the web page of the Brazilian Ministry of Health.
    BMH page: https://www.saude.gov.br/boletins-epidemiologicos
    Syntax: 
        subject: string type with the subject of interest. Check in the BMH page for the right spelling.
            (e.g.: "Coronavírus/COVID-19" or "Dengue")
        soup: BeautifulSoup data structure from the web page of interest.
        interestStructure: the HTML type of the structure of interest. The default value is 'h3'.
        rootPage: in the case there are broken links, i.e., those starting with a directory structure 
            instead of a "http" statement, this address will be joint to them.
            E.g.: 'https://www.saude.gov.br' or 'https://portalarquivos.saude.gov.br'
        
    Returns:
        links: a list of links corresponding to the searching criteria.
    '''
    ## Finding the section corresponding to the subject:
    section = soup.find(interestStructure, string=subject)
    
    ## Finding the parent element:
    section_parent = section.find_parent()
    
    ## Finding all the HTTP links in the parent section corresponding to a PDF file:
    linkList = []
    for link in section_parent.find_all('a', href=True):
        if link['href'].lower().endswith(".pdf"):
            if link['href'].lower().startswith("http"):
                linkList.append(link['href'])

            ## Adding the root page to the broken links:
            elif link['href'].lower().startswith("/images/"):
                linkList.append(rootPage+link['href'])

    return linkList

-----------------------------------_

Testing, once again, the function we created to retrieve the bulletins about COVID-19 from BMH:

In [17]:
## Obtaining the list:
links_COVID = BMH_findLinks_bySubject_v2('Coronavírus/COVID-19', soup)

In [18]:
## Checking out each item:
for link in links_COVID:
    print(link)

https://portalarquivos.saude.gov.br/images/pdf/2020/April/27/2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/21/BE13---Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
https://www.saude.gov.br/images/pdf/2020/April/18/2020-04-17---BE11---Boletim-do-COE-21h.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
https://www.saude.gov.br/images/pdf/2020/April/09/be-covid-08-final-2.pdf
https://www.saude.gov.br/images/pdf/2020/April/09/be-covid-08-final.pdf
https://www.saude.gov.br/images/pdf/2020/April/06/2020-04-06-BE7-Boletim-Especial-do-COE-Atualizacao-da-Avaliacao-de-Risco.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/03/BE6-Boletim-Especial-do-COE.pdf
https://www.saude.gov.br/images/pdf/2020/marco/21/2020-03-13-Bolet

Since there are not so many links, one may try to audit it from the original web page. Doing so, one would find out there is an aditional link here. If we pay attention, there are two files regarding the 8th bulleting, probably a second version due to formating issues. From a data science perspective, it is not a problem at all since such data overlap must be carried out during the data processing step.

--------------------------_

Just for checking, we can also try the same functions for other diseases, as for example the "Asbestose":

In [19]:
links_asbestose = BMH_findLinks_bySubject_v2('Asbestose', soup)

In [20]:
## Checking out each item:
for link in links_asbestose:
    print(link)

https://www.saude.gov.br/images/pdf/2016/fevereiro/02/2015-011---Asbestose.pdf


---------------------

### Downloading the files - Version 1.0 [There's an updated version below]
At first, someone could use the os.system library ('wget http ...') to download each file. However, in order to allow the code portability, it is recommended to use the own Python libraries whenever it is possible. 
[Ref.: http://stackoverflow.com/questions/2467609/using-wget-via-python]

In [21]:
import urllib
import re
import os

Using the current version of the urllib library (*request* module) does not allow one to specify the destination folder. Based on [this reference](https://stackoverflow.com/questions/20338452/saving-files-downloaded-from-urlretrieve-to-another-folder-other), we will make use of the *os* library to set the full file path (*set_fullpath* function).

In [22]:
## Defining the destination directory:
pathDir = 'BMH_Bulletins'

In [150]:
def set_fullpath(directory, url, createDir=False):
    ## Getting the file name from the URL using RegEx:
    filename = re.findall('(?:[^/]+)$(?<=(?:.jpg)|(?:.pdf))', url)[0]  #The last term is to take the filename out of the list type.
        
    ## Checking if the directory exists:    
    if os.path.isdir(directory):
        fullpath = os.path.join(directory, filename)
        return fullpath
    elif createDir:
        os.makedirs(directory)
        fullpath = os.path.join(directory, filename)
        return fullpath
    else:
        print('Directory "{}" doest not exist.'.format(directory))
        return -1

In [146]:
set_fullpath('BMH_Bulletins/Dengue', url, createDir=False)

'BMH_Bulletins/Dengue/Boletim-epidemiologico-SVS-04fev20.pdf'

In [140]:
url

'https://www.saude.gov.br/images/pdf/2020/fevereiro/04/Boletim-epidemiologico-SVS-04fev20.pdf'

In [165]:
# Downloading COVID-19 bulletins:
def BMH_download_bulletins(subjectLinks, pathDir, verbose=False, createDir=False):
    countSuccess = 0
    countFails = 0
    for url in subjectLinks:
        if verbose:
            print(url)
        try:
            urllib.request.urlretrieve(url, set_fullpath(pathDir, url, createDir))
            countSuccess += 1
            if verbose:
                print('Success! \n')
        except:
            countFails += 1
            if verbose: 
                print('FAILED! :( \n')
    #if verbose: -- We opt to keep this final message, even when verbose is False.
    print('{0} files were successfully downloaded, with {1} fails.'.format(countSuccess, countFails))

    return

In [166]:
BMH_download_bulletins(links_asbestose,pathDir='BMH_Bulletins/asbestose', verbose=True, createDir=True)

https://www.saude.gov.br/images/pdf/2016/fevereiro/02/2015-011---Asbestose.pdf
Success! 

1 files were successfully downloaded, with 0 fails.


-------------------

### Downloading the files - Version 2.0
This version **checks if the file was already downloaded**. In order to allow the code portability, it is recommended to use the own Python libraries whenever it is possible. 
[Ref.: http://stackoverflow.com/questions/2467609/using-wget-via-python]

In [23]:
import urllib
import re
import os

Using the current version of the urllib library (*request* module) does not allow one to specify the destination folder. Based on [this reference](https://stackoverflow.com/questions/20338452/saving-files-downloaded-from-urlretrieve-to-another-folder-other), we will make use of the *os* library to set the full file path (*set_fullpath* function).

In [24]:
## Defining the destination directory:
pathDir = 'BMH_Bulletins'

In [44]:
def set_fullpath(directory, url, createDir=False):
    '''Given a local path for a directory and a URL name for a PDF file, 
    this function defines the full path (directory + filename.pdf) for downloading it. 
    Syntaxe: 
        directory: the directory path;
        url: the URL containing the PDF file;
        createDir: if True, creates the dir in case it doesn't exist. 
            If false, returns error (-1) if the directory doesn't exist.
    '''
    ## Getting the file name from the URL using RegEx:
    filename = re.findall('(?:[^/]+)$(?<=(?:.jpg)|(?:.pdf))', url)[0]  #The last term is to take the filename out of the list type.
        
    ## Checking if the directory exists:    
    if os.path.isdir(directory):
        fullpath = os.path.join(directory, filename)
        return fullpath
    elif createDir:
        os.makedirs(directory)
        fullpath = os.path.join(directory, filename)
        return fullpath
    else:
        print('Directory "{}" doest not exist.'.format(directory))
        return -1

In [45]:
def get_filenameFromURL(url):
    '''From a URL name for a PDF file, gets only the filename.pdf by using regular expression.'''
    ## Getting the file name from the URL using RegEx:
    filename = re.findall('(?:[^/]+)$(?<=(?:.jpg)|(?:.pdf))', url)[0]  #The last term is to take the filename out of the list type.
    return filename

In [54]:
def doesFileExists(pathDir, filename, verbose=True):
    '''Check whether a file already exists in a given directory.
    It returns True, in the case it exists, or False, in the case it doesn't.''' 
    filelist = os.listdir(pathDir)
    if filename in filelist:
        if verbose:
            print('The file {0} already exists in the specified path.\n'.format(filename))
        return True
    else:
        return False

In [55]:
# Downloading COVID-19 bulletins:
def BMH_download_bulletins_checkExists(subjectLinks, pathDir, verbose=False, createDir=False):
    countSuccess = 0
    countFails = 0
    countExists = 0
    for url in subjectLinks:
        if verbose:
            print(url)
        filename = get_filenameFromURL(url)
        try:
            ## Checking if the file already exists:
            if doesFileExists(pathDir, filename, verbose):
                countExists += 1
            else:
                ## Downloading the file:
                urllib.request.urlretrieve(url, set_fullpath(pathDir, url, createDir))
                countSuccess += 1
                if verbose:
                    print('Success! \n')
        except:
            countFails += 1
            if verbose: 
                print('FAILED! :( \n')
    #if verbose: -- We opt to keep this final message, even when verbose is False.
    print('{0} files were successfully downloaded, with {1} fails. {2} already exist.'.format(countSuccess, countFails, countExists))
    return

#### Testing the functions above by downloading the "Covid-19" reprots:

In [58]:
BMH_download_bulletins_checkExists(links_COVID,pathDir='BMH_Bulletins/covid/', verbose=False, createDir=True)

0 files were successfully downloaded, with 1 fails. 15 already exist.


-------------------

## Concluding with a wrapper function:
In this final section we will build a wrapper function with all those pieces of code previously built.  

Of course we must take into account that any change in the website that we wish scrape would make our code unfeasible, which requires constant maintenance -- and adaptation.

Anyway, as long as the structure of the portal of the Brazilian Ministry of Health remains the same, the function below will remain valid and practical.

In [62]:
### The webpage must be already collected with chromium and BeautifulSoup. I.e., the "soup" must be ready!
def BMH_get_bulletins(subject, soup, pathDir='BMH_Bulletins', verbose=True):
    links = BMH_findLinks_bySubject_v2(subject, soup)
    BMH_download_bulletins_checkExists(links,pathDir, verbose, createDir=True)
    return

In [63]:
BMH_get_bulletins('Coronavírus/COVID-19', soup, pathDir='BMH_Bulletins/covid', verbose=True)

https://portalarquivos.saude.gov.br/images/pdf/2020/April/27/2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf
The file 2020-04-27-18-05h-BEE14-Boletim-do-COE.pdf already exists in the specified path.

https://portalarquivos.saude.gov.br/images/pdf/2020/April/21/BE13---Boletim-do-COE.pdf
The file BE13---Boletim-do-COE.pdf already exists in the specified path.

https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
The file BE12-Boletim-do-COE.pdf already exists in the specified path.

https://www.saude.gov.br/images/pdf/2020/April/18/2020-04-17---BE11---Boletim-do-COE-21h.pdf
The file 2020-04-17---BE11---Boletim-do-COE-21h.pdf already exists in the specified path.

https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
The file 2020-04-16---BE10---Boletim-do-COE-21h.pdf already exists in the specified path.

https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
The file 2020

---------------------

In [64]:
BMH_get_bulletins('Dengue', soup, pathDir='BMH_Bulletins/dengue', verbose=False)

5 files were successfully downloaded, with 6 fails. 167 already exist.


In [65]:
BMH_get_bulletins('Dengue', soup, pathDir='BMH_Bulletins/dengue', verbose=False)

0 files were successfully downloaded, with 6 fails. 172 already exist.
