### Environment log:

(1) Using the "analytics3" anaconda environment; installing python packages directly through *pip* command.  
(2) Installed *pyvirtualenv*, *selenium*, and *bs4* packages.  
(3) Installed *chromedriver* through "pip install chromedriver".  

(4) Got this error when initializing the Display function:
> display = Display(visible=0, size=(800,600))
> display.start()  
>> FileNotFoundError: [Errno 2] No such file or directory: 'Xvfb': 'Xvfb'
>> EasyProcessError: start error EasyProcess cmd_param=['Xvfb', '-help'] cmd=['Xvfb', '-help'] oserror=[Errno 2] No such file or directory: 'Xvfb': 'Xvfb' return_code=None stdout="None" stderr="None" timeout_happened=False>

**Solution found in: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=860501**   
 
(5) Installed the system package xvfb through the shell line: *sudo apt-get install xvfb*.

- When running the "driver = webdriver.Chrome()" python command, I received the following error:  
> driver = webdriver.Chrome()
>> FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'  

(6) Even though, since I'm using conda environment I have set the chromedriver path before:  
> chromeDriverPath = '~/anaconda3/envs/anaytics3/'

The solution was to create a symbolic link in the path shown above, through the system bash, which made it work properly:  
> $ ln -s ~/anaconda3/envs/analytics3/chromedriver-Linux64 chromedriver


## Epidemiological bulletins from World Health Organization - WHO
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/  

## Epidemiological bulletins from Brazilian Ministry of Health
https://www.saude.gov.br/boletins-epidemiologicos
  - Coronavírus/COVID-19  
  - Dengue

### Initial statements

In [17]:
import time
import pdb   # Python debugger
from pyvirtualdisplay import Display
from selenium import webdriver
from bs4 import BeautifulSoup

In [25]:
# Definição de parâmetros
url = 'https://www.saude.gov.br/boletins-epidemiologicos'
chromeDriverPath = '~/anaconda3/envs/anaytics3/'

### Web browsing

In this step of our data crawling we want to access the webpage through the Chromium driver. It is then appropriate to create some functions both for the page access as for its content analysis. However, it would be only possible if we know the page structure. In order to reach it, at this time we will try to read and explore it, trying to find the patterns and links we are looking for.

In [26]:
# Display starting:
display = Display(visible=0, size=(800,600))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '800x600x24', ':1005'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '800x600x24', ':1005'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [27]:
# Open the Chromium.driver for the intended page:
driver = webdriver.Chrome()

In [28]:
# Reading the page content and saving it specifying its encoding scheme:
driver.get(url)
page = driver.page_source.encode('utf-8')

#### Initial exploration

In [30]:
# How large is the loaded page?
print(len(page))
# Data streaming type is expected:
print(type(page))

485346
<class 'bytes'>


In [31]:
# Considerando-se o tamanho pequeno, irei mostrar toda a página:
print(page)



Since the loaded page has too many links, as well as other contents we are not interested in, we will use BeautifulSoap to parse it in order to obtain the desired links.

In [32]:
# Exibindo a página principal, para adicionar manualmente o link e avaliar o conceito.
print(url)

https://www.saude.gov.br/boletins-epidemiologicos


In [19]:
# Foi acessado o seguinte documento:
# http://web.eecs.umich.edu/~radev/coursera-slides/nlpintro_co8_12.05_DR_Edit.pdf

In [33]:
# Aplicando o *parsing* na página:
soup = BeautifulSoup(page)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [21]:
# Os links da página estão dispostos no seguinte formato: 
#</p><ul>
#<li><a href="nlpintro_co3_01.01_DR_Edit.pdf">01.01.pdf</a>
#</li><li><a href="nlpintro_co3_01.02_DR_Edit.pdf">01.02.pdf</a>*

In [23]:
#links = soup.findAll("div", {"class": "complaint-item ng-scope"}, recursive=True)

In [44]:
tags = soup('a')

In [45]:
print(type(tags))
print("Number of links on the page analyzed:", len(tags))

<class 'bs4.element.ResultSet'>
Number of links on the page analyzed: 1571


In [46]:
# Para obter apenas o nome do arquivo, i.e. o link que estava no campo href do 3o link:
tags[1000].get('href')

'/images/pdf/2016/marco/23/2016-008---DengueSE8-publica----o.pdf'

In [41]:
tags

[<a class="hide" href="#accessibility" id="topo">Ir direto para menu de acessibilidade.</a>,
 <a class="link-barra" href="https://gov.br">Brasil</a>,
 <a href="#" id="menu-icon"></a>,
 <a href="http://www.saude.gov.br/coronavirus" style="color:red;">CORONAVÍRUS (COVID-19)</a>,
 <a class="link-barra" href="http://www.simplifique.gov.br">Simplifique!</a>,
 <a class="link-barra" href="https://www.gov.br/pt-br/participacao-social/">Participe</a>,
 <a class="link-barra" href="http://www.acessoainformacao.gov.br">Acesso à informação</a>,
 <a class="link-barra" href="http://www.planalto.gov.br/legislacao">Legislação</a>,
 <a class="link-barra" href="https://gov.br/pt-br/canais-do-executivo-federal">Canais</a>,
 <a class="logo-vlibras" href="#" id="logovlibras"></a>,
 <a href="http://www.vlibras.gov.br">VLibras</a>,
 <a accesskey="1" href="#content" id="link-conteudo">
                                     Ir para o conteúdo
                                     <span>1</span>
 </a>,
 <a accessk

In [48]:
#soup = BeautifulSoup(''.join(page), 'lxml')
for link in soup.find_all('a', href=True):
    if link['href'].lower().endswith(".pdf"):
        if link['href'].lower().startswith("http"):
            print(link['href'])

https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/03/BE6-Boletim-Especial-do-COE.pdf
https://portalarquivos2.saude.gov.br/images/pdf/2020/fevereiro/13/Boletim-epidemiologico-COEcorona-SVS-13fev20.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/19/BE12-Boletim-do-COE.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/17/2020-04-16---BE10---Boletim-do-COE-21h.pdf
https://portalarquivos.saude.gov.br/images/pdf/2020/April/12/2020-04-11-BE9-Boletim-do-COE.pdf
https://portalarquivos2.saude.gov.br/images/pdf/2020/fevereiro/13/Boletim-epidemiologico-COEcorona-SVS-13fev20.pdf
http://saude.gov.br/images/pdf/2019/outubro/24/Boletim-epidemiologico-SVS-31.pdf
http://portalarquivos2.saude

### Downloading the files
At first, someone could use the os.system library ('wget http ...') to download each file. However, to allow the portability of the code in different operating systems, it is recommended to use the Python libraries themselves, whenever possible. 
[Ref.: http://stackoverflow.com/questions/2467609/using-wget-via-python]

In [49]:
import urllib

In [43]:
# Teste da linha de comando para incluir no loop:
urllib.urlretrieve(url + tags[2].get('href'), filename='apostila/'+tags[2].get('href'))

('apostila/nlpintro_co3_01.03_DR_Edit.pdf',
 <httplib.HTTPMessage instance at 0x7f9fe198f200>)

In [44]:
# Loop para o download de todos os links:
for tag in tags:
    nome_arquivo = tag.get('href')
    urllib.urlretrieve(url + nome_arquivo, filename='apostila/'+nome_arquivo)

## Conclusão
81 clicks were 'saved' ;)