## UFRN - EEC2006 - TOPICOS ESPECIAIS F
## Terceiro trabalho 

Componentes:
* **20171021275 - Fabio Fonseca de Oliveira**
* **2016102462  - Júlio César Melo Gomes de Oliveira**
* **20171021201 - Tiago Fernandes de Miranda**

Notebook com solução do terceiro trabalho proposto na disciplina. A seguir breve descrição de cada solução bem como suas informações específicas.

O presente notebook possui a primeira parte da solução, apresentando o procedimento utilizado para o Web Scrapping, usado para o nome dos servidores da UFRN. 

1. Portal da Transparência
==

[Portal da Transparência](http://www.portaldatransparencia.gov.br/) is a Brazilian government portal dedicated to making public all expenditures of the federal government. It has a list of all expenses and money transfers the federal government has made.

1.1 Motivations

- How many employees do the IES (*instituições de ensino superior*) have?
- What is the gender gap between the employees? 
    - https://www.dicionariodenomesproprios.com.br/
    - https://gender-api.com/
    - https://pypi.python.org/pypi/Genderize
    - http://fmeireles.com/blog/rstats/genderbr-predizer-sexo


2. Number of employees by IES
==

- [Units of Ministry of Education]( http://www.portaltransparencia.gov.br/servidores/OrgaoExercicio-ListaOrgaos.asp?CodOS=15000)

### 2.1 Identifying the URL structure

In [None]:
# import package
from requests import get

# specify the url
url = 'http://www.portaldatransparencia.gov.br/servidores/\
OrgaoExercicio-ListaOrgaos.asp?CodOS=15000'

# packages the request, send the request and catch the response
response = get(url)

# extract the text
text = response.text

print(text[:500])

### 2.1 Understanding the HTML structure of a single page

In [None]:
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

In [None]:
# information about IES is within a table element
unit_table = html_soup.find_all('table')

In [None]:
# there are two table elements 
len(unit_table)

In [None]:
# the second one is the target
unit_rows  = unit_table[1].find_all('tr')

In [None]:
len(unit_rows)

In [None]:
# the first tr is the header of table
unit_rows

In [None]:
unit_rows = unit_rows[1:]

In [None]:
unit_rows[0:4]

In [None]:
#Get Page
ttag_p = html_soup.find('p', class_ = 'paginaAtual').text
text1 = ttag_p.split(' ', 1 )
text2 = text1[1].split('/', 1 )
lastPage = text2[1]
print(lastPage)

### 2.2 Extracting the RH code

In [None]:
unit_rh_code = unit_rows[0].find('td', class_ = 'firstChild').text
unit_rh_code
for k in unit_rows:
    p = k.find('td', class_ = 'firstChild').text
    print(p)

### 2.3 Extracting the name of IES


In [None]:
unit_name = unit_rows[0].find('a').text
unit_name
for k in unit_rows:
    p = k.find('a').text
    print(p)

### 2.4 Extracting the number of employees

In [None]:
unit_number_of_employees = unit_rows[0].find('td', attrs = {'style':'text-align: right;'}).text
unit_number_of_employees = int(unit_number_of_employees)
unit_number_of_employees

## 2.5 The script for a single page

In [None]:
# Lists to store the scraped data in
rh_codes = []
names = []
number_of_employees = []

# Extract data from individual ies rows
for row in unit_rows:
    
    # rh codes
    codes = row.find('td', class_ = 'firstChild').text
    rh_codes.append(codes)
    
    # ies names
    name = row.find('a').text
    names.append(name)
    
    # number of employees
    employees = row.find('td', attrs = {'style':'text-align: right;'}).text
    number_of_employees.append(int(employees))

In [None]:
# Let’s check the data collected so far. 
# Pandas makes it easy for us to see whether 
# we’ve scraped our data successfully.

import pandas as pd

web_scraping_df = pd.DataFrame({'Code': rh_codes,
                       'IES_name': names,
                       'Number_employees': number_of_employees})
print(web_scraping_df.info())
web_scraping_df

## 2.6. The script for multiple pages

Scraping multiple pages is a bit more challenging. We’ll build upon our one-page script by doing three more things:

- Making all the requests we want from within the loop.
- Controlling the loop’s rate to avoid bombarding the server with requests.
- Monitoring the loop while it runs.

We’ll scrape all pages (8 pages) that contains information about the number of employees of IES. Each page has 15 lines (excluding the header) of target information, so we’ll scrape data for 120 IES. But not all pages have 15 lines, the last one is incomplete. 


### 2.6.1 Changing the URL’s parameters

As shown earlier, the URLs follow a certain logic as the web pages change.

http://www.portaltransparencia.gov.br/servidores/OrgaoExercicio-ListaOrgaos.asp?CodOS=15000&Pagina=5

As we are making the requests, we’ll only have to vary the values of only the last parameter of the URL: the <span style="background-color: #F9EBEA; color:##C0392B">Pagina</span> parameter. 


In [None]:
pages = [str(i) for i in range(1,9)]
pages

### 2.6.2  Piecing everything together

In [None]:
from time import sleep
from random import randint
from time import time
from warnings import warn
from IPython.core.display import clear_output

In [None]:
# Lists to store the scraped data in
rh_codes = []
names = []
number_of_employees = []

# Preparing the monitoring of the loop
start_time = time()
requests = 0

# For each page
for page in pages:
    
    #url 
    url = 'http://www.portaltransparencia.gov.br/servidores/\
    OrgaoExercicio-ListaOrgaos.asp?CodOS=15000&Pagina={}'.format(page).replace(" ", "")
        
    # Make a get request
    response = get(url)
        
    # Pause the loop
    sleep(randint(5,10))
    
    # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
              
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
              
    # Break the loop if the number of requests is greater than expected
    if requests > 72:
        warn('Number of requests was greater than expected.')  
        break 
        
    # information about IES is within a table element
    unit_table = html_soup.find_all('table')
    
    # the second one is the target
    unit_rows  = unit_table[1].find_all('tr')
    unit_rows = unit_rows[1:]
    
    # Extract data from individual ies rows
    for row in unit_rows:
    
        # rh codes
        codes = row.find('td', class_ = 'firstChild').text
        rh_codes.append(codes)
    
        # ies names
        name = row.find('a').text
        names.append(name)
    
        # number of employees
        employees = row.find('td', attrs = {'style':'text-align: right;'}).text
        number_of_employees.append(int(employees))


In [None]:
# Let’s check the data collected so far. 
# Pandas makes it easy for us to see whether 
# we’ve scraped our data successfully.

import pandas as pd

web_scraping_df = pd.DataFrame({'Code': rh_codes,
                       'IES_name': names,
                       'Number_employees': number_of_employees})
print(web_scraping_df.info())
web_scraping_df

In [None]:
web_scraping_df.to_csv('number_of_employees.csv')

### 2.7. Plotting and analyzing the distributions

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#to switch to seaborn defaults, simply call the set() function.
sns.set()

# The four preset contexts, in order of relative size, are paper, notebook, talk, and poster
sns.set_context("notebook")

# plot a univariate distribution of observations.
sns.distplot(web_scraping_df["Number_employees"],bins=50, kde=False)
plt.show()

In [None]:
sns.boxplot(web_scraping_df["Number_employees"])
plt.show()

In [None]:
print('Mean: %d' % (web_scraping_df["Number_employees"].mean()))
print('Median: %d' % (web_scraping_df["Number_employees"].median()))
print('Standard deviation: %d' % (web_scraping_df["Number_employees"].std()))

### 3.0.  Adição dos servidores da UFRN num dataframe para tratamento

In [None]:
#Verificar o número de páginas que deverão ser checadas para adição dos servidores na lista

import timeit

df = pd.read_csv('number_of_employees.csv',encoding = 'utf-8')
column_names = df.columns
code = df['Code']
pages = []
i = 1
for c in code:
        url ='http://www.portaltransparencia.gov.br/servidores/\
        OrgaoExercicio-ListaServidores.asp?CodOrg={}'.format(c).replace(" ", "")

        # Make a get request
        response = get(url)

        html_soup = BeautifulSoup(response.text, 'html.parser')
        type(html_soup)

        # Pause the loop
        sleep(randint(5,10))

        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if requests > 372:
            warn('Number of requests was greater than expected.')  
            break 

        ttag_p = html_soup.find('p', class_ = 'paginaAtual').text
        text1 = ttag_p.split(' ', 1 )
        text2 = text1[1].split('/', 1 )
        lastPage = text2[1]
        pages.append(lastPage)
        print(lastPage)
        #time.sleep(time.localtime(time.time())[15])


In [None]:
from requests import get
from bs4 import BeautifulSoup
from time import sleep
from random import randint
from time import time
from warnings import warn
from IPython.core.display import clear_output
import pandas as pd

In [None]:
#Pesquisa no portal dos servidores com a UFRN
#Aquisição de páginas e adição de cada servidor numa Lista

code = '26243'
url ='http://www.portaltransparencia.gov.br/servidores/\
OrgaoExercicio-ListaServidores.asp?CodOrg={}'.format(code).replace(" ", "")

# Make a get request
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

ttag_p = html_soup.find('p', class_ = 'paginaAtual').text
text1 = ttag_p.split(' ', 1 )
text2 = text1[1].split('/', 1 )
lastPage = text2[1]

pages = [str(i) for i in range(2,int(lastPage)+1)]
pages

unit_table = html_soup.find_all('table')
unit_rows  = unit_table[1].find_all('tr')

func = []
for k in unit_rows:
    p = k.find('a').text
    func.append(p)
    
# Preparing the monitoring of the loop
start_time = time()
requests = 0    

for p in pages:
        url ='http://www.portaltransparencia.gov.br/servidores/\
        OrgaoExercicio-ListaServidores.asp?CodOrg={}&Pagina={}'.format(code,p).replace(" ", "")

        # Make a get request
        response = get(url)

        html_soup = BeautifulSoup(response.text, 'html.parser')
        type(html_soup)

        # Pause the loop
        sleep(randint(5,10))

        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        unit_table = html_soup.find_all('table')
        unit_rows  = unit_table[1].find_all('tr')

        for k in unit_rows:
            pson = k.find('a').text
            func.append(pson)

func = [p for p in func if p !='Nome do servidor']
len(func)
pages = [str(i) for i in range(2,int('9')+1)]



In [None]:
# Remoção dos espaços adicionais no Final dos nomes completos
people = []
for f in func:
    people.append(f.rstrip()) 

In [None]:
code = '26243' # Cod. de pesquisa da UFRN
gen = 'I' # Indefinido
codes_func = []
gen_func = []
for p in people:
    codes_func.append(code)
    gen_func.append(gen)
func_scraping_df = pd.DataFrame({'Inst_code': codes_func,
                       'Serv_name': people,
                       'Serv_gen': gen_func})
print(func_scraping_df.info())
func_scraping_df    
func_scraping_df.to_csv('func_scraping_df.csv')

In [None]:
from genderize import Genderize
df = pd.read_csv('func_scraping_df.csv',encoding = 'utf-8')
column_names = df.columns
func_name = df['Serv_name']


In [None]:
# Organição dos dados para armazenamento

first_n = []
prob_n = []
gender_n = []
count = 0
for name in func_name:
    gender = Genderize().get([name])
    gender_n.append(gender[0])
    first_n.append(gender[0]['gender'])
    sleep(randint(6,12))
    count += 1
    print(count," ")

In [None]:
# Organição dos dados a partir do primeiro nome para tratamento

df_n = pd.read_csv('func_scraping_df.csv',encoding = 'utf-8')
column_names = df_n.columns

for i,gen  in df_n.iterrows():
    df_n.loc[i,'Serv_gen'] = first_n[i]

In [None]:
df_n.to_csv('func_scraping_df_n.csv', encoding = 'utf-8')