# Normalizing institution `orgname` and `orgdiv1` at USP in `pt_BR`

For running this notebook you'll need some external packages,
which can be installed with:

```shell
pip install bs4 lxml pandas
```

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400 # Avoid "..." in large strings
pd.options.display.max_rows = 120     # Avoid "..." in the table representation of lengthy dataframes

## The `institution_orgdiv1` in the CSV created from raw XML documents

This is the Clea's CSV generated from raw XML documents:

In [2]:
dataset = pd.read_csv("inner_join_2018-06-04.csv",
                      dtype=str,
                      keep_default_na=False) \
            .drop_duplicates()

The goal isn't to normalize everything, but a subset of the data.
We're concerned with these columns/fields:

In [3]:
fields = [
    "addr_city",
    "addr_country",
    "addr_country_code",
    "addr_state",
    "institution_orgname",
    "institution_orgdiv1",
]
idataset = dataset[fields] # Input dataset

In [4]:
pd.DataFrame({
    "filled_in": (idataset.applymap(len) > 0).sum(),
    "distinct": idataset.apply(lambda col: col[col.apply(len) > 0].unique().size),
})

Unnamed: 0,filled_in,distinct
addr_city,79922,2728
addr_country,85968,266
addr_country_code,67724,124
addr_state,68545,671
institution_orgname,88527,9651
institution_orgdiv1,56518,9903


Which are the most common rows?

In [5]:
idataset.groupby(fields) \
        .size() \
        .rename("count") \
        .sort_values(ascending=False) \
        .head(10) \
        .reset_index()

Unnamed: 0,addr_city,addr_country,addr_country_code,addr_state,institution_orgname,institution_orgdiv1,count
0,,,,,,,537
1,São Paulo,Brazil,BR,SP,Universidade de São Paulo,Faculdade de Medicina,515
2,Santa Maria,Brazil,BR,RS,Universidade Federal de Santa Maria,,250
3,Belo Horizonte,Brazil,BR,MG,Universidade Federal de Minas Gerais,,234
4,Porto Alegre,Brazil,BR,RS,Universidade Federal do Rio Grande do Sul,,205
5,Santa Maria,Brazil,BR,RS,Universidade Federal de Santa Maria,Centro de Ciências Rurais,200
6,,Brasil,BR,,USP,Escola de Enfermagem,182
7,São Paulo,Brazil,BR,SP,Universidade de São Paulo,,169
8,São Paulo,Brazil,BR,SP,Universidade Federal de São Paulo,Escola Paulista de Medicina,164
9,São Paulo,Brazil,BR,SP,Universidade Federal de São Paulo,,155


USP (University of São Paulo) has $3$ entries in this top $10$.
There's no known number of typos, gross errors or inconsistencies for these fields,
but we can assume that any information
regarding normalizing this university name and its divisions' names
should be helpful.
On the other hand,
seeing the raw "first division" names that follows,
we know that:

* There are more than a single writing for the same school/college/institute;
* Sometimes their acronyms are used instead of their names;
* Usually they're in Brazilian Portuguese but sometimes they're not;
* Sometimes they're not school/college/institute names, but:
  * Internal departments names;
  * NAP (research group) names;
  * Graduate program names.

In [6]:
idataset[idataset["institution_orgname"].isin(["Universidade de São Paulo", "USP"])] \
        ["institution_orgdiv1"] \
        .drop_duplicates()

50                                                                             Faculdade de Odontologia
267                                                                             Instituto Oceanográfico
348                                                                           Instituto de Oceanografia
390                                                                                                    
941                                                                                  Escola Politécnica
988                                                                  Escola de Engenharia de São Carlos
1134                                                     Escola Superior de Agricultura Luiz de Queiroz
1194                                                                           Departamento de Ecologia
1214                                                   Escola Superior de Agricultura “Luiz de Queiroz”
1283                                                            

## Getting the name of institutes, colleges and schools at USP (University of São Paulo)

The Brazilian Portuguese names
for the USP institutes/colleges/schools
can be found
[here](http://www5.usp.br/institucional/escolas-faculdades-e-institutos/),
and the English names can be found
[here](http://www5.usp.br/english/institutional/escolas-faculdades-e-institutos-2/?lang=en).

In [7]:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
from lxml import etree

We can download the Brazilian Portuguese HTML:

In [8]:
pt_br_usp_ordgiv1_url = "http://www5.usp.br/institucional/escolas-faculdades-e-institutos/"
raw_html = urlopen(pt_br_usp_ordgiv1_url).read()
bsoup = BeautifulSoup(raw_html, "lxml")

Scraping the links with CSS is quite straightforward:

In [9]:
usp_orgdiv1_bsoup_info = pd.DataFrame([(el.text, el.attrs["href"])
                                       for el in bsoup.select(".post_content li a")],
                                       columns=["name", "url"])
usp_orgdiv1_bsoup_info.head()

Unnamed: 0,name,url
0,"Escola de Artes, Ciências e Humanidades (EACH)",http://each.uspnet.usp.br/
1,Escola de Comunicações e Artes (ECA),http://www.eca.usp.br/
2,Escola de Educação Física e Esporte (EEFE),http://www.eefe.usp.br/
3,Escola de Enfermagem (EE),http://www.ee.usp.br/
4,Escola Politécnica (Poli),http://www.poli.usp.br/


The cities aren't in these links,
but in the previous `<h2>` heading,
so we need to get them as well,
but bs4 loses the elements' order when using the
`.post_content li a | .post_content h2` CSS selector,
so we need to use something else.

With `lxml` directly
and a custom dictionary to match the campus city
with the school/college/institute element index, we get:

In [10]:
tree = etree.fromstring(raw_html, parser=etree.HTMLParser())
prefix_cutter_regex = re.compile("(?:Campus de |Em )(.*)")
a_h2_els = tree.xpath("//*[contains(@class, 'post_content')]//li/a | "
                      "//*[contains(@class, 'post_content')]//h2")

class PrevDict(dict):
    def __missing__(self, key):
        return next(self[k] for k in sorted(self, reverse=True) if k < key)

usp_campi = PrevDict((idx, prefix_cutter_regex.match(el.findtext("*")).groups()[0])
                     for idx, el in enumerate(a_h2_els) if el.tag == "h2")
usp_campi

{0: 'São Paulo',
 30: 'Bauru',
 32: 'São Carlos',
 38: 'Lorena',
 40: 'Piracicaba',
 43: 'Pirassununga',
 45: 'Ribeirão Preto',
 54: 'São Sebastião'}

In [11]:
usp_orgdiv1 = pd.DataFrame([(usp_campi[idx], el.text, el.attrib["href"])
                            for idx, el in enumerate(a_h2_els)
                            if el.tag == "a"],
                           columns=["city", "name", "url"])
usp_orgdiv1.tail()

Unnamed: 0,city,name,url
43,Ribeirão Preto,"Faculdade de Economia, Administração e Contabilidade de Ribeirão Preto (FEARP)",http://www.fearp.usp.br/
44,Ribeirão Preto,"Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto (FFCLRP)",http://www.ffclrp.usp.br/
45,Ribeirão Preto,Faculdade de Medicina de Ribeirão Preto (FMRP),http://www.fmrp.usp.br/
46,Ribeirão Preto,Faculdade de Odontologia de Ribeirão Preto (FORP),http://www.forp.usp.br/
47,São Sebastião,Centro de Biologia Marinha (CEBIMar),http://www.usp.br/cbm


Here in this notebook, these scraped names will be seen as
valid, normalized and consistent names for `addr_city` and `institution_orgdiv1`.

In [12]:
ndataset = usp_orgdiv1 \
    .rename(columns={"name": "institution_orgdiv1",
                     "city": "addr_city"}) \
    .assign(
        addr_country="Brazil",
        addr_country_code="BR",
        addr_state="SP",
        institution_orgname="Universidade de São Paulo (USP)",
    ).reindex(fields, axis=1)
ndataset # Ground-truth normalized dataset

Unnamed: 0,addr_city,addr_country,addr_country_code,addr_state,institution_orgname,institution_orgdiv1
0,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),"Escola de Artes, Ciências e Humanidades (EACH)"
1,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Escola de Comunicações e Artes (ECA)
2,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Escola de Educação Física e Esporte (EEFE)
3,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Escola de Enfermagem (EE)
4,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Escola Politécnica (Poli)
5,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Faculdade de Arquitetura e Urbanismo (FAU)
6,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Faculdade de Ciências Farmacêuticas (FCF)
7,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Faculdade de Direito (FD)
8,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),"Faculdade de Economia, Administração e Contabilidade (FEA)"
9,São Paulo,Brazil,BR,SP,Universidade de São Paulo (USP),Faculdade de Educação (FE)


Rows that have the same information from this normalized dataset,
perhaps with some missing field,
are consistent.

This table misses historical schools/colleges/institutes changes
and perhaps some nuances,
e.g. the music department at FFCLRP used to be the
[CMU-RP department at ECA](https://web.archive.org/web/20070614064907/http://www.musica.pcarp.usp.br/)
(i.e., it used to be a department of a school that happens to be in another city,
 $315 km$ away from the remaining departments).