# Looking for articles from public state universities in SP

There are $3$ public universities
that are managed by the São Paulo state government:

* USP, *Universidade de São Paulo*
* UNICAMP, *Universidade Estadual de Campinas*
* UNESP, *Universidade Estadual Paulista*

The goal here is to find all research articles
in the `scl` (SciELO Brazil) collection
from any of these universities.

In [1]:
import re

In [2]:
import regex
import pandas as pd

## Loading the data

The information about the university
should be in the `documents_authors.csv` reports,
but if there's no link between the authors and the given affiliations,
it should still be in the `documents_affiliations.csv` file.
Let's open these
using the field name normalization step
that can be found in <https://github.com/scieloorg/scielo20gt6/>:

In [3]:
def normalize_column_title(name):
    name_unbracketed = re.sub(r".*\((.*)\)", r"\1",
                              name.replace("(in months)", "in_months"))
    words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split()
    ignored_words = ("at", "the", "of", "and", "google", "scholar", "+")
    replacements = {
        "document": "doc",
        "documents": "docs",
        "frequency": "freq",
        "language": "lang",
    }
    return "_".join(replacements.get(word, word)
                    for word in words if word not in ignored_words) \
              .replace("title_is", "is")

In [4]:
doc_affs = pd.read_csv("2018-11-10_scl/documents_affiliations.csv") \
             .rename(columns=normalize_column_title)

In [5]:
doc_authors = pd.read_csv("2018-11-10_scl/documents_authors.csv") \
                .rename(columns=normalize_column_title)

In [6]:
doc_affs.columns

Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns',
       'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences',
       'is_applied_social_sciences', 'is_biological_sciences',
       'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences',
       'is_human_sciences', 'is_linguistics_letters_arts',
       'is_multidisciplinary', 'title_current_status', 'pid_scielo',
       'doc_publishing_year', 'doc_type', 'doc_is_citable',
       'doc_affiliation_instituition', 'doc_affiliation_country',
       'doc_affiliation_country_iso_3166', 'doc_affiliation_state',
       'doc_affiliation_city'],
      dtype='object')

In [7]:
doc_authors.columns

Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns',
       'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences',
       'is_applied_social_sciences', 'is_biological_sciences',
       'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences',
       'is_human_sciences', 'is_linguistics_letters_arts',
       'is_multidisciplinary', 'title_current_status', 'pid_scielo',
       'doc_publishing_year', 'doc_type', 'doc_is_citable', 'doc_author',
       'doc_author_institution', 'doc_author_affiliation_country',
       'doc_author_affiliation_state', 'doc_author_affiliation_city'],
      dtype='object')

In [8]:
doc_affs.tail(2).T

Unnamed: 0,817809,817810
extraction_date,2018-11-10,2018-11-10
study_unit,document,document
collection,scl,scl
issn_scielo,2446-4740,2446-4740
issns,2446-4732;2446-4740,2446-4732;2446-4740
title_scielo,Research on Biomedical Engineering,Research on Biomedical Engineering
title_thematic_areas,Engineering,Engineering
is_agricultural_sciences,0,0
is_applied_social_sciences,0,0
is_biological_sciences,0,0


In [9]:
doc_authors.tail(2).T

Unnamed: 0,1437245,1437246
extraction_date,2018-11-10,2018-11-10
study_unit,document,document
collection,scl,scl
issn_scielo,2446-4740,2446-4740
issns,2446-4732;2446-4740,2446-4732;2446-4740
title_scielo,Research on Biomedical Engineering,Research on Biomedical Engineering
title_thematic_areas,Engineering,Engineering
is_agricultural_sciences,0,0
is_applied_social_sciences,0,0
is_biological_sciences,0,0


There are much more info in `doc_authors`.

Let's get all the instituition names: 

In [10]:
all_names = set(doc_affs["doc_affiliation_instituition"].dropna()) \
    .union(doc_authors["doc_author_institution"].dropna())
len(all_names)

60437

## USP

The following names have either the "`usp`" substring
or something like the university name:

In [11]:
usp_regex = "usp|(university\s+of|univ(ersidade?)?(\s*de)?)\s+(s[aãÃ]o|s\.)\s+paulo"
names_with_usp = {name for name in all_names
                  if re.search(usp_regex, name.lower())}
names_with_usp

{' Departamento de Entomologia, Fitopatologia e Zoologia Agrícola, Escola Superior de Agricultura "Luiz de Queiroz", Universidade de São Paulo',
 ' Departamento de Genética e Biologia Evolutiva, Instituto de Biociências, Universidade de São Paulo',
 ' Departamento de Genética e Biologia Evolutiva, Universidade de São Paulo',
 ' Departamento de Genética, Escola Superior de Agricultura Luiz de Queiroz, Universidade de São Paulo',
 ' Laboratório de Ecotoxicologia Marinha, Instituto Oceanográfico, Universidade de São Paulo',
 ', Universidade de São Paulo',
 '-ESALQ/USP',
 '-FMUSP',
 '/USP',
 '/Universidade de São Paulo',
 'Ambulatório Geral do Instituto da Criança - HCFMUSP',
 'Anhanguera University of São Paulo',
 'Bandeirante University of São Paulo',
 'Campus USP',
 'Catholic University of São Paulo',
 'Clínicas da Faculdade de Medicina da Universidade de São Paulo (IOT-HCFMUSP)',
 'Comissão de Publicações da FEUSP',
 'Departamento de Ciência Política da USP',
 'Departamento de Farmacol

Only these aren't from the University of São Paulo
(sometimes `São` appears without the tilde,
 some of them appear more than once because of other neighboring words):

* `Anhanguera University of São Paulo`
* `Bandeirante University of São Paulo`
* `Catholic University of São Paulo`
* `Federal University of São Paulo`
* `Methodist University of São Paulo`
* `Pontifícia Universidade de São Paulo`
* `USPMS` (US Petroleum Marine Services)

Besides `PUC/USP`, which isn't a valid name,
though it's explicit in the [article PDF](
  http://w.scielo.br/pdf/rlae/v7n2/13461.pdf
).
It could be a typo of `PUC/SP` (*Pontifícia Universidade Católica*/São Paulo)
but the author's Lattes CV/resumé tells us that her Ph.D. degree was at UNIFESP,
not USP or PUC/SP. Either way, we're just looking for USP entries for now,
so there should be no more worries regarding this entry.

In [12]:
doc_authors[doc_authors["pid_scielo"].isin(
    doc_authors[doc_authors["doc_author_institution"] == "PUC/USP"]["pid_scielo"]
)].T

Unnamed: 0,178925,178926,178927,178928,178929
extraction_date,2018-11-10,2018-11-10,2018-11-10,2018-11-10,2018-11-10
study_unit,document,document,document,document,document
collection,scl,scl,scl,scl,scl
issn_scielo,0104-1169,0104-1169,0104-1169,0104-1169,0104-1169
issns,1518-8345,1518-8345,1518-8345,1518-8345,1518-8345
title_scielo,Revista Latino-Americana de Enfermagem,Revista Latino-Americana de Enfermagem,Revista Latino-Americana de Enfermagem,Revista Latino-Americana de Enfermagem,Revista Latino-Americana de Enfermagem
title_thematic_areas,Health Sciences,Health Sciences,Health Sciences,Health Sciences,Health Sciences
is_agricultural_sciences,0,0,0,0,0
is_applied_social_sciences,0,0,0,0,0
is_biological_sciences,0,0,0,0,0


The only entry that remains strange is `USPI`,
which seems to be a typo for `USP`.

In [13]:
doc_authors[doc_authors["pid_scielo"].isin(
    doc_authors[doc_authors["doc_author_institution"] == "USPI"]["pid_scielo"]
)].T

Unnamed: 0,942280,942281,942282
extraction_date,2018-11-10,2018-11-10,2018-11-10
study_unit,document,document,document
collection,scl,scl,scl
issn_scielo,0103-166X,0103-166X,0103-166X
issns,0103-166X;1982-0275,0103-166X;1982-0275,0103-166X;1982-0275
title_scielo,Estudos de Psicologia (Campinas),Estudos de Psicologia (Campinas),Estudos de Psicologia (Campinas)
title_thematic_areas,Human Sciences,Human Sciences,Human Sciences
is_agricultural_sciences,0,0,0
is_applied_social_sciences,0,0,0
is_biological_sciences,0,0,0


There are some typos for the university names,
but we should be careful when looking for similar names.

In [14]:
usp_fuzzy_regex = (
    "((university\s+of|universidade?(\s*de)?)\s+(s[aãÃ]o|s\.)\s+paulo"
    "|(s[aãÃ]o|s\.)\s+paulo university"
    "){e<=3}"
)
other_names_with_usp = {name for name in all_names - names_with_usp
                        if regex.search(usp_fuzzy_regex, name.lower())}
other_names_with_usp

{'City of Sao Paulo University',
 'City of São Paulo University',
 'Clinical Hospital of São Paulo University',
 'De Paul University',
 'Federal University of S. Carlos',
 'Federal University of Sao Carlos',
 'Federal University of São Carlos',
 'Federal University of São Carlos – UFSCar',
 'Federal University of São Carlos,Department of Physiological Sciences',
 'Federal University of São Carlos,Laboratory of Neurosciences, Department of Physiotherapy',
 'Nove de Julho University, São Paulo',
 'Sao Camilo University Center',
 'Sao Paulo University',
 'São Camilo University Center',
 'São Paulo University',
 'São Paulo University Medical',
 'São Paulo University Medical School',
 'São Paulo University School of Medicine',
 'São Paulo University of Pharmaceutical Sciences',
 'São Paulo Univesity',
 'Univeridade de São Paulo',
 'Univers, de S. Paulo',
 'Universiade de São Paulo',
 'Universidad San Pablo CEU',
 'Universidad de San Carlos',
 'Universidad de San Carlos de Guatemala',
 'Univ

There are some names that clearly regards to other university,
like *Universidad Católica San Pablo* (CEU).
Names like *Camilo*, *Carlos*, *Marcos* or *Pablo* aren't *Paulo*.
Again, USP isn't a national/federal university,
and *City of São Paulo University* reads like UNICID, not USP.

In [15]:
def has_none_of(text, avoidance_list):
    lower_text = text.lower()
    return all((el not in lower_text) for el in avoidance_list)

In [16]:
not_paulo = ["city", "federal", "camilo", "carlos", "marcos", "pablo", "julho", "de paul"]
typoed_usp = {name for name in other_names_with_usp if has_none_of(name, not_paulo)}
typoed_usp

{'Clinical Hospital of São Paulo University',
 'Sao Paulo University',
 'São Paulo University',
 'São Paulo University Medical',
 'São Paulo University Medical School',
 'São Paulo University School of Medicine',
 'São Paulo University of Pharmaceutical Sciences',
 'São Paulo Univesity',
 'Univeridade de São Paulo',
 'Univers, de S. Paulo',
 'Universiade de São Paulo',
 'Universidada de São Paulo',
 'Universidade d São Paulo',
 'Universidade da São Paulo',
 'Universidade de So Paulo',
 'Universidade de SÃ£o Paulo',
 'Universidade de Sâo Paulo',
 'Universidade de São Paul',
 'Universidade de Säo Paulo',
 'Universidade of São Paulo',
 'Universidsade de São Paulo',
 'University São Paulo',
 'University of San Paolo',
 'University os São Paulo',
 'Univesidade de São Paulo',
 'niversidade de São Paulo'}

In [17]:
not_usp = ["state", "anhanguera", "bandeirante", "catholic",
           "federal", "methodist", "pontifícia", "uspms"]
usp_names = {
    name for name in names_with_usp
         if has_none_of(name, not_usp)
}.union(typoed_usp)
len(usp_names)

219

## UNESP

In [18]:
unesp_regex = (
    "unesp"
    "|universidade?\s*esta(du|t)al\s*paulista"
    "|j[uúÚ]lio.*mesquita\s+filho"
    "|(s[aãÃ]o|s\.)\s*paulo\s*state\s*university"
    "|state.*(s[aãÃ]o|s\.)\s*paulo\s*university"
)
names_with_unesp = {name for name in all_names - usp_names
                    if re.search(unesp_regex, name.lower())}
names_with_unesp

{' Núcleo de Estudos em Poluição e Ecotoxicologia Aquática, Universidade Estadual Paulista',
 'Bióloga (UNESP/Botucatu)',
 'CEVAP-UNESP',
 'Campus de Botucatu - UNESP,Departamento de Doenças Tropicais e Diagnóstico por Imagem - Faculdade de Medicina de Botucatu',
 'Doutor wsaad@fmb.unesp.br',
 'FCA-UNESP',
 'FCA/UNESP Fazenda Experimental Lajeado',
 'FCAV-Unesp',
 'FCAV/UNESP',
 'FMUNESP',
 'Faculdade de Ciências e Tecnologia, UNESP',
 'Faculdade de Medicina - UNESP',
 'Faculdade de Odontologia de Araraquara - Unesp',
 'IB-UNESP',
 'IBUNESP',
 'IIUniversidade Estadual Paulista',
 'Light SESA/UNESP',
 'Paulista State University ?Julio de Mesquita Filho?',
 'Sao Paulo State University',
 'Sao Paulo State University Júlio de Mesquita Filho',
 'School of Dentistry, São Paulo State University',
 'School of São Paulo State University Julio de Mesquita Filho',
 'State University of São Paulo (Júlio de Mesquita Filho)',
 'State University of São Paulo - “Julio de Mesquita Filho”',
 'São Paulo 

Here the only names that don't regard to UNESP are the UNESPAR (3 entries).

In [19]:
unesp_fuzzy_regex = (
    "(universidade?\s*esta(du|t)al\s*paulista"
    "|j[uúÚ]lio.*mesquita\s+filho"
    "|(s[aãÃ]o|s\.)\s*paulo\s*state\s*university"
    "|state.*(s[aãÃ]o|s\.)\s*paulo\s*university"
    "){e<=3}"
)
other_names_with_unesp = {name for name in all_names - names_with_unesp - usp_names
                          if regex.search(unesp_fuzzy_regex, name.lower())}
other_names_with_unesp

{'Univerisidade Estadual Paulista',
 'Universdiade Estadual Paulista',
 'Universidad Estatal a Distancia',
 'Universidad Estatal a Distancia de Costa Rica',
 'Universidade Estadual Paupsta',
 'Universidade Estadula Paulista',
 'Universidade Estatal a Distancia, Sistema de estúdios de Posgrado',
 'Universidade Julio de Mequita Filho',
 'Univesidade Estadual Paulista',
 'Unuiversidade Estadual Paulista',
 'niversidade Estadual Paulista'}

UNESP have nothing to do with these entries with the word "`Distancia`".

In [20]:
typoed_unesp = {name for name in other_names_with_unesp if "distancia" not in name.lower()}
typoed_unesp

{'Univerisidade Estadual Paulista',
 'Universdiade Estadual Paulista',
 'Universidade Estadual Paupsta',
 'Universidade Estadula Paulista',
 'Universidade Julio de Mequita Filho',
 'Univesidade Estadual Paulista',
 'Unuiversidade Estadual Paulista',
 'niversidade Estadual Paulista'}

In [21]:
unesp_names = {
    name for name in names_with_unesp
         if "unespar" not in name.lower()
}.union(typoed_unesp)
len(unesp_names)

147