# Looking for articles from EMBRAPA and public state universities in SP (SciELO Network)

As seen in a previous notebook, there are $3$ public universities
that are managed by the São Paulo state government:

* USP, *Universidade de São Paulo*
* UNICAMP, *Universidade Estadual de Campinas*
* UNESP, *Universidade Estadual Paulista*

The goal here is to find all research articles
in all collections in the SciELO network
coming from any of these $3$ universities
or from [EMBRAPA](https://www.embrapa.br/en/international).
Though this problem had been addressed in a previous notebook,
we're now including EMBRAPA and the dataset isn't the same.
Using the whole network instead of the SciELO Brazil collection
brings new challenges regarding:

* The *USP* acronym appears elsewhere (homonyms);
* *Saint Paul* (translation of *São Paulo*)
  is part of the name of [another university](https://ustpaul.ca/),
  in Canada;
* There are way more alternative writings;
* Several institution names are written in languages other than
  Portuguese, English and Spanish.

This notebook might have some parts that are similar to the last approach,
but the name-only evaluation doesn't suffice for this new dataset.

In [1]:
from functools import partial
import re

In [2]:
import pandas as pd
import regex
from unidecode import unidecode

## Part 1: Loading the data

The information about each author institution
should be in the `documents_authors.csv` reports,
but if there's no link between the authors and the given affiliations,
it should still be in the `documents_affiliations.csv` file.
Let's open these
using the field name normalization step
that can be found in <https://github.com/scieloorg/scielo20gt6/>:

In [3]:
def normalize_column_title(name):
    name_unbracketed = re.sub(r".*\((.*)\)", r"\1",
                              name.replace("(in months)", "in_months"))
    words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split()
    ignored_words = ("at", "the", "of", "and", "google", "scholar", "+")
    replacements = {
        "document": "doc",
        "documents": "docs",
        "frequency": "freq",
        "language": "lang",
        "instituition": "institution",
    }
    return "_".join(replacements.get(word, word)
                    for word in words if word not in ignored_words) \
              .replace("title_is", "is")

We'll use the same snapshot data from the previous experiments,
but including all data in the SciELO network,
not just SciELO Brazil.

In [4]:
reports_version = "2018-12-10" # Directory name

In [5]:
doc_affs = pd.read_csv(reports_version + "/documents_affiliations.csv") \
             .rename(columns=normalize_column_title)
doc_authors = pd.read_csv(reports_version + "/documents_authors.csv") \
                .rename(columns=normalize_column_title)

Let's join these to get a smaller dataset,
removing duplications
and entries without an explicit institution.

In [6]:
def renormalize_column_title(name):
    return name.replace("_scielo", "").split("_")[-1]

In [7]:
dataset = pd.concat([
    doc_affs[[
        "collection",
        "pid_scielo",
        "doc_affiliation_institution",
        "doc_affiliation_country",
        "doc_affiliation_state",
        "doc_affiliation_city",
    ]].rename(columns=renormalize_column_title),
    doc_authors[[
        "collection",
        "pid_scielo",
        "doc_author_institution",
        "doc_author_affiliation_country",
        "doc_author_affiliation_state",
        "doc_author_affiliation_city",
    ]].rename(columns=renormalize_column_title),
], sort=False).dropna(subset=["institution"]).drop_duplicates()
print(dataset.shape)
dataset.head()

(1367987, 6)


Unnamed: 0,collection,pid,institution,country,state,city
0,scl,S0100-879X1998000800006,University of Gorakhpur,,,
1,scl,S0100-879X1998000800011,Universidade Estadual de Londrina,,,
2,scl,S0100-879X1998000800005,Southern Sea Biology Institute,,,
3,scl,S0100-879X1998000800005,Carleton University,,,
4,scl,S0100-879X1998000800005,Ivano-Frankivsk State Medical Academy,,,


How many *non-empty* entries are there in each field?

In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1367987 entries, 0 to 1738918
Data columns (total 6 columns):
collection     1367987 non-null object
pid            1367987 non-null object
institution    1367987 non-null object
country        1134869 non-null object
state          604598 non-null object
city           888309 non-null object
dtypes: object(6)
memory usage: 73.1+ MB


How many *distinct* values are there in each field?

In [9]:
dataset.apply(lambda x: len(x.unique()))

collection         20
pid            732991
institution    168568
country          1542
state            4231
city            19783
dtype: int64

Some names are filled with "garbage/placeholder content"
and other way too small names
that shouldn't be seen as USP/UNESP/Unicamp/EMBRAPA
due to the lack of information.

In [10]:
raw_institution_names = dataset["institution"].drop_duplicates()
small_institution_names = \
    raw_institution_names[raw_institution_names.str.replace("\W", "")
                                               .str.len() <= 2].unique()
small_institution_names.sort()
small_institution_names

array(['*', '-', '.', '15', '3M', 'AC', 'BA', 'Bt', 'CA', 'CE', 'CI',
       'CP', 'DS', 'DZ', 'EE', 'ES', 'EU', 'F M', 'FC', 'FM', 'FO', 'FS',
       'GV', 'H. C.', 'H.C.', 'HC', 'HI', 'HP', 'HR', 'I', 'I & D', 'I-A',
       'IA', 'IB', 'IC', 'IF', 'IG', 'IP', 'IT', 'IZ', 'JB', 'JK', 'K.U',
       'K.U.', 'KU', 'LM', 'M.D.', 'M.S', 'MA', 'MC', 'MD', 'ME', 'MF',
       'MG', 'ML', 'MP', 'MS', 'MT', 'MZ', 'O & S', 'O&S', 'PA', 'PE',
       'PR', 'PT', 'QN', 'R & D', 'RJ', 'RN', 'RN.', 'RS', 'RT', 'S.L.',
       'S.S.', 'SC', 'SN', 'SP', 'SU', 'T.U', 'TM', 'U.A', 'U.A.', 'U.C',
       'U.C.', 'U.F.', 'U.H', 'U.J.', 'UA', 'UB', 'UC', 'UD', 'UE', 'UF',
       'UG', 'UH', 'UL', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'UU', 'UV',
       'UZ', 'V', 'VO', 'VS', 'VU', 'WP', 'a', 'aa', 'bu', 'e', 'nd',
       's.a', 'v', 'xx'], dtype=object)

## Part 2: Country normalization

*Almost* all entries that matters are from Brazil.
That said, most affiliations that are obviously not from Brazil
can be discarded, and we should just replace the `country` field
by two fields:
a `is_brazil` and
a `mb_brazil` (standing for "might be Brazil");
both with either `True` or `False`.
Unknown entries should be marked as "might be Brazil".
However:

In [11]:
dataset["country"].dropna().drop_duplicates()

756                               Brazil
3222                              Brasil
3227                            Alemanha
3244                                 USA
3394                              France
3408                                  UK
3409                              Turkey
3415                           Indonesia
3416                           Australia
3424                             Belgium
3426                            Portugal
3429                                 U.K
3431                            Scotland
3432                             England
3433                               Kenya
3434                        Burkina Faso
3437                             Austria
3444                                Togo
3445                      United Kingdom
3447                               Italy
3450                           Venezuela
3461                               India
3478                           Argentina
3498                              Israel
3512            

Some entries have an e-mail instead of the country:

In [12]:
email_as_country_df = dataset[dataset["country"].fillna("").str.contains("@")]
email_as_country_df

Unnamed: 0,collection,pid,institution,country,state,city
203191,scl,S0101-31572008000200004,University of Lagos,muyiwaking@yahoo.com,Lagos,Akoka
212359,scl,S1516-35982007001000013,UNESP,rareis@fcav.unesp.br,SP,Jaboticabal
282518,scl,S0006-87052009000300021,Universidade Federal de Mato Grosso,emilioaz@ufmt.br,MT,Cuiabá
347198,scl,S0006-87052010000500007,Centro de Pesquisa e Desenvolvimento de Solos ...,sidney@iac.sp.gov.br,SP,Campinas
733339,scl,S0101-31572017000200381,Instituto de Pesquisa Econômica Aplicada,marcos.cintra@ipea.gov.br,,
894811,bol,S1562-38232012000400012,Argentina,jsalvador@citedef.gob.ar,,
897188,bol,S1012-29662016000200009,UMSS,aleantezana22@hotmail.com,,Cochabamba
956770,chl,S0717-95532006000100007,Universidad San Sebastián,mariaelenaneira@hotmail.com,,Concepción
989186,chl,S0717-95532016000200008,UESB,faby_jq@hotmail.com,BA,Jequié
1008190,chl,S0718-58392009000500001,Universidad de Buenos Aires,afcirelli@fvet.uba.ar,,Buenos Aires


We can get the final part after the dot,
as in most entries that's a country code:

In [13]:
email_as_country_df["country"].str.replace(".*@.*\.", "")

203191     com
212359      br
282518      br
347198      br
733339      br
894811      ar
897188     com
956770     com
989186     com
1008190     ar
1107981     es
1123269     co
1132014     co
1152738     co
1209263    com
1474020    com
1527449    com
1650447     br
1681807     nz
1682902     za
1715604     es
1716731     es
1738654     cl
Name: country, dtype: object

Then, we should apply a fuzzy regex to get which entries are from Brazil:

In [14]:
is_br_re_search = partial(
    regex.search,
    "^br$|^(bra[sz]il){e<=2}$|(bra[sz]il){e<=1}"
)
countries_df = pd.DataFrame(
    dataset
    [["country"]]
    .dropna()
    .drop_duplicates()
    .assign(
        country_pre=lambda df: df["country"].apply(unidecode)
                                            .str.lower()
                                            .str.replace("\W|.*@.*\.", ""),
    ).assign(
        is_br=lambda df: df["country_pre"].apply(is_br_re_search).astype(bool),
    )
)
brazil_names = countries_df[countries_df["is_br"]]["country"].values.tolist()

In [15]:
pd.DataFrame([brazil_names[0::3],
              brazil_names[1::3],
              brazil_names[2::3]]).fillna("").T

Unnamed: 0,0,1,2
0,Brazil,Brasil,Brazi1
1,BRAZIL,Brazi,Brasi
2,BRASIL,Br,Brasília
3,Brzail,Brésil,Bra sil
4,Barzil,Brazill,- Brasil
5,Brasil.,Brasi l,rareis@fcav.unesp.br
6,BR,Brasíl,Brasill
7,- BRASIL,- BR,-BR
8,emilioaz@ufmt.br,sidney@iac.sp.gov.br,"SP, Brazil"
9,Brazil.,Brazile,Brasile


Other names that contains `BR` as a substring aren't *Brazil*:

In [16]:
countries_df[countries_df["country_pre"].str.contains("br") &
             ~countries_df["is_br"]]["country"]

11520                                          Great Britain
180116                                          Grã-Bretanha
290163                                   British West Indies
293668                                            Rio Branco
375598                                  Syrian Arab Republic
421894                  Estado Libre Asociado de Puerto Rico
658240                                          Grã-bretanha
660211     United Kingdom of Great Britain and Northern I...
792721     United Kingdom of Great Britain na Northern Ir...
810638                                                Brunei
859423                                          Gran Bretaña
1218903                                    Brunei Darussalam
1248918                                              Ginebra
1391946                               Arab Republic of Egypt
1719635                                            Bruxelles
Name: country, dtype: object

The remaining names are either:

- Another country;
- A mistake (e.g. a Brazilian state);
- Just some noise (actually unfilled data).

In [17]:
country_counts = dataset["country"].fillna("").value_counts()
empty_countries = countries_df[countries_df["country_pre"].str.len() <= 1] \
                              ["country"].tolist() + [""]
country_counts[country_counts.index.isin(empty_countries)]

     233118
-         6
(         2
U         1
z         1
E         1
a         1
.         1
Name: country, dtype: int64

In [18]:
br_states_in_country = countries_df[countries_df["country_pre"] \
    .isin(["sp", "rj", "go", "df", "pb", "ce", "rr",
           "minasgerais", "saopaulo", "riodejaneiro", "espiritosanto",
           "goias", "matogrosso", "matogrossodosul", "distritofederal",
           "parana", "riograndedosul", "santacatarina",
           "riograndedonorte", "sergipe", "bahia", "pernambuco",
           "piaui", "paraiba", "ceara", "maranhao", "alagoas",
           "amazonas", "acre", "roraima", "rondonia",
           "amapa", "tocantins", "para",
    ])]["country"].tolist()
country_counts[country_counts.index.isin(br_states_in_country)]

s.p                    93
SP                     14
São Paulo               7
Bahia                   7
Ceará                   6
RJ                      5
Minas Gerais            5
Paraná                  4
Rio de Janeiro          4
PB                      3
CE                      3
Amazonas                3
Pernambuco              3
Distrito Federal        2
GO                      2
D. F                    1
Rio Grande do Norte     1
.sp                     1
Sergipe                 1
Piauí                   1
Pará                    1
DF                      1
Goiás                   1
Paraíba                 1
Name: country, dtype: int64

It's pretty hard to find these other names.
`SP`, `RJ`, `GO`, `DF`, `PB`, `CE` and `RR`
aren't ISO 3166-1 alpha-2 codes,
so they're probably just the Brazilian state names,
though they might be an acronym for a country name
in another language.

In [19]:
might_be_br = countries_df[countries_df["country_pre"] \
    .isin(["mg", # Madagascar or Minas Gerais?
           "es", # Spain or Espírito Santo?
           "mt", # Malta or Mato Grosso?
           "ms", # Montserrat or Mato Grosso do Sul?
           "pr", # Puerto Rico or Paraná?
           "rs", # Serbia or Rio Grande do Sul?
           "sc", # Seychelles or Santa Catarina?
           "rn", # Niger or Rio Grande do Norte?
           "se", # Sweden or Sergipe?
           "ba", # Bosnia and Herzegovina or Bahia?
           "pe", # Peru or Pernambuco?
           "pi", # Philippines or Piauí?
           "ma", # Morocco or Maranhão?
           "al", # Albania or Alagoas?
           "am", # Armenia or Amazonas?
           "ac", # Ascension Island or Acre?
           "ro", # Romania or Rondônia?
           "ap", # African Regional Industrial Property Organization or Amapá?
           "to", # Tonga or Tocantins?
           "pa", # Panama or Pará?
    ]) & (countries_df["country"].str.len() <= 5)]["country"].tolist()
country_counts[country_counts.index.isin(might_be_br)]

PR      11
RS      10
ES       7
MG       4
PE       4
BA       3
MA       2
PA       2
AL       2
P.R.     2
SC       1
MT       1
Name: country, dtype: int64

Can the country be in another field?

In [20]:
dataset[
    ~dataset["country"].isin(brazil_names) &
    ~dataset["country"].isna() &
    dataset["state"].isin(brazil_names)
]

Unnamed: 0,collection,pid,institution,country,state,city
732797,scl,S2237-101X2017000200381,Jardim Botânico do Rio de Janeiro,E,Brasil,Rio de Janeiro
831244,arg,S1850-15322009000300007,Universitäts Klinikum Freiburg,Germany,Br,Freiburg
1522434,rve,S0104-35522012000200018,Universidade Federal do Rio de Janeiro,E-mail,Brasil,Rio de Janeiro


Yes. So let's find Brazil in all text fields.

In [21]:
all_texts = (dataset
    .drop(columns=["collection", "pid"])
    .fillna("")
    .applymap(unidecode)
    .T.apply(lambda row: " ".join(row))
    .str.lower()
    .str.replace("\W", " ")
)

In [22]:
br_in_all_texts = all_texts.apply(is_br_re_search).apply(bool)

In [23]:
pd.DataFrame(
    dataset[
        ~dataset["country"].isin(brazil_names) &
        ~dataset["country"].isna() &
        br_in_all_texts
    ]
    .fillna("")
    .groupby(["country", "state", "city", "institution"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
country,state,city,institution,Unnamed: 4_level_1
70610-200,DF,Brasília,Polícia Civil do Distrito Federal,1
Amazonas,,Manaus,Universidade Luterana do Brasil,1
Argentina,,,Clínica Basilea,2
Argentina,,C.A.B.A,Clínica Basilea,2
Argentina,,Ciudad de Buenos Aires,Clínica Basilea,1
Buenos Aires,PI,Teresina,Empresa Brasileira de Pesquisa Agropecuária,1
Canada,,Toronto,University of Brasilia,1
Colombia,,Bogotá,Universidad de Brasília,1
Colombia,,Brasília,Organização Pan-Americana da Saúde,4
Colombia,D.F.,Brasilia,Organização Pan-Americana da Saúde,1


Several of the above entries aren't from Brazil,
on the other hand it won't hurt
to say all these entries "might be Brazil",
since the number of entries is small.

## Part 3: State normalization

For USP/UNESP/Unicamp,
it would be useful to filter by the state name,
as it should be `São Paulo` for all of them,
but it's not a normalized field:

In [24]:
dataset["state"].dropna().drop_duplicates()

3222                                      RS
3224                                      PR
3226                                      SP
3236                                      MG
3396                                      PE
3397                                      RJ
3412                                      CE
3417                                      MS
3420                                      SC
3421                                      CA
3443                                      CO
3458                                   Texas
3461                                  Mumbai
3462                      State of São Paulo
3492                               São Paulo
3512                                    D.F.
3550                                   S. P.
3598                                      Pr
3600                                      DF
3602                                      DC
3604                                    S.C.
3606                                    S.P.
3607      

There's no e-mail filled in as the state name:

In [25]:
dataset[dataset["state"].fillna("").str.contains("@")].empty

True

In [26]:
is_sampa_re_search = partial(
    regex.search,
    "^s+p+$|(s(ao)?paulo|s(ain)?tpaul){e<=1}"
)
not_sampa_list = ["pablo", "palmas", "galo", "spain", "seoul"]
states_df = pd.DataFrame(
    dataset
    [["state"]]
    .dropna()
    .drop_duplicates()
    .assign(
        state_pre=lambda df: df["state"].apply(unidecode)
                                        .str.lower()
                                        .str.replace("\W", ""),
    ).assign(
        is_sp=lambda df: df["state_pre"].apply(
                             lambda name: all(ns not in name
                                              for ns in not_sampa_list)
                                          and bool(is_sampa_re_search(name))
                         ) | df["state"]
                             .str.lower()
                             .str.replace("\W", " ")
                             .apply(lambda name: "sp" in name.split()),
    )
)
sp_names = states_df[states_df["is_sp"]]["state"].tolist()

In [27]:
pd.DataFrame([sp_names[0::3],
              sp_names[1::3],
              sp_names[2::3]]).fillna("").T

Unnamed: 0,0,1,2
0,SP,State of São Paulo,São Paulo
1,S. P.,S.P.,Sao Paulo
2,SãoPaulo,S.P,"SP,"
3,S.Paulo,S. Paulo,São Paulo State
4,SP.,S P,Sao Paulo State
5,SP),- SP,S. P
6,SPP,-SP,S/P
7,"São Paulo,",São Paulo/SP,Campinas/SP
8,São Paulo|SP,"São Paulo, SP",SSP
9,Sp,sp,SÃ£o Paulo


The remaining names are probably not *São Paulo*.

In [28]:
states_df[states_df["state_pre"].str.contains("sp|sampa") & ~states_df["is_sp"]]["state"]

7700                                 Espírito Santo
31181                                       Espanha
107627                                       España
173221                                        Spain
204148                                Nueva Esparta
260417                               Espirito Santo
669499                              San Luis Potosí
735021                              San Luís Potosí
768580                              Sancti Spíritus
805434                        Buenos Aires Province
836983                               Espíritu Santo
887410                             Villa Carlos Paz
913024                         Estado Nueva Esparta
922003                      Núcleo de Nueva Esparta
956833                     Estado do Espírito Santo
970422                   Las Palmas de Gran Canaria
1028180                              Santi Spíritus
1048559                             Sancti Spiritus
1056351                              Sacti Spiritus
1061255     

Unfilled (including single-lettered) states:

In [29]:
state_counts = dataset["state"].fillna("").value_counts()
empty_states = states_df[states_df["state_pre"].str.len() <= 1] \
                        ["state"].tolist() + [""]
state_counts[state_counts.index.isin(empty_states)]

      763389
M          3
S          3
D          1
N          1
(          1
G          1
D.         1
i          1
-          1
,          1
Name: state, dtype: int64

Collecting the name of the cities with at least one campus
belonging to one of the chosen universities:

In [30]:
sp_cities_in_state = states_df[states_df["state_pre"] \
    .isin(["bauru", "ribeiraopreto", "saocarlos", "franca",
           "piracicaba", "pirassununga", "lorena", "santos",
           "campinas", "limeira", "paulinia", "saojoaodaboavista",
           "saojosedoriopreto", "saojose", "saojosedoscampos",
           "dracena", "botucatu", "rioclaro", "araraquara",
           "ourinhos", "assis", "itapeva", "registro",
           "saovicente", "guaratingueta", "sorocaba",
           "jaboticabal", "marilia", "tupa", "presidenteprudente",
           "aracatuba", "ilhasolteira", "rosana",
    ]) & (states_df["state"] != "França")]["state"].tolist()
country_counts[country_counts.index.isin(sp_cities_in_state)]

Bauru    1
Name: country, dtype: int64

Is there any other field filled with "São Paulo"?

In [31]:
sp_in_all_texts = all_texts.apply(is_sampa_re_search).apply(bool)

In [32]:
pd.DataFrame(
    dataset[
        ~dataset["state"].isin(sp_names) &
        ~dataset["state"].isna() &
        sp_in_all_texts
    ]
    .fillna("")
    .groupby(["country", "state", "city", "institution"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
country,state,city,institution,Unnamed: 4_level_1
,14040-902,,Universidade de São Paulo,1
,37950-000,,Universidade de São Paulo,1
,BA,Paulo Afonso,Companhia Hidro Elétrica do São Francisco,1
,BA,Paulo Afonso,Universidade do Estado da Bahia,1
,Brasil,Sao Paulo,Universidad Cruceiro do Sol Sao Paulo Brasil,1
,Brasil,São Paulo,CNEM,1
,Brasil,São Paulo,FMB/UNESP,1
,Brasil,São Paulo,Instituto Israelita de Ensino e Pesquisa Albert Einstein,1
,Brasil,São Paulo,SciELO,2
,Brasil,São Paulo,Secretaria de Estado da Saúde de São Paulo,1


Again, several entries have nothing to do with the Brazilian state,
but we'll still say all these entries "might be São Paulo".