# Looking for articles from EMBRAPA and public state universities in SP (SciELO Network)

As seen in a previous notebook, there are $3$ public universities
that are managed by the São Paulo state government:

* USP, *Universidade de São Paulo*
* UNICAMP, *Universidade Estadual de Campinas*
* UNESP, *Universidade Estadual Paulista*

The goal here is to find all research articles
in all collections in the SciELO network
coming from any of these $3$ universities
or from [EMBRAPA](https://www.embrapa.br/en/international).
Though this problem had been addressed in a previous notebook,
we're now including EMBRAPA and the dataset isn't the same.
Using the whole network instead of the SciELO Brazil collection
brings new challenges regarding:

* The *USP* acronym appears elsewhere (homonyms);
* *Saint Paul* (translation of *São Paulo*)
  is part of the name of [another university](https://ustpaul.ca/),
  in Canada;
* There are way more alternative writings;
* Several institution names are written in languages other than
  Portuguese, English and Spanish.

This notebook might have some parts that are similar to the last approach,
but the name-only evaluation doesn't suffice for this new dataset.

In [1]:
from functools import partial
import re

In [2]:
import pandas as pd
import regex
from unidecode import unidecode

## Part 1: Loading the data

The information about each author institution
should be in the `documents_authors.csv` reports,
but if there's no link between the authors and the given affiliations,
it should still be in the `documents_affiliations.csv` file.
Let's open these
using the field name normalization step
that can be found in <https://github.com/scieloorg/scielo20gt6/>:

In [3]:
def normalize_column_title(name):
    name_unbracketed = re.sub(r".*\((.*)\)", r"\1",
                              name.replace("(in months)", "in_months"))
    words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split()
    ignored_words = ("at", "the", "of", "and", "google", "scholar", "+")
    replacements = {
        "document": "doc",
        "documents": "docs",
        "frequency": "freq",
        "language": "lang",
        "instituition": "institution",
    }
    return "_".join(replacements.get(word, word)
                    for word in words if word not in ignored_words) \
              .replace("title_is", "is")

We'll use the same snapshot data from the previous experiments,
but including all data in the SciELO network,
not just SciELO Brazil.

In [4]:
reports_version = "2018-12-10" # Directory name

In [5]:
doc_affs = pd.read_csv(reports_version + "/documents_affiliations.csv") \
             .rename(columns=normalize_column_title)
doc_authors = pd.read_csv(reports_version + "/documents_authors.csv") \
                .rename(columns=normalize_column_title)

Let's join these to get a smaller dataset,
removing duplications
and entries without an explicit institution.

In [6]:
def renormalize_column_title(name):
    return name.replace("_scielo", "").split("_")[-1]

In [7]:
dataset = pd.concat([
    doc_affs[[
        "collection",
        "pid_scielo",
        "doc_affiliation_institution",
        "doc_affiliation_country",
        "doc_affiliation_state",
        "doc_affiliation_city",
    ]].rename(columns=renormalize_column_title),
    doc_authors[[
        "collection",
        "pid_scielo",
        "doc_author_institution",
        "doc_author_affiliation_country",
        "doc_author_affiliation_state",
        "doc_author_affiliation_city",
    ]].rename(columns=renormalize_column_title),
], sort=False).dropna(subset=["institution"]).drop_duplicates()
print(dataset.shape)
dataset.head()

(1367987, 6)


Unnamed: 0,collection,pid,institution,country,state,city
0,scl,S0100-879X1998000800006,University of Gorakhpur,,,
1,scl,S0100-879X1998000800011,Universidade Estadual de Londrina,,,
2,scl,S0100-879X1998000800005,Southern Sea Biology Institute,,,
3,scl,S0100-879X1998000800005,Carleton University,,,
4,scl,S0100-879X1998000800005,Ivano-Frankivsk State Medical Academy,,,


How many *non-empty* entries are there in each field?

In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1367987 entries, 0 to 1738918
Data columns (total 6 columns):
collection     1367987 non-null object
pid            1367987 non-null object
institution    1367987 non-null object
country        1134869 non-null object
state          604598 non-null object
city           888309 non-null object
dtypes: object(6)
memory usage: 73.1+ MB


How many *distinct* values are there in each field?

In [9]:
dataset.apply(lambda x: len(x.unique()))

collection         20
pid            732991
institution    168568
country          1542
state            4231
city            19783
dtype: int64

Some names are filled with "garbage/placeholder content"
and other way too small names
that shouldn't be seen as USP/UNESP/Unicamp/EMBRAPA
due to the lack of information.

In [10]:
raw_institution_names = dataset["institution"].drop_duplicates()
small_institution_names = \
    raw_institution_names[raw_institution_names.str.replace("\W", "")
                                               .str.len() <= 2].unique()
small_institution_names.sort()
small_institution_names

array(['*', '-', '.', '15', '3M', 'AC', 'BA', 'Bt', 'CA', 'CE', 'CI',
       'CP', 'DS', 'DZ', 'EE', 'ES', 'EU', 'F M', 'FC', 'FM', 'FO', 'FS',
       'GV', 'H. C.', 'H.C.', 'HC', 'HI', 'HP', 'HR', 'I', 'I & D', 'I-A',
       'IA', 'IB', 'IC', 'IF', 'IG', 'IP', 'IT', 'IZ', 'JB', 'JK', 'K.U',
       'K.U.', 'KU', 'LM', 'M.D.', 'M.S', 'MA', 'MC', 'MD', 'ME', 'MF',
       'MG', 'ML', 'MP', 'MS', 'MT', 'MZ', 'O & S', 'O&S', 'PA', 'PE',
       'PR', 'PT', 'QN', 'R & D', 'RJ', 'RN', 'RN.', 'RS', 'RT', 'S.L.',
       'S.S.', 'SC', 'SN', 'SP', 'SU', 'T.U', 'TM', 'U.A', 'U.A.', 'U.C',
       'U.C.', 'U.F.', 'U.H', 'U.J.', 'UA', 'UB', 'UC', 'UD', 'UE', 'UF',
       'UG', 'UH', 'UL', 'UM', 'UN', 'UP', 'UR', 'US', 'UT', 'UU', 'UV',
       'UZ', 'V', 'VO', 'VS', 'VU', 'WP', 'a', 'aa', 'bu', 'e', 'nd',
       's.a', 'v', 'xx'], dtype=object)

## Part 2: Country normalization

*Almost* all entries that matters are from Brazil.
That said, most affiliations that are obviously not from Brazil
can be discarded, and we should just replace the `country` field
by two fields:
a `is_brazil` and
a `mb_brazil` (standing for "might be Brazil");
both with either `True` or `False`.
Unknown entries should be marked as "might be Brazil".
However:

In [11]:
dataset["country"].dropna().drop_duplicates()

756                               Brazil
3222                              Brasil
3227                            Alemanha
3244                                 USA
3394                              France
3408                                  UK
3409                              Turkey
3415                           Indonesia
3416                           Australia
3424                             Belgium
3426                            Portugal
3429                                 U.K
3431                            Scotland
3432                             England
3433                               Kenya
3434                        Burkina Faso
3437                             Austria
3444                                Togo
3445                      United Kingdom
3447                               Italy
3450                           Venezuela
3461                               India
3478                           Argentina
3498                              Israel
3512            

Some entries have an e-mail instead of the country:

In [12]:
email_as_country_df = dataset[dataset["country"].fillna("").str.contains("@")]
email_as_country_df

Unnamed: 0,collection,pid,institution,country,state,city
203191,scl,S0101-31572008000200004,University of Lagos,muyiwaking@yahoo.com,Lagos,Akoka
212359,scl,S1516-35982007001000013,UNESP,rareis@fcav.unesp.br,SP,Jaboticabal
282518,scl,S0006-87052009000300021,Universidade Federal de Mato Grosso,emilioaz@ufmt.br,MT,Cuiabá
347198,scl,S0006-87052010000500007,Centro de Pesquisa e Desenvolvimento de Solos ...,sidney@iac.sp.gov.br,SP,Campinas
733339,scl,S0101-31572017000200381,Instituto de Pesquisa Econômica Aplicada,marcos.cintra@ipea.gov.br,,
894811,bol,S1562-38232012000400012,Argentina,jsalvador@citedef.gob.ar,,
897188,bol,S1012-29662016000200009,UMSS,aleantezana22@hotmail.com,,Cochabamba
956770,chl,S0717-95532006000100007,Universidad San Sebastián,mariaelenaneira@hotmail.com,,Concepción
989186,chl,S0717-95532016000200008,UESB,faby_jq@hotmail.com,BA,Jequié
1008190,chl,S0718-58392009000500001,Universidad de Buenos Aires,afcirelli@fvet.uba.ar,,Buenos Aires


We can get the final part after the dot,
as in most entries that's a country code:

In [13]:
email_as_country_df["country"].str.replace(".*@.*\.", "")

203191     com
212359      br
282518      br
347198      br
733339      br
894811      ar
897188     com
956770     com
989186     com
1008190     ar
1107981     es
1123269     co
1132014     co
1152738     co
1209263    com
1474020    com
1527449    com
1650447     br
1681807     nz
1682902     za
1715604     es
1716731     es
1738654     cl
Name: country, dtype: object

Then, we should apply a fuzzy regex to get which entries are from Brazil:

In [14]:
is_br_re_search = partial(
    regex.search,
    "^br$|^(bra[sz]il){e<=2}$|(bra[sz]il){e<=1}"
)
countries_df = pd.DataFrame(
    dataset
    [["country"]]
    .dropna()
    .drop_duplicates()
    .assign(
        country_pre=lambda df: df["country"].apply(unidecode)
                                            .str.lower()
                                            .str.replace("\W|.*@.*\.", ""),
    ).assign(
        is_br=lambda df: df["country_pre"].apply(is_br_re_search).astype(bool),
    )
)
brazil_names = countries_df[countries_df["is_br"]]["country"].values.tolist()

In [15]:
pd.DataFrame([brazil_names[0::3],
              brazil_names[1::3],
              brazil_names[2::3]]).fillna("").T

Unnamed: 0,0,1,2
0,Brazil,Brasil,Brazi1
1,BRAZIL,Brazi,Brasi
2,BRASIL,Br,Brasília
3,Brzail,Brésil,Bra sil
4,Barzil,Brazill,- Brasil
5,Brasil.,Brasi l,rareis@fcav.unesp.br
6,BR,Brasíl,Brasill
7,- BRASIL,- BR,-BR
8,emilioaz@ufmt.br,sidney@iac.sp.gov.br,"SP, Brazil"
9,Brazil.,Brazile,Brasile


Other names that contains `BR` as a substring aren't *Brazil*:

In [16]:
countries_df[countries_df["country_pre"].str.contains("br") &
             ~countries_df["is_br"]]["country"]

11520                                          Great Britain
180116                                          Grã-Bretanha
290163                                   British West Indies
293668                                            Rio Branco
375598                                  Syrian Arab Republic
421894                  Estado Libre Asociado de Puerto Rico
658240                                          Grã-bretanha
660211     United Kingdom of Great Britain and Northern I...
792721     United Kingdom of Great Britain na Northern Ir...
810638                                                Brunei
859423                                          Gran Bretaña
1218903                                    Brunei Darussalam
1248918                                              Ginebra
1391946                               Arab Republic of Egypt
1719635                                            Bruxelles
Name: country, dtype: object

The remaining names are either:

- Another country;
- A mistake (e.g. a Brazilian state);
- Just some noise (actually unfilled data).

In [17]:
country_counts = dataset["country"].fillna("").value_counts()
empty_countries = countries_df[countries_df["country_pre"].str.len() <= 1] \
                              ["country"].tolist() + [""]
country_counts[country_counts.index.isin(empty_countries)]

     233118
-         6
(         2
U         1
z         1
E         1
a         1
.         1
Name: country, dtype: int64

In [18]:
br_states_in_country = countries_df[countries_df["country_pre"] \
    .isin(["sp", "rj", "go", "df", "pb", "ce", "rr",
           "minasgerais", "saopaulo", "riodejaneiro", "espiritosanto",
           "goias", "matogrosso", "matogrossodosul", "distritofederal",
           "parana", "riograndedosul", "santacatarina",
           "riograndedonorte", "sergipe", "bahia", "pernambuco",
           "piaui", "paraiba", "ceara", "maranhao", "alagoas",
           "amazonas", "acre", "roraima", "rondonia",
           "amapa", "tocantins", "para",
    ])]["country"].tolist()
country_counts[country_counts.index.isin(br_states_in_country)]

s.p                    93
SP                     14
São Paulo               7
Bahia                   7
Ceará                   6
RJ                      5
Minas Gerais            5
Paraná                  4
Rio de Janeiro          4
PB                      3
CE                      3
Amazonas                3
Pernambuco              3
Distrito Federal        2
GO                      2
D. F                    1
Rio Grande do Norte     1
.sp                     1
Sergipe                 1
Piauí                   1
Pará                    1
DF                      1
Goiás                   1
Paraíba                 1
Name: country, dtype: int64

It's pretty hard to find these other names.
`SP`, `RJ`, `GO`, `DF`, `PB`, `CE` and `RR`
aren't ISO 3166-1 alpha-2 codes,
so they're probably just the Brazilian state names,
though they might be an acronym for a country name
in another language.

In [19]:
might_be_br = countries_df[countries_df["country_pre"] \
    .isin(["mg", # Madagascar or Minas Gerais?
           "es", # Spain or Espírito Santo?
           "mt", # Malta or Mato Grosso?
           "ms", # Montserrat or Mato Grosso do Sul?
           "pr", # Puerto Rico or Paraná?
           "rs", # Serbia or Rio Grande do Sul?
           "sc", # Seychelles or Santa Catarina?
           "rn", # Niger or Rio Grande do Norte?
           "se", # Sweden or Sergipe?
           "ba", # Bosnia and Herzegovina or Bahia?
           "pe", # Peru or Pernambuco?
           "pi", # Philippines or Piauí?
           "ma", # Morocco or Maranhão?
           "al", # Albania or Alagoas?
           "am", # Armenia or Amazonas?
           "ac", # Ascension Island or Acre?
           "ro", # Romania or Rondônia?
           "ap", # African Regional Industrial Property Organization or Amapá?
           "to", # Tonga or Tocantins?
           "pa", # Panama or Pará?
    ]) & (countries_df["country"].str.len() <= 5)]["country"].tolist()
country_counts[country_counts.index.isin(might_be_br)]

PR      11
RS      10
ES       7
MG       4
PE       4
BA       3
MA       2
PA       2
AL       2
P.R.     2
SC       1
MT       1
Name: country, dtype: int64

Can the country be in another field?

In [20]:
dataset[
    ~dataset["country"].isin(brazil_names) &
    ~dataset["country"].isna() &
    dataset["state"].isin(brazil_names)
]

Unnamed: 0,collection,pid,institution,country,state,city
732797,scl,S2237-101X2017000200381,Jardim Botânico do Rio de Janeiro,E,Brasil,Rio de Janeiro
831244,arg,S1850-15322009000300007,Universitäts Klinikum Freiburg,Germany,Br,Freiburg
1522434,rve,S0104-35522012000200018,Universidade Federal do Rio de Janeiro,E-mail,Brasil,Rio de Janeiro


Yes. So let's find Brazil in all text fields.

In [21]:
all_texts = (dataset
    .drop(columns=["collection", "pid"])
    .fillna("")
    .applymap(unidecode)
    .T.apply(lambda row: " ".join(row))
    .str.lower()
    .str.replace("\W", " ")
)

In [22]:
br_in_all_texts = all_texts.apply(is_br_re_search).apply(bool)

In [23]:
pd.DataFrame(
    dataset[
        ~dataset["country"].isin(brazil_names) &
        ~dataset["country"].isna() &
        br_in_all_texts
    ]
    .fillna("")
    .groupby(["country", "state", "city", "institution"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
country,state,city,institution,Unnamed: 4_level_1
70610-200,DF,Brasília,Polícia Civil do Distrito Federal,1
Amazonas,,Manaus,Universidade Luterana do Brasil,1
Argentina,,,Clínica Basilea,2
Argentina,,C.A.B.A,Clínica Basilea,2
Argentina,,Ciudad de Buenos Aires,Clínica Basilea,1
Buenos Aires,PI,Teresina,Empresa Brasileira de Pesquisa Agropecuária,1
Canada,,Toronto,University of Brasilia,1
Colombia,,Bogotá,Universidad de Brasília,1
Colombia,,Brasília,Organização Pan-Americana da Saúde,4
Colombia,D.F.,Brasilia,Organização Pan-Americana da Saúde,1


Several of the above entries aren't from Brazil,
on the other hand it won't hurt
to say all these entries "might be Brazil",
since the number of entries is small.

## Part 3: State normalization

For USP/UNESP/Unicamp,
it would be useful to filter by the state name,
as it should be `São Paulo` for all of them,
but it's not a normalized field:

In [24]:
dataset["state"].dropna().drop_duplicates()

3222                                      RS
3224                                      PR
3226                                      SP
3236                                      MG
3396                                      PE
3397                                      RJ
3412                                      CE
3417                                      MS
3420                                      SC
3421                                      CA
3443                                      CO
3458                                   Texas
3461                                  Mumbai
3462                      State of São Paulo
3492                               São Paulo
3512                                    D.F.
3550                                   S. P.
3598                                      Pr
3600                                      DF
3602                                      DC
3604                                    S.C.
3606                                    S.P.
3607      

There's no e-mail filled in as the state name:

In [25]:
dataset[dataset["state"].fillna("").str.contains("@")].empty

True

In [26]:
is_sampa_re_search = partial(
    regex.search,
    "^s+p+$|(s(ao)?paulo|s(ain)?tpaul){e<=1}"
)
not_sampa_list = ["pablo", "palmas", "galo", "spain", "seoul"]
states_df = pd.DataFrame(
    dataset
    [["state"]]
    .dropna()
    .drop_duplicates()
    .assign(
        state_pre=lambda df: df["state"].apply(unidecode)
                                        .str.lower()
                                        .str.replace("\W", ""),
    ).assign(
        is_sp=lambda df: df["state_pre"].apply(
                             lambda name: all(ns not in name
                                              for ns in not_sampa_list)
                                          and bool(is_sampa_re_search(name))
                         ) | df["state"]
                             .str.lower()
                             .str.replace("\W", " ")
                             .apply(lambda name: "sp" in name.split()),
    )
)
sp_names = states_df[states_df["is_sp"]]["state"].tolist()

In [27]:
pd.DataFrame([sp_names[0::3],
              sp_names[1::3],
              sp_names[2::3]]).fillna("").T

Unnamed: 0,0,1,2
0,SP,State of São Paulo,São Paulo
1,S. P.,S.P.,Sao Paulo
2,SãoPaulo,S.P,"SP,"
3,S.Paulo,S. Paulo,São Paulo State
4,SP.,S P,Sao Paulo State
5,SP),- SP,S. P
6,SPP,-SP,S/P
7,"São Paulo,",São Paulo/SP,Campinas/SP
8,São Paulo|SP,"São Paulo, SP",SSP
9,Sp,sp,SÃ£o Paulo


The remaining names are probably not *São Paulo*.

In [28]:
states_df[states_df["state_pre"].str.contains("sp|sampa") & ~states_df["is_sp"]]["state"]

7700                                 Espírito Santo
31181                                       Espanha
107627                                       España
173221                                        Spain
204148                                Nueva Esparta
260417                               Espirito Santo
669499                              San Luis Potosí
735021                              San Luís Potosí
768580                              Sancti Spíritus
805434                        Buenos Aires Province
836983                               Espíritu Santo
887410                             Villa Carlos Paz
913024                         Estado Nueva Esparta
922003                      Núcleo de Nueva Esparta
956833                     Estado do Espírito Santo
970422                   Las Palmas de Gran Canaria
1028180                              Santi Spíritus
1048559                             Sancti Spiritus
1056351                              Sacti Spiritus
1061255     

Unfilled (including single-lettered) states:

In [29]:
state_counts = dataset["state"].fillna("").value_counts()
empty_states = states_df[states_df["state_pre"].str.len() <= 1] \
                        ["state"].tolist() + [""]
state_counts[state_counts.index.isin(empty_states)]

      763389
M          3
S          3
D          1
N          1
(          1
G          1
D.         1
i          1
-          1
,          1
Name: state, dtype: int64

Collecting the name of the cities with at least one campus
belonging to one of the chosen universities:

In [30]:
sp_cities_in_state = states_df[states_df["state_pre"] \
    .isin(["bauru", "ribeiraopreto", "saocarlos", "franca",
           "piracicaba", "pirassununga", "lorena", "santos",
           "campinas", "limeira", "paulinia", "saojoaodaboavista",
           "saojosedoriopreto", "saojose", "saojosedoscampos",
           "dracena", "botucatu", "rioclaro", "araraquara",
           "ourinhos", "assis", "itapeva", "registro",
           "saovicente", "guaratingueta", "sorocaba",
           "jaboticabal", "marilia", "tupa", "presidenteprudente",
           "aracatuba", "ilhasolteira", "rosana",
    ]) & (states_df["state"] != "França")]["state"].tolist()
country_counts[country_counts.index.isin(sp_cities_in_state)]

Bauru    1
Name: country, dtype: int64

Is there any other field filled with "São Paulo"?

In [31]:
sp_in_all_texts = all_texts.apply(is_sampa_re_search).apply(bool)

In [32]:
pd.DataFrame(
    dataset[
        ~dataset["state"].isin(sp_names) &
        ~dataset["state"].isna() &
        sp_in_all_texts
    ]
    .fillna("")
    .groupby(["country", "state", "city", "institution"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
country,state,city,institution,Unnamed: 4_level_1
,14040-902,,Universidade de São Paulo,1
,37950-000,,Universidade de São Paulo,1
,BA,Paulo Afonso,Companhia Hidro Elétrica do São Francisco,1
,BA,Paulo Afonso,Universidade do Estado da Bahia,1
,Brasil,Sao Paulo,Universidad Cruceiro do Sol Sao Paulo Brasil,1
,Brasil,São Paulo,CNEM,1
,Brasil,São Paulo,FMB/UNESP,1
,Brasil,São Paulo,Instituto Israelita de Ensino e Pesquisa Albert Einstein,1
,Brasil,São Paulo,SciELO,2
,Brasil,São Paulo,Secretaria de Estado da Saúde de São Paulo,1


Again, several entries have nothing to do with the Brazilian state,
but we'll still say all these entries "might be São Paulo".

## Part 4: Iterative "collateral effect" manual classification

As it'll be way too hard to find a single criterion
that will match every single entry in this dataset,
we'll use both a *subtractive* and an *additive* approach
to include and remove entries,
updating a dataset of "remaining" PIDs.

In [33]:
solution = pd.DataFrame({
    "usp": None,
    "unesp": None,
    "unicamp": None,
    "embrapa": None,
}, index=dataset.index, dtype=float).assign(pid=dataset["pid"])

Where the values in this table are:

- `0.0`: False;
- `1.0`: True;
- `NaN` (Not a Number): Unknown.

The lack of information means
these entries aren't from any of these desired outputs.

In [34]:
solution.loc[
    dataset[dataset["institution"].isin(small_institution_names)].index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = 0.

How many PIDs do we have in total?

In [35]:
solution["pid"].drop_duplicates().size

732991

### Wordlist acronym approach

Let's add some fields regarding the previous steps
and a wordlist for the institution name.
This is the first try for a classification:
if the acronym is in the list,
we already know the correct classification
(though there might be some false positive,
 e.g. `USP` names from other countries).

In [36]:
wl_dataset = dataset.assign(
    is_br=dataset["country"].isin(brazil_names + br_states_in_country),
    is_sp=dataset["state"].isin(sp_names + sp_cities_in_state),
    word_list=dataset["institution"].fillna("")
                                    .apply(unidecode)
                                    .str.replace("\W", " ")
                                    .str.lower()
                                    .str.split(),
).assign(
    mb_br=lambda df: df["is_br"]
                   | df["country"].isna()
                   | df["country"].isin(empty_countries + might_be_br)
                   | br_in_all_texts,
    mb_sp=lambda df: df["is_sp"]
                   | df["state"].isna()
                   | df["state"].isin(empty_states)
                   | sp_in_all_texts,
    ipre=lambda df: df["word_list"].str.join(" "),
    has_usp=lambda df: df["word_list"].apply(lambda wl: "usp" in wl),
    has_unesp=lambda df: df["word_list"].apply(lambda wl: "unesp" in wl),
    has_unicamp=lambda df: df["word_list"].apply(lambda wl: "unicamp" in wl),
    has_embrapa=lambda df: df["word_list"].apply(lambda wl: "embrapa" in wl),
)
wl_dataset[1000:5000:1200].T

Unnamed: 0,1495,3006,4415,5795
collection,scl,scl,scl,scl
pid,S0103-90161998000100010,S0103-90161996000100024,S0100-879X1999000700006,S0074-02761999000600004
institution,USP,ESALQ,Universidade de São Paulo,Universidade Federal de Minas Gerais
country,,,Brasil,Brasil
state,,,SP,MG
city,,,Ribeirão Preto,Belo Horizonte
is_br,False,False,True,True
is_sp,False,False,True,False
word_list,[usp],[esalq],"[universidade, de, sao, paulo]","[universidade, federal, de, minas, gerais]"
mb_br,True,True,True,True


The number of `True` entries in each field
(`mb` stands for "might be",
 and all entries that "are something" also "might be something"):

In [37]:
wl_dataset[["is_br", "mb_br", "is_sp", "mb_sp",
            "has_usp", "has_unicamp", "has_unesp", "has_embrapa"]].sum()

is_br          580292
mb_br          813586
is_sp          156830
mb_sp          920649
has_usp          4554
has_unicamp      1287
has_unesp        1635
has_embrapa     14266
dtype: int64

Are there inconsistent entries?

In [38]:
wl_multi_valued = wl_dataset[
    wl_dataset[["has_usp", "has_unicamp",
                "has_unesp", "has_embrapa"]].T.sum() > 1
]
wl_multi_valued.T

Unnamed: 0,1144697
collection,col
pid,S1794-47242015000200002
institution,Universidade de São Paulo (USP)/Universidade E...
country,Brasil
state,
city,
is_br,True
is_sp,False
word_list,"[universidade, de, sao, paulo, usp, universida..."
mb_br,True


That's actually a multi-valued non-ambiguous entry,
not an inconsistency:

In [39]:
wl_multi_valued["institution"].tolist()

['Universidade de São Paulo (USP)/Universidade Estadual Paulista (UNESP)']

In [40]:
solution.loc[
    wl_multi_valued.index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = 1. * wl_multi_valued[["has_usp", "has_unesp",
                          "has_unicamp", "has_embrapa"]].values

We should look for inconsistencies regarding the country and the state.

#### EMBRAPA acronym consistency

In [41]:
ac_cols2remove = ["word_list", "ipre",
                  "has_usp", "has_unesp", "has_unicamp", "has_embrapa"]

In [42]:
wl_dataset[
    ~wl_dataset["is_br"] &
    ~wl_dataset["country"].isna() &
    wl_dataset["has_embrapa"]
].drop(columns=ac_cols2remove)

Unnamed: 0,collection,pid,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
39855,scl,S1519-566X2002000400001,Embrapa-Labex-USA,USA,,,False,False,False,True
54349,scl,S0100-41582003000500001,Embrapa,France,,,False,False,False,True
60767,scl,S0100-67622003000500017,Embrapa Labex,USA,Maryland,Beltsville,False,False,False,False
293668,scl,S1413-70542009000700003,Embrapa Acre,Rio Branco,AC,Zona Rural,False,False,False,False
402502,scl,S0101-20612011000400012,Embrapa Labex Europa,France,,,False,False,False,True
530884,scl,S0044-59672014000400011,Embrapa Labex Europa,France,,Montpellier,False,False,False,True
543687,scl,S1983-40632010000400006,Embrapa Labex Europa,France,,Montpellier,False,False,False,True
584465,scl,S0100-06832015000200377,Embrapa,France,,Montpellier,False,False,False,True
606243,scl,S1984-29612015000300317,Embrapa Labex,United States,,Beltsville,False,False,False,True
642176,scl,S0100-69162015000601172,Embrapa Labex USA,United States,,,False,False,False,True


Indeed, *EMBRAPA* exists in other countries,
in a program called [Labex](https://www.embrapa.br/programa-embrapa-labex).

However, all `Spain` entries are actually from Brazil.
In one such entry, `Brasil` appeared as the city instead of the country,
but the institution name
have both the EMBRAPA detailed name and its address,
and it's just the
[Embrapa Dairy Cattle](https://www.embrapa.br/en/gado-de-leite/dados-cadastrais)
(Brazil, MG).
`Embrapa Tropical Semiárido` happens to be
[Embrapa Semi-Arid](https://www.embrapa.br/en/semiarido/dados-cadastrais)
(Brazil, PE).

The `Rio Branco` country is actually a city name (Brazil, AC).
All other names actually belongs to EMBRAPA in other countries,
including
[Embrapa África](
  http://memoria.ebc.com.br/agenciabrasil/galeria/2008-04-20/20-de-abril-de-2008
).
These should be the only EMBRAPA entries from outside Brazil,
unless some other entry doesn't have the explicit acronym.

That said, every entry that has the EMBRAPA acronym
should be seen as belonging to it,
no matter the country.

In [43]:
solution.loc[
    wl_dataset["has_embrapa"],
    "embrapa"
] = 1.

#### Unicamp/UNESP code/acronym consistency

All entries are or might be from Brazil.

In [44]:
wl_dataset[~wl_dataset["mb_br"] & wl_dataset["has_unicamp"]].empty

True

In [45]:
wl_dataset[~wl_dataset["mb_br"] & wl_dataset["has_unesp"]].empty

True

Which *Unicamp* entries aren't from São Paulo?

In [46]:
unicamp_notsp = wl_dataset[
    wl_dataset["mb_br"] &
    ~wl_dataset["is_sp"] &
    ~wl_dataset["state"].isna() &
    wl_dataset["has_unicamp"]
].drop(columns=ac_cols2remove)
unicamp_notsp.set_index("pid")

Unnamed: 0_level_0,collection,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
S0104-59702003000200017,scl,Unicamp,Brasil,RJ,Rio de Janeiro,True,False,True,False
S0021-75572011000400014,scl,UNICAMP,,SC,Florianópolis,False,False,True,False
S0103-73072011000300014,scl,Unicamp,,FE,,False,False,True,False
S1518-61482009000400006,psi,UNICAMP,,Minas Gerais,Uberlândia,False,False,True,False
S0104-07072012000300015,rve,Unicamp,,Brasil,São Paulo,False,False,True,True


The last row has just some shifted fields.
The other ones don't look right,
since `FE` isn't a Brazilian state,
and *Unicamp* has no campus at these other Brazilian states.
The author and publishing year related to these entries are:

In [47]:
unicamp_notsp_authors = doc_authors[
    (doc_authors["doc_author_affiliation_state"].isin([
        "RJ", "SC", "FE", "Minas Gerais",
    ])) &
    (doc_authors["doc_author_institution"].str.upper() == "UNICAMP")
][["pid_scielo", "doc_publishing_year", "doc_author"]].set_index("pid_scielo")
unicamp_notsp_authors

Unnamed: 0_level_0,doc_publishing_year,doc_author
pid_scielo,Unnamed: 1_level_1,Unnamed: 2_level_1
S0104-59702003000200017,2003,Alex Gonçalves Varela
S0021-75572011000400014,2011,Camila Isabel Santos Schivinski
S0103-73072011000300014,2011,Carlos Eduardo Albuquerque Miranda
S1518-61482009000400006,2009,João Luiz Leitão Paravidini


Their Lattes:

- [Alex Gonçalves Varela](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4795093J6
  ): Mastership at UNICAMP in 2003.
  His graduation and postdoc were in RJ, but not in that year;
- [Camila Isabel Santos Schivinski](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4133346E6
  ): Works for UDESC since 2009.
  It tells she's also a collaborator for Unicamp,
  but [the article itself](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0021-75572011000400014&lng=en&nrm=iso&tlng=en
  ) tells she's from UDESC, not Unicamp;
- [Carlos Eduardo Albuquerque Miranda](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4782899J4
  ): `FE` stands for
  [*Faculdade de Educação*](https://www.fe.unicamp.br/)
  (Pedagogy college),
  and [the article](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0103-73072011000300014&lng=en&nrm=iso&tlng=en
  ) clearly states the author is from Unicamp;
- [João Luiz Leitão Paravidini](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4772134H5
  ): works at Uberlândia since 1994,
  studied at Campinas until 2002,
  it seems that, formally, he had never went back there.

Camila and Carlos entries should haven't been assigned to Unicamp,
whereas the other two entries
have misleading data in the remaining fields.

Another approach to detect that would be
the number of publications by year in each institution for these authors:

In [48]:
pd.DataFrame(doc_authors
    [doc_authors["doc_author"].isin(unicamp_notsp_authors["doc_author"])]
    .groupby(["doc_author", "doc_publishing_year", "doc_author_institution"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
doc_author,doc_publishing_year,doc_author_institution,Unnamed: 3_level_1
Alex Gonçalves Varela,2002,Universidade Estadual de Campinas,1
Alex Gonçalves Varela,2003,Unicamp,1
Alex Gonçalves Varela,2004,Universidade Estadual de Campinas,1
Alex Gonçalves Varela,2005,Universidade Estadual de Campinas,1
Alex Gonçalves Varela,2006,Universidade Estadual de Campinas,1
Alex Gonçalves Varela,2007,"Ministério da Ciência, Tecnologia e Inovação",2
Alex Gonçalves Varela,2007,Museu de Astronomia e Ciências Afins,2
Alex Gonçalves Varela,2008,"Ministério da Ciência, Tecnologia e Inovação",1
Alex Gonçalves Varela,2010,Ministério da Ciência e Tecnologia,1
Alex Gonçalves Varela,2010,Museu de Astronomia e Ciências Afins,1


In [49]:
solution.loc[
    unicamp_notsp.index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = unicamp_notsp.assign(
    usp=0.,
    unesp=0.,
    unicamp=1. * ~unicamp_notsp["state"].isin(["SC", "Minas Gerais"]),
    embrapa=0.,
)[["usp", "unesp", "unicamp", "embrapa"]]

Which *UNESP* entries aren't from São Paulo?

In [50]:
unesp_notsp = wl_dataset[
    wl_dataset["mb_br"] &
    ~wl_dataset["is_sp"] &
    ~wl_dataset["state"].isna() &
    wl_dataset["has_unesp"]
].drop(columns=ac_cols2remove)
unesp_notsp.set_index("pid")

Unnamed: 0_level_0,collection,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
S1519-69842006000100013,scl,Light SESA/UNESP,Brazil,RJ,Piraí,True,False,True,False
S0104-80232005000100010,scl,UNESP/Araraquara,,Pr,Maringá,False,False,True,False
S0006-87052008000300001,scl,UNESP,Brasil,PE,Petrolina,True,False,True,False
S0365-66912006001200002,esp,Universidad del Estado de San Pablo (UNESP),Brasil,San Pablo,Araraquara,True,False,True,False
S0104-07072009000100012,rve,FMB/UNESP,,Brasil,São Paulo,False,False,True,True
S0378-18442003000200007,ven,Universidade Estadual de São Paulo (UNESP),Brasil,PR,Maringá,True,False,True,True


Some remarks:

- The [S1519-69842006000100013](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1519-69842006000100013&lng=en&nrm=iso&tlng=en
  ) entry looks misleading, as not even the author name is clear.
  The actual author is [Rinaldo Jose da Silva Rocha](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4782602U5
  ), and the Light tells us that [UNESP had some participation](
    http://www.light.com.br/repositorio/ped%20balancos/balanco2009.htm
  ) with some research and development project,
  but the years don't seem to match.
  Nevertheless, UNESP seem to have something to do with this entry;
- `Araraquara` is the name of a city in the State of *São Paulo*
  where lies a UNESP campus;
- [FMB/UNESP](http://fmb.unesp.br/) is at Botucatu (*São Paulo*);
- The [S0006-87052008000300001](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0006-87052008000300001&lng=en&nrm=iso&tlng=en
  ) entry is both from UNESP at Botucatu, EMBRAPA (Brazil, PE)
  and Universidade Federal de Pelotas (Brazil, RS);
- There's no UNESP at Maringá (*Paraná*),
  but the authors entries in Lattes
  ([Tuleski, Silvana](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4798208U8
  ); [Hahn, Norma](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4798522P0
  ))
  are from people that relates to both UNESP
  and *Universidade Estadual de Maringá*,
  but just the Silvana entry seems to be really from UNESP
  (its year is in her doctorate years range);
  - The [S0378-18442003000200007](
      http://www.scielo.org.ve/scielo.php?script=sci_arttext&pid=S0378-18442003000200007&lng=es&nrm=iso&tlng=pt
    ) entry isn't from UNESP,
    it's there because of a misleading "bio" in the article.

That said, just that single entry in the `ven` collection
isn't from UNESP.
But we've found that the `PE` entry is also from EMBRAPA!

In [51]:
da_cols2remove = [k for k in doc_authors.columns
                    if k[:3] in ["is_", "ext", "stu"]]

In [52]:
doc_authors[
    (doc_authors["doc_author_affiliation_city"] == "Maringá") &
    (doc_authors["doc_author_institution"].str.contains("UNESP"))
].drop(columns=da_cols2remove).fillna("").T

Unnamed: 0,233019,2902013
collection,scl,ven
issn_scielo,0104-8023,0378-1844
issns,0104-8023,0378-1844
title_scielo,Revista do Departamento de Psicologia. UFF,Interciencia
title_thematic_areas,Human Sciences,Agricultural Sciences;Biological Sciences;Engi...
title_current_status,deceased,suspended
pid_scielo,S0104-80232005000100010,S0378-18442003000200007
doc_publishing_year,2005,2003
doc_type,research-article,research-article
doc_is_citable,1,1


In [53]:
pd.DataFrame(doc_authors
    [doc_authors["doc_author"] == "Norma Segatti Hahn"]
    .groupby(["doc_author_institution", "doc_publishing_year"])
    .size()
    .rename("count")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
doc_author_institution,doc_publishing_year,Unnamed: 2_level_1
"Núcleo de Pesquisas em Limnologia, Ictiologia e Aqüicultura",2007,1
PEA,2007,1
Universidade Estadual de Maringá,1997,1
Universidade Estadual de Maringá,1998,1
Universidade Estadual de Maringá,2003,1
Universidade Estadual de Maringá,2004,3
Universidade Estadual de Maringá,2005,1
Universidade Estadual de Maringá,2007,2
Universidade Estadual de Maringá,2008,2
Universidade Estadual de Maringá,2010,1


In [54]:
solution.loc[
    unesp_notsp.index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = unesp_notsp.assign(
    usp=0.,
    unesp=1. * (unesp_notsp["collection"] != "ven"),
    unicamp=0.,
    embrapa=1.0 * (unesp_notsp["state"] == "PE"),
)[["usp", "unesp", "unicamp", "embrapa"]]

#### USP acronym consistency

USP can be a reference to [USP Hospitales](http://www.usphospitales.com),
a [no longer active](https://empresite.eleconomista.es/USP-HOSPITALES-MADRID.html)
private hospital group in Spain
that [had been merged](
  https://www.elconfidencial.com/economia/2012-03-22/los-hospitales-usp-y-quiron-se-fusionan-para-crear-un-gigante-de-la-sanidad-privada_417839/
) with the [Clínica Quirón Palmaplanas].
It's unknown if USP is really an acronym in this case,
some pages like [Unidad de Arritmias y Síncope](http://www.unidadarritmias.com/)
are a part of it, but
no longer have the word "USP" anywhere,
and there's no institution name history available.
[USP Dexeus University Institute](https://www.quironsalud.es/dexeus-barcelona)
is a [UAB-partner research institute](
  https://www.uab.cat/web/research/itineraries/relation-with-surrounding-areas/research-centres-institutes/instituto-centro-de-investigacion-1345467963242.html?param1=1345659470618
) (UAB stands for *Universitat Autònoma de Barcelona*).
That's still just about *USP Hospitales*,
there are several other homonyms.

United States Pharmacopeia
(*Farmacopea de los Estados Unidos*)
[is another name with the same acronym](https://www.usp.org/).

USP may also mean
`University of the South Pacific` (intergovernmental organization, Oceania).

It might be pretty hard to distinguish between them,
to do that we'll need to use some other information
(hopefully, just the country),
but there are some words we know that shouldn't appear
in any *University of São Paulo* entries
that can appear on these other names:

- `Palmaplanas`
- `Quirónsalud`
- `Dexeus`
- `Pharmacopeia`
- `Farmacopea`

Let's see all entries having the `USP` acronym
that aren't from Brazil.

In [55]:
usp_notbr = wl_dataset[
    ~wl_dataset["mb_br"] & wl_dataset["has_usp"]
].drop(columns=ac_cols2remove)
usp_notbr.set_index("pid")

Unnamed: 0_level_0,collection,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
S0102-64451988000100008,scl,USP,EUA,,Madison,False,False,False,True
S0004-06142006000600003,esp,USP-Policlínico Santa Teresa,España,,La Coruña,False,False,False,True
S1139-76322010000500053,esp,Instituto Universitario Dexeus USP,España,,Barcelona,False,False,False,True
S1139-76322010000500023,esp,USP Instituto Universitario Dexeus,España,,Barcelona,False,False,False,True
S1139-76322010000500037,esp,USP Instituto Universitario Dexeus,España,,Barcelona,False,False,False,True
S0210-56912011000200013,esp,Hospital USP La Colina,España,,Santa Cruz de Tenerife,False,False,False,True
S0210-56912011000400014,esp,Hospital USP La Colina,España,,Santa Cruz de Tenerife,False,False,False,True
S0210-56912012000300008,esp,Clínica USP- Palmaplanas,España,,Palma de Mallorca,False,False,False,True
S1130-01082012001100014,esp,USP Hospital Marbella,Spain,,Málaga,False,False,False,True
S1405-99402013000300002,mex,Hospital USP,Spain,,Vitoria,False,False,False,True


From these, only the one with `Capital` as the country name
belongs to *University of São Paulo*.
All other entries are from some of those other USP-named institutions.

In [56]:
country_counts["Capital"] # It's Brazil!

1

In [57]:
solution.loc[
    usp_notbr.index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = usp_notbr.assign(
    usp=1. * (usp_notbr["country"] == "Capital"),
    unesp=0.,
    unicamp=0.,
    embrapa=0.,
)[["usp", "unesp", "unicamp", "embrapa"]]

How about the entries that aren't from São Paulo?

In [58]:
usp_notsp = wl_dataset[
    wl_dataset["mb_br"] &
    ~wl_dataset["is_sp"] &
    ~wl_dataset["state"].isna() &
    wl_dataset["has_usp"]
].drop(columns=ac_cols2remove).sort_values(by="pid")
usp_notsp.set_index("pid")

Unnamed: 0_level_0,collection,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
S0080-62342008000200011,rve,USP,Brasil,PR,Londrina,True,False,True,False
S0080-62342008000300019,rve,USP,Brasil,BA,Salvador,True,False,True,False
S0103-166X2003000300005,scl,USP,,PR,Londrina,False,False,True,False
S0104-07072007000300006,rve,USP,,Brasil,Ceará,False,False,True,False
S0104-07072010000300006,rve,USP,,Brasil,São Paulo,False,False,True,True
S0104-07072012000300016,rve,USP,,Brasil,São Paulo,False,False,True,True
S0104-12902009000200017,spa,USP,Brasil,MG,Belo Horizonte,True,False,True,False
S0104-12902009000300011,spa,USP,Brasil,MG,Uberaba,True,False,True,False
S0104-12902011000400011,spa,USP,,Brasil,São Paulo,False,False,True,True
S1413-81232005000500012,spa,USP,Brazil,RJ,Rio de Janeiro,True,False,True,False


All these entries are from some actual Brazilian city,
and sometimes the `state` field have the country name.
The long name is just a pair of universities in a single entry.

In [59]:
usp_notsp["institution"].unique()

array(['USP',
       'Universidade de São Paulo (USP); Universidade Federal do Triângulo Mineiro (UFTM)'],
      dtype=object)

Only *São Paulo* and *Ribeirão Preto* are cities
in the State of *São Paulo*,
and both have USP campi.

In [60]:
sorted(usp_notsp["city"].unique())

['Belo Horizonte',
 'Cascavel',
 'Ceará',
 'Goiânia',
 'Londrina',
 'Maringá',
 'Ribeiro Preto',
 'Rio de Janeiro',
 'Salvador',
 'São Paulo',
 'Uberaba']

The entries with the other cities are:

- [S0080-62342008000200011](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0080-62342008000200011&lng=en&nrm=iso&tlng=en
  ): In 2008, [Denise Rodrigues Costa Schmidt](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4130285P2
  ) was in her doctorate at USP.
  Since 2001, she works for UEL (Londrina, PR);
- [S0080-62342008000300019](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0080-62342008000300019&lng=en&nrm=iso&tlng=en
  ): In 2008, [Darci de Oliveira Santa Rosa](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4792909D8
  ) was no longer in her doctorate (finished in 1999 at USP).
  That seems like part of a bio,
  not something related with that specific document;
- [S0103-166X2003000300005](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0103-166X2003000300005&lng=en&nrm=iso&tlng=en
  ): In 2003, [Jocelaine Martins da Silveira](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4707774P7
  ) was in her doctorate at USP.
  From 1994 until 2006, she worked at UEL (Londrina, PR);
- [S0104-07072007000300006](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0104-07072007000300006&lng=en&nrm=iso&tlng=en
  ): In 2007, [Ana Patrícia Pereira Morais](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4799339Z6
  ) was in her doctorate at USP.
  Since 2000, she works for UECE (CE/Ceará);
- [S0104-12902009000200017](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0104-12902009000200017&lng=en&nrm=iso&tlng=en
  ): Belo Horizonte (MG) is just
  a personal or postal/mailing address in the text.
  This entry has a bogus "Unicamp" in the text as part of some bio,
  but, as desired and as shown below, it's not in data;
- [S0104-12902009000300011](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0104-12902009000300011&lng=en&nrm=iso&tlng=en
  ): Uberaba (MG) is just
  a personal or postal/mailing address in the text;
- [S1413-81232005000500012](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1413-81232005000500012&lng=en&nrm=iso&tlng=en
  ): In 2005, [Cláudio Picanço Magalhães](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4797783Y2
  ) had already finished his doctorate (USP, from 1998 to 2002),
  and was working for Faculdade JK (since 2001 until 2006).
  On the other hand, Illona Maria de Brito Sá Stoppelli
  (who doesn't have a Lattes)
  finished her doctorate [in 2005](
    http://www.teses.usp.br/teses/disponiveis/18/18139/tde-25062005-192546/en.php
  ), and Rio de Janeiro (RJ) looks like
  her personal or postal/mailing address;
- [S1415-52731998000200001](
    http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1415-52731998000200001&lng=en&nrm=iso&tlng=en
  ): In 1998, [Maria Margareth Veloso Naves](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4795292Z2
  ) was in her doctorate at USP.
  Since 1983, she works for UFG (GO/Goiás);
- S1517-38522012000300023 [\[PDF\]](
    http://www.repositorio.ufc.br/bitstream/riufc/12322/1/2012_art_bfmartins.pdf
  ): In 2012, [Maycon Rogério Seleghim](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4240127Y2
  ) was in his doctorate at USP,
  as well as in a remote teaching job for UEM (Maringá, PR);
- S1518-19442008000200020 [\[PDF\]](
    https://www.fen.ufg.br/fen_revista/v10/n2/pdf/v10n2a21.pdf
  ): In 2008, [Ariana Rodrigues da Silva Carvalho](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4758442D8
  ) was in his doctorate at USP.
  Since 2003, she works for UNIOESTE (Cascavel, PR);
- [S1518-61482009000200008](
    http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S1518-61482009000200008&lng=en&nrm=iso&tlng=en
  ): In 2009, [Perla Klautau](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4760122U9
  ) was in her postdoc at USP.
  Rio de Janeiro (RJ) is
  her personal or postal/mailing address;
- [S1806-24902014000100004](
    http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S1806-24902014000100004&lng=en&nrm=iso&tlng=en
  ): In 2014, [Ana Carolina Zuanazzi Fernandes](
    http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4478464U7
  ) was in her mastership at USP.
  Londrina (PR) is
  her personal or postal/mailing address;
- [S2316-51972013000100007](
    http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S2316-51972013000100007&lng=en&nrm=iso&tlng=en
  ): An author from USP, another from UFTM.

Then, just `S0080-62342008000300019` isn't from USP.
These entries had also been manually checked
if they belong to UNESP, Unicamp or EMBRAPA,
and it's clear they have nothing to do with these.

In [61]:
# No bogus "Unicamp" entry for this PID!
dataset[dataset["pid"] == "S0104-12902009000200017"]["institution"].unique()

array(['Universidade de São Paulo',
       'Centro de Estudos, Pesquisa e Documentação em Cidades Saudáveis',
       'USP'], dtype=object)

In [62]:
solution.loc[usp_notsp.index, "usp"] = \
    1. * (usp_notsp["pid"] != "S0080-62342008000300019")
solution.loc[
    solution["pid"].isin(
        usp_notsp[
            ~usp_notsp["city"].isin(["São Paulo", "Ribeiro Preto"])
        ]["pid"]
    ),
    ["unesp", "unicamp", "embrapa"]
] = 0.

#### South Pacific in this dataset

*University of the South Pacific* also appears in this dataset,
but it's still unknown if it appears as "USP".
Below, only the
*Centre for Oceanographic Research in the Eastern South-Pacific*
(COPAS)
isn't part of that university,
but it's not *University of São Paulo* either.
Although it has "Eastern" in its name,
it's actually from [Chile](http://www.copas.cl/eng/).

In [63]:
south_pacific = wl_dataset[
    wl_dataset["ipre"].str.contains("south pacific")
].drop(columns=ac_cols2remove)
south_pacific.set_index("pid")

Unnamed: 0_level_0,collection,institution,country,state,city,is_br,is_sp,mb_br,mb_sp
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
S1517-83822006000200011,scl,University of the South Pacific,Fiji,,Suva,False,False,False,True
S1516-14392010000400006,scl,University of the South Pacific,Fiji,,Suva,False,False,False,True
S1516-14392011000400004,scl,The University of the South Pacific,,,Fiji,False,False,True,True
S1516-14392012000200019,scl,University of the South Pacific,Fiji,,Suva,False,False,False,True
S1516-14392014000100026,scl,The University of the South Pacific,Fiji,,Suva,False,False,False,True
S0250-71611999007500004,chl,The University of the South Pacific,Fiji,,,False,False,False,True
S0716-078X2004000400004,chl,Centre for Oceanographic Research in the Easte...,,,,False,False,True,True
S0717-34581998000100001,chl,University of the South Pacific Apia Western S...,Samoa,,,False,False,False,True
S0034-77442007000300004,cri,The University of the South Pacific,Fiji,,,False,False,False,True
S2182-12672016000100011,prt,University of South Pacific,Fiji,,Suva,False,False,False,True


In [64]:
south_pacific["institution"].unique()

array(['University of the South Pacific',
       'The University of the South Pacific',
       'Centre for Oceanographic Research in the Eastern South-Pacific (COPAS)',
       'University of the South Pacific Apia Western Samoa',
       'University of South Pacific'], dtype=object)

In [65]:
solution.loc[
    south_pacific.index,
    ["usp", "unesp", "unicamp", "embrapa"]
] = 0.