# Applying FastText directly on ISO data

In [1]:
from functools import reduce
from pathlib import Path
import re
import warnings

from gensim.models import FastText
import numpy as np
import pandas as pd
from sklearn import ensemble, metrics, model_selection
from unidecode import unidecode

from ioisis import iso

In [2]:
warnings.filterwarnings(
    action="ignore",
    message="(?:.* and F-score are ill-defined|F-score is ill-defined.*)",
    category=metrics.classification.UndefinedMetricWarning,
)

## Loading ISO data with [`ioisis`](https://github.com/scieloorg/ioisis) 0.1.0 and Pandas

This is the directory with ISO data,
it can be the mounted `isis2mongo` source data directory,
a copy of it,
or a symbolic link to the actual path.
For this experiment,
we're using [this ISIS ISO dump](
  https://drive.google.com/open?id=101-oKPeKF2LM0L2uO_dYL9fp0eKOCE_-
), which includes the data of the first $200$ article headers
from every SciELO collection.

In [3]:
iso_root_dir = "./2019-11-13_iso200/"

The structure of this directory should be a `COLLECTION/DATASET.iso`,
where `COLLECTION` is the 3 lowercase letters code of a collection,
and `DATASET` is the name of a dataset
(`title`, `artigo`, `bib4cit` or `issue`).
For now we'll use just the `artigo.iso` files:

In [4]:
iso_file_names_map = {iso_file.parts[-2]: str(iso_file.resolve())
                      for iso_file in Path(iso_root_dir).glob("???/artigo.iso")}

These functions will let us get the first records
from each of these datasets.
As we are going to use only the records of the *header* type
(the `artigo.iso` data files have more than one schema,
 one can think that there's a distinct "table"
 for each value of the field $706$).

In [5]:
def filter_headers(iterator):
    for el in iterator:
        if el["706"] == ["h"]:
            yield el

In [6]:
def take(n, iterator):
    return [el for unused, el in zip(range(n), iterator)]

Let's get the first record from the SciELO Brazil ISO file
to see how it looks like:

In [7]:
scl1, = take(1, filter_headers(iso.iter_records(iso_file_names_map["scl"])))
scl1

defaultdict(list,
            {'1': ['br1.1'],
             '2': ['S0044-5967(04)03400101'],
             '4': ['v34n1'],
             '10': ['^rND^1A01^nSilvio Roberto Miranda dos^sSantos',
              '^rND^1A01^nIzildinha de Souza^sMiranda',
              '^rND^1A01^nManoel Malheiros^sTourinho'],
             '12': ['Estimativa de biomassa de sistemas agroflorestais das várzeas do rio juba, Cametá, Pará^lpt',
              'Biomass estimation of agroforestry systems of the Juba river floodplain in Cametá, Pará^len'],
             '14': ['^f01^l08'],
             '30': ['Acta Amaz.'],
             '31': ['34'],
             '32': ['1'],
             '35': ['0044-5967'],
             '38': ['ILUS', 'TAB'],
             '40': ['pt'],
             '42': ['1'],
             '49': ['AA010'],
             '58': ['Projeto Várzea/UFRA'],
             '65': ['20040000'],
             '70': ['Universidade Federal Rural da Amazônia-UFRA^iA01^cBelém^sPA^pBrasil^e<a href="mailto:izildinha@ufra.

It's a dictionary of lists of strings,
in the `{field_id: [field, field, field, ...]}` format.
Each field string have multiple subfields,
where each subfield starts with a `^` symbol
and its single-char identifier.
This regular expression converts a field string
to a list of subfield pairs
(similar to the format type 2 of
[isis2json](https://github.com/scieloorg/isis2json)):

In [8]:
get_all_subfields_pairs = re.compile(r"(?:^|\^(.))([^^]*)").findall

What matters most for us
is a way to get the pairs of affiliation and contributors
with their data in a "first normal form" sense joined together,
keeping the affiliation data even if it has no linked contributor.
We can get a dataframe like that with Pandas:

In [9]:
def field_df(record, field_id):
    return pd.DataFrame([
        {k: v for k, v in get_all_subfields_pairs(field) if v}
        for field in record[field_id]
    ]).rename(columns=lambda name: field_id + (name or "_"))

In [10]:
scl1_affs = field_df(scl1, "70")
scl1_affs

Unnamed: 0,70_,70i,70c,70s,70p,70e
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin..."


In [11]:
scl1_contribs = field_df(scl1, "10")
scl1_contribs

Unnamed: 0,10r,101,10n,10s
0,ND,A01,Silvio Roberto Miranda dos,Santos
1,ND,A01,Izildinha de Souza,Miranda
2,ND,A01,Manoel Malheiros,Tourinho


In [12]:
scl1_aff_contrib_pairs = pd.merge(scl1_affs, scl1_contribs,
                                  how="left", left_on="70i", right_on="101")
scl1_aff_contrib_pairs

Unnamed: 0,70_,70i,70c,70s,70p,70e,10r,101,10n,10s
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Silvio Roberto Miranda dos,Santos
1,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Izildinha de Souza,Miranda
2,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Manoel Malheiros,Tourinho


From the ISO data, we have less information
than what we would get from the XML data,
e.g. it doesn't have the citations, nor issue information,
but we've got some other information, like the PID.
We can "join" (in a cartesian product sense)
the multiple values of some fields:

In [13]:
def cartesian_df(*dfs, free_name="_empty"):
    return reduce(
        pd.DataFrame.join,
        [df.assign(**{free_name: 0}).set_index(free_name) for df in dfs],
    ).reset_index(drop=True)

In [14]:
scl1_other = cartesian_df(field_df(scl1, "2"), field_df(scl1, "12"), field_df(scl1, "936"))
scl1_other

Unnamed: 0,2_,12_,12l,936i,936y,936o
0,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
1,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1


That article has a title in more than one language.
We can use the cartesian product again to join these rows
with the affiliation-contributor pairs
in order to get a dataframe from a single ISO record,
and we can also replace these field/subfield IDs
by some meaningful names:

In [15]:
COLUMN_NAMES = {
    "70_": "orgname",
    "701": "orgdiv1",
    "702": "orgdiv2",
    "703": "orgdiv3",
    "704": "normalized",
    "708": "c8",
    "709": "original",
    "70e": "email",
    "70c": "city",
    "70d": "division",
    "70s": "state",
    "70p": "country",
    "70q": "country_full",
    "70z": "zip",
    "70i": "affid",
    "70l": "label",
    "101": "affid",
    "10s": "surname",
    "10n": "names",
    "10p": "prefix",
    "10z": "suffix",
    "10r": "role",
    "10k": "orcid",
    "2_": "pid",
    "12_": "title",
    "12l": "title_lang",
    "936i": "issn",
    "936y": "year",
    "936o": "number",
}

In [16]:
def record2df(record):
    affs = field_df(record, "70")
    if "70i" not in affs.columns:
        return pd.DataFrame()
    
    contribs = field_df(record, "10")
    if "101" not in contribs.columns:
        contribs = pd.DataFrame(columns=["101"])
        
    return cartesian_df(
        pd.merge(affs, contribs,
                 how="left", left_on="70i", right_on="101"),
        *(field_df(record, field_id)
          for field_id in ["2", "12", "936"]),
    ).drop(columns="101").rename(columns=COLUMN_NAMES)

In [17]:
record2df(scl1).drop(columns=["email"])

Unnamed: 0,orgname,affid,city,state,country,role,names,surname,pid,title,title_lang,issn,year,number
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Silvio Roberto Miranda dos,Santos,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
1,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Silvio Roberto Miranda dos,Santos,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1
2,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Izildinha de Souza,Miranda,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
3,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Izildinha de Souza,Miranda,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1
4,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Manoel Malheiros,Tourinho,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
5,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,Manoel Malheiros,Tourinho,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1


## Building a small dataset

This dataset, which is stored in a single `pd.DataFrame` instance,
has only the top $100$ entries from all collections.

In [18]:
datagens = {k: filter_headers(iso.iter_records(v))
            for k, v in iso_file_names_map.items()}
dataset = pd.concat([
    record2df(record).assign(collection=k)
    for k, v in datagens.items()
    for record in take(100, v)
], axis="index", ignore_index=True, sort=True)

How many entries are there?

In [19]:
dataset.shape

(9022, 23)

Let's see some statistics for this data:

In [20]:
dataset.describe().T

Unnamed: 0,count,unique,top,freq
12s,4,1,a review,4
affid,9022,26,A01,4548
city,4517,403,São Paulo,319
collection,9022,23,mex,1070
country,5946,85,Brasil,937
email,1744,665,jemunozf@palmira.unal.edu.co,18
issn,9022,25,0043-3144,1250
label,1402,9,1,964
names,8496,2373,M,98
number,9022,11,1,3144


### Pre-normalization of `country` and `state`

The data is simple but quite noisy.

In [21]:
country_counts = dataset["country"].value_counts()
print("Total countries:", len(country_counts))
country_counts[country_counts < 3]

Total countries: 85


COLOMBIA                    2
Honduras                    2
Austria                     2
Poland                      2
Norway                      2
Colombia.                   2
Australia                   2
Rusia                       2
Bélgica                     2
The Netherlands             2
UK                          2
Hungary                     2
the Netherlands             2
República Dominicana        2
Netherlands                 2
Venezuela,                  2
England                     2
Cuba<A NAME="cargo"></A>    2
Belgium                     2
CostaRica                   1
Reino Unido                 1
New Zealand                 1
Pennsylvania                1
Grecia                      1
Namibia                     1
Republic of China           1
Name: country, dtype: int64

Some of these country names are noisy
(upper/lower case, dot, accents, HTML tag, etc.),
but some might be just uncommon.
Let's try to pre-normalize these:

In [22]:
pd.DataFrame([
    country_counts.index.to_series().rename("country"),
    country_counts.index.to_series()
        .map(unidecode)
        .str.lower()
        .str.replace(r"<.*>", "")
        .str.replace(r"[^a-z]", "")
        .rename("norm"),
    country_counts.rename("count"),
]).T.infer_objects().groupby("norm").apply(
    lambda df: pd.DataFrame() if df.shape[0] == 1 else
                df.assign(norm=df["count"].idxmax())
).reset_index(drop=True)

Unnamed: 0,country,norm,count
0,Argentina,Argentina,259.0
1,(Argentina),Argentina,6.0
2,Colombia,Colombia,497.0
3,COLOMBIA,Colombia,2.0
4,Colombia.,Colombia,2.0
5,Costa Rica,Costa Rica,36.0
6,CostaRica,Costa Rica,1.0
7,Cuba,Cuba,62.0
8,"Cuba<A NAME=""cargo""></A>",Cuba,2.0
9,Mexico,Mexico,874.0


That's not completely normalized
as the typos aren't fixed,
translated names are still kept "as is",
and there might be acronyms and distinct writings
for the same country.
However, that should suffice for now,
from that we can create a simple function
to create a pre-normalize map
(to be used with `pd.Series.replace`):

In [23]:
def prenorm_map(column):
    counts = column.value_counts()
    return pd.DataFrame([
        counts.index.to_series().rename(column.name),
        counts.index.to_series()
            .map(unidecode)
            .str.lower()
            .str.replace(r"<.*>", "")
            .str.replace(r"[^a-z]", "")
            .rename("norm"),
        counts.rename("count"),
    ]).T.infer_objects().groupby("norm").apply(
        lambda df: pd.DataFrame() if df.shape[0] == 1 else
                    df.assign(norm=df["count"].idxmax())
                              .drop(df["count"].idxmax())
    ).set_index(column.name)["norm"]

This is the map:

In [24]:
country_prenorm_map = prenorm_map(dataset["country"])
country_prenorm_map

country
(Argentina)                       Argentina
COLOMBIA                           Colombia
Colombia.                          Colombia
CostaRica                        Costa Rica
Cuba<A NAME="cargo"></A>               Cuba
México                               Mexico
Peru                                   Perú
the Netherlands             The Netherlands
U.S.A.                                  USA
U.S.A                                   USA
Venezuela.                        Venezuela
Venezuela,                        Venezuela
Name: norm, dtype: object

The same can be used to normalize the state names.

In [25]:
state_prenorm_map = prenorm_map(dataset["state"])
state_prenorm_map

state
Am                                                   AM
Buenos Aires                              Buenos Aires.
Estado Bolívar                           estado Bolívar
Fl                                                   FL
Provincia de Buenos Aires.    Provincia de Buenos Aires
Rondonia                                       Rondônia
SC.                                                  SC
Name: norm, dtype: object

Unfortunately, *Buenos Aires* got a trailing dot.
That's okay for now,
we just need to group the distinct writings
of what clearly looks like the same place.
There are many distinct states, though:

In [26]:
state_counts = dataset["state"].value_counts()
print("Total states:", len(state_counts))
state_counts[state_counts < 3]

Total states: 146


Montecillo          2
Junín               2
Alta Verapaz        2
Colorado            2
Coyoacán            2
Rondônia            2
Puno                2
BC                  2
Ciudad de México    2
Zulia               2
Washington, DC      2
Argentina.          2
Coahuila            2
Rio de Janeiro      2
Santa Teresa        2
Ceará               2
Connecticut         2
Genève              2
RR                  2
Portuguesa          2
Tabasco             2
Bahia               2
TO                  2
Chihuahua           2
Anzóategui          2
MO                  2
Espírito Santo      2
Lara                2
Rondonia            2
Fl                  2
Illinois            2
Edo. Mexico         2
Rennes              2
Antioquia           2
Santa Catarina      1
WA                  1
Freiburg            1
Madrid              1
Ciudad Real         1
CA                  1
Philadelphia        1
Name: state, dtype: int64

### A really simple FastText algorithm on raw `orgname` to detect `country` and `state`

From all the selected fields, and using the entire dataset,
are we able to detect the `country` and the `state`
using a FastText model?
It's hard to know and perhaps also hard to try,
so let's keep it simple for now:
can we detect the `country` and `state` fields
from the `orgname` alone using FastText?

For now, let's use a really simple model,
one that stores every `orgname` as a vector with just $4$ numbers.
That should avoid issues with overfitting.

In [27]:
sentences = dataset["orgname"].dropna().apply(lambda orgname: [orgname])
orgname_model = FastText(size=4, window=5, min_count=1)
orgname_model.build_vocab(sentences=sentences)
orgname_model.train(
    sentences=sentences,
    total_examples=len(sentences),
    epochs=15,
)

Every `orgname` now gets represented by a vector like this:

In [28]:
orgname_model.wv["USP"]

array([ 0.00727715, -0.01557652, -0.03035402,  0.00647145], dtype=float32)

But it's not so meaningful for us.
From the data we used to train it,
this model should not be regarded as "semantic":

In [29]:
orgname_model.wv["Universidade de São Paulo"]  # Other writing for USP

array([0.02402348, 0.00684705, 0.00068171, 0.00122255], dtype=float32)

In [30]:
orgname_model.wv.similar_by_word("USP")

[('Professor of the Inter-institutional Postgraduate in Animal Production',
  0.9954269528388977),
 ('Universidad de Costa Rica Universidad Tecnológica Indoamérica',
  0.9834996461868286),
 ('Universidad del Quindío', 0.9822467565536499),
 ('Universidad del País Vasco', 0.9767181873321533),
 ('SBPC', 0.9696468710899353),
 ('Universidad de Carabobo', 0.9617018103599548),
 ('Intituto Nacional de Seguros', 0.9605212211608887),
 ('Universidad del Valle', 0.9558581709861755),
 ('Universidad de Concepción, Chile', 0.942765474319458),
 ('INPA', 0.9378964900970459)]

In [31]:
def build_rf_model_1to1(
    fasttext_model, fasttext_field, raw_data, field, norm_map,
    test_size=.5,
    random_states=(42, 101),
    n_estimators=20,
    criterion="gini",
):
    dataset_filled = raw_data[~raw_data[field].isna()]
    X = fasttext_model.wv[dataset_filled[fasttext_field].fillna("")]
    y = dataset_filled[field].replace(norm_map)

    y_counts = y.value_counts()
    y_strata = y.replace(y_counts[y_counts == 1].index, "")

    X_train, X_test, y_train, y_test = \
        model_selection.train_test_split(
            X, y,
            test_size=test_size,
            random_state=random_states[0],
            stratify=y_strata,
        )

    y_train_counts = y_train.value_counts()
    y_weights = y_train.replace(1 / y_train_counts).values

    rf_model = ensemble.RandomForestClassifier(
        random_state=random_states[1],
        n_estimators=n_estimators,
        criterion=criterion,
    )
    rf_model.fit(X_train, y_train, sample_weight=y_weights)
    return rf_model, X_test, y_test

In [32]:
country_rf_model, country_X_test, country_y_test = build_rf_model_1to1(
    fasttext_model=orgname_model,
    fasttext_field="orgname",
    raw_data=dataset,
    field="country",
    norm_map=country_prenorm_map,
)
country_y_pred = country_rf_model.predict(country_X_test)

In [33]:
print(metrics.classification_report(country_y_test, country_y_pred))

                          precision    recall  f1-score   support

                Alemanha       1.00      1.00      1.00         3
                Alemania       0.17      0.50      0.25         2
                 Antigua       0.80      1.00      0.89         8
     Antigua and Barbuda       1.00      1.00      1.00         4
               Argentina       0.83      0.90      0.86       132
               Australia       1.00      1.00      1.00         1
                 Austria       0.50      1.00      0.67         1
                 Bahamas       1.00      1.00      1.00         2
                Barbados       0.05      1.00      0.09        13
                 Belgium       1.00      1.00      1.00         1
                  Belize       1.00      1.00      1.00         2
                 Bolivia       0.87      0.81      0.84        32
                  Brasil       0.92      0.92      0.92       469
                  Brazil       0.75      0.64      0.69        78
         

In [34]:
state_rf_model, state_X_test, state_y_test = build_rf_model_1to1(
    fasttext_model=orgname_model,
    fasttext_field="orgname",
    raw_data=dataset,
    field="state",
    norm_map=state_prenorm_map,
)
state_y_pred = state_rf_model.predict(state_X_test)

In [35]:
print(metrics.classification_report(state_y_test, state_y_pred))

                           precision    recall  f1-score   support

                       AC       1.00      1.00      1.00         1
                       AM       0.84      0.44      0.57        94
                  Alagoas       1.00      1.00      1.00         3
             Alta Verapaz       1.00      1.00      1.00         1
                 Amazonas       0.67      0.67      0.67        18
                   Ancash       1.00      1.00      1.00         3
                Antioquia       1.00      1.00      1.00         1
               Anzóategui       1.00      1.00      1.00         1
                   Aragua       1.00      1.00      1.00         6
                 Arequipa       0.89      1.00      0.94        17
               Argentina.       1.00      1.00      1.00         1
                       BA       1.00      1.00      1.00         4
                       BC       1.00      1.00      1.00         1
                    Bahia       1.00      1.00      1.00     

Such high values for the macro averages above are somewhat impressive
for simple models like these two ones
and for a not so normalized input data
(some names appeared more than once in distinct languages).

### Applying in the next 100 records from each collection

Let's apply this model on a second dataset
containing the next $100$ article headers
from each collection.
Is it able to generalize on the names
that have never appeared,
not even for the FastText training?

In [36]:
dataset2 = pd.concat([
    record2df(record).assign(collection=k)
    for k, v in datagens.items()
    for record in take(100, v)
], axis="index", ignore_index=True, sort=True)
dataset2.shape

(7017, 22)

Applying the model and evaluating the country/state
after the pre-normalization step:

In [37]:
d2country = dataset2[["orgname", "country"]].dropna()
d2country["country"].replace(prenorm_map(d2country["country"]), inplace=True)
d2country_wvs = orgname_model.wv[d2country["orgname"]]
d2country["pred"] = country_rf_model.predict(d2country_wvs)
print(metrics.classification_report(d2country["country"], d2country["pred"]))

                                    precision    recall  f1-score   support

                             (RSA)       0.00      0.00      0.00         2
                          Alemanha       0.00      0.00      0.00         2
                          Alemania       0.00      0.00      0.00         4
               Antigua and Barbuda       0.60      0.75      0.67        16
                         Argentina       0.07      0.22      0.10        65
                         Australia       0.00      0.00      0.00         8
                           Bahamas       0.00      0.00      0.00        42
                          Barbados       0.03      1.00      0.05         8
                           Belgium       0.00      0.00      0.00         0
                            Belize       0.00      0.00      0.00         0
                           Bolivia       0.86      0.51      0.64       172
                            Brasil       0.33      0.61      0.43       235
           

In [38]:
d2state = dataset2[["orgname", "state"]].dropna()
d2state["state"].replace(prenorm_map(d2state["state"]), inplace=True)
d2state_wvs = orgname_model.wv[d2state["orgname"]]
d2state["pred"] = state_rf_model.predict(d2state_wvs)
print(metrics.classification_report(d2state["state"], d2state["pred"]))

                           precision    recall  f1-score   support

                 A Coruña       0.00      0.00      0.00         1
                       AC       0.33      1.00      0.50         4
                       AM       0.44      0.22      0.29       110
                       AZ       0.00      0.00      0.00         6
                     Acre       0.00      0.00      0.00         4
           Aguascalientes       0.00      0.00      0.00         2
                 Alicante       0.00      0.00      0.00         6
                 Amazonas       0.00      0.00      0.00         6
                   Ancash       0.00      0.00      0.00         0
                      Ant       0.00      0.00      0.00         2
                Antioquia       0.00      0.00      0.00         2
                   Aragua       0.67      1.00      0.80        12
                 Arequipa       0.20      1.00      0.33         2
                Argentina       0.00      0.00      0.00     

That's pretty bad.
That probably doesn't have much to do with the random forest model,
since the previous optimistic analysis was already based
on a "testing-only" part of the splitten dataset
that had no influence on the random forest training.
On the other hand, the \[not so orthodox\] FastText model
based on "single word" documents
was created using all the available data we had,
so we have here an evidence that the FastText model
is still biasing the overall outcomes
towards the first $100$ records from each collection.

There are some new country and state names,
so it was expected that the macro averages would drop
(though not that much).
The new names were:

In [39]:
def reshaped_df(data, n=5):
    """Function created just to display the data as a DataFrame."""
    return pd.DataFrame(
        np.pad(data, [0, (n - len(data)) % n], constant_values="")
        .reshape(-1, 5)
    )

In [40]:
d2nd_country = set(d2country["country"].unique()) - set(dataset["country"].unique())
reshaped_df(sorted(d2nd_country))

Unnamed: 0,0,1,2,3,4
0,(RSA),China,Costa,Croatia,EE. UU.
1,EUA,Estados Unidos de América,Guyana,Holanda,Inglaterra
2,Iran,Ireland,Irlanda,Italia,Japan
3,Kenya,Kingston,Lithuania,Malawi,Nueva Zelanda
4,República Federal de Alemania,San José,Universidad de las Fuerzas Armadas,Zambia,


In [41]:
d2nd_state = set(d2state["state"].unique()) - set(dataset["state"].unique())
reshaped_df(sorted(d2nd_state))

Unnamed: 0,0,1,2,3,4
0,A Coruña,AZ,Acre,Aguascalientes,Alicante
1,Ant,Argentina,Arizona,Barcelona,British Columbia
2,CO,Campeche,Cantabria,Cd. Mx.,Cundinamarca
3,D.F.,Distrito Capital,Distrito Federal,ES,Edo. Aragua
4,Edo. Carabobo,España,Falcón,Galicia,Girona
5,Grand-duchy,Huila,Iowa,La Coruña,La Pampa
6,Lagos,Los Ríos,Louisiana,Lugo,MN
7,MP,Mallorca,Maracaibo,Maranhão,Meta
8,Michoacán,Miranda,Málaga,NS,Navarra
9,Norte de Santander,Nuevo León,Oaxaca,Ontario,Paraíba


Given that even the high support targets like Brazil and Mexico
had REALLY low scores,
the proper solution to that would be
to use more classification data
and a better FastText model,
perhaps something already pre-trained with a huge corpus
to model both semantical and morphological features.