# Example of how to use the trained model based on Word2Vec

Trained files to download:

* [dictionary_w2v_both.dump](https://drive.google.com/open?id=1z4vAm2m3ANp48b2XnRtSlNDM2Gp4vrMX)
* [rf_w2v_200.dump](https://drive.google.com/open?id=1EEI-sY-nprjzQ1yyS11F_fhocAKzRpIt)
* [rf_w2v_1000.dump](https://drive.google.com/open?id=1_HeYOyjPlM6s1taoXSpG48XjIWd6A921)

This is an example of the based on `word2vec`.
No neural network is required,
the actual model is an already trained random forest
applied on a document vector that can be obtained
by simply multiplying a bag of words vector
and the matrix of eigenvectors,
stored in the "`u`" attribute in the model object.

In [1]:
%%time
import re
from unidecode import unidecode

from gensim.matutils import corpus2csc
from joblib import load
import pandas as pd

TEXT_ONLY_REGEX = re.compile("[^a-zA-Z ]")

def pre_normalize(name):
    return TEXT_ONLY_REGEX.sub("", unidecode(name).lower())

dictionary = load("dictionary_w2v_both.dump")
rf_model = load("rf_w2v_200.dump")

def predict(msg):
    return rf_model.predict(
        corpus2csc([
            dictionary.doc2bow(pre_normalize(msg).split())
        ], num_terms=len(dictionary)).T @ rf_model.u
    )[0]

def predict_proba(msg):
    result = rf_model.predict_proba(
        corpus2csc([
            dictionary.doc2bow(pre_normalize(msg).split())
        ], num_terms=len(dictionary)).T @ rf_model.u
    )
    result_series = pd.Series(result.ravel(), index=rf_model.classes_)
    return result_series[result_series > 0].sort_values(ascending=False)

CPU times: user 2.09 s, sys: 1.84 s, total: 3.93 s
Wall time: 11.3 s


This is the example that appeared in the training notebook.

In [2]:
spain_example = (
    "Cádiz "
    "2 Universidad de Cádiz Facultad de Ciencias del Mar y Ambientales "
        "Departamento de Química-Física Cádiz "
        "Departamento de Química-Física, "
        "Facultad de Ciencias del Mar y Ambientales, "
        "Universidad de Cádiz, "
        "UNESCO/UNITWIN WiCoP, "
        "Campus de Excelencia International del Mar (CEIMAR). "
        "(Polígono Río San Pedro s/n, Puerto Real 11510) "
        "Universidad de Cádiz "
    "Metals impact into the Paranaguá Estuarine Complex (Brazil) "
        "during the exceptional flood of 2011 "
    "Rocha Marilia Lopes da "
    "Rocha "
    "Facultad de Ciencias del Mar y Ambientales "
    "Departamento de Química-Física "
    "Universidad de Cádiz "
    "Universidad de Cádiz "
    "Departamento de Química-Física, "
        "Facultad de Ciencias del Mar y Ambientales, "
        "Universidad de Cádiz, "
        "UNESCO/UNITWIN WiCoP, "
        "Campus de Excelencia International del Mar (CEIMAR). "
        "(Polígono Río San Pedro s/n, Puerto Real 11510, Cádiz, ) "
    "Universidad de Cádiz "
    "Brazilian Journal of Oceanography "
    "Universidade de São Paulo, Instituto Oceanográfico"
)

In [3]:
%%time
predict(spain_example)

CPU times: user 34.6 ms, sys: 24.1 ms, total: 58.7 ms
Wall time: 66.1 ms


'ES'

Using the `predict_proba` function one can see the probabilities.

In [4]:
%%time
predict_proba(spain_example)

CPU times: user 71.7 ms, sys: 1.36 ms, total: 73.1 ms
Wall time: 82.9 ms


ES    0.522222
CO    0.111111
BR    0.100000
CL    0.066667
PE    0.055556
MX    0.033333
EC    0.022222
VE    0.011111
UY    0.011111
NI    0.011111
FR    0.011111
CU    0.011111
CR    0.011111
BO    0.011111
AR    0.011111
dtype: float64

The same for another example,
an Italy example coming from the 2018-06-04 Clea's CSV output,
but with the country name removed from the message contents.

In [5]:
%%time
predict_proba(
    "Pontedera "
    "Workcenter of Jerzy Grotowski and Thomas Richards - Pontedera, "
    " Workcenter of Jerzy Grotowski and Thomas Richards Pontedera  "
    "Sobre The Living Room "
    "Thomas,Richards Thomas, "
    "Richards, "
    "Workcenter of Jerzy Grotowski and Thomas Richards, "
    "Workcenter of Jerzy Grotowski and Thomas Richards "
    "Pontedera,  "
    "Revista Brasileira de Estudos da Presença "
    "Universidade Federal do Rio Grande do Sul "
)

CPU times: user 60 ms, sys: 16.5 ms, total: 76.5 ms
Wall time: 75.1 ms


IT    0.633333
BR    0.122222
US    0.088889
GB    0.055556
TR    0.011111
TH    0.011111
RU    0.011111
MX    0.011111
FR    0.011111
CZ    0.011111
CN    0.011111
BE    0.011111
AU    0.011111
dtype: float64

Remind the message contents should include (in any order):

- Affiliation city and state
- Raw affiliation text coming from the XML
- Document title
- Contributor mini-bio, prefix, name and surname
- Institution name (3x), division, subdivision
- Journal title
- Publisher name

For a single author-affiliation pair in a document.