# A word2vec approach with `gensim`

What's `word2vec`?
Roughly speaking, it's a shallow neural network model
that can be trained to create a word embedding for NLP.
There are two architectures:

* Continuous BoW (Bag of Words),
  this one tries to predict a word given the context;
* Continuous skip-gram,
  this one tries to predict the context from a given word.

Here we'll use `gensim` again, it's an open source library
that was created as part of the
[Radim Řehůřek's Ph.D. Thesis](
  https://radimrehurek.com/phd_rehurek.pdf
), *Scalability of semantic analysis
    in natural language processing*, 2011.
His thesis is mainly towards LSA (Latent Semantic Analysis),
and LDA (Latent Dirichlet Allocation).
However, `word2vec` was published after that,
by Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean
(a team of Google researchers),
in the *Efficient Estimation of
        Word Representations in Vector Space*, 2013
\[[PDF](https://arxiv.org/pdf/1301.3781.pdf),
  [C++ code](https://code.google.com/archive/p/word2vec/)\].
Radim Řehůřek himself added `word2vec` to his `gensim` library,
and published a [short tutorial for it](
  https://rare-technologies.com/word2vec-tutorial/
).

## Wikipedia trigram model

For a first try of `word2vec` in \[Brazilian\] Portuguese,
one can see the Felipe Parpinelli's
[word2vec-pt-br](https://github.com/felipeparpinelli/)
repository (unfortunately, only available for Python 2).
However, [he trained a trigram vector model and published it](
  https://drive.google.com/file/d/0B_eXEo_eUPCDWnJ0YWtUdW1kVFk/view
),
so we can directly use here in Python 3.7 with `gensim`.
The model is a 2GB file whose SHA256 is
`5421465d49a5f709f81cec3607c64b1e6a0724fdce94f9d507a48fe07f95d098`.

In [1]:
from gensim.models import KeyedVectors
import numpy as np
import pandas as pd

In [2]:
wiki_model = KeyedVectors.load_word2vec_format("wiki.pt.trigram.vector", binary=True)

It has a vocabulary of more than one million words and expressions,
all in lower case, with underscores as separators:

In [3]:
len(wiki_model.vocab)

1264918

### Word similarity and "maths with words"

As of today, in the cited repository,
Parpinelli is only using two model methods:
`most_similar` and `doesnt_match`.
The first one can be used to find similar words,
with a similarity measurement
ranging from $0$ to $1$.
This example shows the name of cities
in the São Paulo state, Brazil,
given the name of one city:

In [4]:
wiki_model.most_similar("campinas")

  if np.issubdtype(vec.dtype, np.int):


[('ribeirão_preto', 0.7867798805236816),
 ('sorocaba', 0.7684873342514038),
 ('jundiaí', 0.7378007173538208),
 ('araraquara', 0.7296241521835327),
 ('são_paulo', 0.7239118814468384),
 ('guarulhos', 0.7190227508544922),
 ('bauru', 0.708629310131073),
 ('botucatu', 0.6960499882698059),
 ('taubaté', 0.6935198307037354),
 ('mogi_das_cruzes', 0.6845329403877258)]

These are the words
that have the highest similarity with `campinas`.
Such list of tuples can be easily converted
to a Pandas dataframe:

In [5]:
pd.DataFrame(
    wiki_model.most_similar("campinas"),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
ribeirão_preto,0.78678
sorocaba,0.768487
jundiaí,0.737801
araraquara,0.729624
são_paulo,0.723912
guarulhos,0.719023
bauru,0.708629
botucatu,0.69605
taubaté,0.69352
mogi_das_cruzes,0.684533


As we're trying to predict the context vector from a word,
what this gives is that
all these words can easily appear in the same contexts.
Though the training process of `word2vec` is unsupervised
(it's a dimensionality reduction algorithm)


Instead of a single word,
we can also give a list of *positive* and *negative* words,
performing something akin to this math:

$$
\begin{array}{cll}
{}   & Brasília & \text{# federal capital of Brazil} \\
{} - & Brasil   & \text{# Brazil, in Brazilian Portuguese} \\
{} + & Alemanha & \text{# Germany, in Brazilian Portuguese} \\ \hline
{}   & ???
\end{array}
$$

In [6]:
wiki_model.most_similar(
    positive=["brasilia", "alemanha"],
    negative=["brasil"],
    topn=1,
)

[('berlin', 0.5845881104469299)]

The typical "equation" is $king - man + woman$,
the first example in the 2013 paper,
which here also yields $queen$
(but with all the words in Brazilian Portuguese):

In [7]:
wiki_model.most_similar(
    positive=["rei", "mulher"], # ["king", "woman"]
    negative=["homem"],         # ["man"]
    topn=1,
)

[('rainha', 0.6084680557250977)]

### Word vector normalization

We can also get the vector regarding a word to make some actual maths with it:

In [8]:
type(wiki_model["brasilia"])

numpy.ndarray

In [9]:
wiki_model["brasilia"].shape

(400,)

But in this case, using the `similar_by_vector` method,
we need to manually remove the similarity with itself:

In [10]:
wiki_model.similar_by_vector(wiki_model["campinas"], topn=5)

[('campinas', 1.0),
 ('ribeirão_preto', 0.7867798805236816),
 ('sorocaba', 0.7684873342514038),
 ('jundiaí', 0.7378007769584656),
 ('araraquara', 0.7296241521835327)]

And performing the maths doesn't result in the same vectors,
as not all vectors will have the same weight:

In [11]:
pd.DataFrame(
    wiki_model.similar_by_vector(
        wiki_model["brasilia"] - wiki_model["brasil"] + wiki_model["alemanha"],
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
magdeburg,0.542283
erfurt,0.525711
krefeld,0.520014
alta_baviera,0.517179
aachen,0.516245
freiburg,0.516041
baixa_saxónia,0.508358
salzburg,0.506056
ulm,0.505461
koblenz,0.505086


In [12]:
pd.DataFrame(
    wiki_model.similar_by_vector(
        wiki_model["rei"] - wiki_model["homem"] + wiki_model["mulher"],
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
rei,0.713224
rainha,0.626994
consorte,0.553226
rainha_viúva,0.531954
mulher,0.513182
rainha_consorte,0.50805
rainha_isabel,0.507507
monarca,0.502872
princesa,0.501628
rainha_regente,0.49872


That's because the vector magnitude is way too different,
and we care mostly about the vector direction,
not the vector magnitude.
Let's calculate the vector magnitude/norm
for each of these words:

In [13]:
{k: np.sqrt((wiki_model[k] ** 2).sum())
 for k in ["brasilia", "brasil", "alemanha"]}

{'brasilia': 9.4632, 'brasil': 28.960426, 'alemanha': 24.099485}

In [14]:
{k: np.sqrt((wiki_model[k] ** 2).var())
 for k in ["rei", "homem", "mulher"]}

{'rei': 2.9873471, 'homem': 2.229483, 'mulher': 2.2330337}

To give the same weight to these vectors,
we need to normalize them before doing that sum/subtraction math.
We can simply divide the vectors by the numbers above (their norm),
but that's already done by the `word_vec` method
when `use_norm=True`:

In [15]:
(wiki_model.word_vec("rei", use_norm=True) ** 2).sum()

1.0

Calculating the most similar vectors again
(using the direction, not the magnitude):

In [16]:
pd.DataFrame(
    wiki_model.similar_by_vector(
          wiki_model.word_vec("brasilia", True)
        - wiki_model.word_vec("brasil", True)
        + wiki_model.word_vec("alemanha", True),
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
berlin,0.584588
hamburg,0.580028
salzburg,0.579465
münchen,0.572176
freiburg,0.571325
sinsheim,0.5625
köln,0.561137
nürnberg,0.560043
krefeld,0.559181
frankfurt_oder,0.558645


In [17]:
pd.DataFrame(
    wiki_model.similar_by_vector(
          wiki_model.word_vec("rei", True)
        - wiki_model.word_vec("homem", True)
        + wiki_model.word_vec("mulher", True),
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
rei,0.656337
rainha,0.608468
consorte,0.547408
mulher,0.534614
rainha_viúva,0.525085
esposa,0.499288
rainha_consorte,0.498275
princesa,0.494415
rainha_isabel,0.493366
rainha_regente,0.490066


That's what the `most_similar` method does,
and it's also what the original *word2vec* implementation does,
as the method documentation states:

> The method corresponds to the `word-analogy` and `distance` scripts
> in the original word2vec implementation.

As an explanation of the maths,
we're looking for a vector $b^*$ that maximizes $b^* \cdot (b - a + a^*)$,
working with unit-length vectors $b$, $a$ and $a^*$
that corresponds to the input words.
That can also be written as a maximization of
$b^* \cdot b - b^* \cdot a + b^* \cdot a^*$,
or, defining a binary operator $\cos$ as the cosine similarity
(dot product divided by the norms of each vector), we can write it as:

$$\arg \max_{b^*} \cos(b^*, b) - \cos(b^*, a) + \cos(b^*, a^*)$$

### Alternative similarity formula

Omer Levy and Yoav Goldberg, in
*Linguistic Regularities in Sparse and Explicit Word Representations*, 2014
\[[PDF](http://www.aclweb.org/anthology/W14-1618)\],
named as *3CosAdd* the similarity algorithm we've just seen,
and they proposed an alternative "multiplicative" algorithm called *3CosMul*,
that reminds us of applying the c of each addend/subtrahend
in the "sum/subtraction of cosine similarities" function we were maximizing.
This new equation can also be seen as something like
taking a geometric mean of the terms, instead of a simple average.

$$
\arg \max_{b^*} \dfrac{\cos(b^*, b) \cos(b^*, a^*)}
                      {\cos(b^*, a) + \varepsilon}
$$

That's exactly what the `most_similar_cosmul` method does!

In [18]:
wiki_model.most_similar_cosmul(
    positive=["brasilia", "alemanha"],
    negative=["brasil"],
    topn=10,
)

[('frankfurt_oder', 0.9992457628250122),
 ('hamburg', 0.9958572387695312),
 ('salzburg', 0.993118941783905),
 ('magdeburg', 0.979534387588501),
 ('krefeld', 0.9781520962715149),
 ('oberhausen', 0.9737989902496338),
 ('zürich', 0.9721503853797913),
 ('budapest', 0.9673693776130676),
 ('berlin', 0.9658992290496826),
 ('vienna', 0.9648484587669373)]

In [19]:
wiki_model.most_similar_cosmul(
    positive=["rei", "mulher"],
    negative=["homem"],
    topn=10,
)

[('rainha', 0.9783447980880737),
 ('consorte', 0.9311118721961975),
 ('rainha_viúva', 0.9254377484321594),
 ('rainha_consorte', 0.9043235778808594),
 ('rainha_isabel', 0.8911893963813782),
 ('rainha_regente', 0.8879780173301697),
 ('infanta', 0.8857831358909607),
 ('princesa', 0.8801829814910889),
 ('rainha_reinante', 0.8782250881195068),
 ('concubina', 0.8746837377548218)]

It kept *rainha* (queen) as the most similar word
in the $king - man + woman$ equation,
but *berlin* (Berlin) is no longer the most similar word
in the $Brasília - Brazil + Germany$ equation.
The similarity values are way higher,
but Vienna is the federal capital of Austria,
it's no longer a city in Germany.
In some sense, these $2$ cherry-picked examples
aren't better in *3CosMul* than in *3CosAdd*,
that suffices for us to avoid dropping any of these.

### Lack of data cleaning

We've found that some words aren't properly cleaned in the input,
and that might rule out what we would expect
with "words maths" like the ones previously performed here.
An example is the word *frequency*,
which in portuguese is *frequência*,
yet it appear in the dataset with some different writings:

In [20]:
wiki_model.similar_by_vector(wiki_model["frequência"], topn=5)

[('frequência', 1.0),
 ('freqüência', 0.8619202971458435),
 ('frequencia', 0.6580279469490051),
 ('amplitude', 0.6013369560241699),
 ('frequências', 0.6001715660095215)]

The $frequency - hertz + seconds$ equation
is expected to return something like *period*, *duration* or *time*,
but none of these appears as the $5$ most similar words/expressions:

In [21]:
wiki_model.most_similar(
    positive=["frequência", "segundos"],
    negative=["hertz"],
    topn=5,
)

[('cinco_minutos', 0.5157788991928101),
 ('minutos', 0.5141439437866211),
 ('freqüência', 0.506064772605896),
 ('dez_minutos', 0.47891607880592346),
 ('cada_vez', 0.47068893909454346)]

Translation of each result, in order:

- Five minutes
- Minutes
- Frequency (another writing of the same word)
- Ten minutes
- Each time/turn/cycle
  (*vez* have nothing to do with the physical meaning of *time*)

The fifth entry have something to do with the idea of a cycle,
but the word *frequency* appeared again,
and it's strange to see these time durations in minutes
as distinct tokens.
Using *3CosMul* doesn't help that much,
its new entries are *do* (of) and *já* (already / right now):

In [22]:
wiki_model.most_similar_cosmul(
    positive=["frequência", "segundos"],
    negative=["hertz"],
    topn=5,
)

[('do', 0.9138926267623901),
 ('já', 0.9123607873916626),
 ('cinco_minutos', 0.9002106785774231),
 ('cada_vez', 0.890281081199646),
 ('minutos', 0.8871634006500244)]

An expected result was *período*,
but there are entries

- Missing the acute accent
- A typo (perído)

In [23]:
wiki_model.similar_by_vector(wiki_model["período"], topn=5)

[('período', 1.0),
 ('periodo', 0.7241699695587158),
 ('perído', 0.6035453081130981),
 ('longo_período', 0.5876239538192749),
 ('períodos', 0.5766887068748474)]

And these are obviously not the only words
that hadn't been properly cleaned.
Perhaps the result wasn't as expected
in the last "word math" calculation
because of this "word splitting",
though there's no reason to believe
that these alternative writings have some systematic bias.

In [24]:
wiki_model.similar_by_vector(wiki_model["periodo"], topn=5)

[('periodo', 1.0000001192092896),
 ('período', 0.7241699695587158),
 ('perído', 0.7190518379211426),
 ('conturbado_período', 0.6516879200935364),
 ('interregno', 0.6100722551345825)]

## Training a `word2vec` model

How could a `word2vec` model replace an LSI model
to create a word embedding
that would be helpful on
either finding invalid/inconsistent entries
or filling missing values
on a dataset of document affiliations?

Let's perform the same procedure applied on LSI in the previous experiment,
but, this time, let's use it to train a word2vec model.

The goal is to:

- Fit a `word2vec` model with the CSV created on 2018-06-04,
  ignoring the country fields;
- Fit a random forest model to detect the country from the `word2vec` vectors.

### Simple/small example data

Let's use the `gensim` word list example:

In [25]:
from gensim.test.utils import common_texts
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Training the model is quite straightforward.

In [26]:
from gensim.models import Word2Vec
simple_model = Word2Vec(common_texts,
    size=100,
    window=5,
    min_count=1, # Don't ignore rare words
    sg=1, # Skip-gram architecture
    workers=4,
)

The embedded vectors are stored as a numpy array:

In [27]:
print(simple_model.wv.vectors.shape)
simple_model.wv.vectors

(12, 100)


array([[-0.00429325, -0.00053441, -0.00069498, ..., -0.00280878,
        -0.00281088,  0.00136091],
       [ 0.00264861, -0.00271103, -0.00236115, ..., -0.00461583,
         0.00246741, -0.00283012],
       [-0.00305232,  0.00020281, -0.00423114, ..., -0.00494703,
        -0.00124028,  0.00197053],
       ...,
       [ 0.0021847 , -0.00360832, -0.00217316, ..., -0.0048031 ,
        -0.00139142, -0.00275932],
       [-0.0002061 ,  0.00336424, -0.00263451, ...,  0.00407349,
         0.00052204, -0.00373554],
       [ 0.00118459,  0.00014319, -0.00420273, ..., -0.00492786,
         0.00173646,  0.00473093]], dtype=float32)

The words to which the each row belongs are stored in the `index2word` list:

In [28]:
simple_model.wv.index2word

['system',
 'user',
 'trees',
 'graph',
 'human',
 'interface',
 'computer',
 'survey',
 'response',
 'time',
 'eps',
 'minors']

Checking the order:

In [29]:
np.all(simple_model.wv.vectors ==
       np.array([simple_model.wv[word]
                 for idx, word in enumerate(simple_model.wv.index2word)])
      )

True

That information should be enough
for us to create a model with the actual data.

### Getting the word lists and training `word2vec` w/ non-country text fields from Clea's CSV

We'll work with the same dataset
(coming from Clea's output, created on 2018-06-04)
as the previous regarding LSI/LSA (it started on 2018-08-23):

In [30]:
dataset = pd.read_csv("inner_join_2018-06-04.csv",
                      dtype=str,
                      keep_default_na=False) \
            .drop_duplicates()

The first cleaning step is the same one from that experiment,
which removes accents (keeping the letters), lowercases everything
and removes any other character.

In [31]:
import re
from unidecode import unidecode

TEXT_ONLY_REGEX = re.compile("[^a-zA-Z ]")

def pre_normalize(name):
    return TEXT_ONLY_REGEX.sub("", unidecode(name).lower())

Using the same stopwords as before, we can create a function
that gets a dataset and casts it as a word list.
Here we don't need to worry about uncommon (single occurrence) words,
that's a `word2vec` parameter.

In [32]:
stop_words = ["da", "de", "desta", "do", "em", "ii", "iii", "in", "mesma", "no", "pela", "pelos"]

def df2wlist(dset, stop_words=stop_words):
    return dset.T.apply(
        lambda row: [
            word for word in pre_normalize(" ".join(row)).split()
                 if word not in stop_words
                 and len(word) > 1
        ]
    )

The following entries are all the field names from that CSV
with some content that might be useful for us
from which we want to find the `addr_country_code`.
When training with a joined-columns word list,
the order of the fields matters!

In [33]:
x_fields = [
    "addr_city",
    "addr_state",
    "aff_text",
    "article_title",
    "contrib_bio",
    "contrib_prefix",
    "contrib_name",
    "contrib_surname",
    "institution_orgdiv1",
    "institution_orgdiv2",
    "institution_orgname",
    "institution_orgname_rewritten",
    "institution_original",
    "institution_orgname_rewritten",
    "journal_title",
    "publisher_name",
]

From this, we can get the desired wordlist:

In [34]:
%%time
wlist = df2wlist(dataset[x_fields])

CPU times: user 14.6 s, sys: 118 ms, total: 14.7 s
Wall time: 14.8 s


In [35]:
wlist.tail()

94655    [universidade, estado, rio, janeiro, instituto...
94656    [ciudad, buenos, aires, ciudad, buenos, aires,...
94657    [joao, pessoa, pb, universidade, federal, para...
94658    [sao, paulo, universidade, sao, paulo, departa...
94659    [campinas, sp, universidade, estadual, campina...
dtype: object

In [36]:
%%time
model = Word2Vec(wlist,
    size=200,
    window=7,
    min_count=2, # Ignore uncommon words
    sg=1, # Skip-gram architecture
)

CPU times: user 5min 48s, sys: 460 ms, total: 5min 49s
Wall time: 1min 58s


That was really fast!

Although the `word2vec` neural network model is shallow,
the data we've got here might be too small.
Anyway, we know how to compare words,
but we still doesn't know how to compare documents.