# A word2vec approach with `gensim`

What's `word2vec`?
Roughly speaking, it's a shallow neural network model
that can be trained to create a word embedding for NLP.
There are two architectures:

* Continuous BoW (Bag of Words),
  this one tries to predict a word given the context;
* Continuous skip-gram,
  this one tries to predict the context from a given word.

Here we'll use `gensim` again, it's an open source library
that was created as part of the
[Radim Řehůřek's Ph.D. Thesis](
  https://radimrehurek.com/phd_rehurek.pdf
), *Scalability of semantic analysis
    in natural language processing*, 2011.
His thesis is mainly towards LSA (Latent Semantic Analysis),
and LDA (Latent Dirichlet Allocation).
However, `word2vec` was published after that,
by Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean
(a team of Google researchers),
in the *Efficient Estimation of
        Word Representations in Vector Space*, 2013
\[[PDF](https://arxiv.org/pdf/1301.3781.pdf),
  [C++ code](https://code.google.com/archive/p/word2vec/)\].
Radim Řehůřek himself added `word2vec` to his `gensim` library,
and published a [short tutorial for it](
  https://rare-technologies.com/word2vec-tutorial/
).

## Wikipedia trigram model

For a first try of `word2vec` in \[Brazilian\] Portuguese,
one can see the Felipe Parpinelli's
[word2vec-pt-br](https://github.com/felipeparpinelli/)
repository (unfortunately, only available for Python 2).
However, [he trained a trigram vector model and published it](
  https://drive.google.com/file/d/0B_eXEo_eUPCDWnJ0YWtUdW1kVFk/view
),
so we can directly use here in Python 3.7 with `gensim`.
The model is a 2GB file whose SHA256 is
`5421465d49a5f709f81cec3607c64b1e6a0724fdce94f9d507a48fe07f95d098`.

In [1]:
from gensim.models import KeyedVectors
import numpy as np
import pandas as pd

In [2]:
wiki_model = KeyedVectors.load_word2vec_format("wiki.pt.trigram.vector", binary=True)

It has a vocabulary of more than one million words and expressions,
all in lower case, with underscores as separators:

In [3]:
len(wiki_model.vocab)

1264918

As of today, in the cited repository,
Parpinelli is only using two model methods:
`most_similar` and `doesnt_match`.
The first one can be used to find similar words,
with a similarity measurement
ranging from $0$ to $1$.
This example shows the name of cities
in the São Paulo state, Brazil,
given the name of one city:

In [4]:
wiki_model.most_similar("campinas")

  if np.issubdtype(vec.dtype, np.int):


[('ribeirão_preto', 0.7867798805236816),
 ('sorocaba', 0.7684873342514038),
 ('jundiaí', 0.7378007173538208),
 ('araraquara', 0.7296241521835327),
 ('são_paulo', 0.7239118814468384),
 ('guarulhos', 0.7190227508544922),
 ('bauru', 0.708629310131073),
 ('botucatu', 0.6960499882698059),
 ('taubaté', 0.6935198307037354),
 ('mogi_das_cruzes', 0.6845329403877258)]

These are the words
that have the highest similarity with `campinas`.
Such list of tuples can be easily converted
to a Pandas dataframe:

In [5]:
pd.DataFrame(
    wiki_model.most_similar("campinas"),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
ribeirão_preto,0.78678
sorocaba,0.768487
jundiaí,0.737801
araraquara,0.729624
são_paulo,0.723912
guarulhos,0.719023
bauru,0.708629
botucatu,0.69605
taubaté,0.69352
mogi_das_cruzes,0.684533


As we're trying to predict the context vector from a word,
what this gives is that
all these words can easily appear in the same contexts.
Though the training process of `word2vec` is unsupervised
(it's a dimensionality reduction algorithm)


Instead of a single word,
we can also give a list of *positive* and *negative* words,
performing something akin to this math:

$$
\begin{array}{cll}
{}   & Brasília & \text{# federal capital of Brazil} \\
{} - & Brasil   & \text{# Brazil, in Brazilian Portuguese} \\
{} + & Alemanha & \text{# Germany, in Brazilian Portuguese} \\ \hline
{}   & ???
\end{array}
$$

In [6]:
wiki_model.most_similar(
    positive=["brasilia", "alemanha"],
    negative=["brasil"],
    topn=1,
)

[('berlin', 0.5845881104469299)]

The typical "equation" is $king - man + woman$,
the first example in the 2013 paper,
which here also yields $queen$
(but with all the words in Brazilian Portuguese):

In [7]:
wiki_model.most_similar(
    positive=["rei", "mulher"], # ["king", "woman"]
    negative=["homem"],         # ["man"]
    topn=1,
)

[('rainha', 0.6084680557250977)]

We can also get the vector regarding a word to make some actual maths with it:

In [8]:
type(wiki_model["brasilia"])

numpy.ndarray

In [9]:
wiki_model["brasilia"].shape

(400,)

But in this case, using the `similar_by_vector` method,
we need to manually remove the similarity with itself:

In [10]:
wiki_model.similar_by_vector(wiki_model["campinas"], topn=5)

[('campinas', 1.0),
 ('ribeirão_preto', 0.7867798805236816),
 ('sorocaba', 0.7684873342514038),
 ('jundiaí', 0.7378007769584656),
 ('araraquara', 0.7296241521835327)]

And performing the maths doesn't result in the same vectors,
as not all vectors will have the same weight:

In [11]:
pd.DataFrame(
    wiki_model.similar_by_vector(
        wiki_model["brasilia"] - wiki_model["brasil"] + wiki_model["alemanha"],
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
magdeburg,0.542283
erfurt,0.525711
krefeld,0.520014
alta_baviera,0.517179
aachen,0.516245
freiburg,0.516041
baixa_saxónia,0.508358
salzburg,0.506056
ulm,0.505461
koblenz,0.505086


In [12]:
pd.DataFrame(
    wiki_model.similar_by_vector(
        wiki_model["rei"] - wiki_model["homem"] + wiki_model["mulher"],
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
rei,0.713224
rainha,0.626994
consorte,0.553226
rainha_viúva,0.531954
mulher,0.513182
rainha_consorte,0.50805
rainha_isabel,0.507507
monarca,0.502872
princesa,0.501628
rainha_regente,0.49872


That's because the vector magnitude is way too different,
and we care mostly about the vector direction,
not the vector magnitude.
Let's calculate the vector magnitude/norm
for each of these words:

In [13]:
{k: np.sqrt((wiki_model[k] ** 2).sum())
 for k in ["brasilia", "brasil", "alemanha"]}

{'brasilia': 9.4632, 'brasil': 28.960426, 'alemanha': 24.099485}

In [14]:
{k: np.sqrt((wiki_model[k] ** 2).var())
 for k in ["rei", "homem", "mulher"]}

{'rei': 2.9873471, 'homem': 2.229483, 'mulher': 2.2330337}

To give the same weight to these vectors,
we need to normalize them before doing that sum/subtraction math.
We can simply divide the vectors by the numbers above (their norm),
but that's already done by the `word_vec` method
when `use_norm=True`:

In [15]:
(wiki_model.word_vec("rei", use_norm=True) ** 2).sum()

1.0

Calculating the most similar vectors again
(using the direction, not the magnitude):

In [16]:
pd.DataFrame(
    wiki_model.similar_by_vector(
          wiki_model.word_vec("brasilia", True)
        - wiki_model.word_vec("brasil", True)
        + wiki_model.word_vec("alemanha", True),
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
berlin,0.584588
hamburg,0.580028
salzburg,0.579465
münchen,0.572176
freiburg,0.571325
sinsheim,0.5625
köln,0.561137
nürnberg,0.560043
krefeld,0.559181
frankfurt_oder,0.558645


In [17]:
pd.DataFrame(
    wiki_model.similar_by_vector(
          wiki_model.word_vec("rei", True)
        - wiki_model.word_vec("homem", True)
        + wiki_model.word_vec("mulher", True),
        topn=10,
    ),
    columns=["token", "similarity"],
).set_index("token")

Unnamed: 0_level_0,similarity
token,Unnamed: 1_level_1
rei,0.656337
rainha,0.608468
consorte,0.547408
mulher,0.534614
rainha_viúva,0.525085
esposa,0.499288
rainha_consorte,0.498275
princesa,0.494415
rainha_isabel,0.493366
rainha_regente,0.490066
