## Word2Vec with Gensim

But Why Word2Vec ?

Word2Vec finds relation (Semantic or Syntactic) between the words which was not possible by our Tradional TF-IDF or Frequency based approach. When we train the model, each one hot encoded word gets a point in a dimensional space where it learns and groups the words with similar meaning.

The neural network incorporated here is a Shallow.

One thing to note here is that we need large textual data to pass into Word2Vec model in order to figure out relation within words or generate meaningful results.

In general the Word2Vec is based on Window Method, where we have to assign a Window size.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/gensim.png" width="1200">

## Gensim :
- Gensim is fairly easy to use module which inherits CBOW and Skip-gram.
- We can install it by using !pip install gensim in Jupyter Notebook.
- Alternate way to implement Word2Vec is to build it from scratch which is quite complex.
- Read more about Gensim : https://radimrehurek.com/gensim/index.html
- FYI Gensim was developed and is maintained by the NLP researcher Radim Řehůřek and his company RaRe Technologies.

In [2]:
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2021-03-30 09:40:53--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.102.254
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.102.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-03-30 09:41:10 (94.8 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [3]:
#Ed estrai il file .gz utilizzando gunzip.
#Google mette a disposizione un modello preaddestrato su un corpus di Google News, contenente 3 milioni di parole e 300 dimensioni.
!gunzip GoogleNews-vectors-negative300.bin.gz

print('gunzip done!')

In questo notebook useremo gensim per caricare il pre-trained model, per farlo ci basta usare la funzione .load_word2vec_format(filpath), trattandosi di un file binario dobbiamo specificare il parametro binary a true.

In [6]:
from gensim.models import Word2Vec
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(type(model))

print('load model!')

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>
load model!


In [7]:
model["man"].shape

(300,)

In [8]:
model["man"]

array([ 0.32617188,  0.13085938,  0.03466797, -0.08300781,  0.08984375,
       -0.04125977, -0.19824219,  0.00689697,  0.14355469,  0.0019455 ,
        0.02880859, -0.25      , -0.08398438, -0.15136719, -0.10205078,
        0.04077148, -0.09765625,  0.05932617,  0.02978516, -0.10058594,
       -0.13085938,  0.001297  ,  0.02612305, -0.27148438,  0.06396484,
       -0.19140625, -0.078125  ,  0.25976562,  0.375     , -0.04541016,
        0.16210938,  0.13671875, -0.06396484, -0.02062988, -0.09667969,
        0.25390625,  0.24804688, -0.12695312,  0.07177734,  0.3203125 ,
        0.03149414, -0.03857422,  0.21191406, -0.00811768,  0.22265625,
       -0.13476562, -0.07617188,  0.01049805, -0.05175781,  0.03808594,
       -0.13378906,  0.125     ,  0.0559082 , -0.18261719,  0.08154297,
       -0.08447266, -0.07763672, -0.04345703,  0.08105469, -0.01092529,
        0.17480469,  0.30664062, -0.04321289, -0.01416016,  0.09082031,
       -0.00927734, -0.03442383, -0.11523438,  0.12451172, -0.02

### Cosine Similarity


<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/cosine.png" width="1200">

Mathematically speaking, Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. 

It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Calcolando la cosine similarity tra le rappresentazioni vettoriali di due parole possiamo sapere quanto esse sono simili.
NOTA BENE
La funzione cosine(u,v) di scipy calcola la distanza del coseno, possiamo trasformare la distanza in similitudine sottrando tale distanza a 1.

In [9]:
from scipy.spatial.distance import cosine

cosine(model["man"],model["boy"])

0.3175129294395447

In [10]:
1-cosine(model["man"],model["boy"])

0.6824870705604553

il metodo .similarity(word1, word2) è già implementato per il calcolo diretto della similitudine di due parole.

In [11]:
model.similarity("man","boy") # queste parole sono molto simili

0.68248713

In [12]:
model.similarity("cat","mouse") # queste parole sono molto diverse (o almeno spero che lo siano)

0.46566275

Possiamo cercare le parole più simili ad una nostra parola chiave usando il metodo .most_similar.

In [13]:
model.most_similar(positive=['shocked'], topn=10)

[('stunned', 0.8812650442123413),
 ('surprised', 0.8090525269508362),
 ('flabbergasted', 0.8001877069473267),
 ('horrified', 0.7986997365951538),
 ('dismayed', 0.7774383425712585),
 ('dumbfounded', 0.7773443460464478),
 ('appalled', 0.7613470554351807),
 ('astonished', 0.757473349571228),
 ('taken_aback', 0.7515914440155029),
 ('astounded', 0.7368210554122925)]

Utilizzando questo stesso metodo possiamo anche eseguire ricerche più complesse, come le parole più simili a delle determinate parole chiave, passate all'interno del parametro positive ma contrarie ad altre parole chiave, passate all'interno del parametro negative.

In [14]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

Un'altro metodo utile è .doesnt_match(words) che prendendo in input una serie di parole ritorna quella meno attinente alle altre.

In [15]:
model.doesnt_match("breakfast football dinner lunch".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'football'

In [None]:
## Piccolo Esercizio

https://randomwordgenerator.com/

per 10 parole trovare:
* most similar
* model.similarity per coppie
* model.doesnt_match



In [None]:
model.most_similar(positive=['xxxxxxxxxx'], topn=10)

In [None]:
model.most_similar(positive=['xxxxxxxxx', 'yyyyyyy'])

In [None]:
model.doesnt_match("zzzzzzz kkkkkkkk uuuuuuuu eeeeeeeeee ttttttttttt pppppppppppppp".split())