# Word2Vec implementation with Gensim
(Adapted from Kavita Ganesan tutorial)

The idea behind Word2Vec is that you can tell the meaning of a word by analyzing its neighbors. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related.

In this notebook, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work!. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Dataset
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [None]:
# imports needed and set up logging
import gzip
import gensim
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html).


In [None]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""

    logging.info("reading file {0}...this may take a while".format(input_file))

    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f):

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess(line)


documents = list (read_input(data_file))
logging.info ("Done reading data file")

2022-07-05 09:26:23,793 : INFO : reading file reviews_data.txt.gz...this may take a while
2022-07-05 09:26:23,797 : INFO : read 0 reviews
2022-07-05 09:26:26,509 : INFO : read 10000 reviews
2022-07-05 09:26:29,053 : INFO : read 20000 reviews
2022-07-05 09:26:32,028 : INFO : read 30000 reviews
2022-07-05 09:26:34,877 : INFO : read 40000 reviews
2022-07-05 09:26:37,658 : INFO : read 50000 reviews
2022-07-05 09:26:39,531 : INFO : read 60000 reviews
2022-07-05 09:26:41,138 : INFO : read 70000 reviews
2022-07-05 09:26:42,632 : INFO : read 80000 reviews
2022-07-05 09:26:44,158 : INFO : read 90000 reviews
2022-07-05 09:26:46,130 : INFO : read 100000 reviews
2022-07-05 09:26:47,600 : INFO : read 110000 reviews
2022-07-05 09:26:49,109 : INFO : read 120000 reviews
2022-07-05 09:26:50,667 : INFO : read 130000 reviews
2022-07-05 09:26:52,298 : INFO : read 140000 reviews
2022-07-05 09:26:53,874 : INFO : read 150000 reviews
2022-07-05 09:26:55,440 : INFO : read 160000 reviews
2022-07-05 09:26:57,022

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

In [None]:
model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)
# https://radimrehurek.com/gensim/models/word2vec.html

2022-07-05 09:27:11,374 : INFO : collecting all words and their counts
2022-07-05 09:27:11,375 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-07-05 09:27:11,564 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2022-07-05 09:27:11,761 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2022-07-05 09:27:11,989 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2022-07-05 09:27:12,201 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2022-07-05 09:27:12,430 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2022-07-05 09:27:12,655 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2022-07-05 09:27:12,842 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2022-07-05 09:27:13,015 : INFO : PROG

(303486983, 415193580)

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me.

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.


In [None]:
# Guardar el modelo
model.save("word2vec-OPINRank.model")

2022-07-05 09:32:18,247 : INFO : saving Word2Vec object under word2vec-OPINRank.model, separately None
2022-07-05 09:32:18,249 : INFO : storing np array 'vectors' to word2vec-OPINRank.model.wv.vectors.npy
2022-07-05 09:32:18,280 : INFO : not storing attribute vectors_norm
2022-07-05 09:32:18,281 : INFO : storing np array 'syn1neg' to word2vec-OPINRank.model.trainables.syn1neg.npy
2022-07-05 09:32:18,309 : INFO : not storing attribute cum_table
2022-07-05 09:32:18,440 : INFO : saved word2vec-OPINRank.model


In [None]:
# Cargar el modelo: Gensim permite continuar posteriormente con el entrenamiento
model = gensim.models.Word2Vec.load("word2vec-OPINRank.model")

2022-07-05 09:32:18,449 : INFO : loading Word2Vec object from word2vec-OPINRank.model
2022-07-05 09:32:19,101 : INFO : loading wv recursively from word2vec-OPINRank.model.wv.* with mmap=None
2022-07-05 09:32:19,102 : INFO : loading vectors from word2vec-OPINRank.model.wv.vectors.npy with mmap=None
2022-07-05 09:32:19,123 : INFO : setting ignored attribute vectors_norm to None
2022-07-05 09:32:19,124 : INFO : loading vocabulary recursively from word2vec-OPINRank.model.vocabulary.* with mmap=None
2022-07-05 09:32:19,124 : INFO : loading trainables recursively from word2vec-OPINRank.model.trainables.* with mmap=None
2022-07-05 09:32:19,125 : INFO : loading syn1neg from word2vec-OPINRank.model.trainables.syn1neg.npy with mmap=None
2022-07-05 09:32:19,145 : INFO : setting ignored attribute cum_table to None
2022-07-05 09:32:19,145 : INFO : loaded word2vec-OPINRank.model


In [None]:
vector = model.wv['room']
print(vector)

[ 1.0338641e+00 -6.4515978e-01  9.7491533e-01 -2.5768504e+00
 -3.8499403e-01 -3.1500733e+00  4.0942492e+00 -7.9168373e-01
 -3.4173593e-01 -2.7893143e+00 -2.2020018e+00  4.5127887e-01
 -7.3172611e-01  4.8634477e+00 -1.0662341e+00  6.0559525e+00
  3.2156411e-01 -5.8192587e-01 -2.5803950e+00  4.8855715e+00
 -3.2599993e+00  4.9360234e-01  2.4993925e-01  2.4927576e+00
  3.8899567e+00 -3.0643025e-01  2.2596550e+00  1.2166240e+00
 -2.6572332e+00 -5.8885515e-01  3.5694442e+00  1.5582733e+00
 -1.0725802e+00 -1.5443007e+00 -1.9529471e-01 -2.3242235e+00
  4.3981800e+00 -2.7963874e+00  2.7555578e+00  3.7171381e+00
  5.7213402e-01 -5.8329701e+00  7.7622014e-01  1.2758117e+00
  2.5325222e+00  2.2387545e+00  4.1407201e-01 -4.1228833e+00
  7.3756444e-01  7.2111540e-02  1.6436992e+00  3.8360384e+00
  1.0493493e+00 -1.0342419e+00  4.8277225e+00 -1.2769881e+00
 -2.9975235e+00 -1.2046417e+00 -2.1433125e+00  1.7713475e+00
 -1.3990079e-01 -3.7347598e+00 -5.2563004e+00  9.5838511e-01
  2.0864408e+00  3.80079

## Now, let's look at some output
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words.

![title](https://samyzaf.com/ML/nlp/word2vec2.png)

In [None]:
w1 = "room"
model.wv.most_similar(w1)

2022-07-05 09:32:19,271 : INFO : precomputing L2-norms of word weight vectors


[('rooms', 0.6112707257270813),
 ('suite', 0.5912639498710632),
 ('bathroom', 0.5482749938964844),
 ('rooom', 0.544310986995697),
 ('washroom', 0.5111060738563538),
 ('unit', 0.5072125196456909),
 ('rm', 0.5023929476737976),
 ('bathroon', 0.48620980978012085),
 ('bedroom', 0.4830615520477295),
 ('balcony', 0.47041428089141846)]

In [None]:
w2 = "excellent"
model.most_similar(w2)
# https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html

  model.most_similar(w2)


[('outstanding', 0.8643324971199036),
 ('superb', 0.8458866477012634),
 ('excellant', 0.8231006264686584),
 ('great', 0.8062571287155151),
 ('terrific', 0.805456280708313),
 ('exceptional', 0.8036608099937439),
 ('fantastic', 0.7959946393966675),
 ('awesome', 0.7794080972671509),
 ('incredible', 0.7753171324729919),
 ('amazing', 0.7472891211509705)]

In [None]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)

[('germany', 0.672870934009552),
 ('canada', 0.635601818561554),
 ('spain', 0.6352485418319702),
 ('mexico', 0.6124602556228638),
 ('greece', 0.6115262508392334),
 ('rome', 0.6114856004714966)]

In [None]:
# look up top 6 words similar to 'shocked'
w1 = ["horrified"]
model.wv.most_similar (positive=w1,topn=6)

[('shocked', 0.8021960258483887),
 ('dismayed', 0.7235410809516907),
 ('stunned', 0.7071316242218018),
 ('mortified', 0.699037492275238),
 ('appalled', 0.6896869540214539),
 ('astonished', 0.678575336933136)]

In [None]:
pos = ["paris", "madrid"]
neg = ["france"]
model.wv.most_similar(positive=pos,negative=neg)

[('vienna', 0.5171779990196228),
 ('sofitel', 0.5132507085800171),
 ('raffles', 0.5100057125091553),
 ('bangkok', 0.4957188367843628),
 ('luxor', 0.4954802691936493),
 ('rome', 0.4874468445777893),
 ('hong', 0.48338907957077026),
 ('tokyo', 0.48272350430488586),
 ('barcelona', 0.46742141246795654),
 ('bellagio', 0.4592193365097046)]

In [None]:
pos = ["paris", "france"]
neg = ["spain"]
model.wv.most_similar(positive=pos,negative=neg)

[('alpenhaus', 0.5069018602371216),
 ('rome', 0.49667128920555115),
 ('barcelona', 0.48386555910110474),
 ('venetian', 0.48059549927711487),
 ('vin', 0.47832345962524414),
 ('vere', 0.4652695059776306),
 ('raffles', 0.4632904827594757),
 ('brussels', 0.45971423387527466),
 ('crillon', 0.45862504839897156),
 ('vacancier', 0.4543246924877167)]

In [None]:
model.wv.most_similar(positive=['father', 'girl'], negative=['mother'], topn=10)

[('lady', 0.7344351410865784),
 ('guy', 0.7197610139846802),
 ('chap', 0.7068520784378052),
 ('gentleman', 0.698444664478302),
 ('woman', 0.6913731694221497),
 ('man', 0.6906266212463379),
 ('gal', 0.6423194408416748),
 ('receptionist', 0.6367693543434143),
 ('bloke', 0.61173015832901),
 ('lad', 0.6091463565826416)]

In [None]:
d = (model.wv['king'] - model.wv['man']) + (model.wv['queen'])
d

array([  7.679434  ,  -5.080357  ,  -8.801502  ,  -0.591382  ,
        -0.6605388 ,   1.45326   ,   3.5242076 ,  10.27034   ,
        -6.1279073 ,  -0.9242047 ,  -0.66062593,   4.3310866 ,
         2.4123845 ,  -3.5539737 ,  -7.068388  ,  14.074575  ,
        -0.87745565,   2.2695732 ,  -1.1855704 ,  -0.5794375 ,
        -8.419677  ,   1.3832095 ,  -9.85326   ,   6.9531317 ,
        -1.402785  ,  -5.511939  ,   2.3033757 ,   0.98649883,
        -2.2998977 ,  -2.8698518 ,  10.731483  , -11.404169  ,
        -7.0918655 ,  -8.665656  ,   0.05643851,  -6.014915  ,
         6.690112  ,  -3.8634753 ,   4.294281  ,   6.3559775 ,
         3.5707755 ,  -0.8635861 ,   6.919281  ,  -2.952892  ,
         8.207981  ,   7.683299  ,  -8.783346  ,  -7.5964317 ,
        -8.108012  ,   7.373233  ,   3.7055397 ,  -1.7219545 ,
         3.4231548 ,  -1.0096405 ,   7.572375  ,  -3.990315  ,
         2.3143418 ,  -2.1566367 ,   3.1369524 ,  -9.612807  ,
        -5.31958   ,  -1.4896635 ,  -5.0531154 ,   4.17

### Similarity between two words in the vocabulary

In [None]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.7678789

In [None]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [None]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.26613754

![title](https://samyzaf.com/ML/nlp/word2vec1.png)

# Comparison of CBOW, SkipGram and SkipGram with Subword Information

### Train a char n-gram model (subword information) with fastText

In [None]:
model_subword = gensim.models.FastText(documents, size=150, window=10, min_count=2, workers=10, min_n=3, max_n=6)  # instantiate
model_subword.train(documents,total_examples=len(documents),epochs=10)

2022-07-05 09:40:44,453 : INFO : resetting layer weights
2022-07-05 09:40:59,745 : INFO : collecting all words and their counts
2022-07-05 09:40:59,753 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-07-05 09:41:00,066 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2022-07-05 09:41:00,306 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2022-07-05 09:41:00,618 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2022-07-05 09:41:00,924 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2022-07-05 09:41:01,211 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2022-07-05 09:41:01,475 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2022-07-05 09:41:01,706 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keepi

In [None]:
model_subword.save("word2vec-OPINRank-FastText.model")
#https://arxiv.org/abs/1607.04606

2022-07-05 10:27:49,193 : INFO : saving FastText object under word2vec-OPINRank-FastText.model, separately None
2022-07-05 10:27:49,199 : INFO : storing np array 'vectors' to word2vec-OPINRank-FastText.model.wv.vectors.npy
2022-07-05 10:27:49,239 : INFO : storing np array 'vectors_vocab' to word2vec-OPINRank-FastText.model.wv.vectors_vocab.npy
2022-07-05 10:27:49,268 : INFO : storing np array 'vectors_ngrams' to word2vec-OPINRank-FastText.model.wv.vectors_ngrams.npy
2022-07-05 10:27:51,885 : INFO : not storing attribute vectors_norm
2022-07-05 10:27:51,886 : INFO : not storing attribute vectors_vocab_norm
2022-07-05 10:27:51,887 : INFO : not storing attribute vectors_ngrams_norm
2022-07-05 10:27:51,888 : INFO : not storing attribute buckets_word
2022-07-05 10:27:51,892 : INFO : storing np array 'syn1neg' to word2vec-OPINRank-FastText.model.trainables.syn1neg.npy
2022-07-05 10:27:51,924 : INFO : storing np array 'vectors_vocab_lockf' to word2vec-OPINRank-FastText.model.trainables.vector

### Train a SkipGram model

In [None]:
model_skipgram = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10, sg=1)
model_skipgram.train(documents,total_examples=len(documents),epochs=10)

2022-07-05 10:28:13,348 : INFO : collecting all words and their counts
2022-07-05 10:28:13,349 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-07-05 10:28:13,756 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2022-07-05 10:28:14,281 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2022-07-05 10:28:14,822 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2022-07-05 10:28:15,329 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2022-07-05 10:28:15,885 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2022-07-05 10:28:16,468 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2022-07-05 10:28:16,942 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2022-07-05 10:28:17,378 : INFO : PROG

(303480294, 415193580)

In [None]:
model_skipgram.save("word2vec-OPINRank-skipgram.model")

2022-07-05 10:53:31,897 : INFO : saving Word2Vec object under word2vec-OPINRank-skipgram.model, separately None
2022-07-05 10:53:31,901 : INFO : storing np array 'vectors' to word2vec-OPINRank-skipgram.model.wv.vectors.npy
2022-07-05 10:53:31,940 : INFO : not storing attribute vectors_norm
2022-07-05 10:53:31,942 : INFO : storing np array 'syn1neg' to word2vec-OPINRank-skipgram.model.trainables.syn1neg.npy
2022-07-05 10:53:31,972 : INFO : not storing attribute cum_table
2022-07-05 10:53:32,090 : INFO : saved word2vec-OPINRank-skipgram.model


## Comparison

In [None]:
from gensim.models import KeyedVectors
from IPython.display import display_html
import pandas as pd

cbow_vectors = KeyedVectors.load("word2vec-OPINRank.model")
subword_vectors = KeyedVectors.load("word2vec-OPINRank-FastText.model")
skipgram_vectors = KeyedVectors.load("word2vec-OPINRank-skipgram.model")

2022-07-05 10:53:34,082 : INFO : loading Word2VecKeyedVectors object from word2vec-OPINRank.model
2022-07-05 10:53:34,891 : INFO : loading wv recursively from word2vec-OPINRank.model.wv.* with mmap=None
2022-07-05 10:53:34,892 : INFO : loading vectors from word2vec-OPINRank.model.wv.vectors.npy with mmap=None
2022-07-05 10:53:34,932 : INFO : setting ignored attribute vectors_norm to None
2022-07-05 10:53:34,933 : INFO : loading vocabulary recursively from word2vec-OPINRank.model.vocabulary.* with mmap=None
2022-07-05 10:53:34,934 : INFO : loading trainables recursively from word2vec-OPINRank.model.trainables.* with mmap=None
2022-07-05 10:53:34,935 : INFO : loading syn1neg from word2vec-OPINRank.model.trainables.syn1neg.npy with mmap=None
2022-07-05 10:53:34,992 : INFO : setting ignored attribute cum_table to None
2022-07-05 10:53:34,993 : INFO : loaded word2vec-OPINRank.model
2022-07-05 10:53:34,994 : INFO : loading Word2VecKeyedVectors object from word2vec-OPINRank-FastText.model
202

From Docs: The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don’t need to continue training), its state can discarded, keeping just the vectors and their keys proper.

This results in a much smaller and faster object that can be mmapped for lightning fast loading and sharing the vectors in RAM between processes.

## 1. Most similar concepts

In [None]:
def display_html_table(html_str):
    """Change the look and display style of table"""

    display_html(html_str.replace('table','table style="padding:20px;display:inline;color:navy;font-size:1.1em"'),raw=True)

def display_side_by_side(*args):
    html_str=''

    for df in args:
        html_str+=df.to_html()

    display_html_table(html_str)

def display_similar(positive:list,topn=10):
    """get similar concepts from 3 different models"""

    topn_cbow=cbow_vectors.wv.most_similar(positive=w1, topn=topn)
    topn_subword=subword_vectors.wv.most_similar(positive=w1, topn=topn)
    topn_skipgram=skipgram_vectors.wv.most_similar(positive=w1, topn=topn)

    display_side_by_side(
                     pd.DataFrame(topn_cbow,columns=['cbow','cosine_sim']),
                     pd.DataFrame(topn_skipgram,columns=['skipgram','cosine_sim']),
                     pd.DataFrame(topn_subword,columns=['skipgramsi','cosine_sim']))

In [None]:
w1=['food','hotel']
display_similar(w1,topn=8)

2022-07-05 10:53:41,366 : INFO : precomputing L2-norms of word weight vectors
2022-07-05 10:53:41,446 : INFO : precomputing L2-norms of word weight vectors
2022-07-05 10:53:41,509 : INFO : precomputing L2-norms of ngram weight vectors
2022-07-05 10:53:44,937 : INFO : precomputing L2-norms of word weight vectors


Unnamed: 0,cbow,cosine_sim
0,cuisine,0.577597
1,restaurant,0.550258
2,place,0.541984
3,property,0.526256
4,foods,0.518593
5,atmosphere,0.49453
6,meal,0.488425
7,meals,0.487389

Unnamed: 0,skipgram,cosine_sim
0,restaurant,0.742725
1,place,0.6639
2,service,0.661621
3,breakfast,0.660158
4,location,0.64349
5,is,0.640454
6,it,0.639137
7,the,0.62815

Unnamed: 0,skipgramsi,cosine_sim
0,foodis,0.726847
1,hoteli,0.718882
2,foodand,0.71272
3,atmospherehotel,0.710441
4,sinohotel,0.704555
5,foodi,0.70399
6,hotelbar,0.702716
7,foodwe,0.698903


In [None]:
w1=['bathroom']
display_similar(w1,topn=8)

Unnamed: 0,cbow,cosine_sim
0,bath,0.800389
1,washroom,0.741017
2,bathrooms,0.729996
3,bathtub,0.703139
4,bathroon,0.687404
5,shower,0.658591
6,bathrrom,0.640843
7,bathrooom,0.640236

Unnamed: 0,skipgram,cosine_sim
0,shower,0.809292
1,washroom,0.80357
2,bath,0.797698
3,vanity,0.786386
4,bathtub,0.76454
5,bathrooms,0.738423
6,sink,0.733037
7,cubicle,0.711759

Unnamed: 0,skipgramsi,cosine_sim
0,bathrooom,0.98171
1,bathroomn,0.970242
2,thebathroom,0.967723
3,bathroomi,0.963018
4,etcbathroom,0.960945
5,bathrroom,0.960438
6,bathroomno,0.958134
7,bathroomhad,0.957669


In [None]:
w1=['cheap']
display_similar(w1,topn=8)

Unnamed: 0,cbow,cosine_sim
0,inexpensive,0.682197
1,expensive,0.600288
2,affordable,0.539129
3,cheep,0.529366
4,fancy,0.5234
5,basic,0.519452
6,overpriced,0.515347
7,bargain,0.51498

Unnamed: 0,skipgram,cosine_sim
0,inexpensive,0.683589
1,expensive,0.623501
2,reasonable,0.580902
3,cheep,0.566319
4,cheapest,0.552123
5,fancy,0.549619
6,snt,0.54825
7,economical,0.542272

Unnamed: 0,skipgramsi,cosine_sim
0,cheapy,0.892279
1,scheap,0.874474
2,cheapp,0.873063
3,cheapo,0.844658
4,cheapish,0.834217
5,cheapos,0.822588
6,cheapie,0.821421
7,inexpensivegood,0.737797


In [None]:
w1=['fire']
display_similar(w1,topn=8)

Unnamed: 0,cbow,cosine_sim
0,police,0.524918
1,evacuation,0.517801
2,sprinkler,0.505275
3,roared,0.466151
4,gas,0.462502
5,evacuate,0.453599
6,firetrucks,0.448916
7,petrol,0.443834

Unnamed: 0,skipgram,cosine_sim
0,alarms,0.73337
1,alarm,0.716886
2,evacuation,0.689325
3,firetrucks,0.652635
4,suppressor,0.652266
5,evacuated,0.645844
6,evacuate,0.635812
7,detection,0.632991

Unnamed: 0,skipgramsi,cosine_sim
0,firenze,0.842171
1,firey,0.811261
2,firefox,0.80818
3,gunfire,0.807786
4,befire,0.805579
5,firehall,0.788535
6,firefly,0.770007
7,firebrigade,0.762


## Compute similarity between words

In [None]:
# similarity between two related words
def get_word_sim(w1,w2,concept_type):

    sim_cbow=model.wv.similarity(w1=w1,w2=w2)
    sim_skipgram=model_skipgram.wv.similarity(w1=w1,w2=w2)
    sim_subword=model_subword.wv.similarity(w1=w1,w2=w2)

    return {"a_word":w1,"b_word":w2,"score_cbow":sim_cbow,"score_skipgram":sim_skipgram,"score_skipgramsi":sim_subword,"concept_type":concept_type}

# word pairs
word_pairs=[['friendly','staff','neighboring'],['shower','curtain','neighboring'],['very','clean','neighboring'],['hotel','property','synonymous'],['dirty','filthy','synonymous'],['washroom','bathroom','synonymous'],['staff','staffs','near_duplicates'],['calendar','calender','near_duplicates'],['bathrroom','bathrooms','near_duplicates']]

# get similarity
results=[]
for p in word_pairs:
    results.append(get_word_sim(p[0],p[1],p[2]))

#put in dataframe
df=pd.DataFrame(results)
display_html_table(df.to_html())

Unnamed: 0,a_word,b_word,score_cbow,score_skipgram,score_skipgramsi,concept_type
0,friendly,staff,0.094583,0.746728,0.200714,neighboring
1,shower,curtain,0.250079,0.729328,0.368841,neighboring
2,very,clean,0.40439,0.680793,0.444848,neighboring
3,hotel,property,0.802641,0.680755,0.81201,synonymous
4,dirty,filthy,0.877223,0.894301,0.888053,synonymous
5,washroom,bathroom,0.741017,0.80357,0.899494,synonymous
6,staff,staffs,0.816281,0.629267,0.944527,near_duplicates
7,calendar,calender,0.246457,0.481848,0.688518,near_duplicates
8,bathrroom,bathrooms,0.205031,0.485334,0.824829,near_duplicates


## Evaluate Similarity Between Phrases
https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.n_similarity.html

In [None]:
def get_phrase_similarity(p1,p2,model):


    tokens_1=[t for t in p1.split() if t in model.wv.vocab]
    tokens_2=[t for t in p2.split() if t in model.wv.vocab]

    #compute cosine similarity using word embedings
    cosine=0
    if (len(tokens_1) > 0 and len(tokens_2)>0):
        cosine=model.wv.n_similarity(tokens_1,tokens_2)

        if cosine > 0.6:
            return 1
        else:
            return 0
    else:
        return 0


df=pd.read_csv("similarity_test.txt")
df['cbow_sim']=df.apply(lambda x:get_phrase_similarity(x.phrase1,x.phrase2,model),axis=1)
df['skipgram_sim']=df.apply(lambda x:get_phrase_similarity(x.phrase1,x.phrase2,model_skipgram),axis=1)
df['skipgramsi_sim']=df.apply(lambda x:get_phrase_similarity(x.phrase1,x.phrase2,model_subword),axis=1)
display_html_table(df.to_html())

Unnamed: 0,phrase1,phrase2,similar,cbow_sim,skipgram_sim,skipgramsi_sim
0,polite staff,rude staff,0,1,1,1
1,friendly manager,rude manager,0,1,1,1
2,room was huge,large rooms,1,0,1,1
3,staff was friendly,very polite manager,1,1,1,1
4,bathroom was very dirty,filthy bathroom,1,1,1,1
5,clean and tidy rooms,the room was a mess,0,0,1,0
6,the views were awesome,the breakfast was nice,0,0,1,0
7,what lovely breakfast,friendly staff,0,0,0,0
8,would recommend,highly recommended,1,0,1,0
9,the manager was rude,staff were arrogant and rude,1,1,1,1


In [None]:
from sklearn.metrics import precision_recall_fscore_support

def get_prf(model_type,x_true,x_pred):
    """Compute precision, recall and f-score"""

    per_class_prf=precision_recall_fscore_support(x_true,x_pred,average='binary')

    precision = per_class_prf[0]
    recall = per_class_prf[1]
    fscore = per_class_prf[2]

    return {"a_model_type":model_type,"b_precision":precision,"c_recall":recall,"d_fscore":fscore}

results=[]
results.append(get_prf("cbow_sim",df['similar'].values,df['cbow_sim'].values))
results.append(get_prf("skipgram_sim",df['similar'].values,df['skipgram_sim'].values))
results.append(get_prf("skipgramsi_sim",df['similar'].values,df['skipgramsi_sim'].values))

df_results=pd.DataFrame(results)
display_html_table(df_results.to_html())

Unnamed: 0,a_model_type,b_precision,c_recall,d_fscore
0,cbow_sim,0.777778,0.583333,0.666667
1,skipgram_sim,0.75,1.0,0.857143
2,skipgramsi_sim,0.714286,0.416667,0.526316


## Gensim Pretrained Models
https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models

Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding.

## ## Pre-trained: Twitter GloVe Embeddings

This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The `25` in the model name refers to the dimensionality of the vectors.

In [None]:
import gensim.downloader as api

In [None]:
model_glove_twitter = api.load("glove-twitter-25")

2022-07-05 11:02:09,145 : INFO : loading projection weights from C:\Users\rufra/gensim-data\glove-twitter-25\glove-twitter-25.gz
2022-07-05 11:02:54,420 : INFO : loaded (1193514, 25) matrix from C:\Users\rufra/gensim-data\glove-twitter-25\glove-twitter-25.gz


In [None]:
model_glove_twitter["trump"],model_glove_twitter['obama']

(array([-0.56174 ,  0.69419 ,  0.16733 ,  0.055867, -0.26266 , -0.6303  ,
        -0.28311 , -0.88244 ,  0.57317 , -0.82376 ,  0.46728 ,  0.48607 ,
        -2.1942  , -0.41972 ,  0.31795 , -0.70063 ,  0.060693,  0.45279 ,
         0.6564  ,  0.20738 ,  0.84496 , -0.087537, -0.38856 , -0.97028 ,
        -0.40427 ], dtype=float32),
 array([ 0.77126 ,  0.81259 , -0.5901  , -0.015908, -0.082797, -1.2261  ,
         0.098286,  0.087488,  0.012586, -0.35884 ,  0.80733 ,  0.12569 ,
        -4.0522  ,  0.14856 ,  0.6988  , -0.78948 , -0.77125 ,  0.49512 ,
         0.16366 , -0.9713  ,  0.95064 ,  0.19921 , -0.27903 , -1.6844  ,
        -0.79424 ], dtype=float32))

In [None]:
model_glove_twitter.wv.most_similar("trump",topn=10)

  model_glove_twitter.wv.most_similar("trump",topn=10)
2022-07-05 11:03:04,344 : INFO : precomputing L2-norms of word weight vectors


[('banks', 0.9113253355026245),
 ('warren', 0.9105228781700134),
 ('clinton', 0.8849892616271973),
 ('gates', 0.8760884404182434),
 ('founder', 0.8722785115242004),
 ('buffett', 0.8699301481246948),
 ('kerry', 0.8676391839981079),
 ('murdoch', 0.8675356507301331),
 ('reagan', 0.8649043440818787),
 ('newman', 0.8631280660629272)]

In [None]:
model_glove_twitter.wv.most_similar("policies",topn=10)

  model_glove_twitter.wv.most_similar("policies",topn=10)


[('policy', 0.9484813213348389),
 ('reforms', 0.9403933882713318),
 ('laws', 0.94012051820755),
 ('government', 0.9230710864067078),
 ('regulations', 0.9168934226036072),
 ('economy', 0.9110006093978882),
 ('immigration', 0.9105909466743469),
 ('legislation', 0.9089651107788086),
 ('govt', 0.9054746627807617),
 ('regulation', 0.9050778746604919)]

In [None]:
model_glove_twitter.wv.doesnt_match(["trump","bernie","obama","pelosi","orange"])

  model_glove_twitter.wv.doesnt_match(["trump","bernie","obama","pelosi","orange"])
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'orange'

## Tutorial Kaggle

https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial/notebook