<a href="https://colab.research.google.com/github/umeshrawat/AI_Math_Vedas/blob/master/NLP1_Word_and_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Word Embeddings
## Word2Vec

In [None]:
# First, you'll need to install gensim
# !pip install gensim

# Import the necessary modules

from gensim.test.utils import common_texts

from gensim.models import Word2Vec

In [None]:
print(common_texts) #Sample Data

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


 Word2vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

`model = Word2Vec(sentences, min_count=10)  # default value is 5`

A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

`model = Word2Vec(sentences, vector_size=200)  # default value is 100`

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

Other hyper-parameters:

*   size: window=window_size for capturing context for target word

*   sample: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5)

*   workers: Use these many worker threads to train the model (faster training with multicore machines)

*   sg: Training algorithm: skip-gram if sg=1, otherwise CBOW.

*   iter: Number of iterations (epochs) over the corpus.


In [None]:
model = Word2Vec(sentences=common_texts, vector_size=10, window=5, min_count=1, workers=4)
#Here, vector_size = 10 denotes the length of embedding
model.save("word2vec.model")

If you save the model you can continue training it later:

In [None]:
# load the saved model
model = Word2Vec.load("word2vec.model")

The trained word vectors are stored in a KeyedVectors instance, as model.wv:

In [None]:
# Get the embeddings for the word 'human'
embedding = model.wv['human']

print(embedding)
print(len(embedding))

[-0.00410223 -0.08368949 -0.05600012  0.07104538  0.0335254   0.0722567
  0.06800248  0.07530741 -0.03789154 -0.00561806]
10


In [None]:
# Get the most similar words (having the most similar embeddings)
similar_words = model.wv.most_similar('human',topn = 3) #topn denotes the top 3 similar words
print(similar_words)

[('graph', 0.3586882948875427), ('system', 0.22743132710456848), ('time', 0.1153423935174942)]


In [None]:
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")

In [None]:
# Load back with memory-mapping = read-only, shared across processes.
from gensim.models import KeyedVectors
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')
wv['computer']  # Get numpy vector embedding for 'computer'

array([ 0.0163195 ,  0.00189972,  0.03474648,  0.00217841,  0.09621626,
        0.05062076, -0.08919986, -0.0704361 ,  0.00901718,  0.06394394],
      dtype=float32)

### Refer to the link below for more details:
https://radimrehurek.com/gensim/models/word2vec.html

# Gensim comes with several already pre-trained models, in the Gensim-data repository

In [None]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
# Download the "glove-twitter-25" embeddings
# Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.
glove_vectors = gensim.downloader.load('glove-twitter-25')
glove_vectors



<gensim.models.keyedvectors.KeyedVectors at 0x7d8a02e6b700>

In [None]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]

# Document/Sentence Embeddings
Paragraph, Sentence, and Document embeddings

## Doc2vec

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Define your sentences (example)
sentences = ["this is the first sentence", "this is the second sentence", "yet another sentence", "one more sentence", "and the final sentence"]

# Tag the sentences for training
tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(sentences)]

# Train the model
model = Doc2Vec(tagged_data, vector_size=10, window=2, min_count=1, workers=4)

# Get the embeddings for the sentences
sentence_vectors = [model.infer_vector(sentence.split()) for sentence in sentences]
# The infer_vectors expects the input as a list of words (nltk.word_tokenize())

print("Sentence Embeddings:")
print(sentence_vectors) #Embeddings of the sentences

import numpy as np
print("\nShape:")
print(np.array(sentence_vectors).shape)

Sentence Embeddings:
[array([-0.02210109,  0.01075956,  0.01045155, -0.00168552,  0.04389304,
       -0.01646093, -0.01722671,  0.00359221, -0.03919   ,  0.04840723],
      dtype=float32), array([-0.04275137, -0.04523604,  0.00160485, -0.04087964, -0.00095917,
        0.01051954,  0.02245842, -0.01437612,  0.04036413, -0.03224698],
      dtype=float32), array([ 0.02298321, -0.00912871, -0.03395214, -0.03105471,  0.02194284,
       -0.01394829, -0.02887552, -0.04132728,  0.0022589 , -0.03036125],
      dtype=float32), array([ 0.04444262, -0.04302356, -0.02289297, -0.03036175, -0.03440027,
       -0.02493767, -0.04262125,  0.01890945, -0.04977329,  0.02532519],
      dtype=float32), array([ 0.04400432, -0.02729604,  0.0402323 ,  0.03534522, -0.0328272 ,
        0.00672655,  0.03224795,  0.0401442 ,  0.00959703,  0.01975554],
      dtype=float32)]

Shape:
(5, 10)


In [None]:
print(sentence_vectors[0]) #the first embedding

[-0.02210109  0.01075956  0.01045155 -0.00168552  0.04389304 -0.01646093
 -0.01722671  0.00359221 -0.03919     0.04840723]


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(sentence_vectors[1].reshape(1,-1),sentence_vectors[2].reshape(1,-1))[0][0]
#Cosine similarity between embeddings

0.18947186

In [None]:
# Find the similarity between all the sentences
similarity = cosine_similarity(sentence_vectors)
similarity

array([[ 1.        , -0.3741152 ,  0.3242597 , -0.03134688, -0.41970065],
       [-0.3741152 ,  1.        ,  0.19963288,  0.09588649,  0.08309498],
       [ 0.3242597 ,  0.19963288,  0.99999994, -0.3109943 , -0.09789877],
       [-0.03134688,  0.09588649, -0.3109943 ,  1.        ,  0.44980752],
       [-0.41970065,  0.08309498, -0.09789877,  0.44980752,  0.9999999 ]],
      dtype=float32)

In [None]:
#Find the most similar sentence to the first sentence (at index = 0)
idx = 0  # The index of the sentence for which you want to find the most similar sentence
max = -1 # This will store the cosine_similarity of the most similar document
max_idx = -1
print("Input Sentence -->", sentences[idx])
for i in range(np.array(sentence_vectors).shape[0]):
    if i == idx:
      continue
    sim = cosine_similarity(sentence_vectors[i].reshape(1,-1),
                            sentence_vectors[idx].reshape(1,-1))[0][0]
    if max < sim:
        max = sim
        max_idx = i

print("Most Similar Sentence -->", sentences[max_idx])
print("Cosine Simialrity:", max)

Input Sentence --> this is the first sentence
Most Similar Sentence --> yet another sentence
Cosine Simialrity: 0.3242597


#### More about Doc2vec here:
https://radimrehurek.com/gensim/models/doc2vec.html