<a href="https://colab.research.google.com/github/shubham62025865/deeplearning/blob/main/Word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec

Word2Vec creates vectors of the words that are distributed numerical representations of word features – these word features could comprise of words that represent the context of the individual words present in our vocabulary. Word embeddings eventually help in establishing the association of a word with another similar meaning word through the created vectors.

As seen in the image below where word embeddings are plotted, similar meaning words are closer in space, indicating their semantic similarity.

<img src = https://editor.analyticsvidhya.com/uploads/93033pic1.png >



## Word2Vec Architecture

The effectiveness of Word2Vec comes from its ability to group together vectors of similar words. Given a large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the text. These estimates yield word associations with other words in the corpus. For example, words like “King” and “Queen” would be very similar to one another. When conducting algebraic operations on word embeddings you can find a close approximation of word similarities. For example, the 2 dimensional embedding vector of "king" - the 2 dimensional embedding vector of "man" + the 2 dimensional embedding vector of "woman" yielded a vector which is very close to the embedding vector of "queen". Note, that the values below were chosen arbitrarily.
```
King    -    Man    +    Woman    =    Queen
[5,3]   -    [2,1]  +    [3, 2]   =    [6,4]  
```

<img src = https://miro.medium.com/max/720/1*hnu-NqrK3C7wmYWcKXpb-Q.webp >

There are two main architectures which yield the success of word2vec. The skip-gram and CBOW architectures.

## CBOW (Continuous Bag of Words)
This architecture is very similar to a feed forward neural network. This model architecture essentially tries to predict a target word from a list of context words. The intuition behind this model is quite simple: given a phrase "Have a great day" , we will choose our target word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take the distributed representations of the context words to try and predict the target word.

<img src = https://miro.medium.com/max/640/1*_8Ul4ICaCtmZWPrWqH32Ow.webp >

## Continuous Skip-Gram Model
The skip-gram model is a simple neural network with one hidden layer trained in order to predict the probability of a given word being present when an input word is present. Intuitively, you can imagine the skip-gram model being the opposite of the CBOW model. In this architecture, it takes the current word as an input and tries to accurately predict the words before and after this current word. This model essentially tries to learn and predict the context words around the specified input word. Based on experiments assessing the accuracy of this model it was found that the prediction quality improves given a large range of word vectors, however it also increases the computational complexity. The process can be described visually as seen below.

<img src = https://miro.medium.com/max/828/1*M6UxaLSbNMeoDFWRN_kPeQ.webp >


As seen above, given some corpus of text, a target word is selected over some rolling window. The training data consists of pairwise combinations of that target word and all other words in the window. This is the resulting training data for the neural network. Once the model is trained, we can essentially yield a probability of a word being a context word for a given target. The following image below represents the architecture of the neural network for the skip-gram model.

<img src = https://miro.medium.com/max/828/1*UYAkOS9JQwdozQjCzttuow.webp >

A corpus can be represented as a vector of size N, where each element in N corresponds to a word in the corpus. During the training process, we have a pair of target and context words, the input array will have 0 in all elements except for the target word. The target word will be equal to 1. The hidden layer will learn the embedding representation of each word, yielding a d-dimensional embedding space. The output layer is a dense layer with a softmax activation function. The output layer will essentially yield a vector of the same size as the input, each element in the vector will consist of a probability. This probability indicates the similarity between the target word and the associated word in the corpus.

In [None]:
!pip install gensim



In [None]:
import pandas as pd
import numpy as np
import os
import gensim

import nltk
nltk.download("popular")
from nltk import sent_tokenize
from nltk.stem import WordNetLemmatizer

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

In [None]:
sample = '''
Shiva looked up at the Pandit, his eyes full of surprise and shame.

‘I know what you have done, Oh Neelkanth,’said the Pandit. ‘And I ask again, is it really so
bad?’

‘Don’t call me the Neelkanth,’ glared Shiva. ‘I don’t deserve the tide. I have the blood of
thousands on my hands.’

‘Many more than thousands have died,’said the Pandit. ‘Probably hundreds of thousands. But
you really think they wouldn’t have died if you hadn’t been around? Is the blood really on your
hands?’
'''
sample

'\nShiva looked up at the Pandit, his eyes full of surprise and shame.\n\n‘I know what you have done, Oh Neelkanth,’said the Pandit. ‘And I ask again, is it really so\nbad?’\n\n‘Don’t call me the Neelkanth,’ glared Shiva. ‘I don’t deserve the tide. I have the blood of\nthousands on my hands.’\n\n‘Many more than thousands have died,’said the Pandit. ‘Probably hundreds of thousands. But\nyou really think they wouldn’t have died if you hadn’t been around? Is the blood really on your\nhands?’\n'

In [None]:
# sentence tokenisation
# stop word removal
# punctuations
# lower case
# lemmatization



In [None]:
tokens = sent_tokenize(sample)

In [None]:
tokens

['\nShiva looked up at the Pandit, his eyes full of surprise and shame.',
 '‘I know what you have done, Oh Neelkanth,’said the Pandit.',
 '‘And I ask again, is it really so\nbad?’\n\n‘Don’t call me the Neelkanth,’ glared Shiva.',
 '‘I don’t deserve the tide.',
 'I have the blood of\nthousands on my hands.’\n\n‘Many more than thousands have died,’said the Pandit.',
 '‘Probably hundreds of thousands.',
 'But\nyou really think they wouldn’t have died if you hadn’t been around?',
 'Is the blood really on your\nhands?’']

In [None]:
rs = [remove_stopwords(token) for token in tokens]
rs

['Shiva looked Pandit, eyes surprise shame.',
 '‘I know done, Oh Neelkanth,’said Pandit.',
 '‘And I ask again, bad?’ ‘Don’t Neelkanth,’ glared Shiva.',
 '‘I don’t deserve tide.',
 'I blood thousands hands.’ ‘Many thousands died,’said Pandit.',
 '‘Probably hundreds thousands.',
 'But think wouldn’t died hadn’t around?',
 'Is blood hands?’']

In [None]:
sp = [simple_preprocess(sentence) for sentence in rs]
sp

[['shiva', 'looked', 'pandit', 'eyes', 'surprise', 'shame'],
 ['know', 'done', 'oh', 'neelkanth', 'said', 'pandit'],
 ['and', 'ask', 'again', 'bad', 'don', 'neelkanth', 'glared', 'shiva'],
 ['don', 'deserve', 'tide'],
 ['blood',
  'thousands',
  'hands',
  'many',
  'thousands',
  'died',
  'said',
  'pandit'],
 ['probably', 'hundreds', 'thousands'],
 ['but', 'think', 'wouldn', 'died', 'hadn', 'around'],
 ['is', 'blood', 'hands']]

In [None]:
[[lemma.lemmatize(word) for word in sentence] for sentence in sp]

[['shiva', 'looked', 'pandit', 'eye', 'surprise', 'shame'],
 ['know', 'done', 'oh', 'neelkanth', 'said', 'pandit'],
 ['and', 'ask', 'again', 'bad', 'don', 'neelkanth', 'glared', 'shiva'],
 ['don', 'deserve', 'tide'],
 ['blood', 'thousand', 'hand', 'many', 'thousand', 'died', 'said', 'pandit'],
 ['probably', 'hundred', 'thousand'],
 ['but', 'think', 'wouldn', 'died', 'hadn', 'around'],
 ['is', 'blood', 'hand']]

In [None]:
tokens[6]

'But\nyou really think they wouldn’t have died if you hadn’t been around?'

In [None]:
rs = remove_stopwords(tokens[6])
rs

'But think wouldn’t died hadn’t around?'

In [None]:
sp = simple_preprocess(rs)
sp

['but', 'think', 'wouldn', 'died', 'hadn', 'around']

but
think
wouldn
died
hadn
around


In [None]:
lemma = WordNetLemmatizer()
# shiva = []

# open txt file with python

with open("/content/shiva_01.txt") as f:
  corpus = f.read()
  # sentence level tokenization
  tokens = sent_tokenize(corpus)

  # stop_words removal
  rs = [remove_stopwords(token) for token in tokens]

  # simple preprocess
  sp = [simple_preprocess(sentence) for sentence in rs]

  # lemmatization
  shiva = [[lemma.lemmatize(word) for word in sentence] for sentence in sp]

In [None]:
shiva

In [None]:
shiva = []

with open("/content/shiva_01.txt") as f:
  corpus = f.read()
  tokens = sent_tokenize(corpus)
  for token in tokens:
    rs = remove_stopwords(token)
    sp = simple_preprocess(rs)
    shiva.append(sp)

In [None]:
shiva

In [None]:
corpus

https://radimrehurek.com/gensim/parsing/preprocessing.html#

https://radimrehurek.com/gensim/utils.html

In [None]:
shiva[:5]

[['chapter', 'he', 'come'],
 ['bc',
  'mansarovar',
  'lake',
  'at',
  'foot',
  'mount',
  'kailash',
  'tibet',
  'shiva',
  'gazed',
  'orange',
  'sky'],
 ['the',
  'cloud',
  'hovering',
  'mansarovar',
  'parted',
  'reveal',
  'setting',
  'sun'],
 ['the', 'brilliant', 'giver', 'life', 'calling', 'day', 'again'],
 ['shiva', 'seen', 'sunrise', 'twenty', 'one', 'year']]

In [None]:
a = [[1,2,3],[4,5,6],[],[7,8,9],[]]

In [None]:
count =0
for x in a:
  if len(x) == 0:
    count+=1
count

2

In [None]:
len(shiva)

7901

In [None]:
count = 0
for x in shiva:
  if len(x) ==0:
    count+=1

count

0

https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=3
)

Parameters:

- window: The number of words before the target and after the target word.
- min_count: The minimum number of words a sentence should have for it to be part of training.

In [None]:
model.build_vocab(shiva)

In [None]:
model.corpus_count

7901

In [None]:
model.train(shiva, total_examples=model.corpus_count, epochs=20)

(899037, 1121920)

In [None]:
model.corpus_count

7901

model.wv.most_similar('<word>')

This method allows to find top 10 most similar words to the word that is passed to the model.

In [None]:
model.wv.most_similar('shiva')

[('sati', 0.9506832361221313),
 ('ayurvati', 0.9458413124084473),
 ('parvateshwar', 0.9420881271362305),
 ('pandit', 0.9318587183952332),
 ('brahaspati', 0.9285432696342468),
 ('krittika', 0.9249944686889648),
 ('daksha', 0.9151716232299805),
 ('surprised', 0.9114599823951721),
 ('smile', 0.9104136824607849),
 ('bhadra', 0.9091978669166565)]

In [None]:
model.wv.most_similar('sword')

[('knife', 0.9774574041366577),
 ('shield', 0.975297749042511),
 ('moved', 0.9733710885047913),
 ('brought', 0.9623579978942871),
 ('side', 0.9613814949989319),
 ('left', 0.9572250247001648),
 ('drew', 0.953218936920166),
 ('distance', 0.9529476761817932),
 ('arm', 0.9493415355682373),
 ('reached', 0.9475962519645691)]

In [None]:
model.wv.most_similar('saraswati')

[('water', 0.9891326427459717),
 ('karachapa', 0.9859890937805176),
 ('narmada', 0.9856348037719727),
 ('led', 0.9841127991676331),
 ('equal', 0.9833917617797852),
 ('indus', 0.9829319715499878),
 ('one', 0.9820099472999573),
 ('surrounding', 0.979999840259552),
 ('extravagantly', 0.9796984791755676),
 ('south', 0.9796000719070435)]

In [None]:
model.wv.most_similar('bhagirath')

[('beamed', 0.9950608015060425),
 ('confused', 0.9950509071350098),
 ('supported', 0.9949643611907959),
 ('concern', 0.9946037530899048),
 ('while', 0.9945772886276245),
 ('guilt', 0.9945439100265503),
 ('agnibaan', 0.9945375919342041),
 ('caused', 0.9944381713867188),
 ('keeper', 0.9943970441818237),
 ('implication', 0.9941989779472351)]

In [None]:
model.wv.most_similar('meluha')

[('society', 0.9830188155174255),
 ('people', 0.9788488149642944),
 ('person', 0.9786986708641052),
 ('believe', 0.9759514927864075),
 ('and', 0.973882794380188),
 ('life', 0.9695990085601807),
 ('somras', 0.9686427116394043),
 ('u', 0.9628500938415527),
 ('way', 0.9627066850662231),
 ('country', 0.9612454175949097)]

model.wv.doesnt_match([array of words])

This method allows to find the word that is the least similar to other words that are in the array.

In [None]:
model.wv.doesnt_match(['shiva', 'sati', "krittika"])

'shiva'

model.wv.similarity('first word', 'second word')


This method allows to find cosine similarity between two words. Higher it is closer to one, the more similar the words are.

In [None]:
model.wv.similarity("krittika","sati")

0.978521

In [None]:
model.wv.similarity("shiva","sati")

0.9506833

In [None]:
model.wv.similarity("krittika","warrior")

0.49806416

In [None]:
model.wv.similarity("parvateshwar","warrior")

0.4553571

model.wv['<word>']:

Displays the vector representation of a word that exists in the vocabulary.

In [None]:
model.wv["shiva"]

array([ 0.00543561, -0.23771565,  0.9131056 ,  0.31328544, -0.4621426 ,
       -0.6545937 ,  0.08630609,  0.49457797, -0.5719148 , -0.5421846 ,
        0.35363135, -0.4895273 , -0.1379944 ,  0.30238163, -0.08937825,
       -0.25313798,  0.40557015, -0.4270343 , -0.14119995, -0.8741759 ,
        0.877691  ,  0.8957698 ,  0.68501717, -0.11159179,  0.31111106,
        0.09970777, -0.35146347,  0.02964295, -0.16365702,  0.04136806,
        0.08374083,  0.19053611, -0.05081322, -0.09916646, -0.2665848 ,
        0.9192197 ,  0.2264377 , -0.49741697, -0.43907112, -1.1605176 ,
       -0.09629039, -0.578658  , -0.512166  , -0.09168068,  0.33681706,
        0.21794127, -0.4603607 ,  0.6652997 ,  0.3787522 , -0.05612567,
        0.14346245, -0.14787324,  0.19197871, -0.57200986, -0.42425135,
        0.23468103,  0.05179806,  0.27557212, -0.02733316,  0.3486072 ,
       -0.3917228 , -0.11519313, -0.29998434, -0.5995471 , -0.46796036,
        0.51135015, -0.27743217,  0.6505596 , -0.5845145 ,  0.35

In [None]:
model.wv["shiva"].shape

(100,)

In [None]:
model.wv.vectors.shape

(3000, 100)