<h1>Word2Vec</h1>
Word2Vec is a technique used to generate word embeddings, which are vector representations of words that capture semantic relationships between them. These embeddings are learned from a large corpus of text using shallow neural networks, allowing for mathematical operations on words and the ability to infer meaning based on context. <br><br>
Here's a more detailed explanation:-<br>
Core Idea: Word2Vec aims to represent words as vectors in a multi-dimensional space, where similar words are located closer to each other. This allows for mathematical operations on words, such as "king - man + woman = queen", to produce meaningful results. <br>
How it Works:-<br>
Word Embeddings: Word2Vec learns these vector representations, also known as word embeddings, by analyzing the context in which words appear in a large text corpus. <br>
Neural Networks:- <br>It utilizes shallow neural networks (specifically Continuous Bag-of-Words (CBOW) and Skip-gram) to learn these embeddings. <br>
CBOW: Predicts a target word based on its surrounding context words. <br>
Skip-gram: Predicts the surrounding context words given a target word. <br>
Semantic Relationships: Words with similar meanings will have similar vector representations, enabling the capture of semantic relationships. 

<h1>Demo</h1><br>We will use the pre-trained weights of word2vec that was trained on Google New
corpus containing 3 billion words. This model consists of 300-dimensional vectors
for 3 million words and phrases.

In [None]:
import gensim


In [None]:
from gensim.models import Word2Vec,KeyedVectors

In [3]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9712 sha256=286c4955e0bf9cbefe996c409941ba849dee887823f4159c7c8a810c7afe2544
  Stored in directory: c:\users\sinha\appdata\local\pip\cache\wheels\40\b3\0f\a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


  DEPRECATION: Building 'wget' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'wget'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [4]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
# Load the pre-trained Word2Vec model
# The model is trained on Google News dataset
# The model contains 300-dimensional vectors for 3 million words and phrases
# The model is in binary format
# The model is loaded using KeyedVectors for efficient memory usage

model = KeyedVectors.load_word2vec_format(r'D:\CODING\PYTHON\NLP\GoogleNews-vectors-negative300.bin',binary=True,limit=500000)

In [None]:
# Example usage of the model
# Finding similar words to 'cricket'
# This will return the top 10 most similar words to 'cricket'
# This is useful for understanding the context and relationships of words in the model
# You can use this to find synonyms or related terms in the context of sports
# You can also use this to explore the semantic space of the model

model['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [None]:
# most_similar method can be used to find words similar to a given word
# This will return the top 10 most similar words
# You can use this to find synonyms or related terms


model.most_similar('man')

[('woman', 0.7664012908935547),
 ('boy', 0.6824871301651001),
 ('teenager', 0.6586930155754089),
 ('teenage_girl', 0.6147903203964233),
 ('girl', 0.5921714305877686),
 ('robber', 0.5585119128227234),
 ('Robbery_suspect', 0.5584409832954407),
 ('teen_ager', 0.5549196600914001),
 ('men', 0.5489763021469116),
 ('guy', 0.5420035123825073)]

In [9]:
model.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [None]:
model.most_similar('facebook') 

[('Facebook', 0.7563533186912537),
 ('FaceBook', 0.7076998949050903),
 ('twitter', 0.6988552212715149),
 ('myspace', 0.6941817998886108),
 ('Twitter', 0.664244532585144),
 ('Facebook.com', 0.6529868245124817),
 ('FacebookFacebook', 0.6162722110748291),
 ('facebook.com', 0.6135972142219543),
 ('Twitter.com', 0.6102108359336853),
 ('TwitterTwitter', 0.6085205674171448)]

In [None]:
# You can also find the similarity between two words
# This will return a float value between -1 and 1
# A value closer to 1 means the words are similar, while a value closer to -1 means they are dissimilar
# This is useful for understanding the semantic relationship between words


model.similarity('man','woman')

0.76640123

In [12]:
model.similarity('man','PHP')

-0.032995153

In [None]:
# You can also find words that do not match a given set of words
# This will return the word that is least similar to the others
# This is useful for finding outliers or unrelated terms in a set of words
# For example, if you have a set of programming languages and one word that is not a programming language, this method will return the non-programming language word
# This can help in identifying terms that do not fit the context of the others
# This is useful for tasks like word analogy or finding unrelated terms in a set of words



model.doesnt_match(['PHP','java','monkey'])

'monkey'

In [None]:
# You can also perform vector arithmetic with the model
# For example, to find a word that is similar to 'king


vec = model['king'] - model['man'] + model['woman']
model.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('queens', 0.5289887189865112),
 ('ruler', 0.5247419476509094)]

In [None]:
# You can also perform vector arithmetic with other words
# For example, to find a word that is similar to 'India' in the context of 'England'
# This will return the top 10 most similar words to the resulting vector
# This is useful for exploring relationships between countries or other entities in the model


vec = model['INR'] - model ['India'] + model['England']
model.most_similar([vec])

[('INR', 0.6442341208457947),
 ('GBP', 0.5040826797485352),
 ('England', 0.44649264216423035),
 ('£', 0.43340998888015747),
 ('Â_£', 0.4307197630405426),
 ('£_#.##m', 0.42561301589012146),
 ('Pounds_Sterling', 0.42512619495391846),
 ('GBP##', 0.42464491724967957),
 ('stg', 0.42324796319007874),
 ('£_#.###m', 0.4201711118221283)]