<h1> Word Embedding</h1>

<p>Word embedding is a technique in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers. This representation captures semantic meaning, enabling the model to understand context, similarity, and relationships between words.</p>
<p> The basic idea behind word embeddings is that words that are used in similar contexts tend to have similar meanings.</p>

<h2>Techniques of Word Embedding</h2>
<ul> 
      <li> Word2Vec</li>
       <li> Glove</li>
       <li> FastText</li>
    <li> BERT</li>
    </ul>

<h4>Word2Vec</h4>
<p>
    Word2Vec is a word embedding technique developed by Google. The fundamental idea is to represent words in a corpus of text as vectors in a continuous vector space, where semantically similar words are mapped to nearby points. <br>For example, let's take the following corpus:
    <br>
    <i>'Lions are the <span style="color:blue; font-weight: bold;">only big cats</span> that live in social groups called prides, which usually consist of related females, their offspring, and a few adult males. Known as the "King of the Jungle," a lion's roar can be heard up to 5 miles away, serving to communicate with pride members and ward off intruders.'</i><br>
    
    To convert the above corpus to word vectors using Word2Vec, we follow these steps. First, choose a window size, which determines the number of context words to consider around a target word. Here, let's say we take a window size of 3 for this purpose. Then, we can use one of the two approaches:
</p>
<p>
    <ol>
        <li><strong>Continuous Bag of Words (CBOW):</strong> This approach uses the surrounding context words to predict the target word. In our example, we take "only" and "cats" and try to predict "big". CBOW is generally faster and works well with smaller datasets.</li>
        <li><strong>Skip-Gram:</strong> This approach uses the target word to predict the surrounding context words. In our example, we take "big" and try to predict "only" and "cats". Skip-Gram works well with larger datasets and is better at capturing rare words.</li>
    </ol>
    Once we train the neural network on all the words in the corpus using these approaches, we optimize the weights in the neural network to minimize the prediction error. After training, we extract the word embeddings, which are the weights from the hidden layer of the network. These embeddings capture the semantic relationships between words and can be used for various natural language processing tasks, such as text classification, clustering, and sentiment analysis.
</p>
<p>
    Word2Vec's ability to capture semantic meaning makes it a powerful tool in many applications. For instance, word embeddings generated by Word2Vec can help in improving the performance of machine learning models by providing meaningful word representations that reflect the context in which words appear.
</p>



#### Implementation of Word2vec with Gensim
I will implement Word2Vec for our corpus using the Gensim library, an open-source natural language processing (NLP) toolkit known for its robustness and efficiency in handling large text datasets

In [25]:
# Example corpus of sentences
corpus = [
    ['lions', 'are', 'the', 'only', 'big', 'cats', 'that', 'live', 'in', 'social', 'groups', 'called', 'prides'],
    ['a', 'lion', 's', 'roar', 'can', 'be', 'heard', 'up', 'to', '5', 'miles', 'away'],
    ['lions', 'communicate', 'with', 'pride', 'members', 'using', 'their', 'roars']
]

##### Train Word2Vec model

In [30]:
# Corpus is the tranning data 
# vecotr_size is dimentinality of vector
# window number of words 
# min_count minimun occrance of word
model = Word2Vec(corpus, vector_size=10, window=3, min_count=1)

In [32]:
word_vector = model.wv['lions']

In [33]:
print("Word Vector for 'lions':", word_vector)

Word Vector for 'lions': [-0.00538927  0.00237865  0.05100801  0.09009357 -0.09300808 -0.07116912
  0.06462646  0.08974411 -0.05014143 -0.03761414]


In [34]:
# Find similar words
similar_words = model.wv.most_similar('lions', topn=5)
print("Similar words to 'lions':", similar_words)

Similar words to 'lions': [('their', 0.5435051321983337), ('up', 0.5108237266540527), ('social', 0.43186718225479126), ('communicate', 0.4005759358406067), ('live', 0.379443883895874)]
