### Word Embedding in NLP

Word embedding is a technique used in Natural Language Processing (NLP) to map words or phrases into vectors of real numbers. These vectors capture the semantic meaning of words, meaning words that are semantically similar will have similar vector representations.

### Concept Overview

Unlike traditional methods like **Bag-of-Words (BOW)** or **TF-IDF**, where words are represented as sparse vectors based on counts or frequencies, **word embeddings** provide a dense representation that captures **semantic relationships** between words. The key advantage of word embeddings is that they map words with similar meanings to nearby points in the vector space.

#### Common Techniques for Word Embeddings

- **Word2Vec**: Developed by Google, Word2Vec uses neural networks to learn vector representations of words by predicting context words (Skip-Gram) or predicting the target word (Continuous Bag of Words - CBOW).
  
- **GloVe**: Global Vectors for Word Representation (GloVe) is based on matrix factorization techniques and captures word co-occurrence statistics across the entire corpus.

- **FastText**: An extension of Word2Vec, FastText models words as bags of character n-grams, allowing it to capture information about rare and out-of-vocabulary words.

### How Word Embedding Works

Word embeddings aim to learn the **context** of words based on their usage across different sentences. For example:

- The words “king” and “queen” are related in meaning, so their word embeddings will be placed near each other in the vector space.
- Similarly, the words “cat” and “dog” will have embeddings that are closer to each other than to other unrelated words like “car” or “house.”

### Why Use Word Embeddings?

- **Captures Semantics**: Word embeddings can capture not only syntactic similarity (how words are used in the same way) but also **semantic meaning** (relatedness of meaning).
- **Handling Synonyms**: Words with similar meanings will have similar embeddings, making it easier for models to understand the meaning of words in context.
- **Reduces Dimensionality**: Unlike traditional methods that use high-dimensional vectors (e.g., for BOW, the vector could be as large as the number of unique words in the corpus), word embeddings significantly reduce the dimensionality of the word representation.

In [None]:
sentences = [
    "I love machine learning",
    "Machine learning is fun",
    "I enjoy learning new things",
    "AI and ML are exciting fields"
]

from gensim.models import Word2Vec

# Sample sentences
sentences = [
    "The movie was fantastic and exciting",
    "I loved the movie, it was great",
    "The film was boring and dull",
    "I hated the movie, it was a disappointment"
]


# Tokenize sentences into words (split on spaces for simplicity)
tokenized_sentences = [sentence.split() for sentence in sentences]

# Create the Word2Vec model
model = Word2Vec(tokenized_sentences, min_count=1)

# Now, let's check the embedding for some words
embedding_movie = model.wv['movie']
embedding_boring = model.wv['boring']
embedding_fantastic = model.wv['fantastic']

print("Embedding for 'movie':", embedding_movie)
print("Embedding for 'boring':", embedding_boring)
print("Embedding for 'fantastic':", embedding_fantastic)


Embedding for 'movie': [ 1.30016683e-03 -9.80430283e-03  4.58776252e-03 -5.38222783e-04
  6.33209571e-03  1.78347470e-03 -3.12979822e-03  7.75997294e-03
  1.55466562e-03  5.52093989e-05 -4.61295387e-03 -8.45352374e-03
 -7.76683213e-03  8.67050979e-03 -8.92496016e-03  9.03471559e-03
 -9.28101782e-03 -2.76756298e-04 -1.90704700e-03 -8.93114600e-03
  8.63005966e-03  6.77781366e-03  3.01943906e-03  4.83345287e-03
  1.12190246e-04  9.42468084e-03  7.02128746e-03 -9.85372625e-03
 -4.43322072e-03 -1.29011157e-03  3.04772262e-03 -4.32395237e-03
  1.44916656e-03 -7.84589909e-03  2.77807354e-03  4.70269192e-03
  4.93731257e-03 -3.17570218e-03 -8.42704065e-03 -9.22061782e-03
 -7.22899451e-04 -7.32746487e-03 -6.81496272e-03  6.12000562e-03
  7.17230327e-03  2.11741915e-03 -7.89940078e-03 -5.69898821e-03
  8.05184525e-03  3.92084382e-03 -5.24047017e-03 -7.39190448e-03
  7.71554711e-04  3.46375466e-03  2.07919348e-03  3.10080405e-03
 -5.62050007e-03 -9.88948625e-03 -7.02083716e-03  2.30308768e-04
  

In [None]:
similar_words = model.wv.most_similar('movie', topn=3)
print("\nMost similar words to 'movie':")
print(similar_words)


Most similar words to 'movie':
[('was', 0.21883949637413025), ('great', 0.1747603565454483), ('The', 0.16371983289718628)]


### How Does This Work?

1. **Training the Model**: The `Word2Vec` model is trained on the tokenized sentences, and it learns to predict words in the context of the surrounding words. It produces word vectors that represent the semantic meaning of each word.
2. **Vector Representation**: The word “machine” gets represented as a dense vector, and this vector will be similar to the vectors of other words that appear in similar contexts (e.g., “learning,” “AI,” “fields”).

### Benefits of Word Embeddings

- **Semantic Relationships**: The vector for “machine” is close to words like “learning” and “AI” because they often appear in similar contexts. This allows us to easily find related words and express semantic relationships.
  
- **Handling Similar Words**: Synonyms like "car" and "automobile" will have similar embeddings, which helps in tasks like **paraphrase detection** and **text classification**.

- **Rich Features for Models**: Word embeddings are often used as inputs for other NLP models, such as for **text classification**, **sentiment analysis**, **translation**, and **question answering**, providing a powerful representation of text.

### Word Embedding Applications

- **Text Classification**: By using word embeddings as features, we can classify text into different categories like spam detection or sentiment analysis.
- **Named Entity Recognition (NER)**: Word embeddings help in recognizing named entities like people, organizations, or locations by providing semantic context.
- **Machine Translation**: Embeddings allow for translating words or phrases from one language to another by comparing their vector representations.
- **Text Similarity**: Helps in measuring the similarity between texts by comparing the distance between their embeddings.

### In Summary

Word embeddings represent a key step forward in understanding language at a deeper level. They transform words into continuous vectors, making it easier for machines to understand the relationships and meanings between words. Techniques like **Word2Vec**, **GloVe**, and **FastText** help capture these relationships, enabling more accurate and efficient NLP applications like search, classification, translation, and more.

In [None]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Prepare a simple corpus
corpus = [
    "Word2Vec is a great technique for word embeddings.",
    "It helps in converting words into vectors.",
    "Word2Vec is used in NLP tasks like sentiment analysis and machine translation.",
    "The Skip-gram model predicts context words given a target word.",
    "CBOW model predicts the target word from context words."
]

# Tokenize the sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=3, min_count=1, sg=1)

# Get the vector for a specific word
word_vector = model.wv['word2vec']
print(f"Word2Vec vector: {word_vector}")

# Find similar words
similar_words = model.wv.most_similar('word2vec', topn=3)
print(f"Words similar to 'word2vec': {similar_words}")

# Save and load the model
model.save("word2vec_model.model")
loaded_model = Word2Vec.load("word2vec_model.model")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word2Vec vector: [ 1.5611659e-02 -1.9036822e-02 -3.6917988e-04  6.9416170e-03
 -1.8735953e-03  1.6766759e-02  1.8041069e-02  1.3068029e-02
 -1.4369206e-03  1.5424302e-02 -1.7057002e-02  6.3956575e-03
 -9.2831803e-03 -1.0187318e-02  7.1869758e-03  1.0755858e-02
  1.5555240e-02 -1.1488882e-02  1.4866323e-02  1.3249519e-02
 -7.3942998e-03 -1.7453466e-02  1.0939996e-02  1.3034756e-02
 -1.5832902e-03 -1.3417154e-02 -1.4160382e-02 -4.9939575e-03
  1.0307272e-02 -7.3558744e-03 -1.8780066e-02  7.6191304e-03
  9.7674970e-03 -1.2905392e-02  2.3778665e-03 -4.1718069e-03
  5.8256763e-05 -1.9791406e-02  5.3453855e-03 -9.4990488e-03
  2.2296992e-03 -3.1321580e-03  4.4113072e-03 -1.5753033e-02
 -5.4209209e-03  5.3441166e-03  1.0683629e-02 -4.8003504e-03
 -1.9042172e-02  9.0447301e-03]
Words similar to 'word2vec': [('tasks', 0.22867631912231445), ('machine', 0.20394939184188843), ('helps', 0.19049221277236938)]


In [None]:
similar_words = model.wv.most_similar('word2vec', topn=10)
print(f"Top 10 words similar to 'word2vec': {similar_words}")

Top 10 words similar to 'word2vec': [('tasks', 0.22867631912231445), ('machine', 0.20394939184188843), ('helps', 0.19049221277236938), ('skip-gram', 0.1684013158082962), ('analysis', 0.13033269345760345), ('like', 0.05815975368022919), ('cbow', 0.04925908148288727), ('nlp', 0.048944756388664246), ('predicts', 0.0446699857711792), ('model', -0.00908462330698967)]
