# Word2Vec

Word2Vec is a popular technique in natural language processing (NLP) for learning word embeddings, which are continuous vector representations of words in a high-dimensional space. Word embeddings capture semantic and syntactic relationships between words and are a fundamental component of many NLP applications. Word2Vec was introduced by researchers at Google in a pair of papers published in 2013.

The name "Word2Vec" is derived from its primary function: it transforms words into vectors.


Word2Vec has been widely adopted in various NLP applications, including:

- **Text Classification**: Word embeddings serve as features for training classifiers to categorize text into different classes.
- **Information Retrieval**: Word2Vec can be used to represent documents or queries in vector space, making it easier to find similar documents.
- **Named Entity Recognition (NER)**: Embeddings help recognize named entities and their types in text.
- **Machine Translation**: Word embeddings enable better translation models by capturing semantic relationships between words in different languages.
- **Question Answering**: Word2Vec embeddings are used to match questions with relevant answers.
- **Recommendation Systems**: They are used to recommend products or content based on users' preferences and content similarity.


Word2Vec is a foundational technique in NLP and has paved the way for more advanced embedding methods like GloVe (Global Vectors for Word Representation) and fastText. These techniques are crucial for understanding and processing natural language text in a wide range of applications



## gensim

Gensim is an open-source Python library for natural language processing (NLP) and machine learning that focuses on text modeling and topic modeling. It was developed primarily for efficient and scalable topic modeling and document similarity analysis but has since grown to include various other features and functionalities related to text processing and modeling. Gensim is widely used by researchers, data scientists, and developers for working with large text corpora and performing tasks such as document similarity, topic modeling, and word embedding training.


**Reference**
- https://radimrehurek.com/gensim/auto_examples/
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html



In [None]:
!pip install gensim numpy



In [None]:
import re
import nltk
import gensim
from gensim.models import word2vec

### Step 1: Prepare Your Text Data
Before you can train a Word2Vec model, you need to prepare your text data. This includes cleaning and tokenizing your text. Here's a simple example:

In [None]:
text = """
The Russian state-owned energy giant, Gazprom, said the restrictions on the Nord Stream 1 pipeline would last for the next three days.
Russia has already significantly reduced gas exports via the pipeline.
It denies accusations it has used energy supplies as a weapon of war against Western countries.
The Nord Stream 1 pipeline stretches 1,200km (745 miles) under the Baltic Sea from the Russian coast near St Petersburg to north-eastern Germany.
It opened in 2011, and can send a maximum of 170 million cubic metres of gas per day from Russia to Germany.
"""

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# all the stop words for all language
stop_words = nltk.corpus.stopwords.words('english')
# stop_words

In [None]:
def sentence_tokenize(text):
    punkt_tokenizer = nltk.tokenize.PunktSentenceTokenizer(text)
    tokenized = punkt_tokenizer.tokenize(text)
    return tokenized


def clean_text(text):
    # remove the numbers and remove stop words, non letters
    sentences = sentence_tokenize(text)
    tokens = []
    for sentence in sentences:
        new_tokens = nltk.tokenize.word_tokenize(sentence)
        # print(new_tokens)
        for word in new_tokens:
            if word in stop_words:
                new_tokens.remove(word)
        tokens = tokens + new_tokens
    return tokens

In [None]:
corpus = clean_text(text)
# corpus

### Step 2: Train a Word2Vec Model

Next, you'll train a Word2Vec model on your tokenized text data. You can configure various parameters, such as the vector dimensionality (`size`), window size (`window)`, and training algorithm (sg for `Skip-gram` or `cbow` for Continuous Bag of Words).


```
# Train a Word2Vec model
model = Word2Vec(
    tokenized_data,
    vector_size=100,  # Dimensionality of the word vectors
    window=5,          # Context window size
    min_count=1,       # Minimum word frequency to consider
    sg=0,              # Use CBOW (0) or Skip-gram (1)
    workers=4          # Number of CPU cores to use for training
)
```

In [None]:
# model = word2vec.Word2Vec(corpus,vector_size=200,window=20, min_count=0, workers=4)
model = word2vec.Word2Vec([corpus], min_count=1)
# data = gensim.models.word2vec.LineSentence(filename)
print(model)

Word2Vec<vocab=63, vector_size=100, alpha=0.025>


### Step 3: Use the Word2Vec Model

```
similar_words = model.wv.most_similar("document", topn=5)
print(similar_words)
```

In [None]:
model.wv['Russia']

array([-0.00949597,  0.00958518, -0.00776668, -0.00263947, -0.00490856,
       -0.00499557, -0.00801268, -0.0077347 , -0.00456166, -0.00129522,
       -0.00512872,  0.00612848, -0.0095164 , -0.00530883,  0.00945281,
        0.00697663,  0.00769548,  0.00422311,  0.00048875, -0.00599704,
        0.00602594,  0.00263872,  0.00772202,  0.00636618,  0.00792931,
        0.00865827, -0.00993999, -0.00675818,  0.0013466 ,  0.00645069,
        0.0074159 ,  0.00551455,  0.00768108, -0.0051468 ,  0.00657358,
       -0.00409449, -0.00903703,  0.00914577,  0.00130429, -0.00277189,
       -0.00246584, -0.00421045,  0.00481935,  0.00440827, -0.00265704,
       -0.00735463, -0.00356974, -0.00036105,  0.00609615, -0.00282682,
       -0.00011138,  0.00085623, -0.00711399,  0.00205407, -0.00146161,
        0.00280777,  0.00485327, -0.00134534, -0.00278401,  0.00774932,
        0.0050701 ,  0.00672746,  0.00454453,  0.00868534,  0.0074941 ,
       -0.00105813,  0.00875308,  0.00461585,  0.00538305, -0.00

In [None]:
similarity = model.wv.similarity("weapon", "war")
print("Similarity between 'weapon' and 'war':", similarity)

Similarity between 'weapon' and 'war': 0.15172634


In [None]:
model.wv.most_similar('countries', topn=10)

[('via', 0.18393656611442566),
 ('significantly', 0.167282834649086),
 ('.', 0.14069588482379913),
 ('last', 0.13259515166282654),
 ('energy', 0.12228170782327652),
 ('Russian', 0.11194396018981934),
 ('1', 0.09922793507575989),
 ('supplies', 0.08814319968223572),
 ('It', 0.08411835879087448),
 ('would', 0.0820133239030838)]

### Step 4: Save and Load a trained Model

In [None]:
model.save('model.bin')

In [None]:
# # load model
new_model = word2vec.Word2Vec.load('model.bin')
print(new_model)

Word2Vec<vocab=63, vector_size=100, alpha=0.025>


In [None]:
new_model.wv.most_similar('countries', topn=10)

[('via', 0.18393656611442566),
 ('significantly', 0.167282834649086),
 ('.', 0.14069588482379913),
 ('last', 0.13259515166282654),
 ('energy', 0.12228170782327652),
 ('Russian', 0.11194396018981934),
 ('1', 0.09922793507575989),
 ('supplies', 0.08814319968223572),
 ('It', 0.08411835879087448),
 ('would', 0.0820133239030838)]

## Other example

In [None]:
from gensim.models import Word2Vec
import nltk
# define training data
content="""Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked.
In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies."""
sentences=nltk.sent_tokenize(content)
words=[]

for i in sentences:
    words.append(nltk.word_tokenize(i))

# train model
print(words)
model = Word2Vec(words, min_count=1)

# summarize the loaded model
print(model)

# summarize vocabulary
# word_vec_words = list(model.wv)
# print(word_vec_words)

# access vector for one word
# print(model['sugar'])
model.wv['is']
# # save model
# model.save('model.bin')

# # load model
# new_model = Word2Vec.load('model.bin')
# print(new_model)


[['Cake', 'is', 'a', 'form', 'of', 'sweet', 'food', 'made', 'from', 'flour', ',', 'sugar', ',', 'and', 'other', 'ingredients', ',', 'that', 'is', 'usually', 'baked', '.'], ['In', 'their', 'oldest', 'forms', ',', 'cakes', 'were', 'modifications', 'of', 'bread', ',', 'but', 'cakes', 'now', 'cover', 'a', 'wide', 'range', 'of', 'preparations', 'that', 'can', 'be', 'simple', 'or', 'elaborate', ',', 'and', 'that', 'share', 'features', 'with', 'other', 'desserts', 'such', 'as', 'pastries', ',', 'meringues', ',', 'custards', ',', 'and', 'pies', '.']]
Word2Vec<vocab=48, vector_size=100, alpha=0.025>


array([-8.7274825e-03,  2.1301615e-03, -8.7354420e-04, -9.3190884e-03,
       -9.4281426e-03, -1.4107180e-03,  4.4324086e-03,  3.7040710e-03,
       -6.4986930e-03, -6.8730675e-03, -4.9994122e-03, -2.2868442e-03,
       -7.2502876e-03, -9.6033178e-03, -2.7436293e-03, -8.3628409e-03,
       -6.0388758e-03, -5.6709289e-03, -2.3441375e-03, -1.7069972e-03,
       -8.9569986e-03, -7.3519943e-04,  8.1525063e-03,  7.6904297e-03,
       -7.2061159e-03, -3.6668312e-03,  3.1185520e-03, -9.5707225e-03,
        1.4764392e-03,  6.5244664e-03,  5.7464195e-03, -8.7630618e-03,
       -4.5171441e-03, -8.1401607e-03,  4.5956374e-05,  9.2636338e-03,
        5.9733056e-03,  5.0673080e-03,  5.0610625e-03, -3.2429171e-03,
        9.5521836e-03, -7.3564244e-03, -7.2703874e-03, -2.2653891e-03,
       -7.7856064e-04, -3.2161034e-03, -5.9258583e-04,  7.4888230e-03,
       -6.9751858e-04, -1.6249407e-03,  2.7443992e-03, -8.3591007e-03,
        7.8558037e-03,  8.5361041e-03, -9.5840869e-03,  2.4462664e-03,
      

In [None]:
model.wv.most_similar('sugar', topn=5)

[('form', 0.1960250288248062),
 ('desserts', 0.1585622876882553),
 ('of', 0.12292061746120453),
 ('share', 0.10217307507991791),
 ('can', 0.08713015913963318)]

In [None]:
model.wv.most_similar('can', topn=10)

[('bread', 0.14504408836364746),
 ('but', 0.14474980533123016),
 ('share', 0.12842437624931335),
 ('their', 0.12143993377685547),
 ('other', 0.10018284618854523),
 ('baked', 0.09942895919084549),
 ('sweet', 0.09211055934429169),
 ('sugar', 0.08713015168905258),
 ('such', 0.08693084120750427),
 ('Cake', 0.08146469295024872)]

# Urdu wav2vec

Ref: https://github.com/samarh/urduvec

Download the model into the workspace
https://drive.google.com/u/0/uc?id=1K_4Fbdv9GJDNjR_avLbzKdJPEMVWdYBm&export=download


In [None]:
pip install --upgrade gensim

In [None]:
from gensim.models import word2vec

# import the existing module
urdu_model = word2vec.Word2Vec.load_word2vec_format('urduvec_140M_100K_300d.bin', binary=True)