# Building Word2Vec models from scratch

### Importing the Word2Vec module from Gensim

In [1]:
from gensim.models import Word2Vec

### Defining a text corpus

In [2]:
sentences = [["I", "am", "trying", "to", "understand", "Natural", "Language", "Processing"],
             ["Natural", "Language", "Processing", "is", "fun", "to", "learn"],
             ["There", "are", "numerous", "use", "cases", "of", "Natural", "Language", "Processing"]]

### Building a basic Word2Vec model

In [3]:
model = Word2Vec(sentences, min_count=1)

### Visualizing the embeddings for the word "Natural"

In [4]:
model.wv["Natural"]

array([-8.6196875e-03,  3.6657380e-03,  5.1898835e-03,  5.7419385e-03,
        7.4669183e-03, -6.1676754e-03,  1.1056137e-03,  6.0472824e-03,
       -2.8400505e-03, -6.1735227e-03, -4.1022300e-04, -8.3689485e-03,
       -5.6000124e-03,  7.1045388e-03,  3.3525396e-03,  7.2256695e-03,
        6.8002474e-03,  7.5307419e-03, -3.7891543e-03, -5.6180597e-04,
        2.3483764e-03, -4.5190323e-03,  8.3887316e-03, -9.8581640e-03,
        6.7646410e-03,  2.9144168e-03, -4.9328315e-03,  4.3981876e-03,
       -1.7395747e-03,  6.7113843e-03,  9.9648498e-03, -4.3624435e-03,
       -5.9933780e-04, -5.6956373e-03,  3.8508223e-03,  2.7866268e-03,
        6.8910765e-03,  6.1010956e-03,  9.5384968e-03,  9.2734173e-03,
        7.8980681e-03, -6.9895042e-03, -9.1558648e-03, -3.5575271e-04,
       -3.0998408e-03,  7.8943167e-03,  5.9385742e-03, -1.5456629e-03,
        1.5109634e-03,  1.7900408e-03,  7.8175711e-03, -9.5101865e-03,
       -2.0553112e-04,  3.4691966e-03, -9.3897223e-04,  8.3817719e-03,
      

### Size of each word vector

In [5]:
model.vector_size

100

### Size of the vocabulary

In [6]:
len(model.wv.key_to_index)


17

### Building a 2nd Word2Vec model restricting the vocabulary using min_count

In [7]:
model = Word2Vec(sentences, min_count=2)

### Visualizing the embeddings for the word "Natural"

In [8]:
model.wv["Natural"]

array([ 9.4563962e-05,  3.0773198e-03, -6.8126451e-03, -1.3754654e-03,
        7.6685809e-03,  7.3464094e-03, -3.6732971e-03,  2.6427018e-03,
       -8.3171297e-03,  6.2054861e-03, -4.6373224e-03, -3.1641065e-03,
        9.3113566e-03,  8.7338570e-04,  7.4907029e-03, -6.0740625e-03,
        5.1605068e-03,  9.9228229e-03, -8.4573915e-03, -5.1356913e-03,
       -7.0648370e-03, -4.8626517e-03, -3.7785638e-03, -8.5361991e-03,
        7.9556061e-03, -4.8439382e-03,  8.4236134e-03,  5.2625705e-03,
       -6.5500261e-03,  3.9578713e-03,  5.4701497e-03, -7.4265362e-03,
       -7.4057197e-03, -2.4752307e-03, -8.6257253e-03, -1.5815723e-03,
       -4.0343284e-04,  3.2996845e-03,  1.4418805e-03, -8.8142155e-04,
       -5.5940580e-03,  1.7303658e-03, -8.9737179e-04,  6.7936908e-03,
        3.9735902e-03,  4.5294715e-03,  1.4343059e-03, -2.6998555e-03,
       -4.3668128e-03, -1.0320747e-03,  1.4370275e-03, -2.6460087e-03,
       -7.0737829e-03, -7.8053069e-03, -9.1217868e-03, -5.9351693e-03,
      

### Size of the vocabulary

In [9]:
len(model.wv.key_to_index)


4

### Vocabulary

In [10]:
words = list(model.wv.key_to_index.keys())
print(words)

['Processing', 'Language', 'Natural', 'to']


### Size of each word vector

In [23]:
model.vector_size

100

### Building a 2nd Word2Vec model restricting the vocabulary using min_count and each vector of size 300

In [25]:
model = Word2Vec(sentences, min_count=2, vector_size=300)

### Visualizing the embeddings for the word "Natural"

In [26]:
model.wv["Natural"]

array([ 2.71075731e-03, -1.48577814e-03, -3.56119068e-04,  3.35454941e-04,
       -6.37046469e-05,  3.82725790e-04,  2.03795359e-03, -6.75718002e-06,
       -1.08198845e-03, -5.03576186e-04,  1.96576631e-03,  5.04700758e-04,
       -2.41420668e-04,  3.11108236e-03, -1.64042786e-03, -2.79469881e-04,
        3.05847055e-03,  2.24980921e-03,  5.00952010e-04, -2.96085351e-03,
        3.82915343e-04, -7.62751908e-04,  3.12274578e-03,  4.03309270e-04,
        4.96687891e-04,  8.02136667e-04, -6.12002215e-04, -1.66654470e-03,
        7.74764994e-05, -6.71393471e-04,  2.20031105e-03,  2.98004108e-03,
       -2.24918127e-04,  9.92338290e-04, -2.03588489e-03,  5.66441624e-04,
       -2.30874424e-03, -2.89800880e-03, -1.96673442e-03, -2.98549165e-03,
        2.42586504e-03, -1.92401046e-03,  2.75878399e-03, -2.41451501e-03,
        1.14055828e-03,  3.22499941e-03, -2.59514921e-03, -3.31501919e-03,
       -1.44304871e-03, -8.94376833e-04, -9.04297849e-05, -2.94385036e-03,
       -2.87251920e-03,  

### Size of the vocabulary

In [28]:
num_words = len(model.wv.key_to_index)
print(num_words)

4


### Size of each word vector

In [29]:
model.vector_size

300

### Building a 4th Word2Vec model using 2 worker threads, skipgram approach and negative sampling

In [31]:
model = Word2Vec(sentences, min_count=1, vector_size=300, workers=2, sg=1, negative=1)

### Size of the vocabulary

In [33]:
num_words = len(model.wv.key_to_index)
print(num_words)

17


### Vocabulary

In [35]:
num_words = len(model.wv.key_to_index)
print(num_words)


17


### Size of each word vector

In [36]:
model.vector_size

300

### Summary
In this assignment we used few of the important python libraries that are crucial for NLP tasks. Here are the libraries we used in this assignment and thier brief description:
# 1.Gensim
Gensim is a popular library for topic modeling and document similarity analysis. It provides a straightforward implementation of Word2Vec. Easy-to-use API, efficient handling of large text corpora, and support for various models like Word2Vec, FastText, and Doc2Vec.

### Steps Followed

1. Import the Word2Vec Module:
The necessary module for creating Word2Vec models is imported from the gensim library.

2. Define a Text Corpus:
A set of example sentences is defined to train the Word2Vec model.

3. Build a Basic Word2Vec Model:
An initial Word2Vec model is created using the defined text corpus with a minimum word count threshold.

4. Visualize Word Embeddings:
The word embeddings for a specific word (e.g., "Natural") are visualized to inspect the learned vector.

5. Check the Size of Each Word Vector:
The dimensionality of the word vectors in the model is examined.

6. Check the Size of the Vocabulary:
The size of the vocabulary, i.e., the number of unique words in the model, is checked.

7. Build a Second Word2Vec Model with Restricted Vocabulary:
A new Word2Vec model is built with a higher minimum word count to restrict the vocabulary size.

8. Visualize the Embeddings and Vocabulary Size Again:
The word embeddings and the vocabulary size for the new model are checked and compared with the previous model.

9. Build a Word2Vec Model with Custom Vector Size:
A Word2Vec model is built with a specified vector size (e.g., 300 dimensions).

10. Build a Word2Vec Model with Additional Parameters:
A more advanced Word2Vec model is created using parameters such as multiple worker threads, skip-gram approach, and negative sampling.