Word2Vec is a popular algorithm used in natural language processing (NLP) for generating word embeddings. Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. Word2Vec was developed by a team of researchers at Google led by Tomas Mikolov and is based on the idea that words that occur in similar contexts tend to have similar meanings.

### How Word2Vec Works

Word2Vec uses a neural network model to learn word embeddings from a large corpus of text. There are two main architectures used in Word2Vec:

1. **Continuous Bag of Words (CBOW)**:
    - Predicts the current word based on its context (surrounding words).
    - Given a context window (e.g., the words before and after the target word), the model tries to predict the target word.
    - For example, in the sentence "The cat sat on the mat", if the context window is 2, the context for the word "sat" would be ["The", "cat", "on", "the"]. The model tries to predict "sat" from this context.

2. **Skip-Gram**:
    - Predicts the context words from the current word.
    - Given a target word, the model tries to predict the surrounding words within a context window.
    - Using the same example, if the context window is 2, the model uses "sat" to predict ["The", "cat", "on", "the"].

### Training Process

The training process involves the following steps:

1. **Initialization**: Initialize the word vectors with small random values.
2. **Training**:
    - For CBOW: The model takes the average of the context word vectors and tries to predict the target word.
    - For Skip-Gram: The model takes the target word vector and tries to predict the vectors of the context words.
3. **Optimization**: Use techniques like gradient descent to adjust the word vectors in such a way that the prediction error is minimized.
4. **Output**: After training, the word vectors are learned such that words with similar contexts have vectors close to each other in the vector space.

### Example Using Gensim Library in Python

Here’s a simple example of how to use the Word2Vec model from the Gensim library:

```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Sample sentences
sentences = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are pets."
]

# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, sg=0)  # sg=0 means CBOW, sg=1 means Skip-Gram

# Get the vector for a word
vector = model.wv['cat']
print(vector)

# Find similar words
similar_words = model.wv.most_similar('cat')
print(similar_words)
```

### Key Points

- **Context Window**: Determines the number of surrounding words to consider for each target word.
- **Vector Size**: Dimensionality of the word vectors. Common sizes are 100, 200, or 300.
- **Training Algorithm**: CBOW (faster, good for larger datasets) vs. Skip-Gram (better for small datasets and infrequent words).
- **Min Count**: Ignores all words with total frequency lower than this.

### Applications

Word2Vec is used in various NLP applications, including:
- Semantic similarity
- Text classification
- Named entity recognition
- Machine translation
- Sentiment analysis

Word2Vec's ability to capture semantic relationships between words makes it a powerful tool for many NLP tasks.

In [24]:
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample training data
X_train = [["king", "man", "ruler"], ["queen", "woman", "ruler"], ["cat", "pet", "animal"], ["dog", "pet", "animal"]]

# Train Word2Vec model
w2v_model = Word2Vec(X_train, vector_size=100, window=5, min_count=1, sg=1)  # min_count=1 for this example

# Function to get the average Word2Vec vector for a document
def document_vector(doc):
    # Filter out words not in the Word2Vec vocabulary
    doc = [word for word in doc if word in w2v_model.wv.index_to_key]
    # Return the mean vector, handling empty documents
    if len(doc) > 0:
        return np.mean(w2v_model.wv[doc], axis=0)
    else:
        # Return a zero vector if the document has no valid words
        return np.zeros(w2v_model.vector_size)

# Check word similarity
word1 = "king"
word2 = "queen"
similarity = w2v_model.wv.similarity(word1, word2)
print(f"Similarity between '{word1}' and '{word2}': {similarity}")

# Find most similar words
word = "queen"
most_similar_words = w2v_model.wv.most_similar(word, topn=5)
print(f"Most similar words to '{word}':")
for similar_word, similarity in most_similar_words:
    print(f"Word: {similar_word}, Similarity: {similarity}")

# Example documents
doc1 = ["king", "man", "ruler"]
doc2 = ["queen", "woman", "ruler"]

# Get document vectors
vector1 = document_vector(doc1)
vector2 = document_vector(doc2)

# Compute cosine similarity
similarity = cosine_similarity([vector1], [vector2])[0][0]
print(f"Similarity between doc1 and doc2: {similarity}")


Similarity between 'king' and 'queen': 0.00882619060575962
Most similar words to 'queen':
Word: ruler, Similarity: 0.14595064520835876
Word: dog, Similarity: 0.041577357798814774
Word: cat, Similarity: 0.03476494178175926
Word: woman, Similarity: 0.01915230229496956
Word: animal, Similarity: 0.01613471284508705
Similarity between doc1 and doc2: 0.4244907796382904


In [25]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from nltk.tokenize import word_tokenize
import nltk
import joblib

# Download the necessary NLTK data
nltk.download('punkt')

# Load dataset
data = pd.read_csv('./fake-news/train.csv')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\suman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [26]:
data = data.head(100)
data

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
95,95,White House Confirms More Gitmo Transfers Befo...,Edwin Mora,President Barack Obama will likely release mor...,0
96,96,The Geometry of Energy and Meditation of Buddha,,License DMCA \nA mandala is a visual symbol of...,1
97,97,Poll: Most Voters Have Not Heard of Democratic...,Katherine Rodriguez,There is a minefield of potential 2020 electio...,0
98,98,Migrants Confront Judgment Day Over Old Deport...,Vivian Yee,There are a little more than two weeks between...,0


In [28]:
# Preprocess text data: lowercasing and tokenizing
data['tokenized_text'] = data['text'].apply(lambda x: word_tokenize(x.lower()))
data

Unnamed: 0,id,title,author,text,label,tokenized_text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,"[house, dem, aide, :, we, didn, ’, t, even, se..."
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"[ever, get, the, feeling, your, life, circles,..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,"[why, the, truth, might, get, you, fired, octo..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,"[videos, 15, civilians, killed, in, single, us..."
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,"[print, an, iranian, woman, has, been, sentenc..."
...,...,...,...,...,...,...
95,95,White House Confirms More Gitmo Transfers Befo...,Edwin Mora,President Barack Obama will likely release mor...,0,"[president, barack, obama, will, likely, relea..."
96,96,The Geometry of Energy and Meditation of Buddha,,License DMCA \nA mandala is a visual symbol of...,1,"[license, dmca, a, mandala, is, a, visual, sym..."
97,97,Poll: Most Voters Have Not Heard of Democratic...,Katherine Rodriguez,There is a minefield of potential 2020 electio...,0,"[there, is, a, minefield, of, potential, 2020,..."
98,98,Migrants Confront Judgment Day Over Old Deport...,Vivian Yee,There are a little more than two weeks between...,0,"[there, are, a, little, more, than, two, weeks..."


In [29]:



# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['tokenized_text'], data['label'], test_size=0.2, random_state=42)


In [40]:
for i in range(1,5):
    print(len(X_train.to_list()[i]))

332
586
673
229


In [41]:

# Train Word2Vec model
w2v_model = Word2Vec(X_train, vector_size=100, window=5, min_count=2, sg=1)  # sg=1 means Skip-Gram

# Function to get the average Word2Vec vector for a document
def document_vector(doc):
    doc = [word for word in doc if word in w2v_model.wv.index_to_key]
    if len(doc) > 0:
        return np.mean(w2v_model.wv[doc], axis=0)
    else:
        # Return a zero vector if the document has no valid words
        return np.zeros(w2v_model.vector_size)

# Transform documents to vectors
X_train_vect = np.array([document_vector(doc) for doc in X_train if len(X_train)>0])
X_test_vect = np.array([document_vector(doc) for doc in X_test])

# Handle NaN values in case of empty documents
X_train_vect = np.nan_to_num(X_train_vect)
X_test_vect = np.nan_to_num(X_test_vect)

# Train a Logistic Regression classifier
clf = LogisticRegression()
clf.fit(X_train_vect, y_train)

# Predict on test data
y_pred = clf.predict(X_test_vect)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Save models
w2v_model.save("word2vec_model.bin")
joblib.dump(clf, "fake_news_classifier.pkl")

# Load models (if needed later)
# w2v_model = Word2Vec.load("word2vec_model.bin")
# clf = joblib.load("fake_news_classifier.pkl")


Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.50      1.00      0.67        10
           1       0.00      0.00      0.00        10

    accuracy                           0.50        20
   macro avg       0.25      0.50      0.33        20
weighted avg       0.25      0.50      0.33        20



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['fake_news_classifier.pkl']