### Word Embedding using Word2Vec

We will use Word2Vec to create word embeddings for our preprocessed text. 

The advantage of using Word Embedding over other methods like Bag of Words or TF-IDF is that it can capture the semantic meaning of words in the text. This means that words with similar meanings will have similar vector representations. This can help improve the performance of our machine learning model.

First, let's import the necessary libraries and train our Word2Vec model on the tokenized text.

In [1]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec


train_data_mod = pd.read_csv('../preprocessing/train_data_mod.csv')
test_data_mod = pd.read_csv('../preprocessing/test_data_mod.csv')


preprocessed_text = train_data_mod['preprocess_text'].tolist()


word2vec_model = Word2Vec(sentences=preprocessed_text, vector_size=50, window=5, min_count=1, workers=4)

word2vec_model.save("word2vec_model.model")

Now that we have trained our Word2Vec model, let's use it to obtain word vectors for our preprocessed text.

In [2]:
def average_word_vector(tokens, word2vec_model, vector_size):
    word_vectors = [word2vec_model.wv[token] for token in tokens if token in word2vec_model.wv.index_to_key]
    
    if not word_vectors:
        return np.zeros(vector_size)
    
    return np.mean(word_vectors, axis=0)

vector_size = 50
train_data_mod['word2vec_vectors'] = train_data_mod['preprocess_text'].apply(lambda x: average_word_vector(x, word2vec_model, vector_size))
test_data_mod['word2vec_vectors'] = test_data_mod['preprocess_text'].apply(lambda x: average_word_vector(x, word2vec_model, vector_size))
train_data_mod[['keyword', 'word2vec_vectors']].head()


Unnamed: 0,keyword,word2vec_vectors
0,missing,"[0.0005248459, -0.038611013, -0.06718053, -0.3..."
1,missing,"[0.023838915, -0.027701743, -0.08011638, -0.37..."
2,missing,"[-0.017775554, -0.050490983, -0.0845786, -0.36..."
3,missing,"[-0.035682485, -0.01963313, -0.093268305, -0.3..."
4,missing,"[-0.0022137025, -0.07134746, -0.0651533, -0.38..."


In [3]:
# export
train_data_mod.to_csv('train_data_mod_word2vec_50d.csv', index=False)
test_data_mod.to_csv('test_data_mod_word2vec_50d.csv', index=False)