## Example of Vector Embeddings using Word2Vec

Note that the gensim library used for Word2Vec requires a numpy version below 2.0, while we have been using more recent versions which is incompatible.  To resolve this, you may need to downgrade numpy to a compatible version.  If you get an error message when you run the code block to pre-process text, you will likely need to restart the Jupyter kernel.  To do this, open the command palette by pressing Cmd+Shift+P (Mac) or Ctrl+Shift+P (Windows) and look for the dropdown that says "Jupyter: Restart Kernel".  This will clear variables and provide a clean slate for this code.

In [None]:
# Downgrade numpy to a compatible version for the gensim library.

# First, completely remove NumPy to ensure there are no conflicting versions installed
!pip uninstall -y numpy
!pip cache purge

# Next, reinstall NumPy to a version earlier than 2.0
!pip install "numpy<2.0"

import numpy as np
print("NumPy version:", np.__version__)
print("NumPy location:", np.__file__)

# Make sure that gensim is using the correct NumPy version
!pip install --force-reinstall gensim "numpy<2.0"

Install necessary libraries if you have not already done so.

In [None]:
!pip install pandas
!pip install gensim
!pip install scikit-learn
!pip install matplotlib
!pip install numpy

Import necessary libraries.

In [None]:
# Import necessary libraries
import pandas as pd
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

Create a toy dataset.

In [None]:
# Example dataset
data = {
    "tweet": [
        "I love the new iPhone! It's fantastic. #Apple",
        "The service at this restaurant was terrible. Never going back. #Disappointed",
        "Tesla's new model is groundbreaking! #Innovation",
        "I had an average experience with the product. It's okay. #Neutral",
    ]
}
df = pd.DataFrame(data)

Next, we preprocess our data.  The first step is to tokenize the tweets.  As you look at the code, think about what it is doing.

The next step is to train our Word2Vec model.  Copy this line of code into ChatGPT and ask it to explain the model parameters to you.

Finally we will store the word vectors from the "model" into a new variable called word_vectors.

In [None]:
# Preprocessing: Tokenize tweets
df["tokens"] = df["tweet"].str.lower().str.split()

# Train Word2Vec model
model = Word2Vec(sentences=df["tokens"], vector_size=50, window=5, min_count=1, workers=4)

# Get embeddings for words in the vocabulary
word_vectors = model.wv


The following function is used to visualize the word embeddings using t-SNE.  

In [None]:
# Visualize embeddings using t-SNE
def plot_word_embeddings(word_vectors):
    words = list(word_vectors.key_to_index.keys())
    vectors = word_vectors[words]
    
    # Reduce dimensions for visualization
    tsne = TSNE(n_components=2, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors)
    
    # Plot the embeddings
    plt.figure(figsize=(10, 8))
    for i, word in enumerate(words):
        plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
        plt.text(reduced_vectors[i, 0] + 0.01, reduced_vectors[i, 1] + 0.01, word, fontsize=9)
    plt.title("Word2Vec Embeddings Visualization")
    plt.show()

plot_word_embeddings(word_vectors)