# E3 - Word Representations
### 1. Training a Word Embedder using Word2Vec

#### Selecting a Dataset
Find a good dataset in any language you like. It does not have to be English, any language you understand well or know well enough to judge a model will work. 

We will be exploring different Word Embeddings and Tokenizers. 
Here are some resources for datasets to use for Tokenization:
- [Kaggle Datasets](https://www.kaggle.com/datasets)

- [Hugging Face Datasets](https://huggingface.co/datasets)

- [The Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)

- [The Natural Tokenization Processing](https://www.nltk.org/nltk_data)

- [Common Crawl](https://commoncrawl.org)

#### Import Libraries

In [None]:
#%pip install nltk gensim scikit-learn matplotlib gensim torch

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from gensim.models import Word2Vec

In [None]:
# TODO: download the pre-trained nltk 'punkt' word tokenizer

# TODO: Load your chosen dataset, e. g. from the Leipzig Corpora Collection
with open('your_dataset', 'r', encoding='utf-8') as f:
    text = f.read()

# TODO: Perform basic preprocessing (lowercasing, removing special characters) with regular expressions

# TODO: Use the word_tokenize() function to tokenize the text and print the first ten tokens

### Create Word Embeddings with Word2Vec

We use Word2Vec to create word embeddings. Word2Vec learns vector representations of words based on the context in which they appear, allowing similar words to have closer vectors in the embedding space.
Word2Vec has two main approaches:
1. **Skip-Gram**: Predicts context words from a center word.
2. **Continuous Bag of Words (CBOW)**: Predicts a center word from surrounding context words.
For simplicity, we'll use the CBOW approach.

In [None]:
embedding_dim = 100 
window_size = 5
min_word_count = 1
workers = 4

# TODO: Train Word2Vec model with 'Word2Vec'

### Explore and Visualize the Word Embeddings

Once the model is trained, we can explore the word embeddings. We'll look at some examples of similar words and visualize the embedding space with dimensionality reduction techniques.

In [None]:
# TODO: Select a subset of words to visualize and play around with different tokens. 
words_to_visualize = ['hello'] # Play around with different words, but they need to be included in the vocab
word_vectors = [word2vec_model.wv[word] for word in words_to_visualize]

# Reduce dimensionality
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)

# Plot the word vectors
plt.figure(figsize=(10, 8))
for i, word in enumerate(words_to_visualize):
    plt.scatter(reduced_vectors[i][0], reduced_vectors[i][1])
    plt.annotate(word, xy=(reduced_vectors[i][0], reduced_vectors[i][1]), fontsize=12)
plt.title("2D Visualization of Word Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()

In [None]:
from gensim.downloader import load
dataset = load("text8")
sentences = list(dataset)  # Each sentence as a list of words

In [None]:
# Train Word2Vec model
model = Word2Vec(sentences=sentences, vector_size=embedding_dim, window=window_size,
                          min_count=min_word_count, workers=workers)
print("Word2Vec model training complete.")

In [None]:
# Dimensionality Reduction and Visualization with PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# TODO: Try different words with the trained word2vec model, trained on text8 dataset
words_to_visualize = ['king', 'queen', 'man', 'woman', 'great', 'awesome', 'amazing']
word_vectors = [model.wv[word] for word in words_to_visualize if word in model.wv]

pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)

plt.figure(figsize=(10, 8))
for i, word in enumerate(words_to_visualize):
    plt.scatter(reduced_vectors[i][0], reduced_vectors[i][1])
    plt.annotate(word, xy=(reduced_vectors[i][0], reduced_vectors[i][1]), fontsize=12)
plt.title("2D Visualization of Word Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()