# Assignment 4: Neural Networks

---

## Task 1) Skip-grams

Tomas Mikolov's [original paper](https://arxiv.org/abs/1301.3781) for word2vec is not very specific on how to actually compute the embedding matrices.
Xin Ron provides a much more detailed [walk-through](https://arxiv.org/pdf/1411.2738.pdf) of the math, I recommend you go through it before you continue with this assignment.
Now, while the original implementation was in C and estimates the matrices directly, in this assignment, we want to use PyTorch (and autograd) to train the matrices.
There are plenty of example implementations and blog posts out there that show how to do it, I particularly recommend [Mateusz Bednarski's](https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb) version. Familiarize yourself with skip-grams and how to train them using pytorch.

### Data

Download the `theses.csv` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 3,000 theses topics chosen by students in the past.
Here are some examples of the file content:

```
27.10.94;14.07.95;1995;intern;Diplom;DE;Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System;
04.11.94;14.03.95;1995;intern;Diplom;DE;Implementierung eines Testüberdeckungsgrad-Analysators für RAS;
01.11.20;01.04.21;2021;intern;Bachelor;DE;Landessprachenerkennung mittels X-Vektoren und Meta-Klassifikation;
```

### Basic Setup

For the upcoming assignments on Neural Networks, we'll be heavily using [PyTorch](https://pytorch.org) as go-to Deep Learning library.
If you're not already familiar with PyTorch, now's the time to get started with it.
Head over to the [Basics](https://pytorch.org/tutorials/beginner/basics/intro.html) and gain some understanding about the essentials.
Before starting this assignment, make sure you've got PyTorch installed in your working environment. 
It's a quick setup, and you'll find all the instructions you need on the PyTorch website.
As always, you can use [NumPy](https://numpy.org) and [Pandas](https://pandas.pydata.org) for data handling etc.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [1]:
# Dependencies
import numpy as np
import pandas as pd
import re
import torch



### Prepare the Data

1.1 Spend some time on preparing the dataset. It may be helpful to lower-case the data and to filter for German titles. The format of the CSV-file should be:

```
Anmeldedatum;Abgabedatum;JahrAkademisch;Art;Grad;Sprache;Titel;Abstract
```

1.2 Create the vocabulary from the prepared dataset. You'll need it for the initialization of the matrices and to map tokens to indices.

1.3 Generate the training pairs with center word and context word. Which window size do you choose?

In [2]:
def load_theses_dataset(filepath):
    """Loads all theses instances and returns them as a dataframe."""
    ### YOUR CODE HERE
    return pd.read_csv(filepath, sep="\t", header=0, encoding='ANSI')
    ### END YOUR CODE

In [3]:
def preprocess(dataframe: pd.DataFrame):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    pattern = r"\b[\wäöüÄÖÜß]+(?:[-'][\wäöüÄÖÜß]+)*\b"
    df_german = dataframe[dataframe.Sprache == "DE"]
    tokens = [re.findall(pattern, str(row).lower(), flags=re.UNICODE) for row in df_german.Titel]
    return tokens
    ### END YOUR CODE

In [4]:
def create_training_pairs(data, word2idx, window_size, num_negatives=5):
    """Creates skip-gram (word, context, label) pairs with negative sampling."""
    train_pairs = []
    vocab = list(word2idx.values())

    # Positive skip-gram pairs (label = 1)
    for title in data:
        for target_idx in range(len(title)):
            target_word = title[target_idx]
            target_id = word2idx[target_word]

            start = max(0, target_idx - window_size)
            end = min(len(title), target_idx + window_size + 1)

            for context_idx in range(start, end):
                if context_idx != target_idx:
                    context_word = title[context_idx]
                    context_id = word2idx[context_word]
                    train_pairs.append((target_id, context_id, 1))

                    # Negative sampling (label = 0)
                    for _ in range(num_negatives):
                        negative_id = np.random.choice(vocab)
                        while negative_id == context_id:
                            negative_id = np.random.choice(vocab)
                        train_pairs.append((target_id, negative_id, 0))

    return train_pairs

In [6]:
dataframe = load_theses_dataset("C:\\Users\\Felix Rittmaier\\PythonProjects\\seqlrn_assignments\\4-nnet\\data\\theses.tsv")
tokenized_data = preprocess(dataframe)
vocabulary = set()
for tokens in tokenized_data:
    vocabulary.update(tokens)
  
word2idx = {
    word: idx for idx, word in enumerate(vocabulary)
}
idx2word = {
    idx: word for idx, word in enumerate(vocabulary)
}

training_pairs = create_training_pairs(tokenized_data, word2idx, 2)

for target_idx, context_idx, label in training_pairs[:20]:
    print(f"({idx2word[target_idx]}, {idx2word[context_idx]})", end=", ")

for target_idx, context_idx, label in training_pairs[-20:]:
    print(f"({idx2word[target_idx]}, {idx2word[context_idx]})", end=", ")

(email, am), (email, pc-cd-rom-steuerung), (email, kostenÿberwachung), (email, referenzmodells), (email, technologievergleich), (email, on-premises), (email, beispiel), (email, technologiebewertung), (email, impact), (email, modellbasierte), (email, networks), (email, werkzeuggestÿtzte), (am, email), (am, webgl-basierten), (am, liquiditštsmanagements), (am, strukturierung), (am, mannschaften), (am, unternehmenssoftware), (am, beispiel), (am, marktforschung), (von, untersuchung), (von, cyber-threat-modells), (von, automotive-spice-prozessmodellen), (von, fallbeispiel), (von, kompatibles), (von, e-mail), (von, systemzustšnden), (von, performancetestergebnissen), (automotive-spice-prozessmodellen, bereitstellung), (automotive-spice-prozessmodellen, fax), (automotive-spice-prozessmodellen, bucketlist), (automotive-spice-prozessmodellen, tcp), (automotive-spice-prozessmodellen, ua), (automotive-spice-prozessmodellen, axapta), (automotive-spice-prozessmodellen, von), (automotive-spice-prozes

### Train and Analyze

2.1 Implement and train the word2vec model with your training data.

2.2 Implement a method to find the top-k similar words for a given word (token).

2.3 Analyze: What are the most similar words to "Konzeption", "Cloud" and "virtuelle"?

In [15]:
### TODO: 2.1 Implement and train the word2vec model.

### YOUR CODE HERE
VOCAB_SIZE = len(vocabulary)
EMBEDDING_SIZE = 300  # from the original word2vec paper
EPOCHS = 50

class SkipGramDataset(torch.utils.data.Dataset):
    def __init__(self, training_pairs):
        """
        training_pairs: list of tuples (target_id, context_id, label)
        """
        self.pairs = training_pairs

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        target, context, label = self.pairs[idx]
        return (
            torch.tensor(target, dtype=torch.long),
            torch.tensor(context, dtype=torch.long),
            torch.tensor(label, dtype=torch.float),
        )

class Word2Vec(torch.nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(Word2Vec, self).__init__()
        self.linear1 = torch.nn.Linear(vocab_size, embed_size)
        self.linear2 = torch.nn.Linear(embed_size, vocab_size, bias=False)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x

dataset = SkipGramDataset(training_pairs)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
one_hot_encoding = np.zeros(shape=(len(vocabulary), len(vocabulary)))
np.fill_diagonal(one_hot_encoding ,1)
print(one_hot_encoding)

model = Word2Vec(VOCAB_SIZE, EMBEDDING_SIZE)
loss_function = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for e in range(EPOCHS):
    total_loss = 0

    for target_idx, context_idx, label in dataloader:
        one_hot_target = torch.tensor(one_hot_encoding[target_idx], dtype=torch.float)

        output = model(one_hot_target)

        # Predict only context word columns
        pred = output.gather(1, context_idx.view(-1, 1))

        # Compute loss
        loss = loss_function(pred.squeeze(), label)

        # Backprop and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {e+1}, Loss: {total_loss:.4f}")
### END YOUR CODE

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]
Epoch 1, Loss: 5441.7821
Epoch 2, Loss: 2865.9892


In [None]:
### TODO: 2.2 Implement a method to find the top-k similar words.

### YOUR CODE HERE
def similarity_search(word_idx, trained_weights, k=10):
    word_embedding = trained_weights[word_idx]
    sim_scores = []

    for i, vec in enumerate(trained_weights):
        if i == word_idx:
            continue  # skip comparing the word to itself

        norm_product = np.linalg.norm(word_embedding) * np.linalg.norm(vec)
        if norm_product == 0:  # avoid division with 0
            similarity = 0
        else:
            similarity = np.dot(word_embedding, vec) / norm_product

        sim_scores.append((i, similarity))

    sim_scores.sort(key=lambda x: x[1], reverse=True)

    return sim_scores[:k]


### END YOUR CODE

In [None]:
### TODO: 2.3 Find the most similar words for "Konzeption", "Cloud" and "virtuelle".

### YOUR CODE HERE
print(model)
trained_weights = model[0].weight

print(similarity_search(word2idx["konzeption"], trained_weights, k=5))
print(similarity_search(word2idx["cloud"], trained_weights, k=5))
print(similarity_search(word2idx["virtuelle"], trained_weights, k=5))

### END YOUR CODE

### Play with the Embeddings

3.1 Use the computed embeddings: Can you identify the most similar theses for some examples?

3.2 Visualize the embeddings for a subset of theses using [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). You can use [Scikit-Learn](https://scikit-learn.org/stable/) and [Matplotlib](https://matplotlib.org) or [Seaborn](https://seaborn.pydata.org).

In [None]:
### TODO: 3.1 Compute the embeddings for the theses and transform with TSNE.

### YOUR CODE HERE



### END YOUR CODE

In [None]:
### TODO: 3.2 Visualize the samples in the 2D space.

### YOUR CODE HERE



### END YOUR CODE