# Transformer-based Natural Language Processing
## Introduction to PyTorch & Deep Learning in NLP 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/texttechnologylab/SoSe22-M-PNLR-PR-TbNLP/blob/main/embeddings.ipynb)

### Aqcuiring Some Data

- Use the code below to accquire some sentence-segmented data.
    - Note: You may use also any other corpus available to you.

In [None]:
# Dowload a small dataset of sentences from the English Wikipedia from the "Wortschatz" project of the University Leipzig
# - D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
#   In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012
!mkdir data
!curl http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_wikipedia_2016_10K.tar.gz -o data/eng_wikipedia_2016_10K.tar.gz
!tar -xf data/eng_wikipedia_2016_10K.tar.gz -C data/

plain_text_file = "data/eng_wikipedia_2016_10K/eng_wikipedia_2016_10K-sentences.txt"

### Installing necessary packages (i.e. if on Colab)

In [None]:
%pip install torch datasets tokenizers

### Working with `🤗 datasets`

1. Familiarize yourself with the `🤗 datasets` package and it's API.
2. Load the plain text corpus that was downloaded using the code above.
3. (Pre-)Process the data:
    - Remove the line-number preceeding each sentence.
    - Split the sentences into words/tokens.


#### Resources

- [`🤗 datasets` Documentation](https://huggingface.co/docs/datasets/index)

In [None]:
from datasets import ...  # TODO

corpus = ...  # TODO: Load the plain text file

### Working with `🤗 tokenizers`

1. Implement a tokenization approach using the `🤗 tokenizers` library.
    - There are [multiple different models](https://huggingface.co/docs/tokenizers/python/latest/components.html#models) of tokenizers available. Which one do you choose for the task at hand?
2. Tokenize your dataset using the new tokenizer and rerun your experiment from above.
3. Evaluate the results and compare them with the results from above.

#### Resources

- [`🤗 tokenizers` Documentation](https://huggingface.co/docs/tokenizers/python/latest/)

In [None]:
import tokenizers

def process_with_tokenizers(*args, **kwargs):  # TODO
    pass

dataset = ...  # TODO: Process the corpus

### NLP in PyTorch &ndash; Basic Word2Vec

Familizarize yourself with the Word2Vec model introduced in Mikolov et al. (2013).

1. Implement the CBOW model (without hierarchical softmax or negative sampling improvements).
2. Train the model on the small dataset from above.
3. Evaluate your results.
    - Note: The small dataset provided above is only sufficient for development, not to actually train a model that should give reasonable results.

#### References
- "Efficient Estimation of Word Representations in Vector Space", Mikolov et al., 2013, [arxiv.org/abs/1301.3781](http://arxiv.org/abs/1301.3781)

![Mikolov et al. (2013)](images/cbow.png)

#### Hints
- Use the `torch.nn.Embedding` module to create your vector representations.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CBOW(nn.Module):
    def __init__(self, *args, **kwargs):  # TODO
        super().__init__()
        pass

    def forward(self, *args, **kwargs):  # TODO
        pass

In [None]:
model = CBOW(...)  # TODO: Model initialization
optimizer = ...  # TODO: Optimizer intialization

for epoch in ...:
    model.train()
    pass  # TODO: Training code goes here

model.eval()
pass  # TODO: Evaluation code goes here

### Bonus: Wang2Vec

1. Implement the CWINDOW model from Ling et. al (2015).
2. Train the model on the same dataset as before.
3. Evaluate and compare your results.

#### Refernces
- "Two/Too Simple Adaptations of Word2Vec for Syntax Problems", Ling et al., 2015, [aclanthology.org/N15-1142/](https://aclanthology.org/N15-1142/)


![Ling et al. (2015)](images/cwindow.png)

In [None]:
class CWINDOW(nn.Module):
    def __init__(self, *args, **kwargs):  # TODO
        super().__init__()
        pass

    def forward(self, *args, **kwargs):  # TODO
        pass

In [None]:
model_cwindow = CWINDOW(...)  # TODO: Model initialization
optimizer = ...  # TODO: Optimizer intialization

for epoch in ...:
    model_cwindow.train()
    pass  # TODO: Training code goes here

model_cwindow.eval()
pass  # TODO: Evaluation code goes here