## Introduction

This assessment aims to make you familiar with word embeddings and fine-tuning them for a specific downstream task. This assignment will be in two parts: **Part1: Fasttext** and **Part2: POS Tagging**. The due date of the exercise is **16th of April, 11.59pm**. You are going to submit your work via Blackboard.


## Install PyTorch (If you are working locally)

1. Have the latest version of Anaconda installed on your machine.
2. Create a new conda environment starting from Python 3.7. In this setup example, we'll call it `torch_env`.
3. Run the command: `conda activate torch_env`
4. Run the command: `pip3 install torch==1.13.1`

## or just work on this notebook on google colab.
https://colab.research.google.com/

For this assignment, you don't need a GPU to train the models. However, if you want to use a GPU, you can follow the steps below to create a GPU backed environment. https://www.youtube.com/watch?v=TI9mTiTKoUc&ab_channel=SinaTofighi This video shows how you can open a Colab, request and allocate a GPU for your work. You can follow the steps to create your GPU backed environment.

# COMP542/442 - Assignment 1 - Part 1

The assignment consists of the following parts:
* **Part I**: Preparation
  * Installing the required packages
  * Downloading data
  * Preprocessing
* **Part II**: Model Training
  * Training word embeddings on the Turkish dataset
  * Training with Contuinuous Bag of Words approach (Optional)
  * Trianing with Skipgram approach
* **Part III**: Observations
  * Make observations for get_nearest_neighbors and get_analogies method
  * Compare CBOW with Skipgram (Optional)

##### Inline question 1: Describe n-gram, BPE vs wordpiece/unigram tokenization methods with one or two sentences. Please compare their advantages and disadvantages over each other.

<font color='red'>Your answer:</font>


## Installing fastText

You may follow the instruction from the documentation:
* https://fasttext.cc/docs/en/support.html
* https://fasttext.cc/docs/en/unsupervised-tutorial.html

In [None]:
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!pip install .
!python setup.py install
%cd ..

In [None]:
# Test if installation was successful
import fasttext

We use the following data for training the embeddings: trwiki-20230401-pages-articles-multistream.xml


**Download preprocessing script**

A raw Wikipedia dump contains a lot of HTML / XML data, for preprocessing it, you may use the script from: https://github.com/hghodrati/wikifil.git

In [None]:
!git clone https://github.com/hghodrati/wikifil.git

In [None]:
# preprocess xml and save to new file
!perl wikifil/wikifil.pl dataset/trwiki-20230401-pages-articles-multistream.xml > dataset/data_embed

In [2]:
# Visualize data
!head -c 80 dataset/data_embed

## Training with fastText

You may find the documentation for training word respresentations here: 
* https://fasttext.cc/docs/en/unsupervised-tutorial.html
* https://fasttext.cc/docs/en/python-module.html#train_unsupervised-parameters

You may use the embedding dimension as 100, which is default by fasttext

In [None]:
import fasttext

EMBEDDING_DIM = 100

You will need to train the word embedding using two approaches:
* Continuous Bag of words
* Skipgram
After training, save the models in their respective paths

You may refer to the tutorial/ documentation for that

In [None]:
%mkdir results
CBOW_EMBED = "results/embed_cbow.bin"
SKIPGRAM_EMBED = "results/embed_skipgram.bin"

In [None]:
# embed_model_cbow = None
# embed_model_cbow.save_model(CBOW_EMBED)
# embed_model_cbow = fasttext.load_model(CBOW_EMBED)

In [None]:
# embed_model_skipgram = None
# embed_model_skipgram.save_model(SKIPGRAM_EMBED)
# embed_model_skipgram = fasttext.load_model(SKIPGRAM_EMBED)

In [None]:
print(embed_model_cbow.words)

In [None]:
print(embed_model_cbow['kral']) 

In [None]:
word = ""

In [None]:
embed_model_cbow.get_nearest_neighbors(word)

In [None]:
print(embed_model_skipgram['kral']) 

In [None]:
embed_model_skipgram.get_nearest_neighbors(word)

#### Inline Question 2: Find an example of an analogy that holds, using the `get_analogies` function. Explain the analogy and also how the analogies are calculated.

<font color='red'>Your answer:</font>


In [4]:
embed_model_skipgram.get_analogies()

# COMP542/442 - Assignment 1 - Part 2

In this assignment, we are implementing an RNN-based POS (Part-of-Speech) tagger using BiLSTM (bidirectional Long Short-Term Memory) networks. 

The assignment consists of the following parts:
* **Part I**: Preparation
  * Installing the required packages
  * Data loading and preprocessing
  * Creating the datasets and dataloaders
* **Part II**: Model Implementation and Training
  * Implementing the BiLSTMPOS Tagger model
  * Defining training and evaluation functions
  * Running the training loop and observing the loss and accuracy
  * Plotting the training metrics such as loss and accuracy
  * Saving the model
* **Part III**: Initializating BiLSTM with fastText Embeddings
  * Loading the fastText model
  * Initializating with fastText and BiLSTM models
  * Training the model
  * Evaluating the model

Throughout the assignment, you will work with a POS dataset to train and test the model to recognize different POS tags for the given sentences. You also have the option to use this notebook on Google Colab that allows you to allocate a GPU for faster training.

For more details about the POS tags, check the following link: https://universaldependencies.org/tr/pos/index.html

# Part I. Preparation

First, we load the Part-of-the-Speech (POS) dataset. Make sure you have downloaded the dataset using the provided script. Check the assignment handout for more details.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from typing import List, Tuple, Dict
import numpy as np
import random
from torch.nn.utils.rnn import pad_sequence
import time
from tqdm import tqdm
from matplotlib import pyplot as plt
import random
from collections import Counter

In [None]:
#Set the seeds for reproducibility
SEED = 542

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [None]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('using device:', device)

In [None]:
def parse_file(file_path):
    """
    Parses a file in the Universal Dependencies (UD) annotation style and returns a list of all the sentences in the file.
    Note: The data files you need in this part of the assignemnt are stored under the dataset/ directory. You can open the
    files to have a better understanding of the format. If you want to learn more about the POS tags, you can visit the
    Universal Dependencies website: https://universaldependencies.org/tr/pos/index.html

    The output should be a list of tuples, where each tuple represents a sentence and contains (word, POS tag) pairs for each
    word in the sentence. For example, the following sentence:

    "The quick brown fox jumps over the lazy dog."
    should be represented as:
    [("The", "DET"), ("quick", "ADJ"), ("brown", "ADJ"), ("fox", "NOUN"), ("jumps", "VERB"), ("over", "ADP"), ("the", "DET"), ("lazy", "ADJ"), ("dog", "NOUN"), (".", "PUNCT")]
    
    Args:
    file_path (str): The path to the file to be parsed.
    
    Returns:
    list: A list of tuples, where each tuple represents a sentence and contains (word, POS tag) pairs for each word in the sentence.
    """
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****


In [6]:
def build_vocab(data: List[List[Tuple[str, str]]]) -> Tuple[Dict[str, int], Dict[str, int]]:
    """
    Builds a vocabulary of words and part-of-speech (POS) tags based on the input data. Don't forget to add special tokens (e.g. <PAD>, <UNK>, etc.)

    Args:
    data (List[List[Tuple[str, str]]]): A list of sentences, where each sentence is represented as a list of (word, POS tag) tuples.

    Returns:
    Tuple[Dict[str, int], Dict[str, int]]: A tuple containing two dictionaries. The first dictionary maps words to their index in the vocabulary, and the second dictionary maps POS tags to their index in the vocabulary.
    """  
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****

In [None]:
# Parse the training and validation data files using the `parse_file` function
training_data = parse_file("./dataset/train.conllu")
validation_data = parse_file("./dataset/val.conllu")

# Build the vocabulary for the training data using the `build_vocab` function
# The `build_vocab` function returns two dictionaries:
#   - `word_to_idx`: maps words to their index in the vocabulary
#   - `pos_to_idx`: maps POS tags to their index in the vocabulary
word_to_idx, pos_to_idx = build_vocab(training_data)

In [None]:
#Helper functions to convert between indices and human-readable format. You don't need to do anything here.
#Just reading and making sure you understand what's going on is enough.

idx_to_word = {idx: word for word, idx in word_to_idx.items()}
idx_to_pos = {idx: pos for pos, idx in pos_to_idx.items()}

def convert_idx_to_words(indices: torch.tensor) -> str:
    """Converts a sequence of word indices to a human-readable format.
    
    Args:
        indices (torch.tensor): A sequence of word indices.
    
    Returns:
        str: A string representation of the sequence of words.
    """
    return " ".join([idx_to_word[idx.item()] for idx in indices])

def convert_idx_to_pos(indices: torch.tensor) -> str:
    """Converts a sequence of POS tag indices to a human-readable format.
    
    Args:
        indices (torch.tensor): A sequence of POS tag indices.
    
    Returns:
        str: A string representation of the sequence of POS tags.
    """
    return " ".join([idx_to_pos[idx.item()] for idx in indices])

In [None]:
# Helper function used for minibatching. You don't need to do anything here. Just reading and making sure you understand what's going on is enough.

def collate_batch(batch):
    """
    This function collates a batch of sentences into a padded tensor that can be processed by the model.

    Arguments:

    batch: a list of tuples where each tuple contains a sentence and its corresponding POS tags.
    Returns:

    A tuple of two padded tensors: one containing the text data and the other containing the POS tags.
    """
    
    tag_list, text_list = [], []
    for (line, label) in batch:
        text_list.append(line)
        tag_list.append(label)
        
    return (
        pad_sequence(text_list, padding_value=word_to_idx['<PAD>']),
        pad_sequence(tag_list, padding_value=pos_to_idx['<PAD>'])
    )

### Build Vocabulary

In [None]:
class POSDataset(Dataset):
    """
    A class representing a Part-Of-Speech (POS) tagging dataset, which inherits from PyTorch's Dataset class.
    You need to four methods for this class:
    - __init__: Initializes the dataset object.
    - __len__: Returns the number of sentences in the dataset.
    - __getitem__: Returns the i-th sentence in the dataset.
    - vocab_lookup: Converts a sentence represented as a list of word/POS-tag pairs (tuples) to a pair of PyTorch tensors 
                    representing the corresponding sequences of word and POS tag indices. Out of vocabulary words are
                    represented by the index of the "<UNK>" token.
    """

    def __init__(self, data: List[List[Tuple[str, str]]], word_to_idx: Dict, pos_to_idx: Dict):
        """
        Initializes a new POSDataset object.
        Args:
        - data: A list of sentences, where each sentence is a list of word/POS-tag pairs (tuples).
        - word_to_idx: A dictionary mapping words to their corresponding indices.
        - pos_to_idx: A dictionary mapping POS tags to their corresponding indices.
        """
        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****

    def vocab_lookup(self, sentence: List[Tuple[str, str]]) -> Tuple[torch.tensor, torch.tensor]:
        """
        Converts a sentence represented as a list of word/POS-tag pairs (tuples) to a pair of PyTorch tensors
        representing the corresponding sequences of word and POS tag indices. Out of vocabulary words are
        represented by the index of the "<UNK>" token.

        Args:
        - sentence: A list of word/POS-tag pairs (tuples) representing a single sentence.

        Returns:
        A tuple containing two PyTorch tensors, the first representing the sequence of word indices in the sentence,
        and the second representing the sequence of POS tag indices in the sentence.
        """
        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****

    
    def __len__(self):
        """
        Returns the number of sentences in the dataset.
        """
        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****

    def __getitem__(self, idx: int) -> Tuple[torch.tensor, torch.tensor]:
        """
        Returns a single sentence from the dataset as a pair of PyTorch tensors representing the corresponding
        sequences of word and POS tag indices.

        Args:
        - idx: The index of the sentence to retrieve.

        Returns:
        A tuple containing two PyTorch tensors, the first representing the sequence of word indices in the sentence,
        and the second representing the sequence of POS tag indices in the sentence.
        """
        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****


In [None]:
BATCH_SIZE = None 

In [None]:
# This part is preparing the training and validation datasets by creating POSDataset objects 
# using training_data and validation_data. The word_to_idx and pos_to_idx dictionaries created in build_vocab 
# are passed to POSDataset so that each sentence in the datasets can be converted to a tensor of word and POS tag indices.
# Then, DataLoader objects are created for both the training and validation datasets, with BATCH_SIZE batches per iteration. 
# shuffle=True is used to shuffle the order of samples in each batch, which helps to prevent the model from overfitting to the order of the data. 
# collate_batch is used as the function to merge samples into batches, as it pads sequences to the same length and 
# returns two tensors, one for the word indices and one for the POS tag indices.
# This code block is essential to prepare the data for training the model. 
# The training and validation dataloaders can be used in the training loop to iterate over the dataset in batches.

training_dataset = None 
validation_dataset = None 

training_dataloader = None
validation_dataloader = None

In [None]:
# It is always usefull to see dataset statistics to get a better understanding of the data.
print(f"Unique tokens in word vocabulary: {len(word_to_idx)}")
print(f"Unique tokens in tag vocabulary: {len(pos_to_idx)}")
print()
print(f"Number of training examples: {len(training_dataset)}")
print(f"Number of validation examples: {len(validation_dataset)}")

In [None]:
# Check a random sample from the training dataset to see if the data is correctly loaded.
print("Sample from the dataset:", training_dataset[4])
print()
print("Human-readable version:", convert_idx_to_words(training_dataset[4][0]), convert_idx_to_pos(training_dataset[4][1]))

# Part 2: Model Implementation and Training

In this part of the assignment, the focus is on Model Implementation and Training. This section involves the following steps:

Implementing the BiLSTM POS Tagger model: In this step, you will create a class called BiLSTMPOSTagger that inherits from nn.Module. This class will be used for implementing the BiLSTM (bidirectional Long Short-Term Memory) network for POS tagging. The BiLSTM model will consist of an Embedding layer, an LSTM layer, a Dropout layer, and a Linear layer to make predictions.

Defining training and evaluation functions: After implementing the model, you will need to define two essential functions for training and evaluation. The train_for_single_epoch function will be responsible for training the model for one epoch on the datasets. The evaluate function will be used for evaluating the model's performance on a given dataset. Both the functions will receive important arguments such as the model, dataset iterator, optimizer, and criterion (loss function).

Running the training loop and observing the loss and accuracy: In this step, you will carry out the actual training. First, you will initialize the model, optimizer, and criterion. Then, you will run the training loop for a certain number of epochs (e.g., 10). In each epoch, you will train the model for one epoch using the train_for_single_epoch function and evaluate its performance on the validation dataset using the evaluate function. Finally, you will print and store the training loss and evaluation accuracy for each epoch.

By the end of Part II, you will have a model that has been trained on the POS tagging dataset, and you can observe how the training process affects the loss and accuracy metrics.

In [None]:
class BiLSTMPOSTagger(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        """
        BiLSTM model for POS tagging.
        Check this link for more details: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

        Args:
            input_dim (int): Number of unique words in the vocabulary.
            embedding_dim (int): Dimension of the word embeddings.
            hidden_dim (int): Dimension of the LSTM hidden states.
            output_dim (int): Number of unique POS tags.
            n_layers (int): Number of LSTM layers.
            bidirectional (bool): Whether to use a bidirectional LSTM.
            dropout (float): Probability of dropout, if any.
            pad_idx (int): Index of the <PAD> token in the vocabulary.
        """
        super().__init__()

        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****
        
    def forward(self, text):
        """
        Perform forward pass through the model.

        Args:
            text (Tensor): Input text of shape [sent len, batch size].

        Returns:
            Tensor: Predictions of shape [sent len, batch size, output dim].
        """
        # *****START OF YOUR CODE*****
        
        pass

        # *****END OF YOUR CODE*****

##### Inline question 3: How do you compare the advantages and disadvantages of using a bidirectional LSTM versus a unidirectional LSTM? #####
<font color='red'>Your answer:</font>

In [None]:
# initialize the model
model = None

In [None]:
# See the output of the model for a random sample from the training dataset.
# It is wrapped in torch.no_grad() because we are not training the model.
with torch.no_grad():
    inputs = training_dataset[0][0]
    tag_scores = model(inputs)
    print(tag_scores)

In [None]:
def train_for_single_epoch(model, iterator, optimizer, criterion, device):
    """
    Trains the model for one epoch on the given iterator with the specified optimizer and criterion.

    Args:
        model: The neural network model to train.
        iterator: The iterator over the training dataset.
        optimizer: The optimizer to use for gradient descent.
        criterion: The loss function to use.
        tag_pad_idx: The index of the padding token in the tag vocabulary.
        tag_unk_idx: The index of the unknown token in the tag vocabulary.

    Returns:
        The average loss and accuracy for the epoch.
    """
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****

In [None]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns the categorical accuracy between predictions and the ground truth, ignoring pad tokens.
    """
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****

In [None]:
def evaluate(model, iterator, tag_pad_idx):
    """
    Evaluate the performance of a BiLSTMPOSTagger model on a given dataset iterator. Use the categorical_accuracy function
    you implemented above to calculate the accuracy on a batch of predictions.

    Args:
    - model: a BiLSTMPOSTagger object.
    - iterator: a DataLoader object containing (text, tags) tuples.
    - tag_pad_idx: an integer representing the index of the padding token in the tag vocabulary.

    Returns:
    - A float representing the categorical accuracy of the model on the given dataset iterator.

    """
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****

In [None]:
# Check the models accuracy without training
accuracy = evaluate(model, validation_dataloader, tag_pad_idx=pos_to_idx['<PAD>'])
print(f'Accuracy before training: {accuracy*100:.2f}%')

In [None]:
# Calculate the accuracy of random predictions
epoch_correct = epoch_n_label = random_accuracy = most_frequent_accuracy = 0

# *****START OF YOUR CODE*****

pass

# *****END OF YOUR CODE*****

print(f'Accuracy of random predictions: {random_accuracy*100:.2f}%')

# Calculate the accuracy of predicting the most frequent class

# Get the most frequent class

# *****START OF YOUR CODE*****

pass

# *****END OF YOUR CODE*****
print(f'Accuracy of predicting the most frequent class: {most_frequent_accuracy*100:.2f}%')

In [None]:
#Define hyperparameters

NUM_OF_EPOCHS = None
LEARNING_RATE = None

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

criterion = nn.CrossEntropyLoss(ignore_index=pos_to_idx['<PAD>']) #Modify this part if you are using a different padding token.
criterion = criterion.to(device)

In [None]:
loss_history = []
accuracy_history = []

for x in range(NUM_OF_EPOCHS):
    # Call the train_for_single_epoch function and store the result in the training_loss variable.
    # Call the evaluate function and store the result in the validation_accuracy variable.
    # Print out the current epoch number, training loss, and validation accuracy using the print function and formatted string syntax. 
    # Append the training_loss and validation_accuracy values to their respective history lists (loss_history and accuracy_history).
    # *****START OF YOUR CODE*****
    
    pass

    # *****END OF YOUR CODE*****

In [None]:
#To plot the training metrics, use the matplotlib library.

# *****START OF YOUR CODE*****

pass

# *****END OF YOUR CODE*****

In [None]:
# Models accuracy after training
accuracy = evaluate(model, validation_dataloader, tag_pad_idx=pos_to_idx['<PAD>'])
print(f'Accuracy after training: {accuracy*100:.2f}%')

In [None]:
# Save the model
torch.save(model.state_dict(), 'pos_tagger_model.pt')

##### Inline question 4: What modifications you need to make to convert this model to a character-level BiLSTM POS tagger?
<font color='red'>Your answer:</font>

# Part III: Initializing BiLSTM with fastText Embeddings

In Part III of this project, you will be combining the power of fastText embeddings with the sequence modeling capability of the BiLSTM model. You will load the pretrained fastText model for Turkish, which was trained by you on a large corpus of Turkish text. The pre-trained model can generate word embeddings for any Turkish word, including words that are not in the training data for our specific task. This is an effective approach to handling the out-of-vocabulary (OOV) words problem that can occur in natural language processing tasks.

After initializing the models, we will train the model on our dataset and evaluate its performance on a held-out development set (same dataset as above). By combining these two powerful models, we hope to achieve better accuracy and robustness in our POS tagging task.

* Optional: Try the embeddings of both skipgram and cbow approach for your evaluation

In [None]:
# Initialize the model
model = None
# Load the FastText pre-trained embeddings and set them as the model's embedding layer
# pretrained_embeddings = None
# model.embedding.weight.data.copy_(pretrained_embeddings)

In [None]:
#Define hyperparameters

NUM_OF_EPOCHS = 10
LEARNING_RATE = 0.01

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

criterion = nn.CrossEntropyLoss(ignore_index=pos_to_idx['<PAD>'])
criterion = criterion.to(device)

In [None]:
# Plot the loss and accuracy curves
# *****START OF YOUR CODE*****

pass

# *****END OF YOUR CODE*****