Base on  [BobaZooba hw](https://github.com/BobaZooba/DeepNLP/blob/2020/Week%203/Homework%202.ipynb)

In [None]:
import math
import numpy as np

from tqdm import tqdm

import torch

import zipfile

import seaborn as sns

from data import Downloader, Parser

### Loading the file with embeddings for English
We will need them a little later.

For other languages: https://fasttext.cc/docs/en/crawl-vectors.html

In [None]:
# uncomment and download
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

--2020-10-05 06:56:55--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Распознаётся dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)… 172.67.9.4, 104.22.74.142, 104.22.75.142
Подключение к dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа… 200 OK
Длина: 681808098 (650M) [application/zip]
Сохранение в: «wiki-news-300d-1M.vec.zip»


2020-10-05 06:58:08 (9,13 MB/s) - «wiki-news-300d-1M.vec.zip» сохранён [681808098/681808098]



In [None]:
# path to data
data_path = './data/'

### Data reader
No need to go into details, this thing just downloads data, then parses it and makes three datasets from it:
- training
- validation
- unlabeled

Unlabeled data is not essential, but you may need it, for example, for a language model or to improve embeddings.

In [None]:
downloader = Downloader(data_path=data_path)

In [None]:
downloader.run()

single: 100%|██████████| 21/21 [02:18<00:00,  6.60s/it]
multiple: 100%|██████████| 17/17 [03:46<00:00, 13.32s/it]


In [None]:
parser = Parser(data_path=data_path)

In [None]:
unlabeled, train, valid = parser.run()

### Let's look at the datasets

In [None]:
unlabeled

In [None]:
train

In [None]:
valid

## Task
Classify the question field into one of the categories in the category field.

This is data from the Amazon QA service, that is, a service where you can ask a question and get an answer from other users.

The idea of ​​the task is the following: let's help the client determine which category to post his question to in order to quickly get the most relevant answer.

### Converting a class into an index
We will code some mapper that converts the class text into a specific unique index. We need this because our model does not work directly with the class, but with its index.

In [None]:
# checking that the train and the validation datasets contain the same categories
set(train.category.unique().tolist()) == set(valid.category.unique().tolist())

In [None]:
unique_categories = set(train.category.unique().tolist() + valid.category.unique().tolist())

In [None]:
category2index = {category: index for index, category in enumerate(unique_categories)}

In [None]:
category2index

In [None]:
train['target'] = train.category.map(category2index)
valid['target'] = valid.category.map(category2index)

In [None]:
train

### Torch Dataset, DataLoader

This is a very important abstraction for Torch.

We will always use it to work with data.

`Dataset` is a class that you need to inherit from to write your own data handler. Inside it, you need to implement two methods,
which will be discussed below. That is, in this class you describe how to convert your data into a Torch format (converting texts
into word indexes, etc.).

`DataLoader` is a class that will sample data in batches for you. It is an iterator, so the format for working with it is approximately as follows:
```python
for batch in data_loader:
    ...
```
That is, at each iteration, one batch of data is given. Iteration ends when you go through all the batches.

Why do we need these abstractions? To simplify and unify our work with data.
In general, you can implement something of your own, but this is a simplification of this task.

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
# toy dataset
# 121535 examples, 4 features, 3 classes
some_data_x = np.random.rand(121535, 4)
some_data_y = np.random.randint(3, size=(121535,))

In [None]:
# just random numbers
some_data_x[:10]

In [None]:
# and classes
some_data_y

### Example of usefulness
To train a model, you need to feed it batches of data. How could we implement this if we didn't have Dataset and DataLoader

In [None]:
batch_size = 16

for i_batch in range(math.ceil(some_data_x.shape[0] / batch_size)):

    x_batch = some_data_x[i_batch * batch_size:(i_batch + 1) * batch_size]
    y_batch = some_data_y[i_batch * batch_size:(i_batch + 1) * batch_size]

    x_batch = torch.tensor(x_batch)
    y_batch = torch.tensor(y_batch)

    break

In [None]:
x_batch

In [None]:
x_batch.shape, y_batch.shape

This is a fairly simple example. We were able to do it ourselves, but almost always, processing the data to feed it into a model is more complicated.
And some things are often needed more than once, for example, if we want to shuffle our data every epoch to get different batches.
We can do this, but to do so, we will have to drag some code with us from project to project. Also, co-development or simply reading someone else's code is much easier when you use unified formats.

### Moving on to Dataset
Let's wrap our data in this handler.

In [None]:
class ToyDataset(Dataset):

    def __init__(self, data_x, data_y):

        super().__init__()

        self.data_x = data_x
        self.data_y = data_y

    def __len__(self):

        # it is very necessary to define this function
        # it should return the size of the dataset
        # it is needed for DataLoader to sample batches

        return len(self.data_x)

    def __getitem__(self, idx):

        # this method needs to be defined as well
        # that is, how we will get our data by index

        return self.data_x[idx], self.data_y[idx]

In [None]:
some_dataset = ToyDataset(some_data_x, some_data_y)

In [None]:
some_dataset[5], some_dataset[467]

### It seems like it doesn't make sense, but this is the simplest example.

### DataLoader
We can set some parameters in it, for example, batch size and whether it is necessary to shuffle data in every pass to get different batches (to compose these batches differently).

In [None]:
some_loader = DataLoader(some_dataset, batch_size=16, shuffle=True)

In [None]:
for x, y in some_loader:
    break

x

In [None]:
x.shape

In [None]:
for x, y in some_loader:
    pass

len(x)

In [None]:
# why 15?
# because the amount of our data is not divisible by 16
# and therefore the last batch is less than 16
len(some_dataset) % 16

### Let's complicate the handler

In [None]:
class ToyDataset(Dataset):

    def __init__(self, data_x, data_y):

        super().__init__()

        self.data_x = data_x
        self.data_y = data_y

    def __len__(self):

        # it is very necessary to define this function
        # it should return the size of the dataset
        # it is needed for DataLoader to sample batches

        return len(self.data_x)

    @staticmethod
    def pow_features(x, n=2):

        return x ** n

    @staticmethod
    def log_features(x):

        return np.log(x)

    def __getitem__(self, idx):

        # this method needs to be defined as well
        # that is, how we will get our data by index

        x = self.data_x[idx]

        # inside the dataset we can do whatever we want with our data
        # for example, to define functions that add power features
        x_p_2 = self.pow_features(x, n=2)
        x_p_3 = self.pow_features(x, n=3)
        # and let's also add logarithmic features
        x_log = self.log_features(x)

        # let's concatenate our features
        x = np.concatenate([x, x_p_2, x_p_3, x_log])

        y = self.data_y[idx]

        return x, y

In [None]:
toy_dataset = ToyDataset(some_data_x, some_data_y)

In [None]:
toy_loader = DataLoader(dataset=toy_dataset, batch_size=128)

In [None]:
for x, y in toy_loader:
    break

In [None]:
x.shape

In [None]:
# note that we immediately get the torch data format, which is obtained from the automatic conversion from numpy
x

In [None]:
y

In [None]:
# let's create a small model and calculate the loss

model = torch.nn.Sequential(torch.nn.Linear(16, 8),
                            torch.nn.ReLU(),
                            torch.nn.Linear(8, 4),
                            torch.nn.ReLU(),
                            torch.nn.Linear(4, 3))

criterion = torch.nn.CrossEntropyLoss()

with torch.no_grad():

    prediction = model(x.float())

    loss = criterion(prediction, y)

loss.item()

### Let's create a dataset for our text data
We will input a string and a target by index

In [None]:
class TextClassificationDataset(Dataset):

    def __init__(self, texts, targets):
        super().__init__()

        self.texts = texts
        self.targets = targets

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):

        text = self.texts[index]
        target = self.targets[index]

        return text, target

In [None]:
# preparing the data
train_x = list(train.question)
train_y = list(train.target)

valid_x = list(valid.question)
valid_y = list(valid.target)

In [None]:
train_dataset = TextClassificationDataset(texts=list(train.question), targets=list(train.target))

In [None]:
# sampling the data
text, target = train_dataset[0]

In [None]:
text

In [None]:
target

### The point of the handler
It is that we need to transform our data into a format that we can then pass to the model.
Right now we have strings, and Torch doesn't know anything about strings, it needs tensors.

### Loading Embeddings
To work with text data, we can split our lines into words, and convert the words into vectors. Where do we get these vectors?
We talked about a method called word2vec and at the beginning of this notebook we loaded a file with these very vectors.


In [None]:
import zipfile
import numpy as np

from tqdm import tqdm

In [None]:
def load_embeddings(zip_path, filename, pad_token='PAD', max_words=100_000, verbose=True):

    vocab = dict()
    embeddings = list()

    with zipfile.ZipFile(zip_path) as zipped_file:
        with zipped_file.open(filename) as file_object:

            vocab_size, embedding_dim = file_object.readline().decode('utf-8').strip().split()

            vocab_size = int(vocab_size)
            embedding_dim = int(embedding_dim)

            # there are 1,000,000 words with vectors in the file, let's limit this dictionary for simplicity
            max_words = vocab_size if max_words <= 0 else max_words

            # let's add the pad token and embedding to our embedding matrix and dictionary
            vocab[pad_token] = len(vocab)
            embeddings.append(np.zeros(embedding_dim))

            progress_bar = tqdm(total=max_words, disable=not verbose)

            for line in file_object:
                parts = line.decode('utf-8').strip().split()

                token = ' '.join(parts[:-embedding_dim]).lower()

                if token in vocab:
                    continue

                word_vector = np.array(list(map(float, parts[-embedding_dim:])))

                vocab[token] = len(vocab)
                embeddings.append(word_vector)

                progress_bar.update()

                if len(vocab) == max_words:
                    break

            progress_bar.close()

    embeddings = np.stack(embeddings)

    return vocab, embeddings

In [None]:
vocab, embeddings = load_embeddings('./wiki-news-300d-1M.vec.zip', 'wiki-news-300d-1M.vec', max_words=100_000)

### Let's look at the word's closest neighbors by embeddings

In [None]:
index2token = {index: token for token, index in vocab.items()}

In [None]:
emb_norms = np.linalg.norm(embeddings, axis=1)

In [None]:
def get_k_nearest_neighbors(word, embeddings, emb_norms, vocab, index2token, k=5):

    if word not in vocab:
        print('Not in vocab')
        return

    word_index = vocab[word]

    word_vector = embeddings[word_index]
    word_vector = np.expand_dims(word_vector, 0)

    scores = (word_vector @ embeddings.T)[0]

    # convert to cosines, dividing by vector norms
    # epsilon 1e-6 so as not to divide by 0
    scores = scores / (emb_norms + 1e-6) / emb_norms[word_index]

    # 1:k+1 because 0-indexed element is the word itself
    for idx in scores.argsort()[::-1][1:k+1]:
        print(f'The word {index2token[idx]} is similar by {scores[idx]:.2f} to the word {word}')

In [None]:
get_k_nearest_neighbors('anna', embeddings, emb_norms, vocab, index2token)

In [None]:
get_k_nearest_neighbors('mom', embeddings, emb_norms, vocab, index2token)

In [None]:
get_k_nearest_neighbors('have', embeddings, emb_norms, vocab, index2token)

In [None]:
get_k_nearest_neighbors('money', embeddings, emb_norms, vocab, index2token)

In [None]:
get_k_nearest_neighbors('music', embeddings, emb_norms, vocab, index2token)

### Choosing a tokenization method
We now have a mapping that a certain word corresponds to a certain embedding of this word.
Tokenization is the process of dividing a text into tokens, that is, parts of this text.
How a "word" differs from a "token": a token is a more generalized concept, that is, for example, a number is a token

In [None]:
# More details about the differences can be found, for example, here
# https://stackoverflow.com/questions/50240029/nltk-wordpunct-tokenize-vs-word-tokenize
from nltk.tokenize import word_tokenize, wordpunct_tokenize

In [None]:
total_n_words = 0
unknown_words = list()

for sample in tqdm(train_x):
    # tokenization by space
    tokens = sample.split()

    for tok in tokens:
        # checking if the token is in our dictionary
        if tok not in vocab:
            unknown_words.append(tok)

        total_n_words += 1

print(f'We don not know {len(unknown_words)} words out of {total_n_words} words in the dataset')
print(f'Which is {len(unknown_words) * 100 / total_n_words:.2f}% of the dataset')
print()
print(f'Unique unknown words: {len(set(unknown_words))}')

In [None]:
total_n_words = 0
unknown_words = list()

for sample in tqdm(train_x):
    # tokenization
    tokens = wordpunct_tokenize(sample)

    for tok in tokens:
        # checking if the token is in our dictionary
        if tok not in vocab:
            unknown_words.append(tok)

        total_n_words += 1

print(f'we don not know {len(unknown_words)} words out of {total_n_words} words in the dataset')
print(f'Which is {len(unknown_words) * 100 / total_n_words:.2f}% of the dataset')
print()
print(f'Unique unknown words: {len(set(unknown_words))}')

In [None]:
total_n_words = 0
unknown_words = list()

for sample in tqdm(train_x):
    # tokenization
    tokens = word_tokenize(sample)

    for tok in tokens:
        # checking if the token is in our dictionary
        if tok not in vocab:
            unknown_words.append(tok)

        total_n_words += 1

print(f'we don not know {len(unknown_words)} words out of {total_n_words} words in the dataset')
print(f'Which is {len(unknown_words) * 100 / total_n_words:.2f}% of the dataset')
print()
print(f'Unique unknown words: {len(set(unknown_words))}')

### Results
- The speed of word_tokenize is much lower than that of wordpunct_tokenize
- Using word_tokenize, we lose about 1% of the information from the dataset compared to wordpunct_tokenize

### The choice is obvious in favor of wordpunct_tokenize

In [None]:
class TextClassificationDataset(Dataset):

    def __init__(self, texts, targets, vocab):
        super().__init__()

        self.texts = texts
        self.targets = targets
        self.vocab = vocab

    def __len__(self):
        return len(self.texts)

    def tokenization(self, text):

        tokens = wordpunct_tokenize(text)

        token_indices = [self.vocab[tok] for tok in tokens if tok in self.vocab]

        return token_indices

    def __getitem__(self, index):

        text = self.texts[index]
        target = self.targets[index]

        tokenized_text = self.tokenization(text)

        # let's translate our token indices into a Torch tensor
        # the target will convert itself
        tokenized_text = torch.tensor(tokenized_text)

        return tokenized_text, target

In [None]:
train_dataset = TextClassificationDataset(texts=train_x, targets=train_y, vocab=vocab)

In [None]:
x, y = train_dataset[5]

In [None]:
x

In [None]:
y

In [None]:
# we can restore the text back by word indexes
[index2token[idx.item()] for idx in x]

### У нас остается проблема разных длин текстов
Чтобы поместить батч текстов в один тензор нам нужны одинаковые длины

In [None]:
## this won't work, you can uncomment and check

# x = [
#     [1, 2, 3],
#     [1, 2, 3, 4, 5],
#     [1, 2, 3, 4, 5, 6, 7]
# ]

# torch.tensor(x), torch.tensor(x).shape

In [None]:
# this will work

x = [
    [1, 2, 3, 0, 0, 0, 0],
    [1, 2, 3, 4, 5, 0, 0],
    [1, 2, 3, 4, 5, 6, 7]
]

torch.tensor(x), torch.tensor(x).shape

### Text length
We need to understand to what length we should pad each of our examples.
We can find the maximum length of an example in tokens in our data and pad to this length, but this approach has a downside:
we may have several texts with an abnormally large length, that is, some outliers.

In this case, it is easier for us to limit the length of these texts to a certain statistic for our dataset. For example, 95% of our texts
have a length of 25 words and this is enough for us. That is, we will limit the texts to this length, because almost the entire dataset fits within this length and we will not need to pad to a large length.

We need padding so that we can place different examples in one batch, but we do not want to take these tokens into account. In fact, these will be idle runs and due to this compromise that most of the dataset is no more than n words and we can optimize our training.

<br>


> Why don't we just throw away these long texts?

The point is that we want to come to some compromise between the maximum length and the loss of information. If we take the 95th percentile of our lengths (that is, 95% of our texts are no larger than n), then throwing away the remaining 5%, we will lose a significant part of the examples.
On the other hand, it may seem wrong to limit the length and this can really break the meaning of the examples, but this is often neglected.

In [None]:
train_lengths = [len(wordpunct_tokenize(sample)) for sample in tqdm(train_x)]

In [None]:
sns.distplot(train_lengths)

In [None]:
# we see large outliers in the data
# 97% of our texts are no more than this many tokens
np.percentile(train_lengths, 95)

In [None]:
class TextClassificationDataset(Dataset):

    def __init__(self, texts, targets, vocab, pad_index=0, max_length=32):
        super().__init__()

        self.texts = texts
        self.targets = targets
        self.vocab = vocab

        self.pad_index = pad_index
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def tokenization(self, text):

        tokens = wordpunct_tokenize(text)

        token_indices = [self.vocab[tok] for tok in tokens if tok in self.vocab]

        return token_indices

    def padding(self, tokenized_text):

        tokenized_text = tokenized_text[:self.max_length]

        tokenized_text += [self.pad_index] * (self.max_length - len(tokenized_text))

        return tokenized_text

    def __getitem__(self, index):

        text = self.texts[index]
        target = self.targets[index]

        tokenized_text = self.tokenization(text)
        tokenized_text = self.padding(tokenized_text)

        tokenized_text = torch.tensor(tokenized_text)

        return tokenized_text, target

In [None]:
train_dataset = TextClassificationDataset(texts=train_x, targets=train_y, vocab=vocab)

In [None]:
x, y = train_dataset[0]
x

In [None]:
[index2token[idx.item()] for idx in x]

In [None]:
train_dataset = TextClassificationDataset(texts=train_x, targets=train_y, vocab=vocab)
valid_dataset = TextClassificationDataset(texts=valid_x, targets=valid_y, vocab=vocab)

train_loader = DataLoader(train_dataset, batch_size=128)
valid_loader = DataLoader(valid_dataset, batch_size=128)

In [None]:
for x, y in train_loader:
    break

In [None]:
x.shape, y.shape

### How can we define layers?

In [None]:
from torch import nn

In [None]:
embedding_layer = nn.Embedding(num_embeddings=len(vocab),
                               embedding_dim=embeddings.shape[-1],
                               padding_idx=0)

In [None]:
x_embed = embedding_layer(x)

In [None]:
x_embed

In [None]:
x_embed.shape

### But we have read our embedding matrix
Thus, it is initialized with pretrained weights.
With such initialization, by default it is frozen, inside ```.from_pretrained(embeddings, padding_idx=0)``` there is a flag ```freeze```, which is responsible for freezing the weights if necessity. That is, these weights will not be updated during the training process.

In [None]:
embeddings = torch.tensor(embeddings).float()

In [None]:
embedding_layer = nn.Embedding.from_pretrained(embeddings, padding_idx=0)

In [None]:
x_embed = embedding_layer(x)

### A bit of LSTM
Below will be about ```batch_first=True```

In [None]:
lstm = nn.LSTM(input_size=300, hidden_size=128, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)

In [None]:
x_lstm, _ = lstm(x_embed)

In [None]:
# 256 because it is a concatenation of the LSTM that read the text from left to right
# and the LSTM that read the text from right to left
x_lstm.shape

In [None]:
# got rid of the time dimension
x_lstm.mean(dim=1).shape

### Let's create our own network
There is more detailed information about why we use classes at the end of the the first homework.

In [None]:
class DeepAverageNetwork(nn.Module):

    def __init__(self, embeddings, linear_1_size, linear_2_size, n_classes):
        super().__init__()

        self.embedding_layer = nn.Embedding.from_pretrained(embeddings, padding_idx=0)

        self.batch_norm = nn.BatchNorm1d(num_features=embeddings.shape[-1])

        self.linear_1 = nn.Linear(in_features=embeddings.shape[-1], out_features=linear_1_size)
        self.linear_2 = nn.Linear(in_features=linear_1_size, out_features=linear_2_size)
        self.linear_3 = nn.Linear(in_features=linear_2_size, out_features=n_classes)

    def forward(self, x):

        # translating word indices into embeddings of these words
        # (batch_size, sequence_length) -> (batch_size, sequence_length, embedding_dim)
        x = self.embedding_layer(x)

        # aggregating our embeddings by time dimension
        # (batch_size, sequence_length, embedding_dim) -> (batch_size, embedding_dim)
        x = x.sum(dim=1)

        # normalization
        # (batch_size, embedding_dim) -> (batch_size, embedding_dim)
        x = self.batch_norm(x)

        # passing through the first linear layer
        # (batch_size, embedding_dim) -> (batch_size, linear_1_size)
        x = self.linear_1(x)

        # applying nonlinearity
        # (batch_size, linear_1_size) -> (batch_size, linear_1_size)
        x = torch.relu(x)

        # passing through the second linear layer
        # (batch_size, linear_1_size) -> (batch_size, linear_2_size)
        x = self.linear_2(x)

        # applying nonlinearity
        # (batch_size, linear_2_size) -> (batch_size, linear_2_size)
        x = torch.relu(x)

        # converting into the number of classes using a linear transformation
        # (batch_size, linear_2_size) -> (batch_size, n_classes)
        x = self.linear_3(x)

        ## in theory there should have been a softmax here
        ## but we will use the nn.CrossEntropyLoss() loss
        ## its documentation says
        ## This criterion combines :func:`nn.LogSoftmax` and :func:`nn.NLLLoss` in one single class.
        ## this is some optimization that includes both the softmax and the negative log likelihood loss itself
        ## since we have a softmax in the loss, we will not use it in the net
        ## at the prediction stage (not training) we will separately do the softmax to obtain the class distribution
        ##
        ## (batch_size, n_classes) -> (batch_size, n_classes)
        # x = torch.softmax(x, dim=-1)

        return x

In [None]:
model = DeepAverageNetwork(embeddings=embeddings,
                           linear_1_size=256,
                           linear_2_size=128,
                           n_classes=len(category2index))

In [None]:
criterion = nn.CrossEntropyLoss()

# set the optimizer
# optimizer = ...

### Write a training loop
What it should include:
1. Obtaining model predictions
1. Calculating the loss function
1. Calculating gradients
1. Gradient descent step
1. Zeroing of the gradients
1. Saving the loss value

In [None]:
losses = list()

# in model training we have a situation where some layers behave differently at the training and prediction stages
# for example, batch norm (as well as all other normalizations) and dropout
# this puts the model in the training mode
model.train()

for x, y in train_loader:

    ...

### Write a validation loop
What it should include:
1. Getting model predictions
1. Calculating the loss function
1. Saving the loss value

Also, using the context ```with torch.no_grad():```, you can explicitly tell torch not to save the necessary parameters for calculating gradients. Required for the prediction mode.

In [None]:
losses = list()

# this puts the model in the prediction mode
# that is, batch norm statistics are recorded, dropout does not throw the features out
model.eval()

# note that we have changed our loader to the validation one
for x, y in valid_loader:

    with torch.no_grad():
        # getting model predictions
        # loss calculation
        ...

    ...

### Train for several epochs
One epoch is one pass through the dataset.
Steps:
- Change something in the model, add a dropout, etc.
- Stop training with early stopping
- Add metric calculation during training and prediction (e.g. micro F1). To do this, you can, for example, save the model's predictions
- After training, draw how the loss function changes on the training and validation dataset as training progresses, how the metrics change
- Optional: build a confusion matrix

Hints:
- To save predictions correctly, you need to detach the variable from the graph, that is, do ```x.detach()```

In [None]:
for n_epoch in range(2):
    ...

### Important and not so intuitive points about LSTM in Torch

By default, LSTM accepts data with the following dimensions:
```python
(seq_len, batch, input_size)
```
This is done for the purpose of optimization at a lower level.

We operate with the following objects:
```python
(batch, seq_len, input_size)
```
For the LSTM to work correctly, we can either pass the parameter ```batch_first=True``` during layer initialization,
or transpose (change) the first and second dimensions of our x before feeding it to the layer.
[More on LSTM](https://pytorch.org/docs/stable/nn.html#lstm)

- 128 - batch size
- 64 - sequence length (number of words)
- 1024 - word embedding

In [None]:
x = torch.rand(128, 64, 1024)

In [None]:
# first way
lstm = torch.nn.LSTM(1024, 512, batch_first=True)

pred, mem = lstm(x)

In [None]:
pred.shape

In [None]:
# second way
lstm = torch.nn.LSTM(1024, 512)

# swap the dimensions of batch and seq_len
x_transposed = x.transpose(0, 1)
pred_transposed, mem = lstm(x_transposed)

In [None]:
# we still have the (seq_len, batch, input_size) dimensions
pred_transposed.shape

In [None]:
# just transpose again
pred = pred_transposed.transpose(0, 1)
pred.shape

### Another important point about LSTM

The input can also be a packed variable length sequence. See [torch.nn.utils.rnn.pack_padded_sequence()](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence) or [torch.nn.utils.rnn.pack_sequence()](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_sequence) for details.

This is an internal Torch design that allows you to not read the ```PAD``` token, but still work with batches. That is, inside the batch we can pass to the LSTM that we have variable-length data. Don't forget that [torch.nn.utils.rnn.PackedSequence](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.PackedSequence) is given to the output.

## Homework

1. Create a neural network class, add the necessary operations, the architecture is described below
1. Write the training procedure (summarize what was discussed above)
1. Add logging
    1. Save the loss at each training iteration __0.25 points__
    1. Save the loss of the train and test each epoch __0.25 points__
    1. Calculate metrics at each epoch __0.25 points__
    1. Add a progress bar that shows the average loss of the last 500 iterations __0.25 points__
1. Add early stopping __0.5 points__
1. Draw graphs of loss, metrics, conjugation matrix __0.5 points__

### Architecture (what to try)
1. Pre-trained embeddings. Read [here](https://pytorch.org/docs/stable/nn.html#embedding) (from_pretrained) how to add your own embeddings, above we read the embedding matrix. __0 points__
1. Retrain the embeddings together with the network and with a different learning rate (specified in the optimizer). __2 points__
1. Bidirectional LSTM. __1 point__
1. Write the correct mean/max pooling, which does not take into account paddings, or rather masks them. __2 points__
1. Add [torch.nn.utils.rnn.pack_padded_sequence()](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence) and [torch.nn.utils.rnn.pack_sequence()](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_sequence) for LSTM. Info [here](#Another-important-point-about-LSTM) __2 points__
1. Add spatial dropout for LSTM input (not just a standard item when initializing LSTM) __1 point__
1. Add BatchNorm/LayerNorm/Dropout/Residual/etc __2 points__
1. Add scheduler __1 point__
1. Train on GPU __2 points__
1. your madness

## Grade: 10 points maximum

# Write down the results of the experiments
# What worked and what didn't and why
# And conclusions