# Task definition
Implement LSTM Sentiment Tagger for imdb reviews dataset.

1. (5pt) Fill missing code below
    * 1pt implement vectorization
    * 2pt implement \_\_init\_\_ and forward methods of models
    * 2pt implement collate function
2. (4pt) Implement training loop, choose proper loss function, use clear ml for max points.
    * 2pts is a baseline for well written, working code
    * 2pts if clear ml used properly
3. (3pt) Train the models (find proper hyperparams). Make sure you are not overfitting or underfitting. Visualize training of your best model (plot training, and test loss/accuracy in time). Your model should reach at least 87% accuracy. For max points it should exceed 89%. 
    * 1pt for accuracy above 89%
    * 1pt for accuracy above 87%
    * 1pt for visualizations

Remarks:
* Use embeddings of size 50
* Use 0.5 threshold when computing accuracy.
* Use supplied dataset for training and evaluation.
* You do not have to use validation set.
* You should monitor overfitting during training.
* For max points use clear ml to store and manage logs from your experiments. 
* We encourage to use pytorch lightning library (Addtional point for using it - however the sum must not exceed 12)

[Clear ML documentation](https://clear.ml/docs/latest/docs/)

[Clear ML notebook exercise from bootcamp](https://colab.research.google.com/drive/1wtLb4gg8beLS7smcyJlOZppn6_rQvSxL?usp=sharing)

In [None]:
!pip install wheel numpy pandas tqdm torch torchtext
!pip install clearml pytorch-lightning torchmetrics
!pip install plotly nltk "ray[tune]"

In [None]:
import os
from collections import defaultdict

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
from torch import nn
from torch import optim
from torch.nn.utils import rnn

from torch.utils.data import Dataset, DataLoader

import torchtext
from clearml import Task
from ray.tune.integration.pytorch_lightning import TuneReportCallback

In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
!tar -xvzf imdb_dataset.gz
data = pd.read_csv("imdb_dataset.csv")

Downloading...
From: https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
To: /home/vitreus/lstm/imdb_dataset.gz
100%|██████████████████████████████████████| 77.0M/77.0M [00:01<00:00, 55.9MB/s]
imdb_dataset.csv


In [None]:
web_server = 'https://app.community.clear.ml'
api_server = 'https://api.community.clear.ml'
files_server = 'https://files.community.clear.ml'
access_key = 'FEGJE7XJQU0JXNVQF87T'#@param {type:"string"}
secret_key = 'r5ZqZr0MsZi1qmXOopKo8AYTHk2TUO7DUMb1T1CnbXZecDrzGf'#@param {type:"string"}

Task.set_credentials(web_host=web_server,
                     api_host=api_server,
                     files_host=files_server,
                     key=access_key,
                     secret=secret_key)

In [None]:
class NaiveVectorizer:
    def __init__(self, tokenized_data, **kwargs):
        """Converts data from string to vector of ints that represent words. 
        Prepare lookup dict (self.wv) that maps token to int. Reserve index 0 for padding.
        """
        self.vocab = {word for seq in tokenized_data for word in seq.split()}
        self.wv = {word: idx for idx, word in enumerate(self.vocab, 1)}        
        self.vocab_size = len(self.vocab)

    def vectorize(self, tokenized_seq):
        """Converts sequence of tokens into sequence of indices.
        If the token does not appear in the vocabulary(self.wv) it is ommited
        Returns torch tensor of shape (seq_len,) and type long."""

        token_indices = [self.wv[word] for word in tokenized_seq 
                         if word in self.wv]
        return torch.LongTensor(token_indices)

class ImdbDataset(Dataset):
    SPLIT_TYPES = ["train", "test", "unsup"]

    def __init__(self, data, preprocess_fn, split="train"):
        super(ImdbDataset, self).__init__()
        if split not in self.SPLIT_TYPES:
            raise AttributeError(f"No such split type: {split}")

        self.split = split
        self.label = [i for i, c in enumerate(data.columns) if c == "sentiment"][0]
        self.data_col = [i for i, c in enumerate(data.columns) if c == "tokenized"][0]
        self.data = data[data["split"] == self.split]
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq = self.preprocess_fn(self.data.iloc[idx, self.data_col].split())
        label = self.data.iloc[idx, self.label]
        return (seq, label)

naive_vectorizer = NaiveVectorizer(data.loc[data["split"] == "train", "tokenized"])

def get_datasets():
    train_dataset = ImdbDataset(data, naive_vectorizer.vectorize)
    test_dataset = ImdbDataset(data, naive_vectorizer.vectorize, split="test")
    return train_dataset, test_dataset

def process_seqs(sequences):
    padded_seqs = rnn.pad_sequence(sequences, padding_value=0, batch_first=True)
    orig_lengths = [len(unpadded) for unpadded in sequences]
    orig_lengths = torch.LongTensor(orig_lengths)
    return padded_seqs, orig_lengths

def custom_collate_fn(pairs):
    """This function is supposed to be used by dataloader to prepare batches
    Input: list of tuples (sequence, label)
    Output: sequences_padded_to_the_same_lenths, original_lenghts_of_sequences, lables.
    torch.nn.utils.rnn.pad_sequence might be usefull here
    """
    sequences, labels = zip(*pairs)
    labels = torch.LongTensor(labels)
    padded, orig_lengths = process_seqs(sequences)
    return padded, orig_lengths, labels

In [None]:
from torch.nn.utils import rnn
import torch.nn.functional as func
import pytorch_lightning as pl
import torchmetrics

cpu = torch.device("cpu")

"""Implement LSTMSentimentTagger. 
The model should use a LSTM module.
Use torch.nn.utils.rnn.pack_padded_sequence to optimize processing of sequences.
When computing vocab_size of embedding layer remeber that padding_symbol counts to the vocab.
Use sigmoid activation function.
"""
class LSTMSentimentTagger(pl.LightningModule):
    def __init__(self, vocab_size=naive_vectorizer.vocab_size, embedding_dim=50,
                 hidden_dim=256, num_layers=2, dropout=0.8):
        super(LSTMSentimentTagger, self).__init__()
        
        self.embed = nn.Embedding(vocab_size+1, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim,
                            batch_first=True, num_layers=num_layers)
        self.dropout = nn.Dropout(dropout)
        self.proj = nn.Linear(hidden_dim, 1, bias=True)

        self.train_acc = torchmetrics.Accuracy()
        self.val_acc = torchmetrics.Accuracy()
    
    def _logits(self, sentence, lengths):
      embeddings = self.embed(sentence)
      packed_emb = rnn.pack_padded_sequence(embeddings, lengths.to(cpu), enforce_sorted=False,
                                            batch_first=True)
      packed_out, _ = self.lstm(packed_emb)
      output, _ = rnn.pad_packed_sequence(packed_out, batch_first=True, padding_value=0)
      final_h = output[range(output.shape[0]),lengths-1,:]

      x = final_h
      x = self.dropout(x)
      x = self.proj(x)
      return x.flatten()

    def forward(self, sentence, lengths):
      logits = self._logits(sentence, lengths)
      scores = torch.sigmoid(logits)
      return scores
    
    def configure_optimizers(self):
      return torch.optim.Adam(self.parameters(), lr=1e-3)
    
    def training_step(self, train_batch, batch_idx):
      sentence, lengths, labels = train_batch
      logits = self._logits(sentence, lengths)
      loss = func.binary_cross_entropy_with_logits(logits, labels.float())
      self.log(f"train_loss", loss)
      scores = torch.sigmoid(logits)
      preds = torch.where(scores < 0.5, 0, 1)
      self.train_acc(preds, labels)
      self.log(f"train_acc_step", self.train_acc)
      return loss
    
    def training_step_end(self, outs):
      self.log(f"train_acc_epoch", self.train_acc)
    
    def validation_step(self, val_batch, batch_idx):
      sentence, lengths, labels = val_batch
      logits = self._logits(sentence, lengths)
      loss = func.binary_cross_entropy_with_logits(logits, labels.float())
      self.log(f"val_loss", loss)
      scores = torch.sigmoid(logits)
      preds = torch.where(scores < 0.5, 0, 1)
      self.val_acc(preds, labels)
      self.log(f"val_acc_step", self.val_acc)
    
    def validation_epoch_end(self, outs):
      self.log(f"val_acc_epoch", self.val_acc)

# Training loop and visualizations


In [None]:
task_name = "hidden_dim of 256, dropout of 0.8, I've no other ideas atm"#@param {type:"string"}
task = Task.init(project_name="RNN Homework", task_name=task_name)

ClearML Task: created new task id=2d813ca0b4ab4bb292ad268eeb61b1f2
2022-01-15 11:21:52,899 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: https://app.community.clear.ml/projects/4e656aeff4644496b51b081559013894/experiments/2d813ca0b4ab4bb292ad268eeb61b1f2/output/log


In [None]:
BATCH_SIZE = 256
train_dataset, test_dataset = get_datasets()

loader_args = dict(batch_size=BATCH_SIZE, collate_fn=custom_collate_fn)
train_loader = DataLoader(train_dataset, shuffle=True, **loader_args)
val_loader = DataLoader(test_dataset, **loader_args)

In [None]:
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger
import shutil
from pathlib import Path

checkpoint_path = Path("net/")
checkpoint_cb = ModelCheckpoint(dirpath=str(checkpoint_path))
if checkpoint_path.exists():
  shutil.rmtree(checkpoint_path)

early_stopping_cb = EarlyStopping(monitor="val_acc_epoch", min_delta=0.0,
                                  patience=8, verbose=False, mode="max")

callbacks = [checkpoint_cb, early_stopping_cb]

tb_logger = TensorBoardLogger("lightning_logs/")

model = LSTMSentimentTagger()

pl.seed_everything(42, workers=True)
trainer = pl.Trainer(gpus=1, precision=16, deterministic=True,
                     callbacks=callbacks, logger=tb_logger)
trainer.fit(model, train_loader, val_loader)

Global seed set to 42
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type      | Params
----------------------------------------
0 | embed     | Embedding | 4.4 M 
1 | lstm      | LSTM      | 841 K 
2 | dropout   | Dropout   | 0     
3 | proj      | Linear    | 257   
4 | train_acc | Accuracy  | 0     
5 | val_acc   | Accuracy  | 0     
----------------------------------------
5.2 M     Trainable params
0         Non-trainable params
5.2 M     Total params
10.387    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  rank_zero_warn(
  labels = torch.LongTensor(labels)
Global seed set to 42
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

2022-01-15 11:23:02,371 - clearml.frameworks - INFO - Found existing registered model id=d2908d8603c74b8ca90af350691a0e34 [/home/vitreus/lstm/net/epoch=0-step=97.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:23:59,403 - clearml.frameworks - INFO - Found existing registered model id=8520b685649e4577b78e4773996e7d55 [/home/vitreus/lstm/net/epoch=1-step=195.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:24:55,659 - clearml.frameworks - INFO - Found existing registered model id=90d99ac72b2743c38d39c1794775ab5b [/home/vitreus/lstm/net/epoch=2-step=293.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:25:51,454 - clearml.frameworks - INFO - Found existing registered model id=214a3398b5c74eda9de314533df8f3f3 [/home/vitreus/lstm/net/epoch=3-step=391.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:26:47,955 - clearml.frameworks - INFO - Found existing registered model id=93c47b77bb004a1c8184348c6b3cdec2 [/home/vitreus/lstm/net/epoch=4-step=489.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:27:42,334 - clearml.frameworks - INFO - Found existing registered model id=9ae424b5dc394c9ba8bbceba2d0bf112 [/home/vitreus/lstm/net/epoch=5-step=587.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:28:36,239 - clearml.frameworks - INFO - Found existing registered model id=679a549a04e247aab63b3de699d00026 [/home/vitreus/lstm/net/epoch=6-step=685.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:29:30,535 - clearml.frameworks - INFO - Found existing registered model id=797896233f9e4d9c87f783df49f904cf [/home/vitreus/lstm/net/epoch=7-step=783.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:30:24,948 - clearml.frameworks - INFO - Found existing registered model id=1cdfb89247174b19bddeedd7d1cb0af0 [/home/vitreus/lstm/net/epoch=8-step=881.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:31:18,098 - clearml.frameworks - INFO - Found existing registered model id=78ccd47f925c4a5aa75ae802ddd1299c [/home/vitreus/lstm/net/epoch=9-step=979.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:32:11,717 - clearml.frameworks - INFO - Found existing registered model id=ec2b0446960b49bc9b0a9ad799b08e7f [/home/vitreus/lstm/net/epoch=10-step=1077.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:33:07,984 - clearml.frameworks - INFO - Found existing registered model id=8e37b16da3904117b57505ea471c7b27 [/home/vitreus/lstm/net/epoch=11-step=1175.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:34:00,909 - clearml.frameworks - INFO - Found existing registered model id=5b9c4608efb346b69aea1b16899ccf8d [/home/vitreus/lstm/net/epoch=12-step=1273.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:34:54,053 - clearml.frameworks - INFO - Found existing registered model id=206a1823f54c44c3a14e83353de89efe [/home/vitreus/lstm/net/epoch=13-step=1371.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:35:46,997 - clearml.frameworks - INFO - Found existing registered model id=843b2fc5e12e4b1f9f64449a841730bc [/home/vitreus/lstm/net/epoch=14-step=1469.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:36:40,318 - clearml.frameworks - INFO - Found existing registered model id=f53093dd51bb4b9095810090e599e540 [/home/vitreus/lstm/net/epoch=15-step=1567.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:37:34,170 - clearml.frameworks - INFO - Found existing registered model id=32dcbc7971e94b26bb9cb90a0ed8ee3f [/home/vitreus/lstm/net/epoch=16-step=1665.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:38:28,353 - clearml.frameworks - INFO - Found existing registered model id=4bb450aa314c48268c23501b66857d12 [/home/vitreus/lstm/net/epoch=17-step=1763.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:39:21,842 - clearml.frameworks - INFO - Found existing registered model id=15d8ec59309944f5b5be2b2da29322d8 [/home/vitreus/lstm/net/epoch=18-step=1861.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:40:15,023 - clearml.frameworks - INFO - Found existing registered model id=c2d0bbce116b474ebf7877429d2afeb4 [/home/vitreus/lstm/net/epoch=19-step=1959.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:41:08,297 - clearml.frameworks - INFO - Found existing registered model id=f0c3458ad1fc4e63a83b8b5d5a564132 [/home/vitreus/lstm/net/epoch=20-step=2057.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

2022-01-15 11:42:02,136 - clearml.frameworks - INFO - Found existing registered model id=a1dd405a7acd413f85d19ee1c86c72f0 [/home/vitreus/lstm/net/epoch=21-step=2155.ckpt] reusing it.


In [None]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/