In [1]:
!wget https://www.manythings.org/anki/rus-eng.zip
!unzip rus-eng.zip

--2023-04-09 15:38:31--  https://www.manythings.org/anki/rus-eng.zip
Распознаётся www.manythings.org (www.manythings.org)… 173.254.30.110
Подключение к www.manythings.org (www.manythings.org)|173.254.30.110|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа… 200 OK
Длина: 15460248 (15M) [application/zip]
Сохранение в: «rus-eng.zip»


2023-04-09 15:41:35 (82,8 KB/s) - «rus-eng.zip» сохранён [15460248/15460248]

Archive:  rus-eng.zip
  inflating: rus.txt                 
  inflating: _about.txt              


In [1]:
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import LearningRateMonitor
import sys, os
import importlib

sys.path.append(os.path.join(os.getcwd(), "./src"))

from models import seq2seq_rnn
from data.datamodule import DataManager


importlib.reload(seq2seq_rnn)

<module 'models.seq2seq_rnn' from '/home/somov-od/mashine_translation/./src/models/seq2seq_rnn.py'>

In [2]:
eng_prefixes = (
    "i am ",
    "i m ",
    "he is",
    "he s ",
    "she is",
    "she s ",
    "you are",
    "you re ",
    "we are",
    "we re ",
    "they are",
    "they re ",
)

def filter_func(x):
    MAX_LENGTH = 5
    len_filter = lambda x: len(x[0].split(" ")) <= MAX_LENGTH and len(x[1].split(" ")) <= MAX_LENGTH
    prefix_filter = lambda x: x[0].startswith(eng_prefixes)
    return len_filter(x) and prefix_filter(x)

config = {
    "batch_size": 128,          # <--- size of batch
    "num_workers": 8,          # <--- num cpu to use in dataloader
    "filter": filter_func,      # <--- callable obj to filter data  
    "filename": "./rus.txt",    # <--- path to file with sentneces
    "lang1": "en",              # <--- name of the first lang    
    "lang2": "ru",              # <--- name of the second lang
    "reverse": False,           # <--- direct or reverse order in pairs
    "train_size": 0.8,          # <--- ratio of data pairs to use in train
    "run_name": "tutorial",     # <--- run name to logger and checkpoints
    "quantile": 0.95,           # <--- (1 - quantile) longest sentences will be removed
}

In [3]:
# Data manager
dm = DataManager(config)
dm.prepare_data()
dm.setup()

input_lang_n_words = dm.input_lang_n_words
output_lang_n_words = dm.output_lang_n_words

Reading from file: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 467119/467119 [00:06<00:00, 75847.90it/s]


### Представление данных

In [4]:
dm.train_data[:5]

[('you re a filthy liar', 'ты грязная лгунья'),
 ('they are both good', 'они оба хорошие'),
 ('i m ill', 'я болен'),
 ('i m young and innocent', 'я молода и невинна'),
 ('he s a liar', 'он лгун')]

In [5]:
dm.train_dataset[0]

(tensor([[  0],
         [132],
         [ 68],
         [ 22],
         [  1]]),
 tensor([[  0],
         [298],
         [201],
         [642],
         [  1]]))

In [12]:
dm.output_lang.word2index['лгунья']

642

### Обучение

In [13]:
model = seq2seq_rnn.Seq2SeqRNN(
    encoder_vocab_size=input_lang_n_words,
    encoder_embedding_size=256,
    decoder_embedding_size=256,
    decoder_vocab_size=output_lang_n_words,
    lr=1e-3,
    output_lang_index2word=dm.train_dataset.output_lang.index2word,
)

In [14]:
# TB Logger
logger = TensorBoardLogger("lightning_logs", name=config["run_name"])

# Callbacks
checkpoint_callback = ModelCheckpoint(
    save_top_k=3,
    monitor="val_loss",
    mode="min",
    dirpath="runs/{}/".format(config["run_name"]),
    filename="{epoch:02d}-{step:d}-{val_loss:.4f}",
    verbose=True,
    every_n_epochs=1,
)
lr_monitor = LearningRateMonitor(logging_interval="step")

# Initialize a Trainer
trainer = pl.Trainer(
    accelerator='cuda',
    overfit_batches=1,
    devices=[0],
    precision=16,
    max_epochs=50,
    min_epochs=1,
    callbacks=[lr_monitor, checkpoint_callback],
    check_val_every_n_epoch=1,
    logger=logger,
    log_every_n_steps=1,
)

  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(overfit_batches=1)` was configured so 1 batch will be used.


In [None]:
trainer.fit(model, dm)

Reading from file:  64%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                           | 298621/467119 [00:03<00:02, 74069.97it/s]

Now you can see tensorboard logs in two ways: launch tensorboard extension in jupyter notebook or use CLI method.

CLI:
1. Tap `tensorboard --logdir=./lightning_logs --port=6006`
2. Forward selected port in the ssh connection if you are working remote else just open `localhost:6006` in the browser


Jupyter:
1. Load extension: `%load_ext tensorboard`
2. Launch built-in tensorboard: `%tensorboard --logdir=./lightning_logs`

In [12]:
%load_ext tensorboardX
%tensorboard --logdir=./lightning_logs

The tensorboardX module is not an IPython extension.


UsageError: Line magic function `%tensorboard` not found.


# Hints

#### Load model from checkpoint
Using `self.save_hyperparameters` in `__init__` body of `pl.LightningModule` allows to load model in this way:
```python
import pytorch_lightning as pl
model = Seq2SeqRNN.load_from_checkpoint('checkpoint.ckpt')
```

Or you can load checkpoint in natural pytorch way:
```python
model = Seq2SeqRNN(*args, **kwargs)
model.load_state_dict(torch.load('checkpoint.ckpt')['state_dict'])
```

#### Add custom metrics to logger
https://lightning.ai/docs/pytorch/stable/extensions/logging.html


#### Enable grad accumulation
In this example accumulation will be the following: 
1. from 0 to 15th epoch accumulate 4 batches
2. from 15th to 25th epoch accumulate 2 batches
3. from 25th epoch accumulate 1 batch

```python
from pytorch_lightning.callbacks import GradientAccumulationScheduler

accumulator = GradientAccumulationScheduler(scheduling={0: 4, 15: 2, 25: 1})
trainer = pl.Trainer(
    ...
    callbacks=[..., accumulator],
    ...
)
```

#### Configure learning rate scheduler

```python
def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
    lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        mode='min',
        factor=0.1,
        patience=0,
        threshold=1e-2,
        threshold_mode='rel',
        cooldown=0,
        min_lr=0,
        eps=1e-09,
        verbose=True
    )
    lr_dict = {
        "scheduler": lr_scheduler,
        "interval": "epoch",
        "frequency": 1,
        "monitor": "val_loss"
    }
    return [optimizer], [lr_dict]
```

#### Other logger: WandB
You can use famous weights&biases logger which is natively supports in pytorch-lightning:
https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.loggers.wandb.html#module-lightning.pytorch.loggers.wandb