This script:  
1. Loads GloVe Twitter embeddings
2. Loads an example arbitrary dataset in a Pandas DataFrame
3. Initializes a PyTorch uni-directional LSTM model
4. Wraps this model in a [PyTorch Lightning](https://www.pytorchlightning.ai/)'s [LightningModule](https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html)
5. Trains this model

NOTE: this script is still a [WIP](https://www.urbandictionary.com/define.php?term=Wip)! PRs are welcome.  

Future plans:  
1. Incorporate hyperparameter tuning using [Ray Tune](https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html).
2. Incorporate EarlyStopping using [PyTorch Lightning callbacks](https://pytorch-lightning.readthedocs.io/en/1.2.0/extensions/generated/pytorch_lightning.callbacks.EarlyStopping.html?highlight=earlystopping).
3. Better commenting, including that of the model architecture.

Environment setup

In [None]:
!pip install pytorch-lightning

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import pandas as pd
import os

Download GloVe embeddings.  

I have already downloaded the [glove.twitter.27B.200d](https://nlp.stanford.edu/projects/glove/) embeddings and converted them into a convenient format:  
1. `terms.npy` (~25MB): a numpy array of shape (size of vocabulary,), each element a *token*.  
2. `embs.npy` (~2GB): a numpy array of shape (size of vocabulary, embeddings dimension), each row an embedding for corresponding token.  

Feel free to:
1. pre-process and load your own embeddings to pass to the model using a similar format (vocabulary in `terms.npy` and the actual embeddings in `embs.npy`).  
2. randomly initialize embeddings which can then be trained along with the rest of the model.

In [None]:
import gdown
glove_terms_url = 'https://drive.google.com/uc?id=1-3B9Zq9DJphA4TdtuxigNfQNrdUhZlhn'
gdown.download(glove_terms_url, 'glove.twitter.27B.200d.terms.npy', quiet=False)

glove_embs_url = 'https://drive.google.com/uc?id=1-A7WSkxmaRS27BmBGtCfWizk_sv37QDY'
gdown.download(glove_embs_url, 'glove.twitter.27B.200d.embs.npy', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1-3B9Zq9DJphA4TdtuxigNfQNrdUhZlhn
To: /content/glove.twitter.27B.200d.terms.npy
24.4MB [00:00, 130MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1-A7WSkxmaRS27BmBGtCfWizk_sv37QDY
To: /content/glove.twitter.27B.200d.embs.npy
1.91GB [00:16, 118MB/s] 


'glove.twitter.27B.200d.embs.npy'

Load the downloaded embeddings.

In [None]:
glove_terms_npa = np.load('glove.twitter.27B.200d.terms.npy',allow_pickle=True)
PAD_TOK, UNK_TOK = '<pad>','<unk>'
glove_terms_npa = np.concatenate([[PAD_TOK, UNK_TOK],glove_terms_npa])

glove_embs_npa = np.load('glove.twitter.27B.200d.embs.npy',allow_pickle=True)
glove_embs_npa = np.concatenate([np.zeros((1,glove_embs_npa.shape[1])),np.mean(glove_embs_npa,axis=0,keepdims=1),glove_embs_npa],axis=0)

print(glove_terms_npa.shape)
print(glove_embs_npa.shape)

(1193516,)
(1193516, 200)


Load dataset  

In this script, we work with the task of **irony detection**.  
**Feel free to load your own dataset instead.**

In [None]:
def lower_text(row):
    row.text = row.text.lower()
    return row
def preprocess_data_df(df):
    df = df.apply(lower_text,axis=1)
    return df

In [None]:
!test -d /tmp/tweeteval && echo "FYI: tweeteval directory already exists, to pull latest version uncomment this line below: !rm -r tweeteval"
#!rm -r /tmp/tweeteval
!test -d /tmp/tweeteval || git clone https://github.com/cardiffnlp/tweeteval /tmp/tweeteval

In [None]:
#####Load train data#####
with open('/tmp/tweeteval/datasets/irony/train_text.txt','rt') as fi:
    train_texts = fi.read().strip().split('\n')
train_text_dfs = pd.Series(data=train_texts,name='text',dtype='str')
train_labels_dfs = pd.read_csv('/tmp/tweeteval/datasets/irony/train_labels.txt',names=['label'],index_col=False).label
irony_train_df = pd.concat([train_text_dfs,train_labels_dfs],axis=1)
#####Load val data#######
with open('/tmp/tweeteval/datasets/irony/val_text.txt','rt') as fi:
    val_texts = fi.read().strip().split('\n')
val_text_dfs = pd.Series(data=val_texts,name='text',dtype='str')
val_labels_dfs = pd.read_csv('/tmp/tweeteval/datasets/irony/val_labels.txt',names=['label'],index_col=False).label
irony_val_df = pd.concat([val_text_dfs,val_labels_dfs],axis=1)
#####Load test data######
with open('/tmp/tweeteval/datasets/irony/test_text.txt','rt') as fi:
    test_texts = fi.read().strip().split('\n')
test_text_dfs = pd.Series(data=test_texts,name='text',dtype='str')
test_labels_dfs = pd.read_csv('/tmp/tweeteval/datasets/irony/test_labels.txt',names=['label'],index_col=False).label
irony_test_df = pd.concat([test_text_dfs,test_labels_dfs],axis=1)
#######Preprocess########
irony_train_df = preprocess_data_df(irony_train_df)
irony_val_df = preprocess_data_df(irony_val_df)
irony_test_df = preprocess_data_df(irony_test_df)
#######Clean up##########
train_text_dfs,train_labels_dfs = None,None
val_text_dfs,val_labels_dfs = None,None
test_text_dfs,test_labels_dfs = None,None

Design the [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) which can then later be loaded into a [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).  
We are creating a map-style Dataset as the example data is already loaded in the memory. Each Dataset must overwrite the `__getitem__()` method. Each call to this method must fetch the `sample_id`th sample from the underlying data.  

\[In case you have a streamed dataset (eg- data streaming-in live from a file descriptor), consider using an [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) instead.\]

In [None]:
class LSTMDataset(torch.utils.data.Dataset):
    def __init__(self, df, vocab, max_seq_length, pad_token, unk_token):
        self.labels = df.label.tolist()
        self.word2idx = {term:idx for idx,term in enumerate(vocab)}
        self.idx2word = {idx:word for word,idx in self.word2idx.items()}
        
        self.pad_token,self.unk_token = pad_token,unk_token
        self.input_ids = []
        self.sequence_lens = []
        self.labels = []
        for i in range(df.shape[0]):
            input_ids,sequence_len = self.convert_text_to_input_ids(df.iloc[i].text,pad_to_len=max_seq_length)
            
            self.input_ids.append(input_ids.reshape(-1))
            self.sequence_lens.append(sequence_len)
            self.labels.append(df.iloc[i].label)
        
        assert len(self.input_ids) == df.shape[0]
        assert len(self.sequence_lens) == df.shape[0]
        assert len(self.labels) == df.shape[0]
    
    def convert_text_to_input_ids(self,text,pad_to_len):
        words = text.strip().split()[:pad_to_len]
        deficit = pad_to_len - len(words)
        words.extend([self.pad_token]*deficit)
        for i in range(len(words)):
            if words[i] not in self.word2idx:
                words[i] = self.word2idx[self.unk_token]
            else:
                words[i] = self.word2idx[words[i]]
        return torch.Tensor(words).long(),pad_to_len - deficit

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, sample_id):
        sample_dict = dict()
        sample_dict['input_ids'] = self.input_ids[sample_id].reshape(-1)
        sample_dict['sequence_len'] = torch.tensor(self.sequence_lens[sample_id]).long()
        sample_dict['labels'] = torch.tensor(self.labels[sample_id])
        return sample_dict

Design the LSTM encoder

In [None]:
class LSTMEncoder(torch.nn.Module):
    def __init__(self, config):
        super(LSTMEncoder, self).__init__()
        
        pretrained_embeddings = config['pretrained_embeddings'] if 'pretrained_embeddings' in config else None
        freeze_embeddings = config['freeze_embeddings'] if 'freeze_embeddings' in config else False
        if pretrained_embeddings is not None:
            self.vocab_size = pretrained_embeddings.shape[0]
            self.embedding_dim = pretrained_embeddings.shape[1]
            self.embedding = torch.nn.Embedding.from_pretrained(torch.from_numpy(pretrained_embeddings).float(),freeze=freeze_embeddings)
        else:
            assert 'vocab' in config and 'embedding_dim' in config
            self.vocab_size = config['vocab'].shape[0]
            self.embedding_dim = config['embedding_dim']
            if freeze_embeddings:
                print('warning: freezing randomly initialized embeddings')
            self.embedding = torch.nn.Embedding(self.vocab_size,self.embedding_dim,freeze=freeze_embeddings)
        
        self.hidden_size = config['hidden_size']
        lstm_unit_cnt = config['lstm_unit_cnt']
        self.lstm = torch.nn.LSTM(input_size=self.embedding_dim,hidden_size=self.hidden_size,num_layers=lstm_unit_cnt,batch_first=True,bidirectional=False)

    def forward(self, batch):
        x = batch['input_ids']
        x_lengths = batch['sequence_len']
        embed_out = self.embedding(x)
        packed_input = torch.nn.utils.rnn.pack_padded_sequence(embed_out, x_lengths.tolist(),enforce_sorted=False,batch_first=True)
        packed_out,_ = self.lstm(packed_input)
        output,_ = torch.nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)    #inverse operation of pack_padded_sequence
        lstm_out = output[range(len(output)), x_lengths - 1, :self.hidden_size]
        return lstm_out
    
    def get_embedding_dims(self):
        return self.vocab_size,self.embedding_dim

Design the Pytorch Lightning's [LightningModule](https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html)

In [None]:
class LitClassifier(pl.LightningModule):

    def __init__(self, config):
        super().__init__()
        #Embeddings init
        self.vocab = config['vocab']
        self.vocab_size = self.vocab.shape[0]
        self.pad_token = config['pad_token']
        self.unk_token = config['unk_token']
        #Model init
        self.label_count = config['label_count']
        self.model = LSTMEncoder(config)
        self.lr = config['lr']
        model_out_size = self.model.hidden_size
        self.pooler_layer = torch.nn.Linear(model_out_size, model_out_size)
        self.dropout = torch.nn.Dropout(config['dropout_prob'])
        self.classifier = torch.nn.Linear(model_out_size, self.label_count)

        #Data init
        if 'df' in config:
            self.df = config['df']
            self.train_df,self.val_df,self.test_df = None,None,None
        elif 'train_df' in config and 'val_df' in config and 'test_df' in config:
            self.train_df = config['train_df'] if 'val_df' in config else None
            self.val_df = config['val_df'] if 'val_df' in config else None
            self.test_df = config['test_df'] if 'test_df' in config else None
            self.df = None
        else:
            raise ValueError()

        self.batch_size = config['batch_size']# if 'batch_size' in config else 32
        self.max_seq_length = config['max_seq_length']
        self.labels = self.train_df.label.tolist()
        
        #self.criterion = torch.nn.CrossEntropyLoss(ignore_index = self.TRG.vocab.stoi[self.TRG.pad_token])
        self.criterion = torch.nn.CrossEntropyLoss()
    
    def shared_step(self, batch, batch_idx):

        batched_contextual_embeddings = self.model(batch)
        
        pooler_layer_out = self.pooler_layer(batched_contextual_embeddings)

        pooler_layer_out = self.dropout(pooler_layer_out)
        logits = self.classifier(pooler_layer_out)
        pred_labels = torch.argmax(logits,dim=1)
        actual_labels = batch['labels']
        assert actual_labels.shape[0] == pred_labels.shape[0]

        loss = self.criterion(logits, actual_labels)

        metrics = {}

        pred_labels = pred_labels.detach().cpu().numpy()
        actual_labels = actual_labels.detach().cpu().numpy()
        logits = logits.detach().cpu().numpy()

        metrics['loss'] = loss
        metrics['acc'] = (pred_labels == actual_labels).sum() / pred_labels.shape[0]
        metrics['macro_f1'] = f1_score(actual_labels,pred_labels,average='macro')

        return_dict = {'loss':loss,'metrics':metrics}

        return return_dict

    def training_step(self, batch, batch_idx):
        shared_step_out_dict = self.shared_step(batch, batch_idx)
        loss,metrics = shared_step_out_dict['loss'],shared_step_out_dict['metrics']

        metrics = {'train_'+k:v for k,v in metrics.items()}
        self.log_dict(metrics,on_epoch=True,prog_bar=True,logger=True)
        return loss
    
    def validation_step(self, batch, batch_idx):
        shared_step_out_dict = self.shared_step(batch, batch_idx)
        loss,metrics = shared_step_out_dict['loss'],shared_step_out_dict['metrics']

        metrics = {'val_'+k:v for k,v in metrics.items()}
        self.log_dict(metrics,on_epoch=True,prog_bar=True,logger=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        shared_step_out_dict = self.shared_step(batch, batch_idx)
        loss,metrics = shared_step_out_dict['loss'],shared_step_out_dict['metrics']

        metrics = {'test_'+k:v for k,v in metrics.items()}
        self.log_dict(metrics,on_epoch=True,prog_bar=True,logger=True)
        return loss
    
    def setup(self, stage=None):
        pass
    
    def prepare_data(self):
        if self.df is not None:
            complete_dataset = LSTMDataset(self.df,self.vocab,self.max_seq_length,self.pad_token,self.unk_token)
            train_len,val_len = int(0.8*len(complete_dataset)),int(0.1*len(complete_dataset))
            test_len = len(complete_dataset) - train_len - val_len
            self.train_dataset,self.val_dataset,self.test_dataset = random_split(complete_dataset,[train_len,val_len,test_len])
        else:
            self.train_dataset = LSTMDataset(self.train_df,self.vocab,self.max_seq_length,self.pad_token,self.unk_token)
            self.val_dataset = LSTMDataset(self.val_df,self.vocab,self.max_seq_length,self.pad_token,self.unk_token)
            self.test_dataset = LSTMDataset(self.test_df,self.vocab,self.max_seq_length,self.pad_token,self.unk_token)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=int(self.batch_size))

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=int(self.batch_size))
    
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=int(self.batch_size))
    
    def configure_optimizers(self):
        #define optimizers and LR schedulers
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr)
        return optimizer
        #scheduler = CyclicLR(optimizer, base_lr = 5e-5,max_lr = 1e-3,step_size_up=100,step_size_down=200,cycle_momentum=False,verbose=False)
        #return [optimizer],[{'scheduler': scheduler,'interval': 'step','frequency': 10}]#,'reduce_on_plateau': False,'monitor': 'val_loss',}]

Next we specify the configuration of our LightningModule.  

Apart from model hyperparameters (eg- max_seq_length), we also need to pass:  
1. configurations for initializing embeddings  
  * if pre-trained embeddings used, provide:  
    1. `pretrained_embeddings`: a numpy array of shape (size of vocabulary, embeddings dimension).
    2. `freeze_embeddings`: `False` if embeddings to be trainable, `True` otherwise.
    3. `vocab`: an iterable of shape (size of vocabulary,), each element a *token*.  
    4. `pad_token`: token to be used as padding.
    5. `unk_token`: token to be used for unknown tokens encountered in input sequences.  
    **NOTE: `pad_token` and `unk_token` should already be present in `vocab`.**
  * if pre-trained embeddings *NOT* used, provide:  
    1. `vocab`: (same as above)
    2. `embedding_dim`: desired embeddings dimension
2. data
  * if data is divided into train/val/test [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)s  
    1. `train_df`
    2. `val_df`
    3. `test_df`
  * if data is not split
    1. `df`: this DataFrame will get randomly divided into train/val/test splits in an 80/10/10 ratio by default.  

NOTE: all data DataFrames must contain 2 columns: `text` and `label`.

In [None]:
config = {
    #model configurations
    'batch_size':32,
    'max_seq_length':100,
    'lr':1e-3,
    'label_count':2,
    'dropout_prob':2e-1,
    'hidden_size':256,
    'lstm_unit_cnt':2,

    #embeddings configurations
    'pretrained_embeddings':glove_embs_npa,
    'freeze_embeddings':True,
    'vocab':glove_terms_npa,
    'pad_token':PAD_TOK,
    'unk_token':UNK_TOK,

    #data
    'train_df':irony_train_df,
    'val_df':irony_val_df,
    'test_df':irony_test_df,
}
model = LitClassifier(config)

If your Jupyter environment has tensorboard support, feel free to execute the below cell to analyse the training and validation.  
PyTorch Lightning logs to the directory `{current_directory}/lightning_logs/` by default.

In [None]:
%load_ext tensorboard
%tensorboard --logdir "lightning_logs"

Finally, we initialize the PyTorch Lightning trainer and start the training process! 🎉

In [None]:
trainer = pl.Trainer(max_epochs=10,gpus=1)
trainer.fit(model)

In [None]:
trainer.test(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.5612244606018066,
 'test_loss': 1.7929543256759644,
 'test_macro_f1': 0.5529208183288574}
--------------------------------------------------------------------------------


[{'test_acc': 0.5612244606018066,
  'test_loss': 1.7929543256759644,
  'test_macro_f1': 0.5529208183288574}]

🤔 Hmmm...the trained model performs well on the train set but quite poorly on the validation and test splits!

Next step should be to explore regularization techniques like [EarlyStopping](https://pytorch-lightning.readthedocs.io/en/1.2.0/extensions/generated/pytorch_lightning.callbacks.EarlyStopping.html?highlight=earlystopping). Also, hyperparameter tuning can be a good idea.

Good for us, these techniques are easily plugged into PyTorch Lightning!