This is the Colab to reproduce our reproduction of the Natural Language Inference task on MNLI dataset with 300d GloVe embeddings in Beyond Fully Connected Layers with Quaternions: Parametrization of Hypercomplex Multiplications with 1/n Parameters. We implemented here PHM-LSTM version of the model as stated in paper. Due to limited paralelizablility of LSTM's and limited memory and gpu usage in colab, we would not able to finish the whole training process. Extensive research can be made with sufficient memory and gpu time. You can find the detailed explanation [here](https://github.com/sinankalkan/CENG501-Spring2021/blob/main/project_BarutcuDemir/readme.md).

# Install & Import Modules

In [None]:
!pip install torchinfo
from tqdm import tqdm
import pandas as pd
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch
import random
import matplotlib.pyplot as plt
import torch.optim as optim
from torchinfo import summary
from nltk.tokenize import WordPunctTokenizer
import random



# Connect to GPU

In [None]:
if torch.cuda.is_available():
  print("Cuda (GPU support) is available and enabled!")
  device = torch.device("cuda")
else:
  print("Cuda (GPU support) is not available :(")
  device = torch.device("cpu")

Cuda (GPU support) is not available :(


# Install Glove Embeddings
We used glove embedding to convert words in MNLI dataset to 300-dimensional vectors.

In [None]:
# Get pretrained glove for word embedding
!wget http://nlp.stanford.edu/data/glove.42B.300d.zip
!unzip -q glove.42B.300d.zip

--2021-07-23 16:14:25--  http://nlp.stanford.edu/data/glove.42B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.42B.300d.zip [following]
--2021-07-23 16:14:25--  https://nlp.stanford.edu/data/glove.42B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip [following]
--2021-07-23 16:14:25--  http://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1877800501 (1.7G) [application/zip]
Sav

# Load Dataset
We used dataset library to get MNLI dataset. "validation matched" and "validation mismatched" are selected as validation, test data respectively. 

In [None]:
# Load MNLI dataset
!pip install datasets
from datasets import load_dataset
train_ds, valid_ds, test_ds = load_dataset('multi_nli',split=['train','validation_matched','validation_mismatched'])



Using custom data configuration default
Reusing dataset multi_nli (/root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)


In [None]:
# Convert dataset to dataframe
train_df = train_ds.to_pandas()
valid_df = train_ds.to_pandas()
test_df = test_ds.to_pandas()

# Converting Glove to dictionary
We store the words as key values in the corpus and 300-d vector represantations as values. This process allocates 2 gb in ram

In [None]:
# Convert Glove to dictionary
embeddings_dict = {}
with open("glove.42B.300d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

In [None]:
train_df = train_df[['premise','hypothesis','label']]
valid_df = valid_df[['premise','hypothesis','label']]
test_df = test_df[['premise','hypothesis','label']]

# Loading the Data
Since converting the all words to embedding vectors before training requires 70 gb memory, we defined a "DataLoader" class to convert each training and validation instances into vector embeddings in batches at training time.

In [None]:
class DataLoader:
  def __init__(self, embedding_dict,train_df, val_df,batch_size):

    self.MAX_LEN = 256
    self.batch_size = batch_size
    self.embedding_dict = embeddings_dict
    self.train_df = train_df
    self.val_df = val_df
    self.tokenizer = WordPunctTokenizer()

  def padding(self,word_list):
    return word_list + ['0']*(self.MAX_LEN-len(word_list)) if len(word_list) <= self.MAX_LEN else word_list[:self.MAX_LEN]

  def sample_data(self,df):
    df_train = self.train_df.sample(n=self.batch_size)
    df_valid = self.val_df.sample(n=self.batch_size)
    return df_train,df_valid

  def load_data(self, df,mode='train'):

    if mode == 'train':
      df,_ = self.sample_data(df)
    else:
      _,df = self.sample_data(df)

    premise_embedding = []
    hypothesis_embedding = []
    labels = []

    premise_list = df['premise'].to_list()
    hypothesis_list = df['hypothesis'].to_list()
    label_list = df['label'].to_list()

    for (premise, hypothesis,label) in zip(premise_list, hypothesis_list,label_list):

      premise_token = self.padding(self.tokenizer.tokenize((premise.lower())))
      hypothesis_token = self.padding(self.tokenizer.tokenize((hypothesis.lower())))
      try:
        premise_embed = [self.embedding_dict[x] for x in premise_token]
        hypothesis_embed = [self.embedding_dict[x] for x in hypothesis_token]
        
        premise_embedding.append(premise_embed)
        hypothesis_embedding.append(hypothesis_embed)
        labels.append(label)
      except KeyError:
        pass

    premise_embedding = torch.tensor(premise_embedding)
    hypothesis_embedding = torch.tensor(hypothesis_embedding)
    labels = torch.tensor(labels)  #.reshape((len(labels),1))
    return premise_embedding,hypothesis_embedding,labels

  def get_train(self):
    return self.load_data(self.train_df)

  def get_valid(self):
    return self.load_data(self.val_df,mode='valid')

# Measure Vector Embedding Process Time
Vector embedding process takes 4.5 seconds with truncation point is restricted with 256 characters. Process time changes linearly with truncation point and batch size. Since we convert those words in training time, this leads to longer iteration times in training process.

In [None]:
import time
data = DataLoader(embeddings_dict,train_df,valid_df,128)
t0= time.process_time()
data.get_train()
t1 = time.process_time()
print(t1-t0)

4.582791444999998


# PHM-LSTM Module
We replaced standard unidirectional LSTM with PHM implementation by using kronecker product property, as stated in [paper](https://openreview.net/pdf?id=rcQdycl0zyk). Detailed explanation can be found  [here](https://github.com/sinankalkan/CENG501-Spring2021/blob/main/project_BarutcuDemir/readme.md)

In [None]:
class PHMLSTM(torch.nn.Module):
    def __init__(self,n, input_size, hidden_size):
        """
          input_size: the size of the input at a time step.
          hidden_size: the number of neurons in the hidden state.
        """
        super().__init__()

        self.n = n
        self.inp = input_size
        self.hid = hidden_size

        self.a_f = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.n, self.n))))
        self.s_f = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.inp//self.n, self.hid//self.n))))

        self.au_f = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.n, self.n))))
        self.su_f = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.hid//n, self.hid//n))))

        self.b_f = nn.Parameter(torch.zeros(self.hid))

        self.a_i = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.n, self.n))))
        self.s_i = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((self.n, self.inp//n, self.hid//n))))

        self.au_i =  nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, n, n))))
        self.su_i = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, self.hid//n, self.hid//n))))

        self.b_i = nn.Parameter(torch.zeros(self.hid))

        self.a_o = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, n, n))))
        self.s_o = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, self.inp//n, self.hid//n))))

        self.au_o = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, n, n))))
        self.su_o = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, self.hid//n, self.hid//n))))

        self.b_o = nn.Parameter(torch.zeros(self.hid))

        self.a_c = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, n, n))))
        self.s_c = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, self.inp//n, self.hid//n))))

        self.au_c = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, n, n))))
        self.su_c = nn.Parameter(nn.init.xavier_uniform_(torch.zeros((n, self.hid//n, self.hid//n))))

        self.b_c = nn.Parameter(torch.zeros(self.hid))


    def kronecker_product1(self, a, b):
      siz1 = torch.Size(torch.tensor(a.shape[-2:]) * torch.tensor(b.shape[-2:]))
      res = a.unsqueeze(-1).unsqueeze(-3) * b.unsqueeze(-2).unsqueeze(-4)
      siz0 = res.shape[:-4]
      out = res.reshape(siz0 + siz1)
      return out
    
    def forward(self, X):
        """
          X: An input that has L time steps and for each time step, it has 
          input_size many elements. Has shape (B, L, input_size) with B being 
          the batch size.

          Output: Tuple (h, c) where h is the tensor holding the hidden state for L
          time steps, and c is the tensor holding the memory state for L time steps. 
          Both have shape (B, L, hidden_size).
        """

        B,L,inp_size = X.shape

        self.W_f = torch.sum(self.kronecker_product1(self.a_f,self.s_f),dim=0)
        self.U_f = torch.sum(self.kronecker_product1(self.au_f,self.su_f),dim=0)

        self.W_i = torch.sum(self.kronecker_product1(self.a_i,self.s_i),dim=0)
        self.U_i = torch.sum(self.kronecker_product1(self.au_i,self.su_i),dim=0)

        self.W_o = torch.sum(self.kronecker_product1(self.a_o,self.s_o),dim=0)
        self.U_o = torch.sum(self.kronecker_product1(self.au_o,self.su_o),dim=0)

        self.W_c = torch.sum(self.kronecker_product1(self.a_c,self.s_c),dim=0)
        self.U_c = torch.sum(self.kronecker_product1(self.au_c,self.su_c),dim=0)

        self.h_prev = torch.zeros((B,self.hid)).to(device)
        self.c_prev = torch.zeros((B,self.hid)).to(device)
        
        h = torch.zeros((B,L,self.hid)).to(device)
        c = torch.zeros((B,L,self.hid)).to(device)

        for t in range(L):
          f = torch.sigmoid(torch.matmul(X[:,t,:],self.W_f) + torch.matmul(self.h_prev,self.U_f)+ self.b_f)
          i = torch.sigmoid(torch.matmul(X[:,t,:],self.W_i) + torch.matmul(self.h_prev,self.U_i) + self.b_i)
          o = torch.sigmoid(torch.matmul(X[:,t,:],self.W_o) + torch.matmul(self.h_prev,self.U_o) + self.b_o)
          c_cap = torch.tanh(torch.matmul(X[:,t,:],self.W_c) + torch.matmul(self.h_prev,self.U_c) + self.b_c)
          c[:,t,:] = f*self.c_prev + i*c_cap
          h[:,t,:] = o*torch.sigmoid(c[:,t,:])
          self.h_prev = h[:,t,:].clone()
          self.c_prev = c[:,t,:].clone()
        return (h,c)

# Define The Network
We implemented PHM-LSTM module following with concatenation process of premise and hypothesis hidden outputs and MLP with tanh activation function. Concatenation process is explained at [GitHub](https://github.com/sinankalkan/CENG501-Spring2021/blob/main/project_BarutcuDemir/readme.md) repository. 

In [None]:
class PHMModule(torch.nn.Module):

    def __init__(self,n, input_dim, hidden_dim,max_len):
        super().__init__()
        
        random.seed(501)
        np.random.seed(501)
        torch.manual_seed(501)

        self.LSTM = PHMLSTM(n,input_dim,hidden_dim)
        self.Linear1 = nn.Linear(max_len*4,100)
        self.Linear2 = nn.Linear(100,3)

    def forward(self, premise,hyphothesis):

        h_premise,c_premise = self.LSTM(premise)
        h_hypothesis,c_hypothesis = self.LSTM(hyphothesis)

        hp_average = torch.mean(h_premise,2)
        hp_max = torch.max(h_premise,2).values

        hh_average = torch.mean(h_hypothesis,2)
        hh_max = torch.max(h_hypothesis,2).values

        v = torch.cat((hp_average,hp_max,hh_average,hh_max),1)

        v = self.Linear1(v)
        v = torch.tanh(v)
        v = self.Linear2(v)
        v = F.softmax(v,dim=1)
        return v

# Training Function

In [None]:
def train(model, criterion, optimizer, epochs, train_df,valid_df,embed_dict,batch_size,verbose_it=True,verbose=True):
  """
    Define the trainer function. We can use this for training any model.
    The parameter names are self-explanatory.

    Returns: the loss history.
  """
  data = DataLoader(embed_dict,train_df,valid_df,batch_size)
  num_train = len(train_df)
  loss_history = []
  valid_loss = [] 
  for epoch in range(epochs):  
    for it in range(int(num_train/batch_size)):
      model.train()    

      # Our batch:
      premise,hypothesis, labels = data.get_train()
      premise = premise.to(device)
      hypothesis = hypothesis.to(device)
      labels = labels.to(device)

      # zero the gradients as PyTorch accumulates them
      optimizer.zero_grad()

      # Obtain the scores
      outputs = model(premise,hypothesis)

      # Calculate loss
      loss = criterion(outputs.to(device), labels)

      # Backpropagate
      loss.backward()

      # Update the weights
      optimizer.step()

      loss_history.append(loss.item())
      with torch.no_grad():

        # Calculate validation loss in batches.

        model.eval()
        v_premise,v_hypothesis, v_labels = data.get_valid()
        v_premise = v_premise.to(device)
        v_hypothesis = v_hypothesis.to(device)
        v_labels = v_labels.to(device)
        v_outputs = model(v_premise,v_hypothesis)
        v_loss = criterion(v_outputs.to(device), v_labels)
        valid_loss.append(v_loss)
        
      if verbose_it: print(f'Train Loss at iteration {it+1}: {loss_history[-1]} | Validation Loss: {valid_loss[-1]}')
      
    if verbose: print(f'Epoch {epoch+1} / {epochs}: Avg. Train loss of last 5 iterations {np.sum(loss_history[:-6:-1])/5} | Validation Loss {np.sum(valid_loss[:-6:-1])/5} ')  
  return loss_history,valid_loss

# Define The Model

In [None]:
epochs = 10
lr = 0.0004
batch_size = 256
hidden = 300
max_len = 256
input_size = 300
n = 10

model = PHMModule(n,input_size,hidden,max_len)
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
summary(model)

Layer (type:depth-idx)                   Param #
PHMModule                                --
├─PHMLSTM: 1-1                           81,200
├─Linear: 1-2                            102,500
├─Linear: 1-3                            303
Total params: 184,003
Trainable params: 184,003
Non-trainable params: 0

# Train The Model
Due to limited runtime of Colab, we were able to train the model with 857 iterations in 3 hours. Though model is not fully trained, we showed that both train and validation values are decreasing over iterations, thus model is learning with PHM implementation.

In [None]:
train_loss,validation_loss = train(model, criterion, optimizer, epochs, train_df,valid_df,embeddings_dict,batch_size)

Train Loss at iteration 1: 1.0985090732574463 | Validation Loss: 1.102396845817566
Train Loss at iteration 2: 1.0911962985992432 | Validation Loss: 1.1048578023910522
Train Loss at iteration 3: 1.1106410026550293 | Validation Loss: 1.125192403793335
Train Loss at iteration 4: 1.1193898916244507 | Validation Loss: 1.1178134679794312
Train Loss at iteration 5: 1.110683560371399 | Validation Loss: 1.089297890663147
Train Loss at iteration 6: 1.1067328453063965 | Validation Loss: 1.0998872518539429
Train Loss at iteration 7: 1.09945547580719 | Validation Loss: 1.1015677452087402
Train Loss at iteration 8: 1.1018390655517578 | Validation Loss: 1.0983408689498901
Train Loss at iteration 9: 1.1089692115783691 | Validation Loss: 1.1009223461151123
Train Loss at iteration 10: 1.1052794456481934 | Validation Loss: 1.101319432258606
Train Loss at iteration 11: 1.0924773216247559 | Validation Loss: 1.1011736392974854
Train Loss at iteration 12: 1.0983141660690308 | Validation Loss: 1.1079963445663

### Standart LSTM Implementation

In [None]:
class LSTM(torch.nn.Module):
  def __init__(self,inp,hid,max_len):
    super().__init__()
    self.inp = inp
    self.hid = hid
    self.max_len = max_len
    self.lstm = nn.LSTM(self.inp,self.hid)
    self.fc1 = nn.Linear(self.max_len*4,100)
    self.fc2 = nn.Linear(100,3)
  def forward(self,premise,hypothesis):
    h_premise,c_premise = self.lstm(premise)
    h_hypothesis, c_hypothesis = self.lstm(hypothesis)

    hp_average = torch.mean(h_premise,2)
    hp_max = torch.max(h_premise,2).values

    hh_average = torch.mean(h_hypothesis,2)
    hh_max = torch.max(h_hypothesis,2).values

    v = torch.cat((hp_average,hp_max,hh_average,hh_max),1)

    v = self.fc1(v)
    v = torch.tanh(v)
    v = self.fc2(v)
    v = F.softmax(v,dim=1)
    return v

In [None]:
epochs = 10
lr = 0.0004
batch_size = 256
hidden = 300
max_len = 256
input_size = 300

model_2 = LSTM(input_size,hidden,max_len)
model_2 = model_2.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_2.parameters(), lr=lr)

In [None]:
summary(model=model_2)

Layer (type:depth-idx)                   Param #
LSTM                                     --
├─LSTM: 1-1                              722,400
├─Linear: 1-2                            102,500
├─Linear: 1-3                            303
Total params: 825,203
Trainable params: 825,203
Non-trainable params: 0

In [None]:
train_loss_2,validation_loss_2 = train(model_2, criterion, optimizer, epochs, train_df,valid_df,embeddings_dict,batch_size)

Train Loss at iteration 1: 1.1003645658493042 | Validation Loss: 1.0977150201797485
Train Loss at iteration 2: 1.0992730855941772 | Validation Loss: 1.1090463399887085
Train Loss at iteration 3: 1.0999653339385986 | Validation Loss: 1.105333685874939
Train Loss at iteration 4: 1.1002124547958374 | Validation Loss: 1.1003557443618774
Train Loss at iteration 5: 1.104783296585083 | Validation Loss: 1.0984902381896973
Train Loss at iteration 6: 1.0972894430160522 | Validation Loss: 1.101171612739563
Train Loss at iteration 7: 1.101003885269165 | Validation Loss: 1.1009111404418945
Train Loss at iteration 8: 1.099508285522461 | Validation Loss: 1.0963428020477295
Train Loss at iteration 9: 1.099807858467102 | Validation Loss: 1.0997064113616943
Train Loss at iteration 10: 1.0992439985275269 | Validation Loss: 1.097491979598999
Train Loss at iteration 11: 1.1010271310806274 | Validation Loss: 1.1002843379974365
Train Loss at iteration 12: 1.0979918241500854 | Validation Loss: 1.1008261442184