# 0 TorchText

## Dataset Preview

Your first step to deep learning in NLP. We will be mostly using PyTorch. Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines. 

We will be using previous session tweet dataset. Let's just preview the dataset.

In [1]:
import pandas as pd
import numpy as np


In [2]:
phrases_data = pd.read_csv('/content/sample_data/dictionary.txt',sep='|',header=None).sort_values(by=1).rename(columns={0:'phrase',1:'id'})
label_data = pd.read_csv('/content/sample_data/sentiment_labels.txt',sep='|').rename(columns={'sentiment values':'sentiment'}).rename(columns={'phrase ids':'id'})
assert phrases_data.shape[0] == label_data.shape[0]  
joined_data = pd.merge(phrases_data,label_data,on='id')

In [3]:
label_dict = dict(zip(joined_data['phrase'],joined_data['sentiment']))

In [4]:
labels = []
with open('/content/sample_data/SOStr.txt','r') as f:
    count=0
    for line in f.readlines():
        temp = line.strip().split('|')
        value = [label_dict.get(c) for c in temp]
        labels.append(value)    

In [5]:
df = pd.read_csv('/content/sample_data/datasetSentences.txt',sep='\t')
df.head()
df['label'] = labels

In [6]:
df['label'] = df['label'].apply(lambda x: np.ceil(25*np.mean(x))).astype('int')

In [7]:
label_map=dict(zip(list(df.label.unique()),range(len(df.label.unique()))))
df['label']=df['label'].map(label_map)

In [8]:
datasplit = pd.read_csv('/content/sample_data/datasetSplit.txt')

In [9]:
df['split'] = datasplit['splitset_label']

In [10]:
train_data = df[df['split']==1]
valid_data = df[df['split']!=1]

In [11]:
def random_deletion(words, p=0.5): 
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return [random.choice(words)] 
    else:
        return remaining

In [12]:
def random_swap(sentence, n=5): 
    length = range(len(sentence)) 
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return " ".join(i for i in sentence)

In [13]:
!pip install googletrans==3.1.0a0



In [14]:
import random
import googletrans
from googletrans import Translator
def random_translation(sentence,translator=Translator()):
    available_langs = list(googletrans.LANGUAGES.keys()) 
    trans_lang = random.choice(available_langs) 


    translations = translator.translate(sentence, dest=trans_lang) 
    t_text = [t.text for t in translations]


    translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
    en_text = " ".join(t.text for t in translations_en_random)
    return en_text

adding random deletions

In [15]:
# add random deletions
aug_data = pd.DataFrame()
for i,row in train_data.iterrows():
    row['sentence'] = ' '.join(random_deletion(row['sentence'].split(), p=0.2))
    label = row['label']
    aug_data = aug_data.append(row).reset_index(drop=True)

translated for only 2600 samples for timebeing

In [16]:
# add random translations
aug_data1 = pd.DataFrame()
for i,row in train_data.iterrows():
    print(i)
    row['sentence'] = random_translation([str(row['sentence'])]) 
    label = row['label']
    aug_data1 = aug_data1.append(row).reset_index(drop=True)

0
1
60
61
62
63
67
71
81
130
131
132
133
134
135
213
227
339
382
386
387
427
428
445
466
467
472
473
485
486
487
513
583
669
691
701
702
703
704
741
747
748
839
892
893
895
896
1111
1152
1153
1214
1240
1255
1273
1481
1488
1587
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
182

KeyboardInterrupt: ignored

Adding random swapping

In [17]:
aug_data2 = pd.DataFrame()
for i,row in train_data.iterrows():
    row['sentence'] = random_swap(row['sentence'].split(" "))
    label = row['label']
    aug_data2 = aug_data2.append(row).reset_index(drop=True)

In [19]:
train_data = train_data.append(aug_data).reset_index(drop=True)

In [21]:
train_data = train_data.append(aug_data1).reset_index(drop=True)

In [22]:
train_data = train_data.append(aug_data2).reset_index(drop=True)

In [23]:
train_data.shape

(36242, 4)

In [24]:
train_data.drop_duplicates(inplace=True)

In [25]:
train_data.shape

(25102, 4)

## Defining Fields

Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequen tial to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [27]:
# Import Library
import random
import torch, torchtext
from torchtext import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fa8ca01ff30>

In [28]:
Tweet = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Label = data.LabelField(tokenize ='spacy', is_target=True, batch_first =True, sequential =False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [29]:
fields = [('tweets', Tweet),('labels',Label)]

In [30]:
valid_data=valid_data.reset_index(drop=True)
train_data=train_data.reset_index(drop=True)

Armed with our declared fields, lets convert from pandas to list to torchtext. We could also use TabularDataset to apply that definition to the CSV directly but showing an alternative approach too.

In [31]:
train_example = [data.Example.fromlist([train_data.sentence[i],train_data.label[i]], fields) for i in range(train_data.shape[0])] 


In [32]:
valid_example = [data.Example.fromlist([valid_data.sentence[i],valid_data.label[i]], fields) for i in range(valid_data.shape[0])]

In [33]:
# Creating dataset
#twitterDataset = data.TabularDataset(path="tweets.csv", format="CSV", fields=fields, skip_header=True)

train = data.Dataset(train_example, fields)
valid = data.Dataset(valid_example, fields)

Finally, we can split into training, testing, and validation sets by using the split() method:

In [34]:
(len(train), len(valid))

(25102, 3311)

An example from the dataset:

In [35]:
vars(train.examples[10])

{'labels': 4.0,
 'tweets': ['Good',
  'fun',
  ',',
  'good',
  'action',
  ',',
  'good',
  'acting',
  ',',
  'good',
  'dialogue',
  ',',
  'good',
  'pace',
  ',',
  'good',
  'cinematography',
  '.']}

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. 

Let’s limit the vocabulary to a maximum of 5000 words in our training set:


In [36]:
Tweet.build_vocab(train)
Label.build_vocab(train)

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.

In [37]:
print('Size of input vocab : ', len(Tweet.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Tweet.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  17212
Size of label vocab :  14
Top 10 words appreared repeatedly : [('.', 22073), (',', 19845), ('the', 16830), ('and', 12381), ('of', 12357), ('a', 12275), ('to', 8442), ('-', 7565), ("'s", 7075), ('is', 7017)]
Labels :  defaultdict(<function _default_unk_index at 0x7fa87ad74488>, {0.0: 0, 1.0: 1, 3.0: 2, 2.0: 3, 7.0: 4, 4.0: 5, 6.0: 6, 5.0: 7, 10.0: 8, 9.0: 9, 8.0: 10, 12.0: 11, 11.0: 12, 13.0: 13})


**Lots of stopwords!!**

Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [38]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [39]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.tweets),
                                                            sort_within_batch=True, device = device)

Save the vocabulary for later use

In [40]:
import os, pickle
with open('tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Tweet.vocab.stoi, tokens)

## Defining Our Model

We use the Embedding and LSTM modules in PyTorch to build a simple model for classifying tweets.

In this model we create three layers. 
1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 
2. That’s then fed into a 2 stacked-LSTMs with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). We are using 2 LSTMs for using the dropout.
3. Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [41]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=dropout,
                           batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
      
        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
        
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
    
        # Hidden = [batch size, hid dim * num directions]
        dense_outputs = self.fc(hidden)   
        
        # Final activation function softmax
        output = F.softmax(dense_outputs[0], dim=1)
            
        return output

In [42]:
# Define hyperparameters
size_of_vocab = len(Tweet.vocab)
embedding_dim = 300
num_hidden_nodes = 100
num_output_nodes = 14
num_layers = 2
dropout = 0.2

# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, dropout = dropout)

In [43]:
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

classifier(
  (embedding): Embedding(17212, 300)
  (encoder): LSTM(300, 100, num_layers=2, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=100, out_features=14, bias=True)
)
The model has 5,406,614 trainable parameters


## Model Training and Evaluation

First define the optimizer and loss functions

In [44]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()
#criterion = nn.BCEWithLogitsLoss()

# define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    _, predictions = torch.max(preds, 1)
    
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    
# push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

The main thing to be aware of in this new training loop is that we have to reference `batch.tweets` and `batch.labels` to get the particular fields we’re interested in; they don’t fall out quite as nicely from the enumerator as they do in torchvision.

**Training Loop**

In [45]:
def train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        tweet, tweet_lengths = batch.tweets   
        
        # convert to 1D tensor
        predictions = model(tweet, tweet_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.labels)        
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.labels)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Evaluation Loop**

In [46]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet, tweet_lengths = batch.tweets
            
            # convert to 1d tensor
            predictions = model(tweet, tweet_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels)
            acc = binary_accuracy(predictions, batch.labels)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Let's Train and Evaluate**

In [47]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	Train Loss: 2.230 | Train Acc: 55.01%
	 Val. Loss: 2.196 |  Val. Acc: 55.94% 

	Train Loss: 2.191 | Train Acc: 56.37%
	 Val. Loss: 2.195 |  Val. Acc: 55.94% 

	Train Loss: 2.191 | Train Acc: 56.39%
	 Val. Loss: 2.196 |  Val. Acc: 55.94% 

	Train Loss: 2.190 | Train Acc: 56.56%
	 Val. Loss: 2.196 |  Val. Acc: 55.94% 

	Train Loss: 2.184 | Train Acc: 57.25%
	 Val. Loss: 2.191 |  Val. Acc: 56.36% 

	Train Loss: 2.152 | Train Acc: 60.67%
	 Val. Loss: 2.184 |  Val. Acc: 57.02% 

	Train Loss: 2.111 | Train Acc: 64.69%
	 Val. Loss: 2.175 |  Val. Acc: 58.07% 

	Train Loss: 2.083 | Train Acc: 67.57%
	 Val. Loss: 2.172 |  Val. Acc: 58.10% 

	Train Loss: 2.056 | Train Acc: 70.20%
	 Val. Loss: 2.170 |  Val. Acc: 58.43% 

	Train Loss: 2.032 | Train Acc: 72.67%
	 Val. Loss: 2.158 |  Val. Acc: 59.24% 

	Train Loss: 2.011 | Train Acc: 74.79%
	 Val. Loss: 2.160 |  Val. Acc: 59.12% 

	Train Loss: 1.996 | Train Acc: 76.11%
	 Val. Loss: 2.157 |  Val. Acc: 59.33% 

	Train Loss: 1.987 | Train Acc: 77.04%
	