# Using Built-in Datasets

Pytorch features variant types of datasets, which you can use to train and test your model. These high-quality datasets are available in:

*   [torchvision (for vision)](https://pytorch.org/docs/stable/torchvision/datasets.html)
*   [torchaudio (for audio)](https://pytorch.org/audio/datasets.html)
*   [torchtext (for text)](https://pytorch.org/text/datasets.html)

You have been familiar with torchvision in the previous session, and this time we are going to use one of torchtext datasets, IMDB reviews dataset. Torchtext provides datasets for different types of text classification problems. The IMDB datasets is used in the sentiment analysis problem, in which we are going to label each review of the dataset as positive or negative.

In [1]:
import torch
import torchtext
import torchtext.data
import torch.nn.functional as F
import matplotlib.pyplot as plt
import spacy

from torch import nn
from torchtext import data, datasets
from torchtext.vocab import GloVe
from tqdm import tqdm

TEXT and LABEL are going to be our fields of words. You can find out more about fields [here](https://torchtext.readthedocs.io/en/latest/data.html).

In [2]:
TEXT = data.Field(lower=True, batch_first=True,fix_length=200)
LABEL = data.Field(sequential=False)

In [3]:
train, test = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 147k/84.1M [00:00<00:57, 1.47MB/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 64.4MB/s]


In [8]:
print('train', len(train))
print('test', len(test))

train 25000
test 25000


In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    print("Running on the GPU")
else:
    device = torch.device("cpu")
    print("Running on the CPU")

Running on the GPU


In [None]:
print('train.fields', train.fields)

train.fields {'text': <torchtext.data.field.Field object at 0x7f2b6f590a90>, 'label': <torchtext.data.field.Field object at 0x7f2b6f590a58>}


In [None]:
print(vars(train[0]))

{'text': ['nicely', 'and', 'intelligently', 'played', 'by', 'the', 'two', 'young', 'girls,', 'mischa', 'barton', 'as', 'frankie,', 'and', 'ingrid', 'uribe', 'as', 'hazel,', 'although', 'the', 'plot', 'is', 'rather', 'a', 'stretch', 'of', 'the', 'imagination.', 'young', 'hazel', 'running', 'for', 'mayor', 'seems', 'out', 'of', 'place,', 'to', 'be', 'honest.<br', '/><br', '/>while', 'the', 'acting', 'is', 'well', 'done', 'by', 'all', 'concerned', 'the', 'movie', 'tends', 'to', 'lack', 'a', 'genuine', 'atmosphere', 'of', 'drama.', 'perhaps', "we've", 'grown', 'to', 'expect', 'gritty', 'reality', 'in', 'movies,', 'rather', 'like', 'comparing', 'pollyanna', 'to', 'how', 'green', 'was', 'my', 'valley!', 'never', 'mind,', 'each', 'of', 'them', 'are', 'good', 'in', 'their', 'own', 'way.<br', '/><br', '/>i', 'do', 'admire', 'joan', 'plowright', 'even', 'if', 'her', 'role', 'is', 'somewhat', 'subdued', 'here.', 'middle', 'of', 'the', 'road', 'entertainment', 'well', 'suited', 'for', 'younger', '

Using pretrained word embeddings helps a lot. You don't have to invent the wheel on your own. You can use the rich pretrained word embeddings like the [GloVe](https://nlp.stanford.edu/projects/glove/).

In [None]:
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300),max_size=10000,min_freq=10)
LABEL.build_vocab(train)

Then you can look at the embedded vectors corresponding to the words contained in the vocab.

In [None]:
print(TEXT.vocab.vectors)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.7724, -0.1800,  0.2072,  ...,  0.6736,  0.2263, -0.2919],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])


In [None]:
print(TEXT.vocab.stoi)



Now we can define our iterators on the training and test data so we can iterator over them.

In [None]:
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_size=128, device=device,shuffle=True)

In [None]:
batch = next(iter(train_iter))
batch.text

tensor([[ 222,    0,    7,  ...,   17,  167,  396],
        [   9,  200,    0,  ...,    1,    1,    1],
        [   9,   62,  317,  ..., 3869,    0,    7],
        ...,
        [   0,  105,    0,  ...,  177,    0,   22],
        [ 819,    0, 1619,  ...,    1,    1,    1],
        [   0,  173,  274,  ...,    1,    1,    1]], device='cuda:0')

In [None]:
batch.label

tensor([2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
        2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2,
        1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1,
        2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2,
        1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2,
        2, 1, 1, 2, 1, 1, 1, 1], device='cuda:0')

# RNN-based Model for Classification

We are ready to define our rnn-based model. Torch.nn features [RNN](https://pytorch.org/docs/master/generated/torch.nn.RNN.html) and [LSTM](https://pytorch.org/docs/master/generated/torch.nn.LSTM.html) models that can act as the core of our RNN-based models. We are going to use an LSTM model which gets input as an embedding vector and based on it and the previous hidden state, provides the updated hidden state and the output. Here we only want one True/False output. So we use a fully connected layer at the end of the network after a dropout layer feeded to a sigmoid activation function to determine the final output.

In [None]:
class SentimentLSTM(nn.Module):    

    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_size, n_layers, drop_prob=0.3):        
        super().__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
    

    def forward(self, x):        
        batch_size = x.size(0)        
        # embeddings and lstm_out
        embeds = self.embedding(x)        
        lstm_out, hidden = self.lstm(embeds)        
        # stack up lstm outputs
        hidden = hidden[0][-1].view(-1, self.hidden_dim)
        # dropout and fully-connected layer
        out = self.dropout(hidden)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        # return last sigmoid output and hidden state
        return sig_out.view(-1)

In the next block we instantiate the model with the determined parameters. You are free to examine different parameters.

In [None]:
vocab_size = len(TEXT.vocab)
embedding_dim = 300
output_size = 1
hidden_dim = 256
n_layers = 2

net = SentimentLSTM(vocab_size, embedding_dim, hidden_dim, output_size, n_layers)
net.embedding.weight.data = TEXT.vocab.vectors
net = net.to(device)
print(net)

SentimentLSTM(
  (embedding): Embedding(10002, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


Next the loss function and optimizer are defined. As the desired true/false output, binary cross entropy loss function is an appropriate choice.

In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

Finally we are ready to train our model and test it on the test data.

In [None]:
n_epoch = 20
print_every = 100

In [None]:
for epoch in range(n_epoch):
    running_loss = 0
    i = 0    
    for batch in tqdm(train_iter, desc='Training epoch ' + str(epoch + 1) + '', position=0):
        i += 1
        inputs = batch.text        
        labels = batch.label.float() - 1        
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        
        outputs = net(inputs)        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print('\nloss: %.3f' % (running_loss / len(train_iter)), flush=True, end='')
    running_loss = 0

Training epoch 1: 100%|██████████| 196/196 [00:32<00:00,  6.00it/s]


loss: 0.691


Training epoch 2: 100%|██████████| 196/196 [00:32<00:00,  6.01it/s]


loss: 0.679


Training epoch 3: 100%|██████████| 196/196 [00:32<00:00,  6.02it/s]


loss: 0.692


Training epoch 4: 100%|██████████| 196/196 [00:32<00:00,  6.02it/s]


loss: 0.693


Training epoch 5: 100%|██████████| 196/196 [00:32<00:00,  6.03it/s]


loss: 0.690


Training epoch 6: 100%|██████████| 196/196 [00:32<00:00,  6.03it/s]


loss: 0.632


Training epoch 7: 100%|██████████| 196/196 [00:32<00:00,  6.02it/s]


loss: 0.688


Training epoch 8: 100%|██████████| 196/196 [00:32<00:00,  6.03it/s]


loss: 0.682


Training epoch 9: 100%|██████████| 196/196 [00:32<00:00,  6.02it/s]


loss: 0.413


Training epoch 10: 100%|██████████| 196/196 [00:32<00:00,  6.02it/s]


loss: 0.270


Training epoch 11: 100%|██████████| 196/196 [00:32<00:00,  6.03it/s]


loss: 0.206


Training epoch 12: 100%|██████████| 196/196 [00:32<00:00,  6.01it/s]


loss: 0.157


Training epoch 13: 100%|██████████| 196/196 [00:32<00:00,  6.03it/s]


loss: 0.116


Training epoch 14: 100%|██████████| 196/196 [00:32<00:00,  6.01it/s]


loss: 0.088


Training epoch 15: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.071


Training epoch 16: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.063


Training epoch 17: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.058


Training epoch 18: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.051


Training epoch 19: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.050


Training epoch 20: 100%|██████████| 196/196 [00:32<00:00,  5.99it/s]


loss: 0.046




In [None]:
correct = 0
total = 0
class_correct = [0 for i in range(2)]
class_total = [0 for i in range(2)]

with torch.no_grad():
    for batch in tqdm(test_iter, desc='Test', position=0):    
        inputs = batch.text
        labels = batch.label.float() - 1
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = net(inputs)        
        c = (torch.round(outputs)  == labels).squeeze()   
        total += labels.size(0)
        correct += c.sum()        
        for i in range(labels.size(0)):
            label = labels[i].int()          
            class_correct[label] += c[i].item()
            class_total[label] += 1
print()
for i in range(2):
    print('Accuracy of %5s : %2d %%' % (('+' if i == 0 else '-'), 100 * class_correct[i] / class_total[i]))
    print(class_correct[i], class_total[i])


print('%d / %d' % (correct, total))
print('Accuracy: %0.2f' % (100 * correct / total))

Test: 100%|██████████| 196/196 [00:15<00:00, 12.96it/s]


Accuracy of     + : 79 %
9953 12500
Accuracy of     - : 86 %
10834 12500
20787 / 25000
Accuracy: 83.00





# Using Different Loss Functions

In this problem we had to deal with a binary output. So we used the binary cross-entropy loss function. In the case of categorical outputs, cross-entropy is an appropriate choice. For continous outputs, Mean-Squared-Error is a common choice but there are more complex loss functions. [Here](https://pytorch.org/docs/master/nn.html#loss-functions!) you can read about the built-in loss functions provided in torch.nn.

# Writing Custom Loss Function

Sometimes you want to define your own loss function which fits the problem better. It is easy to do so and you only have to define the function which gets a batch of outputs and labels, and returns the loss corresponding to them. Then the autograd calculates the gradients with respect to your loss function. For example in text classification problems, it may be useful to define a custom loss function which ignores padding entries. There is such an example [here](https://cs230.stanford.edu/blog/namedentity/).

### Sample Loss function with ROC-AUC

You can read more about auc-roc [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). 

In [13]:
from sklearn.metrics import roc_auc_score
from torch.nn.modules.loss import _Loss

class ROC_AUCLoss(_Loss):
    def __init__(self, threshold):
        super(ROC_AUCLoss, self).__init__()
        self.threshold = threshold

    def forward(self, input, target):
        mse_loss = nn.MSELoss()(input, target)
        input, target = (input >= self.threshold).float().tolist(), (target >= self.threshold).float().tolist()
        try:
            result = torch.tensor(-roc_auc_score(input, target), requires_grad=True)
        except ValueError:
            result = torch.tensor(0.0, requires_grad=True)
        return result + mse_loss

## Acknowledgements

Most of this notebook has been adapted from Pytorch documentation and CE550 DL workshops in Spring 2020.

