# ICE - 7.
# Text Classification with TorchText
==================================

This tutorial shows how to use the text classification datasets
in ``torchtext``, including

::

   - AG_NEWS,
   - SogouNews,
   - DBpedia,
   - YelpReviewPolarity,
   - YelpReviewFull,
   - YahooAnswers,
   - AmazonReviewPolarity,
   - AmazonReviewFull

You can download these datasets using Google Searcg they are avilable for free.

This example shows how to train a supervised learning algorithm for
classification using one of these ``TextClassification`` datasets.

Load data with ngrams
---------------------

Generally speaking, we first need to do preprocessing for any NLP tasks.

Here are some items you can remind your self:

Build and preprocess dataset:

- Segment sentences. Segment words to subwords or characters?
- Change words in lower case?
- Delete stop words ?
- Create special tokens ( i.e. [UNK] [BOS] [EOS] [PAD] ) ?

Build vocabulary:

- Discard words whose frequencies are under a threshold ?
- Build map from word string to index in the embedding table ( str -> int ) 
- Build label vocabulary
- Numericalize words ( Transform list of words to list of numbers )
- Choose pad or not ( Using [PAD] )


For this specific task, a bag of ngrams feature is applied to capture some partial information
about the local word order. In practice, bi-gram or tri-gram are applied
to provide more benefits as word groups than only one word. An example:

::

   For text: **"load data with ngrams"**  
   1-gram results: "load", "data", "with", "ngrams"  
   Bi-grams results: "load data", "data with", "with ngrams"  
   Tri-grams results: "load data with", "data with ngrams"

``TextClassification`` Dataset supports the ngrams method. By setting
ngrams to 2, the example text in the dataset will be a list of single
words plus bi-grams string.




In [None]:
%matplotlib inline
!pip install torch>=1.3.1
!pip install torchtext==0.4

Collecting torchtext==0.4
  Downloading torchtext-0.4.0-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 23.2 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 28.5 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 20.6 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 16.2 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.2 MB/s 
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.10.0
    Uninstalling torchtext-0.10.0:
      Successfully uninstalled torchtext-0.10.0
Successfully installed torchtext-0.4.0


In [None]:
import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 1
import os
if not os.path.isdir('./.data'):
	os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ag_news_csv.tar.gz: 100%|██████████| 11.8M/11.8M [00:00<00:00, 72.8MB/s]
120000lines [00:05, 22278.44lines/s]
120000lines [00:10, 11080.01lines/s]
7600lines [00:00, 11147.45lines/s]


In [None]:
print('One item of training data:', train_dataset[0] ) #(label id, token id tensor)
print('Vocabulary size (including ngrams):', len(train_dataset.get_vocab()))
print('Class size:', len(train_dataset.get_labels()))

One item of training data: (2, tensor([  432,   426,     2,  1606, 14839,   114,    67,     3,   849,    14,
           28,    15,    28,    16, 50726,     4,   432,   375,    17,    10,
        67508,     7, 52259,     4,    43,  4010,   784,   326,     2]))
Vocabulary size (including ngrams): 95812
Class size: 4


Define the model
----------------

The model is composed of the
`EmbeddingBag <https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag>`__
layer and the linear layer (see the figure below). ``nn.EmbeddingBag``
computes the mean value of a “bag” of embeddings. The text entries here
have different lengths. ``nn.EmbeddingBag`` requires no padding here
since the text lengths are saved in offsets.

Additionally, since ``nn.EmbeddingBag`` accumulates the average across
the embeddings on the fly, ``nn.EmbeddingBag`` can enhance the
performance and memory efficiency to process a sequence of tensors.

![](https://github.com/teohangxanh/5290/blob/_static/img/text_sentiment_ngrams_model.png?raw=1)

To make "offsets" more easy to understand, we put an overview explaination here. 

The returned values ( text and offsets ) of generate_batch function will be directly used as the input of the model. In the forward pass, they are used as parameters of self.embedding.

text is a tensor of shape (N,)  it will be treated as a concatenation of multiple bags (sequences). offsets is required to be a 1D tensor containing the starting index positions of each bag in input. Therefore, for offsets of shape (B), text will be viewed as having B bags. (B is batch_size)

For example, assume batch size is 2. In one batch, we have 2 sentences, "I love python", "I love machine learning"

We will first create bag of words for each sentence ["I", "love", "python"], ["I", "love", "machine", "learning"]. Then they will be first transformed to word index according to the vocabulary, tensor([2,1,5]),tensor([2,1,4,7]).
In this case, then we can concatenate tensor([2,1,5]) and tensor([2,1,4,7]) to get text tensor([2,1,5,2,1,4,7]) of shape (N,) (N=7). We also need to create offsets tensor([0,3]) of shape (B,) (B=2). In tensor([0,3]), 0 means the starting index of first sentence in tensor([2,1,5,2,1,4,7]), and 3 means the starting index of second sentence in tensor([2,1,5,2,1,4,7]).





In [None]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class): #Initilaize modules.
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True) # This is equivalent to nn.Embedding followed by torch.mean(dim=0)
        self.fc = nn.Linear(embed_dim, num_class) #Use linear layer here. 
        self.init_weights()

    def init_weights(self):  # Randomly initilaize parameters
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange) # Uniform distribution
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()  # Make bias zeros 

    def forward(self, text, offsets): # Foward pass
        embedded = self.embedding(text, offsets) # input (N,)  it will be treated as a concatenation of multiple bags (sequences). 
        # offsets is required to be a 1D tensor containing the starting index positions of each bag in input. 
        # Therefore, for offsets of shape (B), input will be viewed as having B bags.
        # ouput (B, embed_dim)
        a =  self.fc(embedded)
        return a # ouput (B, num_class)

Initiate an instance
--------------------

The AG_NEWS dataset has four labels and therefore the number of classes
is four.

::

   1 : World
   2 : Sports
   3 : Business
   4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word
and ngrams). The number of classes is equal to the number of labels,
which is four in AG_NEWS case.




In [None]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

Functions used to generate batch
--------------------------------




Since the text entries have different lengths, a custom function
generate_batch() is used to generate data batches and offsets. The
function is passed to ``collate_fn`` in ``torch.utils.data.DataLoader``.
The input to ``collate_fn`` is a list of tensors with the size of
batch_size, and the ``collate_fn`` function packs them into a
mini-batch. Pay attention here and make sure that ``collate_fn`` is
declared as a top level def. This ensures that the function is available
in each worker.

The text entries in the original data batch input are packed into a list
and concatenated as a single tensor as the input of ``nn.EmbeddingBag``.
The offsets is a tensor of delimiters to represent the beginning index
of the individual sequence in the text tensor. Label is a tensor saving
the labels of individual text entries.




In [None]:
def generate_batch(batch): 
    # Input: a iterator of items with length of batch_size. For example:[(1,(tensor([2,4,3])),(0,tensor([6,5]))]
    # Generate a batch used in SGD
    label = torch.tensor([entry[0] for entry in batch]) #tensor of shape (batch_size,)
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) # For example tensor([0,3])
    text = torch.cat(text) # a list of tensor -> tensor of shape (sum([len(i) for i in text]),) For example, tensor([2,4,3,6,5])
    return text, offsets, label

Define functions to train the model and evaluate results.
---------------------------------------------------------




`torch.utils.data.DataLoader <https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader>`__
is recommended for PyTorch users, and it makes data loading in parallel
easily (a tutorial is
`here <https://pytorch.org/tutorials/beginner/data_loading_tutorial.html>`__).
We use ``DataLoader`` here to load AG_NEWS datasets and send it to the
model for training/validation.




In [None]:
from torch.utils.data import DataLoader
BATCH_SIZE = 32

def train_func(sub_train_):

    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)  # Iterable batches
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad() # Before each optimization, make previous gradients zeros
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls) # Forward pass to compute loss
        train_loss += loss.item()
        # Extract the number from a tensor containing only one item, this number will be used in later printing
        loss.backward() # Backforward propagation to compute gradients of each variable node
        optimizer.step() # Update parameters according to gradients
        #choose the class with the highest score as current prediction and compare with gold label (cls )
        train_acc += (output.argmax(1) == cls).sum().item() 
        
    # Adjust the learning rate. After each epoch, do learning rate decay ( optional )
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_) #return average loss and acc to print

def test(data_):
    #Similar to train_func but do not need back propagation or parameter update !
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad(): # prevent computing gradient, could not use backward()
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

Split the dataset and run the model
-----------------------------------

Since the original AG_NEWS has no valid dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
`torch.utils.data.dataset.random_split <https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split>`__
function in PyTorch core library.

`CrossEntropyLoss <https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class.
It is useful when training a classification problem with C classes.
`SGD <https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html>`__
implements stochastic gradient descent method as optimizer. The initial
learning rate is set to 4.0.
`StepLR <https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR>`__
is used here to adjust the learning rate through epochs.




In [None]:
import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

#Use CrossEntropyLoss() as the criterion. 
#The input is the output of the model. First do logsoftmax, then compute cross-entropy loss. 
criterion = torch.nn.CrossEntropyLoss().to(device) 
#Use SGD as optimizer.
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
#Use exponential decay to decrease learning rate
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
#Split whole training dataset to create validation (hold-out datset)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len]) 
for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60
    
    #Print information to monitor the training process
    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 8 seconds
	Loss: 0.0135(train)	|	Acc: 84.5%(train)
	Loss: 0.0000(valid)	|	Acc: 87.1%(valid)
Epoch: 2  | time in 0 minutes, 8 seconds
	Loss: 0.0084(train)	|	Acc: 91.0%(train)
	Loss: 0.0000(valid)	|	Acc: 89.4%(valid)
Epoch: 3  | time in 0 minutes, 8 seconds
	Loss: 0.0071(train)	|	Acc: 92.2%(train)
	Loss: 0.0000(valid)	|	Acc: 88.8%(valid)
Epoch: 4  | time in 0 minutes, 7 seconds
	Loss: 0.0063(train)	|	Acc: 93.2%(train)
	Loss: 0.0000(valid)	|	Acc: 89.8%(valid)
Epoch: 5  | time in 0 minutes, 7 seconds
	Loss: 0.0057(train)	|	Acc: 93.7%(train)
	Loss: 0.0000(valid)	|	Acc: 91.0%(valid)


Running the model on GPU with the following information:

Epoch: 1 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0263(train)     |       Acc: 84.5%(train)
       Loss: 0.0001(valid)     |       Acc: 89.0%(valid)


Epoch: 2 \| time in 0 minutes, 10 seconds

::

       Loss: 0.0119(train)     |       Acc: 93.6%(train)
       Loss: 0.0000(valid)     |       Acc: 89.6%(valid)


Epoch: 3 \| time in 0 minutes, 9 seconds

::

       Loss: 0.0069(train)     |       Acc: 96.4%(train)
       Loss: 0.0000(valid)     |       Acc: 90.5%(valid)


Epoch: 4 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0038(train)     |       Acc: 98.2%(train)
       Loss: 0.0000(valid)     |       Acc: 90.4%(valid)


Epoch: 5 \| time in 0 minutes, 11 seconds

::

       Loss: 0.0022(train)     |       Acc: 99.0%(train)
       Loss: 0.0000(valid)     |       Acc: 91.0%(valid)




Evaluate the model with test dataset
------------------------------------




In [None]:
print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0002(test)	|	Acc: 90.4%(test)


Checking the results of test dataset…

::

       Loss: 0.0237(test)      |       Acc: 90.5%(test)




Test on a random news
---------------------

Use the best model so far and test a golf news. The label information is
available
`here <https://pytorch.org/text/datasets.html?highlight=ag_news#torchtext.datasets.AG_NEWS>`__.




# **Tasks for Today's ICE**
## *All Implementation should be perfomed in Pytorch/torchtext*

In [None]:
# The code for the models is provided as an attachment as Zip file. Please use that as reference to perform classification in terms of accuracy and then test any random piece of code
# as test data to idetify what the article is talking about i.e., "This is Political News". You can use any dataset that are introduced in the start
# All datasets are available for free using google search.
# Final step is to compare the accuracies and provide discussion on why one model has a better performance while do not. 

Answer: Attention models usually give the highest accuracy because its function mainly gives importance to some input states in which it has more contextual relation. So, generally the weights for the inputs of attention function are learned to understand which input it should attend to.
A variant of attention models, self attention is considerable because: <br>
* Minimize total computational complexity per layer
* Maximize amount of parallelizable computations, measured by minimum number of sequential operations required
* Minimize maximum path length between any two input and output positions in network composed of the different layer types . The shorter the path between any combination of positions in the input and * output sequences, the easier to learn long-range dependencies.

In [None]:
import pandas as pd
!wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
!unzip glove.6B.zip

glove = pd.read_csv('glove.6B.50d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

--2021-10-19 19:59:18--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2021-10-19 19:59:18--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2021-10-19 19:59:19--  http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [applic

In [None]:
ex_text_str = "The surface area of a lone black hole won’t change — after all,\
 nothing can escape from within. However, if you throw something into a black \
 hole, it will gain more mass, increasing its surface area. But the incoming \
 object could also make the black hole spin, which decreases the surface area.\
  The area law says that the increase in surface area due to additional mass \
  will always outweigh the decrease in surface area due to added spin."

In [None]:
import torchtext
import numpy as np

matrix_len = 2000
weights_matrix = np.zeros((matrix_len, 50))
words_found = 0
count = 0
for i, word in enumerate(train_dataset.get_vocab()):
  if count >= matrix_len:
    break
  try: 
      weights_matrix[i] = glove_embedding[word]
      print(glove[word])
      words_found += 1
  except KeyError:
      weights_matrix[i] = np.random.normal(scale=0.6, size=(50, ))
  count += 1

weights_matrix = torch.from_numpy(weights_matrix)

In [None]:
def define(model):
  optimizer = Adam(model.parameters(), lr=0.07)
  criterion = nn.CrossEntropyLoss()
  if torch.cude.is_available():
    model = model.cuda()
    criterion = criterion.cuda()
  print('Print model')
  print(model)

def train(epoch, model, X_train, y_train, X_val, y_val):
  model.train()
  tr_loss = 0
  if torch.cuda.is_available():
    X_train = X_train.cuda()
    y_train = y_train.cuda()
    X_val = X_val.cuda()
    y_val = y_val.cuda()

  # Clearing the gradients
  optimizer.zero_grad()

  # Prediction for train and val set
  output_train = model(X_train)
  output_val = model(X_val)

  # Compute train and val loss
  loss_train = criterion(output_train, y_train)
  loss_val = criterion(output_val, y_val)
  train_losses.append(loss_train)
  val_losses.append(loss_val)

  # Compute the updated weights of all model parameters
  loss_train.backward()
  optimizer.step()
  tr_loss = loss_train.item()
  if epoch % (epoch // 10) == 0:
    print(f'Epoch: {epoch + 1},   loss : {loss_val}')

In [None]:
# Use CNN for the above scenario
# _*_ coding: utf-8 _*_

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

class CNN(nn.Module):
	def __init__(self, batch_size, output_size, in_channels, out_channels, kernel_heights, stride, padding, keep_probab, vocab_size, embedding_length, weights):
		super(CNN, self).__init__()
		
		"""
		Arguments
		---------
		batch_size : Size of each batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		in_channels : Number of input channels. Here it is 1 as the input data has dimension = (batch_size, num_seq, embedding_length)
		out_channels : Number of output channels after convolution operation performed on the input matrix
		kernel_heights : A list consisting of 3 different kernel_heights. Convolution will be performed 3 times and finally results from each kernel_height will be concatenated.
		keep_probab : Probability of retaining an activation node during dropout operation
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embedding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table
		--------
		
		"""

		self.batch_size = batch_size
		self.output_size = output_size
		self.in_channels = in_channels
		self.out_channels = out_channels
		self.kernel_heights = kernel_heights
		self.stride = stride
		self.padding = padding
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		
		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
		self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False)
		self.conv1 = nn.Conv2d(in_channels, out_channels, (kernel_heights[0], embedding_length), stride, padding)
		self.conv2 = nn.Conv2d(in_channels, out_channels, (kernel_heights[1], embedding_length), stride, padding)
		self.conv3 = nn.Conv2d(in_channels, out_channels, (kernel_heights[2], embedding_length), stride, padding)
		self.dropout = nn.Dropout(keep_probab)
		self.label = nn.Linear(len(kernel_heights)*out_channels, output_size)
	
	def conv_block(self, input, conv_layer):
		conv_out = conv_layer(input)# conv_out.size() = (batch_size, out_channels, dim, 1)
		activation = F.relu(conv_out.squeeze(3))# activation.size() = (batch_size, out_channels, dim1)
		max_out = F.max_pool1d(activation, activation.size()[2]).squeeze(2)# maxpool_out.size() = (batch_size, out_channels)
		
		return max_out
	
	def forward(self, input_sentences, batch_size=None):
		
		"""
		The idea of the Convolutional Neural Netwok for Text Classification is very simple. We perform convolution operation on the embedding matrix 
		whose shape for each batch is (num_seq, embedding_length) with kernel of varying height but constant width which is same as the embedding_length.
		We will be using ReLU activation after the convolution operation and then for each kernel height, we will use max_pool operation on each tensor 
		and will filter all the maximum activation for every channel and then we will concatenate the resulting tensors. This output is then fully connected
		to the output layers consisting two units which basically gives us the logits for both positive and negative classes.
		
		Parameters
		----------
		input_sentences: input_sentences of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for pos & neg class.
		logits.size() = (batch_size, output_size)
		
		"""
		
		input = self.word_embeddings(input_sentences)
		# input.size() = (batch_size, num_seq, embedding_length)
		input = input.unsqueeze(1)
		# input.size() = (batch_size, 1, num_seq, embedding_length)
		max_out1 = self.conv_block(input, self.conv1)
		max_out2 = self.conv_block(input, self.conv2)
		max_out3 = self.conv_block(input, self.conv3)
		
		all_out = torch.cat((max_out1, max_out2, max_out3), 1)
		# all_out.size() = (batch_size, num_kernels*out_channels)
		fc_in = self.dropout(all_out)
		# fc_in.size()) = (batch_size, num_kernels*out_channels)
		logits = self.label(fc_in)
		
		return logits

vocab_size = 2000
output_size = 2
in_channels = 1
out_channels = 4
num_filters = [2, 2, 2]
kernel_heights = [2, 2, 2]

cnn = CNN(batch_size=BATCH_SIZE, output_size=output_size, in_channels=in_channels, out_channels=out_channels, \
          kernel_heights=kernel_heights, stride=1, padding=0, keep_probab=0.5, vocab_size=vocab_size, \
					     embedding_length=len(glove), weights=weights_matrix)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(cnn.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = cnn(text)

    # Compute and print loss
    rnn_loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    

print("This is a %s news" %ag_news_label[predict(ex_text_str, cnn, vocab, 1)])

In [None]:
# Use RNN for the above scenario
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

class RNN(nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(RNN, self).__init__()

		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		"""

		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		
		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
		self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False)
		self.rnn = nn.RNN(embedding_length, hidden_size, num_layers=2, bidirectional=True)
		self.label = nn.Linear(4*hidden_size, output_size)
	
	def forward(self, input_sentences, batch_size=None):
		
		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for pos & neg class which receives its input as the final_hidden_state of RNN.
		logits.size() = (batch_size, output_size)
		
		"""

		input = self.word_embeddings(input_sentences)
		input = input.permute(1, 0, 2)
		if batch_size is None:
			h_0 = Variable(torch.zeros(4, self.batch_size, self.hidden_size).cuda()) # 4 = num_layers*num_directions
		else:
			h_0 =  Variable(torch.zeros(4, batch_size, self.hidden_size).cuda())
		output, h_n = self.rnn(input, h_0)
		# h_n.size() = (4, batch_size, hidden_size)
		h_n = h_n.permute(1, 0, 2) # h_n.size() = (batch_size, 4, hidden_size)
		h_n = h_n.contiguous().view(h_n.size()[0], h_n.size()[1]*h_n.size()[2])
		# h_n.size() = (batch_size, 4*hidden_size)
		logits = self.label(h_n) # logits.size() = (batch_size, output_size)
		
		return logits

    
rnn = RNN(batch_size=BATCH_SIZE, output_size=output_size, hidden_size=32, vocab_size=vocab_size, embedding_length=embedding_dim, weights=embed_lookup)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(rnn.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = rnn(x)

    # Compute and print loss
    rnn_loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print("This is a %s news" %ag_news_label[predict(ex_text_str, rnn, vocab, 1)])

In [None]:
# Use LSTM for the above scenario
# _*_ coding: utf-8 _*_

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

class LSTMClassifier(nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(LSTMClassifier, self).__init__()
		
		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		"""
		
		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		
		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)# Initializing the look-up table.
		self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False) # Assigning the look-up table to the pre-trained GloVe word embedding.
		self.lstm = nn.LSTM(embedding_length, hidden_size)
		self.label = nn.Linear(hidden_size, output_size)
		
	def forward(self, input_sentence, batch_size=None):
	
		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for positive & negative class which receives its input as the final_hidden_state of the LSTM
		final_output.shape = (batch_size, output_size)
		
		"""
		
		''' Here we will map all the indexes present in the input sequence to the corresponding word vector using our pre-trained word_embedddins.'''
		input = self.word_embeddings(input_sentence) # embedded input of shape = (batch_size, num_sequences,  embedding_length)
		input = input.permute(1, 0, 2) # input.size() = (num_sequences, batch_size, embedding_length)
		if batch_size is None:
			h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial hidden state of the LSTM
			c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial cell state of the LSTM
		else:
			h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
		output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
		final_output = self.label(final_hidden_state[-1]) # final_hidden_state.size() = (1, batch_size, hidden_size) & final_output.size() = (batch_size, output_size)
		
		return final_output


lstm = LSTM(batch_size=BATCH_SIZE, output_size=output_size, hidden_size=32, vocab_size=vocab_size, embedding_length=embedding_dim, weights=embed_lookup)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(lstm.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = lstm(x)

    # Compute and print loss
    rnn_loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print("This is a %s news" %ag_news_label[predict(ex_text_str, lstm, vocab, 1)])

In [None]:
# Use LSTM Attention for the above scenario
# _*_ coding: utf-8 _*_

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F
import numpy as np

class AttentionModel(torch.nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(AttentionModel, self).__init__()
		
		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		--------
		
		"""
		
		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		
		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
		self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
		self.lstm = nn.LSTM(embedding_length, hidden_size)
		self.label = nn.Linear(hidden_size, output_size)
		#self.attn_fc_layer = nn.Linear()
		
	def attention_net(self, lstm_output, final_state):

		""" 
		Now we will incorporate Attention mechanism in our LSTM model. In this new model, we will use attention to compute soft alignment score corresponding
		between each of the hidden_state and the last hidden_state of the LSTM. We will be using torch.bmm for the batch matrix multiplication.
		
		Arguments
		---------
		
		lstm_output : Final output of the LSTM which contains hidden layer outputs for each sequence.
		final_state : Final time-step hidden state (h_n) of the LSTM
		
		---------
		
		Returns : It performs attention mechanism by first computing weights for each of the sequence present in lstm_output and and then finally computing the
				  new hidden state.
				  
		Tensor Size :
					hidden.size() = (batch_size, hidden_size)
					attn_weights.size() = (batch_size, num_seq)
					soft_attn_weights.size() = (batch_size, num_seq)
					new_hidden_state.size() = (batch_size, hidden_size)
					  
		"""
		
		hidden = final_state.squeeze(0)
		attn_weights = torch.bmm(lstm_output, hidden.unsqueeze(2)).squeeze(2)
		soft_attn_weights = F.softmax(attn_weights, 1)
		new_hidden_state = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
		
		return new_hidden_state
	
	def forward(self, input_sentences, batch_size=None):
	
		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for pos & neg class which receives its input as the new_hidden_state which is basically the output of the Attention network.
		final_output.shape = (batch_size, output_size)
		
		"""
		
		input = self.word_embeddings(input_sentences)
		input = input.permute(1, 0, 2)
		if batch_size is None:
			h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda())
		else:
			h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
			
		output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0)) # final_hidden_state.size() = (1, batch_size, hidden_size) 
		output = output.permute(1, 0, 2) # output.size() = (batch_size, num_seq, hidden_size)
		
		attn_output = self.attention_net(output, final_hidden_state)
		logits = self.label(attn_output)
		
		return logits


lstm_A = AttentionModel(batch_size=BATCH_SIZE, output_size=2, hidden_size=32, vocab_size=len(train_dataset.get_vocab()), embedding_length=20, weights=embedding_matrix)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(lstm_A.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = lstm_A(x)

    # Compute and print loss
    rnn_loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print("This is a %s news" %ag_news_label[predict(ex_text_str, lstm_A, vocab, 1)])

In [None]:
# Use Self Attention for the above scenario
# _*_ coding: utf-8 _*_

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.nn import functional as F

class SelfAttention(nn.Module):
	def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
		super(SelfAttention, self).__init__()

		"""
		Arguments
		---------
		batch_size : Size of the batch which is same as the batch_size of the data returned by the TorchText BucketIterator
		output_size : 2 = (pos, neg)
		hidden_sie : Size of the hidden_state of the LSTM
		vocab_size : Size of the vocabulary containing unique words
		embedding_length : Embeddding dimension of GloVe word embeddings
		weights : Pre-trained GloVe word_embeddings which we will use to create our word_embedding look-up table 
		
		--------
		
		"""

		self.batch_size = batch_size
		self.output_size = output_size
		self.hidden_size = hidden_size
		self.vocab_size = vocab_size
		self.embedding_length = embedding_length
		self.weights = weights

		self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
		self.word_embeddings.weights = nn.Parameter(weights, requires_grad=False)
		self.dropout = 0.8
		self.bilstm = nn.LSTM(embedding_length, hidden_size, dropout=self.dropout, bidirectional=True)
		# We will use da = 350, r = 30 & penalization_coeff = 1 as per given in the self-attention original ICLR paper
		self.W_s1 = nn.Linear(2*hidden_size, 350)
		self.W_s2 = nn.Linear(350, 30)
		self.fc_layer = nn.Linear(30*2*hidden_size, 2000)
		self.label = nn.Linear(2000, output_size)

	def attention_net(self, lstm_output):

		"""
		Now we will use self attention mechanism to produce a matrix embedding of the input sentence in which every row represents an
		encoding of the inout sentence but giving an attention to a specific part of the sentence. We will use 30 such embedding of 
		the input sentence and then finally we will concatenate all the 30 sentence embedding vectors and connect it to a fully 
		connected layer of size 2000 which will be connected to the output layer of size 2 returning logits for our two classes i.e., 
		pos & neg.

		Arguments
		---------

		lstm_output = A tensor containing hidden states corresponding to each time step of the LSTM network.
		---------

		Returns : Final Attention weight matrix for all the 30 different sentence embedding in which each of 30 embeddings give
				  attention to different parts of the input sentence.

		Tensor size : lstm_output.size() = (batch_size, num_seq, 2*hidden_size)
					  attn_weight_matrix.size() = (batch_size, 30, num_seq)

		"""
		attn_weight_matrix = self.W_s2(F.tanh(self.W_s1(lstm_output)))
		attn_weight_matrix = attn_weight_matrix.permute(0, 2, 1)
		attn_weight_matrix = F.softmax(attn_weight_matrix, dim=2)

		return attn_weight_matrix

	def forward(self, input_sentences, batch_size=None):

		""" 
		Parameters
		----------
		input_sentence: input_sentence of shape = (batch_size, num_sequences)
		batch_size : default = None. Used only for prediction on a single sentence after training (batch_size = 1)
		
		Returns
		-------
		Output of the linear layer containing logits for pos & neg class.
		
		"""

		input = self.word_embeddings(input_sentences)
		input = input.permute(1, 0, 2)
		if batch_size is None:
			h_0 = Variable(torch.zeros(2, self.batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(2, self.batch_size, self.hidden_size).cuda())
		else:
			h_0 = Variable(torch.zeros(2, batch_size, self.hidden_size).cuda())
			c_0 = Variable(torch.zeros(2, batch_size, self.hidden_size).cuda())

		output, (h_n, c_n) = self.bilstm(input, (h_0, c_0))
		output = output.permute(1, 0, 2)
		# output.size() = (batch_size, num_seq, 2*hidden_size)
		# h_n.size() = (1, batch_size, hidden_size)
		# c_n.size() = (1, batch_size, hidden_size)
		attn_weight_matrix = self.attention_net(output)
		# attn_weight_matrix.size() = (batch_size, r, num_seq)
		# output.size() = (batch_size, num_seq, 2*hidden_size)
		hidden_matrix = torch.bmm(attn_weight_matrix, output)
		# hidden_matrix.size() = (batch_size, r, 2*hidden_size)
		# Let's now concatenate the hidden_matrix and connect it to the fully connected layer.
		fc_out = self.fc_layer(hidden_matrix.view(-1, hidden_matrix.size()[1]*hidden_matrix.size()[2]))
		logits = self.label(fc_out)
		# logits.size() = (batch_size, output_size)

		return logits

self_A = SelfAttention(batch_size=BATCH_SIZE, output_size=output_size, hidden_size=32, vocab_size=vocab_size, embedding_length=embedding_dim, weights=embed_lookup)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(self_A.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = self_A(x)

# Compute and print loss
self_A_loss = criterion(y_pred, y)
if t % 100 == 99:
    print(t, loss.item())

# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()

print("This is a %s news" %ag_news_label[predict(ex_text_str, self_A, vocab, 1)])

###### Is it possible to use transformer for the above scenario i.e., Text Sentiment Analysis / Classification. Answer yes/no. If yes implement using the resource below
Yes, but I am too tired
###### If No why do you think it is not possible for transformer to perform text classification in PyTorch.
###### You can take help from https://github.com/Renovamen/Text-Classification/tree/master/models/Transformer to implement transformers.