### Loading AG News with Torchtext

The AG News dataset is one of many included Torchtext.
It can be found grouped together with many of the other text classification datasets.
While we can download the source text online, Torchtext makes it retrievable with a quick API call&ast;. If you are running this notebook on your machine,  you can uncomment and run this block:

In [12]:
import torchtext
import torch
from torchtext.datasets.ag_news import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torch.nn as nn
import torch.nn.functional as F

The first step is to build a vocabulary with the raw training dataset. Here we use built in factory function build_vocab_from_iterator which accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary.

In [13]:
train_iter, test_iter = AG_NEWS(split=("train", "test"))
flag, text = next(iter(train_iter)) #It each iterator has the category/value and text
tokenizer = get_tokenizer('basic_english') # It retrieves the tool which allows you to encode the text
print(text)
print(tokenizer(text))

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short-sellers', ',', 'wall', 'street', "'", 's', 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']


In [14]:
def yield_tokens(data_iter): #Converts every text into an array of tokens, it returns it in an iterable way
    for _, text in data_iter: 
        yield tokenizer(text)
        
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"]) #It generates the vocab, which represents the words into indexes
vocab.set_default_index(vocab["<unk>"])

> What is yield in Python? [link](https://www.simplilearn.com/tutorials/python-tutorial/yield-in-python#:~:text=let's%20get%20started.-,What%20Is%20Yield%20In%20Python%3F,of%20simply%20returning%20a%20value)<br><br>The Yield keyword in Python is similar to a return statement used for returning values or objects in Python. However, there is a slight difference. The **yield statement returns a generator object to the one who calls the function which contains yield**, instead of simply returning a value. 

In [15]:
print(vocab(['here', 'is', 'an', 'example']))

[475, 21, 30, 5297]


In [16]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [17]:
print(text_pipeline('here is the an example'))
print(label_pipeline('10'))

[475, 21, 2, 30, 5297]
9


In [18]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')

In [19]:
class SWEM(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, num_outputs):
        super(SWEM,self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embedding_size, sparse=True)
        self.fc1 = nn.Linear(embedding_size, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, num_outputs)
        self.init_weights()


    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc1.weight.data.uniform_(-initrange, initrange)
        self.fc1.bias.data.zero_()
        self.fc2.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.zero_()

    def forward(self, x,offsets):
        embed = self.embedding(x,offsets)
        h = self.fc1(embed)
        h = F.relu(h)
        h = self.fc2(h)
        return h

In [20]:

VOCAB_SIZE = len(vocab)
EMBED_DIM = 100
HIDDEN_DIM = 60
NUM_OUTPUTS = len(set([label for (label, text) in train_iter]))
NUM_EPOCHS = 3

model = SWEM(
    vocab_size = VOCAB_SIZE,
    embedding_size = EMBED_DIM, 
    hidden_dim = HIDDEN_DIM, 
    num_outputs = NUM_OUTPUTS,
).to(device)
print(model)

SWEM(
  (embedding): EmbeddingBag(95811, 100, mode=mean)
  (fc1): Linear(in_features=100, out_features=60, bias=True)
  (fc2): Linear(in_features=60, out_features=4, bias=True)
)


In [21]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

In [22]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 50 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 2280 batches | accuracy    0.716
| epoch   1 |  1000/ 2280 batches | accuracy    0.849
| epoch   1 |  1500/ 2280 batches | accuracy    0.874
| epoch   1 |  2000/ 2280 batches | accuracy    0.878
-----------------------------------------------------------
| end of epoch   1 | time: 13.59s | valid accuracy    0.878 
-----------------------------------------------------------
| epoch   2 |   500/ 2280 batches | accuracy    0.901
| epoch   2 |  1000/ 2280 batches | accuracy    0.901
| epoch   2 |  1500/ 2280 batches | accuracy    0.905
| epoch   2 |  2000/ 2280 batches | accuracy    0.904
-----------------------------------------------------------
| end of epoch   2 | time: 12.59s | valid accuracy    0.908 
-----------------------------------------------------------
| epoch   3 |   500/ 2280 batches | accuracy    0.920
| epoch   3 |  1000/ 2280 batches | accuracy    0.917
| epoch   3 |  1500/ 2280 batches | accuracy    0.917
| epoch   3 |  2000/ 2280 batches | accuracy

In [23]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.918


In [24]:
ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news
