<a href="https://colab.research.google.com/github/tejuafonja/nlp/blob/master/text_classification_with_torchtext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Tutorial from official PyTorch Website.**


*pip install --upgrade git+https://github.com/pytorch/text for latest torchtext*



This tutorial shows how to train a supervised learning algorithm for classification using a Topic Classification Dataset - AG News

Our goal is to classify a some document into 4 categories
1. World
2. Sport
3. Business
4. Sci / Tech

In [0]:
# !pip install --upgrade git+https://github.com/pytorch/text 

In [0]:
import torch
import torchtext
from torchtext.datasets import text_classification
import os

### Load Dataset

In [2]:
NGRAMS = 2
if not os.path.isdir('./data'):
  os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ag_news_csv.tar.gz: 11.8MB [00:00, 16.2MB/s]
120000lines [00:08, 13614.34lines/s]
120000lines [00:17, 6695.84lines/s]
7600lines [00:01, 6853.87lines/s]


In [0]:
import torch.nn as nn
import torch.nn.functional as F

### Define the Model

In [0]:
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

### Initiate an Instance

In [0]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

### Function used to Generate Batch

In [0]:
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

    # concatenate all the texts
    text = torch.cat(text)
    return text, offsets, label

### Define Functions to Train the Model and Evaluate Results

In [0]:
from torch.utils.data import DataLoader
def train_func(sub_train_):
  
  # Train the model
  train_loss = 0
  train_acc = 0
  data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch)
  
  for i, (text, offsets, cls) in enumerate(data):
    optimizer.zero_grad()
    text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
    output = model(text, offsets)
    loss = criterion(output, cls)
    train_loss += loss.item()
    loss.backward()
    optimizer.step()
    train_acc += (output.argmax(1) == cls).sum().item()

  # Adjust learning rate
  scheduler.step()

  return train_loss / len(sub_train_), train_acc / len(sub_train_)


def test(data_):
  loss = 0
  acc = 0
  data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
  for text, offsets, cls in data:
    text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
    with torch.no_grad():
      output = model(text, offsets)
      loss = criterion(output, cls) 
      loss += loss.item()
      acc += (output.argmax(1) == cls).sum().item()

  return loss / len(data_), acc / len(data_)

### Split the Dataset and Run the Model

In [31]:
import time
from torch.utils.data.dataset import random_split

N_EPOCHS = 10
min_valid_loss = float("inf")

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)

sub_train_, sub_valid_ = random_split(train_dataset, [train_len, len(train_dataset) - train_len])

best_valid_acc = 0

for epoch in range(N_EPOCHS):

  start_time = time.time()
  train_loss, train_acc = train_func(sub_train_)
  valid_loss, valid_acc = test(sub_valid_)

  if valid_acc > best_valid_acc:
    best_valid_acc = valid_acc
  else:
    break

  secs = int(time.time() - start_time)
  mins = secs / 60
  secs = secs % 60

  print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
  print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
  print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 10 seconds
	Loss: 0.0262(train)	|	Acc: 84.8%(train)
	Loss: 0.0001(valid)	|	Acc: 90.1%(valid)
Epoch: 2  | time in 0 minutes, 10 seconds
	Loss: 0.0119(train)	|	Acc: 93.6%(train)
	Loss: 0.0001(valid)	|	Acc: 90.3%(valid)
Epoch: 3  | time in 0 minutes, 10 seconds
	Loss: 0.0069(train)	|	Acc: 96.4%(train)
	Loss: 0.0001(valid)	|	Acc: 90.9%(valid)
Epoch: 4  | time in 0 minutes, 10 seconds
	Loss: 0.0038(train)	|	Acc: 98.1%(train)
	Loss: 0.0001(valid)	|	Acc: 91.0%(valid)
Epoch: 5  | time in 0 minutes, 10 seconds
	Loss: 0.0023(train)	|	Acc: 99.0%(train)
	Loss: 0.0001(valid)	|	Acc: 91.1%(valid)


### Evaluate the Model with Test Dataset

In [32]:
print('Checking the results of the test dataset...')
test_loss, test_acc = test(test_dataset)

print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of the test dataset...
	Loss: 0.0002(test)	|	Acc: 91.4%(test)


### Test on Random News

In [0]:
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

In [34]:
ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tech"}

def predict(text, model, vocab, ngrams):
  tokenizer = get_tokenizer("basic_english")
  with torch.no_grad():
    text = torch.tensor([vocab[token] for token in ngrams_iterator(tokenizer(text), ngrams)])
    output = model(text, torch.tensor([0]))
    return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news


In [0]:
ex_business = "The S&P 500 and the Dow industrials slipped in a shortened, pre-Christmas session on Tuesday, as investors paused after a record-setting rally fueled by improving U.S.-China trade relations that has put the market on course for its best year since 2013."

In [36]:
print("This is a %s news" %ag_news_label[predict(ex_business, model, vocab, 2)])

This is a Business news


In [0]:
ex_sci = '''
We know a lot about Lego bricks, especially how they interact with bare feet. A team of physicists at Lancaster University in the UK wanted to learn something new, so they chilled some Lego pieces to near absolute zero and discovered they have some intriguing thermal properties.
The scientists used a special "dilution refrigerator" that plunged the plastic to a mere 4 millidegrees above absolute zero. Absolute zero is theoretically the lowest temperature possible, which works out to  –459.67 degrees Fahrenheit or –273.15 degrees Celsius. 
The researchers popped four stacked bricks and an astronaut minifigure into the super-cool machine. You can check out the complicated process in a video deep-dive into the experiment.
'''

In [38]:
print("This is a %s news" %ag_news_label[predict(ex_sci, model, vocab, 2)])

This is a Sci/Tec news


In [0]:
ex_world = '''
About 2.4 billion people -- about half the population of Asia -- live in areas vulnerable to extreme weather events.
This year, flooding and landslides, triggered by torrential monsoon rains, swept across India, Nepal, Pakistan, and Bangladesh, leaving devastation in each country and hundreds of deaths.
China, Vietnam, Japan, India, Bangladesh, South Korea, Thailand, Sri Lanka and the Philippines, were all hit by tropical storms and typhoons -- or cyclones -- in 2019, causing dozens of deaths, hundreds of thousands displaced and millions of dollars in damage.
The climate crisis is expected to create higher storm surges, increased rainfall and stronger winds.
Joanna Sustento has been campaigning for climate action since Typhoon Haiyan devastated her home in Tacloban, the Philippines, in 2013.
Sustento lost both her parents, her eldest brother, sister-in-law, and her young nephew in the storm -- one of the most powerful ever recorded.
'''

In [40]:
print("This is a %s news" %ag_news_label[predict(ex_world, model, vocab, 2)])

This is a World news


In [0]:
ex_scifi = '''
Intelligent machines catastrophically misinterpreting human desires is a frequent trope in science fiction, perhaps used most memorably in Isaac Asimov’s stories of robots that misconstrue the famous “three laws of robotics.” The idea of artificial intelligence going awry resonates with human fears about technology. But current discussions of superhuman A.I. are plagued by flawed intuitions about the nature of intelligence.
We don’t need to go back all the way to Isaac Asimov — there are plenty of recent examples of this kind of fear. Take a recent Op-Ed essay in The New York Times and a new book, “Human Compatible,” by the computer scientist Stuart Russell. Dr. Russell believes that if we’re not careful in how we design artificial intelligence, we risk creating “superintelligent” machines whose objectives are not adequately aligned with our own.
As one example of a misaligned objective, Dr. Russell asks, “What if a superintelligent climate control system, given the job of restoring carbon dioxide concentrations to preindustrial levels, believes the solution is to reduce the human population to zero?” He claims that “if we insert the wrong objective into the machine and it is more intelligent than us, we lose.”
Dr. Russell’s view expands on arguments of the philosopher Nick Bostrom, who defined A.I. superintelligence as “an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.” Dr. Bostrom and Dr. Russell envision a superintelligence with vast general abilities unlike today’s best machines, which remain far below the level of humans in all but relatively narrow domains (such as playing chess or Go).
Dr. Bostrom, Dr. Russell and other writers argue that even if there is just a small probability that such superintelligent machines will emerge in the foreseeable future, it would be an event of such magnitude and potential danger that we should start preparing for it now. In Dr. Bostrom’s view, “a plausible default outcome of the creation of machine superintelligence is existential catastrophe.” That is, humans would be toast.
These thinkers — let’s call them the “superintelligentsia” — speculate that if machines were to attain general human intelligence, the machines would quickly become superintelligent. They speculate that a computer with general intelligence would be able to speedily read all existing books and documents, absorbing the totality of human knowledge. Likewise, the machine would be able to use its logical abilities to make discoveries that increase its cognitive power.
Such a machine, the speculation goes, would not be bounded by bothersome human limitations, such as slowness of thought, emotions, irrational biases and need for sleep. Instead, the machine would possess something like a “pure” intelligence without any of the cognitive shortcomings that limit humans.
The assumption seems to be that this A.I. could surpass the generality and flexibility of human intelligence while seamlessly retaining the speed, precision and programmability of a computer. This imagined machine would be far smarter than any human, far better at “general wisdom and social skills,” but at the same time it would preserve unfettered access to all of its mechanical capabilities. And as Dr. Russell’s example shows, it would lack humanlike common sense.
The problem with such forecasts is that they underestimate the complexity of general, human-level intelligence. Human intelligence is a strongly integrated system, one whose many attributes — including emotions, desires, and a strong sense of selfhood and autonomy — can’t easily be separated.
Similarly, if generally intelligent A.I. is ever created (something that will take many decades, if not centuries), its objectives, like ours, will not be easily “inserted” or “aligned.” They will rather develop along with the other qualities that form its intelligence, as a result of being embedded in human society and culture. The machines’ push to achieve these objectives will be tempered by the common sense, values and social judgment without which general intelligence cannot exist.
What’s more, the notion of superintelligence without humanlike limitations may be a myth. It seems likely to me that many of the supposed deficiencies of human cognition are inseparable aspects of our general intelligence, which evolved in large part to allow us to function as a social group. It’s possible that the emotions, “irrational” biases and other qualities sometimes considered cognitive shortcomings are what enable us to be generally intelligent social beings rather than narrow savants. I can’t prove it, but I believe that general intelligence can’t be isolated from all these apparent shortcomings, either in humans or in machines that operate in our human world.
In his 1979 Pulitzer Prize-winning book, “Gödel, Escher, Bach: an Eternal Golden Braid,” the cognitive scientist Douglas Hofstadter beautifully captures the counterintuitive complexity of intelligence by posing a deceptively simple question: “Will a thinking computer be able to add fast?” Dr. Hofstadter’s surprising but insightful answer was, “perhaps not.”
As Dr. Hofstadter explains: “We ourselves are composed of hardware which does fancy calculations but that doesn’t mean that our symbol level, where ‘we’ are, knows how to carry out the same fancy calculations. Let me put it this way: There’s no way that you can load numbers into your own neurons to add up your grocery bill. Luckily for you, your symbol level (i.e., you) can’t gain access to the neurons which are doing your thinking — otherwise you’d get addlebrained.” So, why, he asks, “should it not be the same for an intelligent program?”
In other words, the intelligent part of your mind can’t harness the fast-adding skills of your own neurons, and for good reason. This barrier — between the “self” that you are aware of and the detailed activity of your brain — permits the kind of thinking that matters for survival without getting overwhelmed (“addlebrained”) by your own thought processes. Similarly, a thinking computer’s hardware, like ours, would presumably include circuits for fast arithmetic, but at the level of its cognitive awareness, the machine wouldn’t be able to tap into these circuits any more than we humans can.
It’s fine to speculate about aligning an imagined superintelligent — yet strangely mechanical — A.I. with human objectives. But without more insight into the complex nature of intelligence, such speculations will remain in the realm of science fiction and cannot serve as a basis for A.I policy in the real world.
Understanding our own thinking is a hard problem for our plain old intelligent minds. But I’m hopeful that we, and our future thinking computers, will eventually achieve such understanding in spite of — or perhaps thanks to — our shared lack of superintelligence.
'''

In [42]:
print("This is a %s news" %ag_news_label[predict(ex_scifi, model, vocab, 2)])

This is a Sci/Tec news
