# Learning Project: Document Classification with PyTorch and TorchText

This project builds a text classification model using the AG_NEWS dataset with PyTorch and TorchText. The model predicts the category of a news article (World, Sports, Business, or Sci/Tech) from its raw text. It uses:
- __Tokenization__ and __vocabulary building__ to convert raw text into numeric format
- A __collate function__ with EmbeddingBag for efficient text representation without padding
- A __simple feedforward neural network__ for classification
- __Cross-entropy loss__ and __stochastic gradient descent (SGD)__ for training
- A learning __rate scheduler__ to dynamically adjust training speed

The project demonstrates the full NLP pipeline: from data preprocessing to model training, evaluation, and optimization.

In [1]:
# load libraries 
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
from torch.optim.lr_scheduler import StepLR

In [2]:
# Train test split 
train_iter, test_iter = AG_NEWS(split=('train','test'))

In [None]:
# Tokenizaiton and Vocab building 
tokenizer = get_tokenizer('basic_english')
# build vocabulary (with generator function to avoid memory inefficiency)
#def yeild_tokens(data_iter): 
#    for _,text in data_iter: 
#        yield tokenizer(text)
vocab = build_vocab_from_iterator(
    (tokenizer(text) for _,text in train_iter), 
    specials=['<unk>']
)
vocab.set_default_index(vocab['<unk>'])

In [15]:
# Collate function 
## collate function essentially preprocess the batch on the fly 
def collate_batch(batch): 
    # label_list -> store the true class label as int
    # text_list -> store tokenized & indexed text 
    # offset -> starting index of each sample 
    label_list, text_list, offset = [],[],[0]
    for label,text in batch: 
        label_list.append(label - 1) # -1 for 0 indexed 
        processed_text = torch.tensor(vocab(tokenizer(text)), dtype=torch.int64) # tokenized & numericalized
        text_list.append(processed_text)
        offset.append(processed_text.size(0)) # basically length of tensor .size is a tensor method to get dim
    # convert label list to label tensor 
    label_tensor = torch.tensor(label_list,dtype=torch.int64)
    # convert text list to text tensor 
    text_tensor = torch.cat(text_list)
    # cumulative offsets: start index of each sample 
    offset_tensor = torch.tensor(offset[:-1]).cumsum(dim=0) # last length no needed; dim = 0 row wised 

    return label_tensor,text_tensor,offset_tensor


In [16]:
# DataLoader 
from torch.utils.data.dataset import random_split
# 95% for training and 5% for validation 
t_len = len(list(AG_NEWS(split='train')))
num_train = int(t_len * 0.95)
num_valid = t_len - num_train
# randomly split training set into training and validation 
train_set, valid_set = random_split(list(AG_NEWS(split='train')), [num_train,num_valid])
# wrap training set into train_dataloader 
train_dataloader = DataLoader(train_set,batch_size=8,shuffle=True,collate_fn=collate_batch)
# wrap validation set into valid_dataloader 
valid_dataloader = DataLoader(valid_set,batch_size=8,collate_fn=collate_batch) # no shuffle, deterministic 
# wrap test set into test_dataloader 
test_dataloader = DataLoader(list(AG_NEWS(split='test')), batch_size=8,collate_fn=collate_batch)

In [17]:
# define the model 
class TextClassificationModel(nn.Module):  # inherits from nn.Module
    def __init__(self,vocab_size, embed_dim, num_class): 
        # super() access parent class (nn.Module) methods 
        super().__init__() # initialize the base(constructor) nn.Module 
        # voccab_size -> # of unique tokens 
        # embed_dim -> dim of word embedding (each word is represented by embed_dim dimension vector)
        self.embedding = nn.EmbeddingBag(vocab_size,embed_dim,sparse = True)
        # fully connected layer 
        ## a simple linear layer that projects the final embedding to class logits.
        self.fc = nn.Linear(embed_dim,num_class)
        # call function init_weights to initiate model weight 
        self.init_weights() 

    # probably not necessary, but I will have one just for good practice 
    def init_weights(self): 
        initrange = 0.5 
        self.embedding.weight.data.uniform_(-initrange,initrange)
        self.fc.weight.data.uniform_(-initrange,initrange)
        self.fc.bias.data.zero_() # initialize biases to zero 

    def forward(self,text,offset): 
        # embedded is a tensor shape of [batch size, embed_dim], which is pooled embedding of a doc 
        embedded = self.embedding(text,offset) # text is 1D tensor 
        return self.fc(embedded)


In [18]:
# initialize model, loss, optimizer, and scheduler 

num_class = 4 # business, sci, sports, world 
vocab_size = len(vocab)
embed_dim = 64 

# init model 
model = TextClassificationModel(vocab_size=vocab_size,embed_dim=embed_dim,num_class=num_class)

# cross entropy loss
criterion = nn.CrossEntropyLoss()
# optimizer 
optimizer = torch.optim.SGD(model.parameters(), lr = 1)
# scheduler 
## step size -> period of learning rate decay 
## gamma -> multiplicative factor of learning rate decay 
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size = 1,gamma = 0.9) 

In [19]:
# Model Training 
def train(dataloder): # dataloader iterate over taining batch 
    # tells the model to operate in training mode 
    model.train()
    # metrics 
    ## total_acc -> total # of correct prediction 
    ## total_loss -> total loss over batches 
    ## total_count -> total # of samples seen 
    total_acc, total_loss, total_count = 0,0,0

    # dataloader contains label, text, offset 
    for label,text,offset in dataloder: 
        # clear old gradients 
        optimizer.zero_grad()
        output = model(text,offset) # nn.Module calls model.__call__() which calls forward() 
        loss = criterion(output,label) # compute how different the predicted vs. actual probabilities are 
        # backpropagation 
        loss.backward() 
        # gradient clipping (prevents explosion)
        torch.nn.utils.clip_grad_norm_(model.parameters(),max_norm=0.5)
        # apply gradient update 
        optimizer.step() 

        # accumulate metrics 
        total_loss += loss.item() # float value from loss tensor 
        # argmax(1) find predicted label compare to true label and sum over bool tensor 
        total_acc += (output.argmax(1)==label).sum().item() 
        total_count += label.size(0) 

    # return avg loss and accuracy 
    return total_loss/total_count, total_acc/total_count
    
for epoch in range(15): 
    loss,acc = train(train_dataloader) 
    # scheduler step 
    scheduler.step() 
    print(f"Epoch {epoch+1}: Accuracy = {acc}, Loss = {loss}")

NotImplementedError: Could not run 'aten::_foreach_norm.Scalar' with arguments from the 'SparseCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_foreach_norm.Scalar' is only available for these backends: [CPU, MPS, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/build/aten/src/ATen/RegisterCPU.cpp:31034 [kernel]
MPS: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:39 [backend fallback]
BackendSelect: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]
Functionalize: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:280 [backend fallback]
Named: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradCPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradCUDA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradHIP: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradXLA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradMPS: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradIPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradXPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradHPU: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradVE: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradLazy: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradMeta: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradMTIA: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradPrivateUse1: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradPrivateUse2: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradPrivateUse3: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
AutogradNestedTensor: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17488 [autograd kernel]
Tracer: registered at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/generated/TraceType_2.cpp:16726 [kernel]
AutocastCPU: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/autocast_mode.cpp:487 [backend fallback]
AutocastCUDA: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/autocast_mode.cpp:354 [backend fallback]
FuncTorchBatched: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:815 [backend fallback]
FuncTorchVmapMode: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/LegacyBatchingRegistrations.cpp:1073 [backend fallback]
VmapMode: fallthrough registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/TensorWrapper.cpp:210 [backend fallback]
PythonTLSSnapshot: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:152 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:487 [backend fallback]
PythonDispatcher: registered at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]


In [8]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")


Using device: mps
