# HW2-1 Text classification

### preprocessing

#### Explain how the data is processed. 

1. Loads JSON lines

2. Tokenizes with spaCy 

3. Builds vocab + maps labels

4. Convert everything to tensors with manual padding

5. Train using a DataLoader

I choose spaCy tokenizer (en_core_web_sm) since spaCy provides a reliable, high-quality English tokenizer, which handles punctuation, splitting contractions, special cases. Also, implement a label mapping to make sure label is correctly used.

In [None]:
import spacy
from collections import Counter
import torch
import torch.nn as nn
import torch.nn.functional as F
import json

json_file = './News_train.json'
dataset = []
with open(json_file, 'r') as f:
    for line in f:
        if line.strip():
            dataset.append(json.loads(line))

print(f"Loaded {len(dataset)} samples")
print(f"Example: {dataset[0]}")

# Load spaCy tokenizer
nlp = spacy.load("en_core_web_sm")

# collect all raw labels
raw_label_list = [item['label'] for item in dataset]
unique_labels = sorted(set(raw_label_list))
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}
print(f"Unique labels: {unique_labels}")
print(f"Label to ID mapping: {label2id}")

tokenized_texts = []
labels = []
vocab_counter = Counter()

print("Label mapping:", label2id)

for item in dataset:
    text = f"{item['headline']} {item['short_description']}"
    tokens = [token.text.lower() for token in nlp(text)]
    tokenized_texts.append(tokens)

    raw_label = item['label']
    label_idx = label2id[raw_label]
    labels.append(label_idx)

    vocab_counter.update(tokens)

# Build vocab
vocab = {"<PAD>": 0, "<UNK>": 1}
for word in vocab_counter:
    vocab[word] = len(vocab)

print(f"Vocab size: {len(vocab)}")

max_len = max(len(tokens) for tokens in tokenized_texts)
sequences = []

for tokens in tokenized_texts:
    seq = [vocab.get(token, vocab["<UNK>"]) for token in tokens]
    seq += [vocab["<PAD>"]] * (max_len - len(seq))
    sequences.append(seq)

# Convert input, label to tensor
input_tensor = torch.tensor(sequences)
labels_tensor = torch.tensor(labels)

print(f"Input tensor shape: {input_tensor.shape}")
print(f"Labels tensor shape: {labels_tensor.shape}")

Loaded 135608 samples
Example: {'id': 0, 'headline': 'Trump Officials Repeatedly Violated Hatch Act, Probe Finds', 'short_description': 'At least 13 former Trump administration officials violated the law by intermingling campaigning with their official government duties, according to a new investigation.', 'label': 0.0}
Unique labels: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
Label to ID mapping: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5, 6.0: 6, 7.0: 7, 8.0: 8, 9.0: 9, 10.0: 10, 11.0: 11, 12.0: 12, 13.0: 13, 14.0: 14}
Label mapping: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4, 5.0: 5, 6.0: 6, 7.0: 7, 8.0: 8, 9.0: 9, 10.0: 10, 11.0: 11, 12.0: 12, 13.0: 13, 14.0: 14}
Vocab size: 75541
Input tensor shape: torch.Size([135608, 310])
Labels tensor shape: torch.Size([135608])


In [2]:
from torch.utils.data import Dataset, DataLoader

class NewsDataset(Dataset):
    def __init__(self, input_tensor, labels_tensor):
        self.inputs = input_tensor
        self.labels = labels_tensor

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.inputs[idx], self.labels[idx]
    
class TestNewsDataset(Dataset):
    def __init__(self, input_tensor, ids):
        self.inputs = input_tensor
        self.ids = ids

    def __len__(self):
        return len(self.ids)
    
    def __getitem__(self, idx):
        return self.inputs[idx], self.ids[idx]
    
dataset = NewsDataset(input_tensor, labels_tensor)
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

### Build Transformer 

#### Discuss the model structure or hyperparameter setting in your design.

1. Positional Encoding

    **Transformers are permutation-invariant.** Unlike RNNs, they have **no inherent notion of sequence order.** Therefore, **positional encoding (sinusoidal)** is added to the input embedding to inject **sequence order information**.

    Sinusoid positional encoding allows the model to extrapolate to sequences longer than seen during training.

    Max length is set to 512 which is a common used size for short to moderate-length text. (headlines + descriptions)

2. Embedding Layer

    Converts token IDs to dense vectors. **d_model=64 is downsized** compared to original paper(d_model=512) because we are using a **smaller dataset** and the structure is quite simple. (headlines/descriptions)

    * Tradeoff: **smaller d_model = faster training + less overfitting risk, but less expressive power.**

3. TransformerEncoderLayer

    Inherits the original paper. Instead of **single attention**, use **multiple heads** to allow the model to **jointly attend to information form different subspaces.**

    n_head in original paper is set to 8 along side with d_model of 512. In my implementation, d_model is set to 64 which is much smaller. Lower down the n_head accordingly while preserve enough dimensions for each head to be useful. n_head is set to 4 in this implementation. 

    Keeps the dim_feedforward setting from the original paper. As in the original paper: "2048 intermediate size allows deep transformation of each token's embedding". Even if d_model=64, this large MLP gives **non-linearity and capacity**.

4. Dropout + Normalization

    Dropout in MultiheadAttention, Feedforward(MLP), Embedding output. Dropout ratio is set higher (0.3) compare to original design for stronger regularization.

    LayerNorm after attention + residual and after feedforward + residual. Matches the Post-Norm design in the original paper.

5. Classifier + Pooling

    The original paper use the [CLS] token equivalent or takes last decoder token. Instead, I use no special token but mean pooling. The structure ignores <PAD> tokens, averages over only real tokens and is simpler than adding a [CLS] token.

6. Other hyperparameters

    num_layers=2 compares to original 6 layers, which is more lightweight (faster, less overfitting risk).

    dropout=0.3 is more aggressive regularization.

    Classifier uses the standard linear layer design.

In [3]:
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x
    
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.3):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, src, padding_mask=None):
        # src shape: (seq_len, batch_size, d_model)
        src2, _ = self.self_attn(src, src, src, key_padding_mask=padding_mask)
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src
    
class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model=64, nhead=4, num_layers=2, num_classes=2, dropout=0.3, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)
        self.embedding_dropout = nn.Dropout(dropout)
        self.pos_encoder = PositionalEncoding(d_model)
        
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, nhead, dropout=dropout) for _ in range(num_layers)
        ])

        self.classifier = nn.Linear(d_model, num_classes)
        self.pad_idx = pad_idx

    def forward(self, src):
        # src shape: (batch_size, seq_len)
        padding_mask = (src == self.pad_idx)
        src = self.embedding(src) # (batch_size, seq_len, d_model)
        src = self.embedding_dropout(src)
        src = self.pos_encoder(src)

        src = src.permute(1, 0, 2) # (seq_len, batch_size, d_model)

        for layer in self.layers:
            src = layer(src, padding_mask=padding_mask)

        src = src.permute(1, 0, 2) # (batch_size, seq_len, d_model)

        # Mask mean pooling
        mask = (~padding_mask).unsqueeze(-1)
        src = src * mask

        summed = src.sum(1)
        lengths = mask.sum(1)

        pooled = summed / lengths.clamp(min=1e-9)
        
        logits = self.classifier(pooled)
        return logits

### Training

In [8]:
# Hyperparameters
vocab_size = len(vocab)
d_model = 64
nhead = 4
num_layers = 2
num_classes = len(set(labels_tensor.tolist()))

model = TransformerClassifier(vocab_size, d_model, nhead, num_layers, num_classes, dropout=0.3)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

epochs = 20
for epoch in range(epochs):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_inputs, batch_labels in dataloader:
        batch_inputs = batch_inputs.to(device)
        batch_labels = batch_labels.long().to(device)
        optimizer.zero_grad()
        outputs = model(batch_inputs)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        _, predicted = torch.max(outputs, 1)
        correct += (predicted == batch_labels).sum().item()
        total += batch_labels.size(0)

    avg_loss = total_loss / len(dataloader)
    acc = correct / total 
    print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_loss:.4f} | Accuracy: {acc*100:.4f}%")

Using device: cuda
Epoch 1/20 | Loss: 1.5877 | Accuracy: 51.7116%
Epoch 2/20 | Loss: 1.0370 | Accuracy: 68.7201%
Epoch 3/20 | Loss: 0.8482 | Accuracy: 74.2493%
Epoch 4/20 | Loss: 0.7414 | Accuracy: 77.3435%
Epoch 5/20 | Loss: 0.6678 | Accuracy: 79.2999%
Epoch 6/20 | Loss: 0.6062 | Accuracy: 80.9495%
Epoch 7/20 | Loss: 0.5477 | Accuracy: 82.5497%
Epoch 8/20 | Loss: 0.4922 | Accuracy: 84.1042%
Epoch 9/20 | Loss: 0.4424 | Accuracy: 85.4891%
Epoch 10/20 | Loss: 0.4033 | Accuracy: 86.7095%
Epoch 11/20 | Loss: 0.3740 | Accuracy: 87.6077%
Epoch 12/20 | Loss: 0.3502 | Accuracy: 88.3525%
Epoch 13/20 | Loss: 0.3313 | Accuracy: 88.9483%
Epoch 14/20 | Loss: 0.3126 | Accuracy: 89.5692%
Epoch 15/20 | Loss: 0.3040 | Accuracy: 89.8214%
Epoch 16/20 | Loss: 0.2923 | Accuracy: 90.1798%
Epoch 17/20 | Loss: 0.2834 | Accuracy: 90.4076%
Epoch 18/20 | Loss: 0.2745 | Accuracy: 90.8685%
Epoch 19/20 | Loss: 0.2720 | Accuracy: 90.8339%
Epoch 20/20 | Loss: 0.2623 | Accuracy: 91.2409%


### Inference

In [None]:
import json
import csv

test_file = './News_test.json'

with open(test_file, 'r') as f:
    test_dataset = []
    for line in f:
        if line.strip():
            test_dataset.append(json.loads(line))

print(f"Loaded {len(test_dataset)} test samples")

test_tokenized_texts = []
test_ids = []

for item in test_dataset:
    text = f"{item['headline']} {item['short_description']}"
    tokens = [token.text.lower() for token in nlp(text)]
    test_tokenized_texts.append(tokens)
    test_ids.append(item['id'])

test_sequences = []
for tokens in test_tokenized_texts:
    seq = [vocab.get(token, vocab.get("<UNK>")) for token in tokens]
    seq += [vocab["<PAD>"]] * (max_len - len(seq))
    test_sequences.append(seq)

test_input_tensor = torch.tensor(test_sequences)

model.eval()

test_input_tensor = test_input_tensor.to(device)

test_dataset = TestNewsDataset(test_input_tensor, test_ids)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

results = []

with torch.no_grad():
    for batch_inputs, batch_ids in test_loader:
        batch_inputs = batch_inputs.to(device)
        outputs = model(batch_inputs)
        _, predicted = torch.max(outputs, 1)
        for id_, label in zip(batch_ids, predicted.cpu()):
            original_label = id2label[label.item()]
            results.append((id_.item(), original_label))

output_file = 'predictions.csv'

with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["ID", "label"])
    for id_, label in results:
        writer.writerow([id_, label])

print(f"Predictions saved to {output_file}")

Loaded 1000 test samples
Predictions saved to predictions.csv
