# 20 Newsgroups Text Classification using Transformer (DistilBERT)

This notebook uses Hugging Face Transformers library to perform text classification on the 20 Newsgroups dataset.

## Goals
- Use pre-trained DistilBERT model
- Fine-tune on 20 Newsgroups dataset
- Achieve 90%+ accuracy




## 1. Install Required Libraries


In [None]:
# Install necessary libraries
%pip install transformers datasets accelerate scikit-learn torch tqdm -q


## 2. Import Libraries


In [None]:
import numpy as np
import pandas as pd
import torch
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    pipeline
)
import time
from tqdm import tqdm

# Set random seeds for reproducibility
import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

print("‚úÖ Libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")


‚úÖ Libraries imported successfully
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


## 3. Load 20 Newsgroups Dataset


In [None]:
print("Loading 20 Newsgroups dataset...")

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]


# Load training and test sets
# Note: Keep headers, footers, quotes for better accuracy

train_data = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=46
)



# train_data = fetch_20newsgroups(subset='train')
test_data = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=46
)



## 4. Data Exploration


In [None]:
# View first sample
print("=" * 60)
print("Sample text:")
print("=" * 60)
print(f"Category: {train_data.target_names[train_data.target[0]]}")
print(f"\nText content (first 500 characters):")
print(train_data.data[0][:500])
print("...")


Sample text:
Category: comp.graphics

Text content (first 500 characters):
From: young@is.s.u-tokyo.ac.jp (YOUNG Shio Hong)
Subject: Looking for Dr. Bala R. Vatti's email address
Nntp-Posting-Host: rabbit-gw
Organization: Dept. of Information Science, Univ. of Tokyo, Japan.
Distribution: comp.graphics
X-Bytes: 660
Lines: 27

Hi!

I am looking for the email address of the author to
"A Generic Solution to Polygon Clipping", 
Communication of the ACM, July 1992, Vol. 35, No. 7. 
I got information about the author as follows
	Mr. Bala R. Vatti
	LCEC, 65 River Road, Hudson,
...


## 5. Create PyTorch Dataset


In [None]:
class Dataset(torch.utils.data.Dataset):
    """PyTorch Dataset wrapper for 20 Newsgroups"""

    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Convert encodings to PyTorch tensors
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

print("‚úÖ Dataset class defined")


‚úÖ Dataset class defined


## 6. Load Tokenizer and Process Data


In [None]:
print("Loading DistilBERT tokenizer...")

# Use DistilBERT tokenizer (same base model as ML_DL_models_SA.ipynb)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

print("\nTokenizing texts...")
print("This may take a few minutes...")

# Tokenize training data
train_encodings = tokenizer(
    train_data.data,
    truncation=True,      # Truncate long texts
    padding=True,         # Pad to same length
    max_length=512,       # Max length 512 (BERT standard)
    return_tensors=None   # Don't convert to tensors yet
)

# Tokenize test data
test_encodings = tokenizer(
    test_data.data,
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors=None
)

print("‚úÖ Tokenization complete")
print(f"\nTrain encoding shape: {len(train_encodings['input_ids'])} samples")
print(f"Test encoding shape: {len(test_encodings['input_ids'])} samples")


Loading DistilBERT tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Tokenizing texts...
This may take a few minutes...
‚úÖ Tokenization complete

Train encoding shape: 2034 samples
Test encoding shape: 1353 samples


## 7. Create Dataset Objects


In [None]:
# Create train and test datasets
train_dataset = Dataset(train_encodings, train_data.target)
test_dataset = Dataset(test_encodings, test_data.target)

print(f"‚úÖ Datasets created")
print(f"Training set size: {len(train_dataset)}")
print(f"Test set size: {len(test_dataset)}")


‚úÖ Datasets created
Training set size: 2034
Test set size: 1353


## 8. Load Pre-trained Model


In [None]:
print("Loading DistilBERT model...")

# Load pre-trained DistilBERT model
# KEY: num_labels=20 (20 newsgroup categories)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=20  # ‚Üê KEY: 20 classes (not 2 for sentiment analysis)
)

# Calculate model parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úÖ Model loaded successfully")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")


Loading DistilBERT model...


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



‚úÖ Model loaded successfully
Total parameters: 66,968,852
Trainable parameters: 66,968,852


## 9. Define Evaluation Metrics


In [None]:
def compute_metrics(pred):
    """Calculate accuracy"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

print("‚úÖ Evaluation function defined")


‚úÖ Evaluation function defined


In [None]:
import transformers
print(transformers.__version__)


4.57.3


In [None]:
from transformers import TrainingArguments
import inspect
print(TrainingArguments)
print(inspect.getfile(TrainingArguments))


<class 'transformers.training_args.TrainingArguments'>
/usr/local/lib/python3.12/dist-packages/transformers/training_args.py


## 10. Configure Training Arguments


In [None]:

# Ë®ìÁ∑¥ÈÖçÁΩÆ
training_args = TrainingArguments(
    output_dir='./results_20newsgroups',          
  num_train_epochs=3,                 
    # num_train_epochs=6,                         
    # num_train_epochs=5,    
      learning_rate=1.5e-5,                   
    per_device_train_batch_size=16,              
    per_device_eval_batch_size=32,               
    warmup_ratio=0.15,                     
    # warmup_steps=500,                            
    weight_decay=0.01,                           
    logging_dir='./logs',                        
    logging_steps=100,                            
    # eval_strategy="steps",                   
    eval_strategy="epoch" ,           
    # eval_steps=500,                             
  save_strategy="epoch",            
    # save_strategy="steps",                       
    # save_steps=500,                             
    load_best_model_at_end=True,                 
    metric_for_best_model="accuracy",           
    greater_is_better=True,                      
    save_total_limit=2,                          
    report_to="none",                               
    do_train=True,                              
    do_eval=True,                                
)

print("‚úÖ TrainingArguments setting completed")

## 11. Create Trainer and Start Training


In [None]:
# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

print("‚úÖ Trainer created")
print("\n" + "="*60)
print("Starting training...")
print("="*60)

# Start training
start_time = time.time()
train_result = trainer.train()
training_time = time.time() - start_time

print("\n" + "="*60)
print("Training complete!")
print("="*60)
print(f"Total training time: {training_time/60:.2f} minutes")


‚úÖ Trainer created

Starting training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0052,0.845903,0.904656
2,0.0041,0.837809,0.906874
3,0.0,0.868997,0.902439



Training complete!
Total training time: 8.22 minutes


## 12. Evaluate Model


In [None]:
print("\n" + "="*60)
print("Final evaluation...")
print("="*60)

# Evaluate on test set
eval_results = trainer.evaluate()

print("\nüéØ Final results:")
print(f"  - Test accuracy: {eval_results['eval_accuracy']:.4f} ({eval_results['eval_accuracy']*100:.2f}%)")
print(f"  - Test loss: {eval_results['eval_loss']:.4f}")



Final evaluation...



üéØ Final results:
  - Test accuracy: 0.9128 (91.28%)
  - Test loss: 0.6925


## 13. Save Model


In [None]:
# Save model and tokenizer
model_save_path = "./20newsgroups_distilbert_model"

print(f"Saving model to: {model_save_path}")
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print("‚úÖ Model saved successfully!")


Saving model to: ./20newsgroups_distilbert_model
‚úÖ Model saved successfully!


## 14. Use Pipeline for Prediction


In [None]:
print("Creating classification pipeline...")

# Create pipeline (similar to sentiment analysis usage)
device = 0 if torch.cuda.is_available() else -1

classifier = pipeline(
    "text-classification",
    model=model_save_path,
    tokenizer=model_save_path,
    device=device,
    truncation=True
)

print("‚úÖ Pipeline created successfully")


Device set to use cuda:0


Creating classification pipeline...
‚úÖ Pipeline created successfully
