<a href="https://colab.research.google.com/github/sionkimadd/news_classifier/blob/main/News_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **News Classifier**


*   **Objective**: fine-tune the *microsoft/deberta-v3-large* to classify news headlines into 7 categories
  *   World
  *   Business
  *   Technology
  *   Entertainment
  *   Sports
  *   Science
  *   Health
*   Resource
  *   [Model](https://huggingface.co/microsoft/deberta-v3-large)
  *   [Dataset](https://huggingface.co/datasets/logicalqubit/news_133k)

# Environment Setup

*   Install Libraries

In [None]:
!pip install transformers datasets evaluate wandb scikit-learn

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [3

*   Import Libraries

In [None]:
import re
import numpy as np
import pandas as pd
import torch
import evaluate
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
    DataCollatorWithPadding,
    set_seed
)
import wandb

*   Set Random Seed for Reproducibility

In [None]:
SEED = 42
set_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Tracking

*   Initialize Wandb

In [None]:
wandb.init(
    project="deberta-v3-large_news_classifier",
    config={
        "model": "microsoft/deberta-v3-large",
        "seed": SEED,
        "batch_size": 8,
        "learning_rate": 6e-6,
        "num_train_epochs": 2,
        "dataset_size": 133000
    }
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlogicalqubit[0m ([33mlogical-qubit[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Data Prepartion

*   Define Label Mapping

In [None]:
LABEL_MAP = {
    "World": 0,
    "Business": 1,
    "Technology": 2,
    "Entertainment": 3,
    "Sports": 4,
    "Science": 5,
    "Health": 6
}
ID2LABEL = {v: k for k, v in LABEL_MAP.items()}

*   Load Dataset

In [None]:
df = pd.read_csv("news_133k.csv")

*   Process Filtering and Mapping

In [None]:
df = df[df['label'].isin(LABEL_MAP.keys())]
df['label'] = df['label'].map(LABEL_MAP).astype(int)

*   Process Sampling

In [None]:
SAMPLES_PER_LABEL = 19000
df = (
    df.groupby('label', group_keys=False)
    .apply(lambda x: x.sample(n=SAMPLES_PER_LABEL, random_state=SEED))
    .reset_index(drop=True)
)

  .apply(lambda x: x.sample(n=SAMPLES_PER_LABEL, random_state=SEED))


In [None]:
print(f"Dataset Scale: {len(df)}")
print(f"Label Distribution:\n{df['label'].value_counts()}")

Dataset Scale: 133000
Label Distribution:
label
0    19000
1    19000
2    19000
3    19000
4    19000
5    19000
6    19000
Name: count, dtype: int64


*   Split Dataset

In [None]:
train_val, test_df = train_test_split(
    df,
    test_size=0.1,
    stratify=df['label'],
    random_state=SEED
)

train_df, val_df = train_test_split(
    train_val,
    test_size=0.1,
    stratify=train_val['label'],
    random_state=SEED
)

In [None]:
print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

Train: 107730, Validation: 11970, Test: 13300


*   Convert pandas DataFrame to Hugging Face DatasetDict

In [None]:
dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df, preserve_index=False),
    "validation": Dataset.from_pandas(val_df, preserve_index=False),
    "test": Dataset.from_pandas(test_df, preserve_index=False)
})

# Tokenizing

*   Initialize Model & Tokenizer

In [None]:
MODEL_NAME = "microsoft/deberta-v3-large"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.model_max_length = 256

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



*   Define Tokenization Function

In [None]:
def tokenize_fn(batch):
    return tokenizer(
        batch["title"],
        truncation=True,
        max_length=256,
        padding=False,
        add_special_tokens=True
    )

*   Tokenize Dataset & Format for PyTorch

In [None]:
tokenized_ds = dataset.map(
    tokenize_fn,
    batched=True,
    batch_size=2048,
    remove_columns=["title"],
    num_proc=8
)
tokenized_ds.set_format("torch")

Map (num_proc=8):   0%|          | 0/107730 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/11970 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/13300 [00:00<?, ? examples/s]

# Model Initialization

*   Load Pretrained Model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=7,
    id2label=ID2LABEL,
    label2id=LABEL_MAP
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/874M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


*   Set Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./deberta-v3-large-news-classifier-checkpoints",
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=15,

    evaluation_strategy="steps",
    eval_steps=1000,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,

    learning_rate=6e-6,
    weight_decay=0.01,
    warmup_steps=50,
    lr_scheduler_type="linear",
    optim="adamw_torch_fused",

    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,

    fp16=True,
    bf16=False,
    max_grad_norm=1.0,

    logging_steps=1000,
    report_to="wandb",

    seed=SEED,
    dataloader_num_workers=8
)



# Metrics & Trainer

*   Define Metric Functions

In [None]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = np.argmax(preds, axis=1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

*   Initialize Trainer

In [None]:
data_collator = DataCollatorWithPadding(tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=15)]
)

# Train & Evaluate

In [None]:
trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1
1000,0.3923,0.021418,0.996324,0.996325
2000,0.0343,0.016972,0.997661,0.997661
3000,0.0311,0.015862,0.997828,0.997827
4000,0.0365,0.008855,0.998413,0.998413
5000,0.0286,0.019409,0.99741,0.99741
6000,0.0223,0.01036,0.998663,0.998663
7000,0.0217,0.017186,0.998079,0.998078
8000,0.0244,0.010051,0.998914,0.998915
9000,0.0102,0.012425,0.998496,0.998496
10000,0.0223,0.01126,0.99858,0.99858


TrainOutput(global_step=26934, training_loss=0.02768356897273821, metrics={'train_runtime': 7278.766, 'train_samples_per_second': 29.601, 'train_steps_per_second': 3.7, 'total_flos': 1.271018795786514e+16, 'train_loss': 0.02768356897273821, 'epoch': 2.0})

*   Evaluate by Validation

In [None]:
val_results = trainer.evaluate(tokenized_ds["validation"])
print(f"Validation Accuracy: {val_results['eval_accuracy']:.4f}")
print(f"Validation F1: {val_results['eval_f1']:.4f}")

Validation Accuracy: 0.9997
Validation F1: 0.9997


*   Evaluate by Test

In [None]:
test_results = trainer.evaluate(tokenized_ds["test"])
print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"Test F1: {test_results['eval_f1']:.4f}")

Test Accuracy: 0.9995
Test F1: 0.9995


*   Save Model

In [None]:
trainer.save_model("deberta-v3-large-news-classifier")
tokenizer.save_pretrained("deberta-v3-large-news-classifier")

('deberta-v3-large-news-classifier/tokenizer_config.json',
 'deberta-v3-large-news-classifier/special_tokens_map.json',
 'deberta-v3-large-news-classifier/spm.model',
 'deberta-v3-large-news-classifier/added_tokens.json',
 'deberta-v3-large-news-classifier/tokenizer.json')

*   Finish Wandb

In [None]:
wandb.finish()

0,1
eval/accuracy,▁▄▄▅▃▆▅▆▅▆▆▆▆▇██▇▇▇▇▇▇█▇████
eval/f1,▁▄▄▅▃▆▅▆▅▆▆▆▆▇██▇▇▇▇▇▇█▇████
eval/loss,█▆▆▃▇▄▆▄▅▄▃▄▄▂▁▂▂▂▂▂▃▃▁▁▁▂▁▂
eval/runtime,▁▂▁▂▁▂▁▂▂▂▃▁▂▁▁▁▂▁▁▁▁▂▂▂▂▂▂█
eval/samples_per_second,▇▆▆▆▇▆▇▃▄▅▂▆▅▇▆▇▅███▆▅▄▃▃▄▅▁
eval/steps_per_second,▇▆▆▆▇▆▇▃▄▅▂▆▅▇▆▇▅███▆▅▄▃▃▄▅▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▄▂█▂▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,██▇▇▇▇▆▆▆▅▅▅▅▄▄▄▄▃▃▃▂▂▂▂▁▁

0,1
eval/accuracy,0.99955
eval/f1,0.99955
eval/loss,0.00394
eval/runtime,102.501
eval/samples_per_second,129.755
eval/steps_per_second,16.224
total_flos,1.271018795786514e+16
train/epoch,2.0
train/global_step,26934.0
train/grad_norm,0.00012
