<a href="https://colab.research.google.com/github/shahzadali1-git/DevelopersHub-AI-and-ML-Advanced-Tasks/blob/main/Copy_of_TASK_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Topic Classifier Using BERT
- Fine-tunes a DistilBERT model on AG News
- Evaluates using accuracy and F1-score
- Deploys a live Gradio demo for interaction


#Task 1: News Topic Classifier Using BERT

**Objective:**

Fine-tune a transformer model (e.g., BERT) to classify news headlines into topic categories.

In [None]:
import torch
print("Torch version:", torch.__version__)

In [None]:
!pip install --quiet transformers datasets scikit-learn gradio


**Dataset:**

AG News Dataset (Available on Hugging Face Datasets)

**Import Libraries**

In [None]:
import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
import gradio as gr



**Load REAL AG News Dataset**

In [None]:
dataset = load_dataset("ag_news")

**See Real News Data**

In [None]:
dataset["train"][0]

**Tokenize Dataset**

In [None]:
model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4
)


In [None]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)


**Prepare Dataset for Training**

In [None]:
tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "label"]
)


In [None]:
tokenized_dataset["train"][0]


**Evaluation Metrics (Accuracy + F1)**

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)

    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="weighted")
    }


**Training Settings**

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    report_to="none"  # disables wandb to avoid prompt
)



**Train the Model (REAL TRAINING)**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"].shuffle(seed=42).select(range(8000)),
    eval_dataset=tokenized_dataset["test"].select(range(2000)),
    compute_metrics=compute_metrics
)


In [None]:
trainer.train()
trainer.evaluate()


**Deploy with Gradio**

In [None]:
labels = ["World", "Sports", "Business", "Sci/Tech"]

def predict_news(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    return labels[prediction]

gr.Interface(fn=predict_news, inputs="text", outputs="text",
             title="News Topic Classifier").launch()
