# Task
Fine-tune a BERT model for news topic classification using the AG News dataset from Hugging Face, evaluate its performance using accuracy and F1-score, and deploy the model using Streamlit or Gradio.

## Install and import necessary libraries

### Subtask:
Install the required libraries such as `transformers`, `datasets`, `torch`, `evaluate`, and `accelerate`. Import the necessary modules for data loading, processing, model training, and evaluation.


**Reasoning**:
Install the necessary libraries using pip.



In [4]:
%pip install transformers datasets torch evaluate accelerate



In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from evaluate import load
import torch

In [6]:
# 1. Load the 'ag_news' dataset
dataset = load_dataset('ag_news')

# 2. Print the dataset structure
print("Dataset structure:")
print(dataset)

# 3. Select one of the splits and print information about it
train_dataset = dataset['train']
print("\nInformation about the training split:")
print(train_dataset)

# 4. Display a few examples from the dataset
print(f"\nExamples from the training split:")
for i in range(3):
    print(f"Example {i+1}:")
    print(f"  Text: {train_dataset[i]['text']}")
    print(f"  Label: {train_dataset[i]['label']}")

# 5. Examine the distribution of labels in the training set
print("\nLabel distribution in the training split:")
label_counts = {}
for example in train_dataset:
    label = example['label']
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

print(label_counts)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Information about the training split:
Dataset({
    features: ['text', 'label'],
    num_rows: 120000
})

Examples from the training split:
Example 1:
  Text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
  Label: 2
Example 2:
  Text: Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
  Label: 2
Example 3:
  Text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\

In [7]:
# 1. Load the bert-base-uncased tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# 2. Define a tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

# 3. Apply the tokenization function to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Print the structure of the tokenized dataset
print("\nTokenized dataset structure:")
print(tokenized_datasets)

# Print an example from the tokenized dataset
print("\nExample from tokenized dataset:")
print(tokenized_datasets['train'][0])


Tokenized dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})

Example from tokenized dataset:
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2, 'input_ids': [101, 2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304, 1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087, 1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [8]:
# 1. Rename the 'label' column to 'labels'
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# 2. Set the format of the dataset to 'torch'
tokenized_datasets.set_format("torch")

# 3. Create training and validation splits
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]

print("\nTokenized dataset structure after renaming and setting format:")
print(tokenized_datasets)

print("\nStructure of the training dataset:")
print(train_dataset)

print("\nStructure of the validation dataset:")
print(eval_dataset)


Tokenized dataset structure after renaming and setting format:
DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})

Structure of the training dataset:
Dataset({
    features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 120000
})

Structure of the validation dataset:
Dataset({
    features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 7600
})


In [9]:
# Load the bert-base-uncased model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

print("\nModel loaded:")
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model loaded:
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm

In [None]:
# 1. Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',  # Output directory for checkpoints and logs
    eval_strategy='epoch',  # Evaluate every epoch
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,  # Batch size for evaluation
    num_train_epochs=3,  # Number of training epochs
    weight_decay=0.01,  # Weight decay
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=10, # Log every 10 steps
)

# 2. Initialize the Trainer
trainer = Trainer(
    model=model,  # The loaded model
    args=training_args,  # The training arguments
    train_dataset=train_dataset,  # The training dataset
    eval_dataset=eval_dataset,  # The evaluation dataset
)

# 3. Start the training process
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mprettty565[0m ([33mprettty565-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




## Evaluate the Model

Evaluate the fine-tuned BERT model on the test dataset using accuracy and F1-score.

In [1]:
# 1. Define the evaluation metrics
accuracy_metric = load("accuracy")
f1_metric = load("f1")

# 2. Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted") # Use weighted average for multi-class
    return {"accuracy": accuracy["accuracy"], "f1": f1["f1"]}

# 3. Update the Trainer with compute_metrics
trainer.compute_metrics = compute_metrics

# 4. Evaluate the model
evaluation_results = trainer.evaluate()

# 5. Print the evaluation results
print("\nEvaluation Results:")
print(evaluation_results)

NameError: name 'load' is not defined

# News Topic Classification using BERT

This project fine-tunes a BERT model for news topic classification using the AG News dataset from Hugging Face.

## Table of Contents

- [Task Overview](#task-overview)
- [Setup](#setup)
- [Data](#data)
- [Model](#model)
- [Training](#training)
- [Evaluation](#evaluation)
- [Deployment](#deployment)
- [Results](#results)
- [Contributing](#contributing)
- [License](#license)

## Task Overview

The main objectives of this project are:
1. Fine-tune a BERT model on the AG News dataset for accurate news topic classification.
2. Evaluate the fine-tuned model's performance using standard metrics like accuracy and F1-score.
3. Provide a framework for deploying the trained model (using Streamlit or Gradio).

## Setup

To get started with this project, you need to install the required libraries. You can do this by running the following command in your environment: