## News Topic Classification using Transformer

Approach: Fine Tune only the final layer/ Head of Transformer Model


This approach is often used when the task for which the transformer model is being fine-tuned is different from the task for which it was pre-trained. ie only update the final layer to make predictions for the new task.


1. Read DATASET
2. Clean Text data
3. Prepare Tokenizer
4. Initialize Model
5. Train Model
6. Evaluate Model

Run below cells if those libraries not installed yet

In [49]:
# !pip install Datasets evaluate
# !pip install transformers[torch]
# !pip install torchinfo

In [1]:
# dependencies
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments,Trainer
import evaluate
from sklearn.metrics import accuracy_score, f1_score

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Read Dataset

In [5]:
# # load dataset from GoogleDrive
# train_df = pd.read_csv('/content/drive/MyDrive/Dataset/train.csv')
# test_df = pd.read_csv('/content/drive/MyDrive/Dataset/test.csv')
# load dataset from GoogleDrive
train_df = pd.read_csv('/content/drive/MyDrive/Fractal/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Fractal/test.csv')

In [6]:
# Understand the number of rows and columns for both train and test dataset
train_df.shape, test_df.shape

((120000, 3), (7600, 3))

In [7]:
# In order to reduce the computational time, we are using a sample (30%) of the train dataset (~36K)
train_df = train_df.sample(frac=0.3)

In [8]:
# Lets visualize the sample of dataset
train_df.head()

Unnamed: 0,Class Index,Title,Description
51897,1,"Car bomb explodes in Iraqi town, several killed",A car bomb blew up in the western Iraqi town o...
92471,1,China #39;regrets #39; submarine #39;s trip i...,CHINA has admitted a quot;regrettable quot; i...
34817,4,Nokia Revamps VPN System Security (NewsFactor),NewsFactor - Nokia (NYSE: NOK) has upgraded th...
43052,1,Embattled Blair pushes for third term,Here we are again .. my toughest week yet sinc...
35913,3,Oil Tops \$48 After U.S. Inventories Dive,LONDON (Reuters) - Oil prices jumped more tha...


The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

The first column is Class Id, the second column is Title and the third column is Description. The class ids are numbered 1-4 where 1 represents World, 2 represents Sports, 3 represents Business and 4 represents Sci/Tech.

We are remapping the Class Index into range from 0 to 3

In [9]:
# Map the class index from one to three
map_dict = {1:0,2:1,3:2,4:3}
train_df['Class Index'] = train_df['Class Index'].map(map_dict)
test_df['Class Index'] = test_df['Class Index'].map(map_dict)

In [10]:
# Get the number of classes(News topics)
num_labels = train_df['Class Index'].nunique()
num_labels

4

In [11]:
# What is the class distribution of topics?
train_df['Class Index'].value_counts()

2    9026
0    9016
1    8996
3    8962
Name: Class Index, dtype: int64

The first column is Class Id, the second column is Title and the third column is Description.

The class ids are numbered 0-3 where   


            0 - World,
            1 - Sports,
            2 - Business
            3 - Sci/Tech.

### Preprocessing

In [12]:
def preprocess(dataset):
    """
    Clean the dataset : Both train and test dataset
    - Combine title and description to a new column

    Input : Raw csv data
    Output : Cleaned data
    """
    # Combine title and description
    dataset['News'] = dataset['Title'] + ': \n' + dataset['Description']
    dataset.drop(['Title', 'Description'], axis = 1, inplace = True)

    # rename to text and labels
    dataset.rename(columns={'News':'text','Class Index':'label'},inplace=True)
    return dataset

In [13]:
# Do the preprocessing for both train and test dataset
train_df = preprocess(train_df)
test_df = preprocess(test_df)

In [14]:
# Data after the preprocessing - Only two columns here, label and text
train_df.head()

Unnamed: 0,label,text
51897,0,"Car bomb explodes in Iraqi town, several kille..."
92471,0,China #39;regrets #39; submarine #39;s trip i...
34817,3,Nokia Revamps VPN System Security (NewsFactor)...
43052,0,Embattled Blair pushes for third term: \nHere ...
35913,2,Oil Tops \$48 After U.S. Inventories Dive: \n ...


In [15]:
# create Huggingface datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

It is a dictionary of column name and column type pairs.

### Prepapre Tokenizer

Convert the data into a numerical representation suitable for input into the Transformer model. This typically involves tokenizing the text into subwords or words, mapping the tokens to integers, and encoding the input as a tensor.

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances.

Here, we are using model 'distilbert-base-uncased'. So we use tokenizer for the DISTILBERT model with the "base" architecture and the "cased" version.

In [16]:
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [17]:
# create a function to tokenize the dataset;  truncate and pad set True
def tokenize(data):
    return tokenizer(data['text'],truncation=True,padding=True)

The “truncation” argument is set to “True”, which means that the tokenization function will truncate sequences that are longer than the maximum length specified by the model. Padding argument set to 'True' will padd the sequences with zeros if the lenght is less than model length

In [18]:
# tokenize both train and test dataset
tokenized_train = train_dataset.map(tokenize,batched=True)
tokenized_test = test_dataset.map(tokenize,batched=True)
#map function used to speed up processing by applying your tokenization function to batches of examples

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Tokenizer generates three new columns in the dataset: input_ids, token_type_ids, and an attention_mask. These are the model inputs.

### Modelling

In [35]:
id2label = {0: "World", 1: "Sports",2:"Business",3:'Sci/Tech'}
label2id = {"World":0, "Sports":1,"Business":2,'Sci/Tech':3}

In [36]:
# Initialize the model using transformer - Ignore the warning
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels = num_labels,
                                                           id2label=id2label, label2id=label2id)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [37]:
# Get the total number of weights in the model
print(f"num params:", model.num_parameters())

num params: 66956548


In [38]:
# lets see the what are the different layers in the dataset
for name, param in model.named_parameters():
    print(name)

distilbert.embeddings.word_embeddings.weight
distilbert.embeddings.position_embeddings.weight
distilbert.embeddings.LayerNorm.weight
distilbert.embeddings.LayerNorm.bias
distilbert.transformer.layer.0.attention.q_lin.weight
distilbert.transformer.layer.0.attention.q_lin.bias
distilbert.transformer.layer.0.attention.k_lin.weight
distilbert.transformer.layer.0.attention.k_lin.bias
distilbert.transformer.layer.0.attention.v_lin.weight
distilbert.transformer.layer.0.attention.v_lin.bias
distilbert.transformer.layer.0.attention.out_lin.weight
distilbert.transformer.layer.0.attention.out_lin.bias
distilbert.transformer.layer.0.sa_layer_norm.weight
distilbert.transformer.layer.0.sa_layer_norm.bias
distilbert.transformer.layer.0.ffn.lin1.weight
distilbert.transformer.layer.0.ffn.lin1.bias
distilbert.transformer.layer.0.ffn.lin2.weight
distilbert.transformer.layer.0.ffn.lin2.bias
distilbert.transformer.layer.0.output_layer_norm.weight
distilbert.transformer.layer.0.output_layer_norm.bias
distil

Keeping the weights of the pre-trained model paramaeters frozen and optimizing only the weights of the head layers

In [39]:
# froze pre-trained model paramaeters
for param in model.parameters():
    param.requires_grad = False

Here, we optimizing pre_classifier and classifier layer

In [40]:
# unfreeze head layers weights - pre_classifier and classifier
for param in model.pre_classifier.parameters():
    param.requires_grad = True

for param in model.classifier.parameters():
    param.requires_grad = True

In [41]:
print("Number of params:", model.num_parameters())
print("Number of trainable params:", model.num_parameters(only_trainable=True))
print("Percentage of trainble parameter to total parameters",(100* model.num_parameters(only_trainable=True))/model.num_parameters())

Number of params: 66956548
Number of trainable params: 593668
Percentage of trainble parameter to total parameters 0.886646665237282


In [42]:
# datacollator with paddings - dynamically take care of padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [43]:
# Get the metric from evaluate
metric = evaluate.load('accuracy')

In [44]:
# Get the metric for the prediction
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits,axis=-1)
    return metric.compute(predictions=predictions,references=labels)

In [45]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

In [46]:
# Setup trainer and training arguments
trainig_args = TrainingArguments(
                output_dir= './NewsClassification',
                learning_rate= 2e-5,
                per_device_eval_batch_size= 16,
                per_device_train_batch_size= 16,
                num_train_epochs= 5,
                evaluation_strategy='epoch',
                logging_strategy='epoch',
                save_strategy = 'epoch',
                weight_decay=0.01,
                load_best_model_at_end=True,
)

In [47]:
# Trainder funtion
trainer = Trainer(
                model= model,
                args= trainig_args,
                train_dataset= tokenized_train,
                eval_dataset= tokenized_test,
                data_collator= data_collator,
                compute_metrics = compute_metrics
)

In [52]:
# # Train the model
trainer.train()

Lost the trainig details here.


We kept 5 epochs for training and the accuracy for the first epoch was 86%, it went to 89% during training. ie improvement of 3%. We have limited the parameters and dataset size due to the computation limitations.

### Evalutate the Model

We choose evaluation metric accuracy here, since there is no data imbalances in the class.

In [53]:
trainer.evaluate(tokenized_test)

Epoch,Training Loss,Validation Loss,Accuracy,F1
0,No log,0.319963,0.889474,0.889048


{'eval_loss': 0.31996336579322815,
 'eval_accuracy': 0.8894736842105263,
 'eval_f1': 0.8890479942247339}

In [54]:
test_pred = trainer.predict(tokenized_test)

Epoch,Training Loss,Validation Loss,Accuracy,F1
0,No log,0.319963,0.889474,0.889048


In [None]:
#!ls OUTPUT
# !huggingface-cli login
# trainer.push_to_hub()