# LR Classification by Fine-Tuning DistilBERT

In this notebook, we Fine-Tune the DistilBERT model from the transformers library, for classifying News Headlines as Left-leaning or Right-leaning. The following functions are performed in this notebook:

This link introduced the DistilBERT model: https://huggingface.co/docs/transformers/model_doc/distilbert

It is essentially a "distilled" form of the BERT model, being 40% smaller in size, while retaining 97% of its language understanding abilities, thus being 60% faster.

The labels are :

- left-leaing: CNN, The Washington Post
- Neutral: Business Insider, USA Today
- Right-leaning: Fox News, Breitbart News

We only use the publications that are Left and Right leaning.

1. Importing necessary libraries
2. Preparing the data for training
3. Preparing the model for fine-tuning
4. Defining helper functions and classes
5. Fine-tuning the model
6. Performing 10-fold Cross Validation

## Importing necessary libraries

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import sys
import random
import pickle as pkl

# libraries to fine-tune the model on our data
from transformers import BertTokenizer, AutoTokenizer
from transformers import AutoTokenizer, TrainingArguments, Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AdamW

# libraries to evaluate our results
from statsmodels.stats.contingency_tables import mcnemar
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm
2023-11-18 18:01:30.439057: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Setting the random seed for Python's random module
random_seed = 42
random.seed(random_seed)

# Setting the random seed for NumPy
np.random.seed(random_seed)

torch.manual_seed(random_seed)
torch.cuda.manual_seed(random_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False  # Disable cuDNN benchmark for reproducibility

## Preparing the data

In [3]:
# Reading in the data
news_df = pd.read_csv('../news_headlines_dataset.csv', index_col = 0)

# Printing the shape of the data
print(news_df.shape)

# Selecting only the required features for fine-tuning
news_df = news_df[['title', 'publication']]

# printing the values in each label
print(news_df.publication.value_counts())

# creating the target column for classification
# Left-leaning rows have value 0
# Right-leaning rows have value 1
# Neutral rows have value 2
news_df['leaning'] = news_df.publication.str.replace('CNN', '0').str.replace('The Washington Post', '0').str.replace('Business Insider', '2').str.replace('USA Today', '2').str.replace('Fox News', '1').str.replace('Breitbart News', '1')


(35886, 20)
publication
Fox News               8327
Breitbart News         7377
CNN                    6485
Business Insider       4803
The Washington Post    4678
USA Today              4216
Name: count, dtype: int64


In [4]:
# printing the columns
news_df.columns

Index(['title', 'publication', 'leaning'], dtype='object')

In [5]:
# printing the values in each label of the target column
news_df['leaning'].value_counts()

leaning
1    15704
0    11163
2     9019
Name: count, dtype: int64

In [6]:
# dropping the rows that are neutral, and are not left-leaning or right-leaning
news_df = news_df[news_df['leaning'] != '2']

# converting to int65 type
news_df.leaning = news_df.leaning.astype('int64')

In [7]:
# dropping the publication column
news_df = news_df.drop(['publication'], axis=1)
news_df.columns

Index(['title', 'leaning'], dtype='object')

In [8]:
def preprocessing(news_df, target):
    """
    This is the main text preprocessing function. It performs the following functions:
    
    1. Dropping duplicates
    2. Undersampling the majority class samples
    3. Random shuffling the dataset
    """
    news_df = news_df.drop_duplicates()
    first = news_df[news_df[target] == 1]
    second = news_df[news_df[target] == 0]
    
    majority = first if len(first) > len(second) else second
    minority = second if len(first) > len(second) else first

    n = len(minority)
    majority = majority.sample(n=n)

    frames = [majority, minority]
    result = pd.concat(frames).sample(frac=1, random_state=42)

    return result

In [9]:
# preprocessing the data
result = preprocessing(news_df, "leaning")

In [10]:
# a glance at the processed data
result.head(), result.shape, result.leaning.value_counts()

(                                                   title  leaning
 29176  Trump associates, including Giuliani, are aski...        0
 21414  Donald Trump: Fox News No Longer Great, an Obs...        1
 25432  William Cohen: Joe Biden will be a steady, sta...        1
 19217  Sen. Brown: Putin Has Something on Trump  We H...        1
 30901  Opinion: Neither alternative facts nor alterna...        0,
 (21350, 2),
 leaning
 0    10675
 1    10675
 Name: count, dtype: int64)

## Defining helper functions and classes for Fine-Tuning

In [11]:
class setDataset(torch.utils.data.Dataset):
    """
    This class defines the dataset with the text encodings of the 
    News Headlines and the label values 
    labels are 0 (left) or 1 (right)
    """
    # initialising the dataset
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    # returning an item of the dataset
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    # returning the length of the dataset
    def __len__(self):
        return len(self.labels)

In [12]:
def get_pytorch_model(model, train_dataset):
    """
    This function takes as input the model to be fine-tuned and the training dataset. 
    It utilisies GPU power to fine-tune the model, and then retrn the final model
    """

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    
    model.to(device)
    model.train()

    # loading the data
    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=False)

    # initialising the optimiser
    optim = AdamW(model.parameters(), lr=5e-5)

    # training the model
    for epoch in range(3):
        for batch in train_loader:
            optim.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs[0]
            loss.backward()
            optim.step()
    model.eval()
    
    return model

In [13]:
def get_predictions(X_test, tokenizer, model):
    """
    This function gets the predictons of the test set from the fine-tuned model.
    It used a batch size of a 100.
    It takes as input the test set, the tokeniser and the model.
    It gets a batch of data from the test set, tokenises it, and then predicts the labels.
    After getting labels for all batches, it compiles the results together and returns it.
    """
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    
    x = 0
    pred_labs = []
    while x < X_test.shape[0]:
        if((x+10) < X_test.shape[0]):
            y = x + 10
        else:
            y = X_test.shape[0]

        x_test = X_test[x:y].tolist()
        inputs = tokenizer(x_test,  padding=True, truncation=True, max_length=500, return_tensors="pt").to(device)
        outputs = model(**inputs)

        y_pred = [int(x.argmax()) for x in outputs[0].softmax(1)]
        pred_labs.extend(y_pred)
        x+=10
        
    return pred_labs

In [14]:
def training(data):
    
    """
    This is the main training function that combined the 
    functionalities discussed above - 

    1. Initialising the tokeniser and the model
    2. Encoding the training and testing datasets using the tokeniser
    3. Creating an instance of the Dataset class for the training and testing encoding
    4. Fine-tuning the DistilBERT model
    5. Getting predictions for the test set using the newly fine-tuned model
    6. Evaluating the results.
    
    """

    # initialising the tokeniser and the model for fine-tuning
    # the model is taken from the transformer's library
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
    autoBERTmodel = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    X_train, X_test, y_train, y_test = data

    # getting train and test set encodings
    train_encodings = tokenizer(X_train.tolist(), truncation=True, max_length=500, padding=True)
    test_encodings = tokenizer(X_test.tolist(), truncation=True, max_length=500, padding=True)

    # getting train dataset and test dataset
    train_dataset = setDataset(train_encodings, y_train)
    test_dataset = setDataset(test_encodings, y_test)

    # fine-tuning and predicting the results
    ftBERTmodel = get_pytorch_model(autoBERTmodel, train_dataset)   # training the model
    y_pred = get_predictions(X_test, tokenizer, ftBERTmodel)      # test set predictions
    train_pred = get_predictions(X_train, tokenizer, ftBERTmodel)   # train set predictions

    # Evaluation metrics
    print(classification_report(y_test, y_pred))
    
    test_acc = accuracy_score(y_test, y_pred)  # testset accuracies
    train_acc = accuracy_score(y_train, train_pred)  # trainset accuracies
    test_f1_score = f1_score(y_test, y_pred)  # testset f1 score
    train_f1_score = f1_score(y_train, train_pred)  # train set f1 score

    return test_acc, train_acc, test_f1_score, train_f1_score, y_pred, train_pred

## Model Fine-Tuning

In [15]:
# the text data and the labels for model fine-tuning
X = np.array(result.title)
y = np.array(result.leaning)

In [16]:
# splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify=y, 
                                                    test_size=0.2, 
                                                    random_state = 42)
data = (X_train, X_test, y_train, y_test)

# calling the main fine-tuning function and 
# getting the evaluation results of the test set
test_acc, train_acc, test_f1_score, train_f1_score, y_pred, train_pred = training(data)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.88      0.82      0.85      2135
           1       0.83      0.89      0.86      2135

    accuracy                           0.85      4270
   macro avg       0.85      0.85      0.85      4270
weighted avg       0.85      0.85      0.85      4270



## Performing 10-fold Cross Validation

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.

In this case, we split the data into 10 parts, and for each iteration, we use one part for testing and the remaining nine for training. 

If the evaluation results are similar across all folds, our model is robust

In [17]:
# initialising lists to store the evaluation results

train_predictions = []
test_predictions = []
test_accs = []
train_accs = []
test_f1 = []
train_f1 = []

In [18]:
X_train.shape, X_test.shape

((17080,), (4270,))

In [19]:
y_train.shape, y_test.shape

((17080,), (4270,))

In [20]:
"""
1. Performing 10-fold cross validation by calling the training function 
on each split of the data.
2. The split was made earlier (using the same preprocessing methods discussed above) 
3. The saved splits are simply imported here. 
"""

for fold in range(10):
    print("Fold: ", fold, "\n\n")

    # importing the data
    with open(f"../LR_tenfoldCV/fold_{fold}_train_data.pkl", "rb") as f:
        X_train, y_train = pkl.load(f)

    with open(f"../LR_tenfoldCV/fold_{fold}_test_data.pkl", "rb") as f:
        X_test, y_test = pkl.load(f)
    
    data = (X_train, X_test, y_train, y_test)

    # model training 
    test_acc, train_acc, test_f1_score, train_f1_score, y_pred, train_pred = training(data)

    # compiling evaluation metrics
    test_accs.append(test_acc)
    train_accs.append(train_acc)
    test_f1.append(test_f1_score)
    train_f1.append(train_f1_score)
    test_predictions.append(y_pred)
    train_predictions.append(train_pred)

Fold:  0 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.81      0.92      0.86      1068
           1       0.90      0.78      0.84      1067

    accuracy                           0.85      2135
   macro avg       0.86      0.85      0.85      2135
weighted avg       0.86      0.85      0.85      2135

Fold:  1 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.78      0.92      0.85      1068
           1       0.91      0.74      0.82      1067

    accuracy                           0.83      2135
   macro avg       0.85      0.83      0.83      2135
weighted avg       0.85      0.83      0.83      2135

Fold:  2 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.85      0.88      0.87      1068
           1       0.88      0.84      0.86      1067

    accuracy                           0.86      2135
   macro avg       0.86      0.86      0.86      2135
weighted avg       0.86      0.86      0.86      2135

Fold:  3 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.81      0.93      0.86      1068
           1       0.91      0.78      0.84      1067

    accuracy                           0.85      2135
   macro avg       0.86      0.85      0.85      2135
weighted avg       0.86      0.85      0.85      2135

Fold:  4 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.79      0.92      0.85      1068
           1       0.91      0.76      0.83      1067

    accuracy                           0.84      2135
   macro avg       0.85      0.84      0.84      2135
weighted avg       0.85      0.84      0.84      2135

Fold:  5 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.80      0.92      0.86      1067
           1       0.91      0.76      0.83      1068

    accuracy                           0.84      2135
   macro avg       0.85      0.84      0.84      2135
weighted avg       0.85      0.84      0.84      2135

Fold:  6 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.79      0.91      0.84      1067
           1       0.90      0.75      0.82      1068

    accuracy                           0.83      2135
   macro avg       0.84      0.83      0.83      2135
weighted avg       0.84      0.83      0.83      2135

Fold:  7 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.82      0.87      0.85      1067
           1       0.87      0.81      0.84      1068

    accuracy                           0.84      2135
   macro avg       0.84      0.84      0.84      2135
weighted avg       0.84      0.84      0.84      2135

Fold:  8 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.86      0.76      0.81      1067
           1       0.79      0.87      0.83      1068

    accuracy                           0.82      2135
   macro avg       0.82      0.82      0.82      2135
weighted avg       0.82      0.82      0.82      2135

Fold:  9 




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

           0       0.84      0.85      0.85      1067
           1       0.85      0.84      0.85      1068

    accuracy                           0.85      2135
   macro avg       0.85      0.85      0.85      2135
weighted avg       0.85      0.85      0.85      2135



Since the results are similar, we prove that our model is robust, and the accuracy of our trained model is 85%