<a href="https://www.kaggle.com/code/sanprofnext/roberta-fake-news-classifier?scriptVersionId=156275956" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Fake News Detection with Roberta LLM

The algorithm laid out below was developed as a solution to the challenge titled [Fake News](https://www.kaggle.com/competitions/fake-news) in kaggle. This algorithm subjects a Roberta Base LLM to transfer learning and tailors it for the purpose of identifying unreliable news articles

This technique generated a private and public score of 1.0 on the test data provided within the Kaggle competition

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fake-news/submit.csv
/kaggle/input/fake-news/train.csv
/kaggle/input/fake-news/test.csv


## 1. Module Installation and Library Import

Install the required dependencies and import the essential libraries for building the algorithm. Also activate the GPU environment for ensuring parallelization of operations and supercharging the transformer computations

In [None]:
!pip install transformers
!pip install torchmetrics

In [None]:
!pip install Dataset

In [None]:
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, BertModel, RobertaForSequenceClassification
from tqdm import tqdm
import torch
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy
from datasets import Dataset
import datasets

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 2. Data intake, cleanup and tokenization

Clean-up the training data by imputing any missing values with suitable placeholders. Concatenate every article body with the title and author to prevent loss of relevant information during the training process

In [None]:
data = pd.read_csv('/kaggle/input/fake-news/train.csv')

#Handle any missing values by substituting them with placeholder text
data['author'] = data['author'].fillna('Unknown',axis=0)
data['title'] = data['title'].fillna('Unknown',axis=0)
data['text'] = data['text'].fillna('Not Available',axis=0)

#Concatenate the article body with title and author fields to prevent loss of valuable textual training samples
data['comb_news'] = 'Title:'+data['title']+'\nAuthor:'+data['author']+'\nBody:'+data['text']

#Extract only the concatenated and the corresponding label indicating the authenticity of news from input data
data_in = data[['comb_news','label']]

Instantiate a tokenizer tied to the Roberta model and define a function for applying the tokenizer on each training sample. The function is designed to take a Dataset structure as input and tokenize the structure into fixed chunks of 128 tokens each. Each tokenized chunk is subsequently encoded for inserting into the model   

In [None]:
#Instantiate the tokenizer linked to the Roberta model
model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

#Define the function for tokenizing the input data
def tokenize(el):
    #Split every tokenized article into chunks of 128 tokens to ensure streamlined processing. If the size of
    #the last sample chunk is less than 128, it can be padded for attaining a token length of 128
    result =  tokenizer(el['comb_news'],truncation=True,max_length=128,padding='max_length',return_overflowing_tokens=True)
    
    #Ensure that the remaining fields are correctly mapped to each chunk associated with the original sample
    sample_map = result.pop('overflow_to_sample_mapping')
    for key,value in el.items():
        result[key] = [value[i] for i in sample_map]
        
    return result

Split the input data frame into training and validation data frames for training the Roberta LLM through transfer learning. Translate the data frames into Dataset structures for the purpose of tokenization. The tokenized datasets are translated into tensors for collation into dataloaders that are passed as batches into the model

In [None]:
#Function used to tokenize the data and package the tokenized into data into data loaders for model ingestion
def create_loaders(data_in):
    #Split the input data columns into training and validation datasets 
    train_data, test_data, train_label, test_label = train_test_split(data_in['comb_news'],data_in['label'],test_size=0.25,random_state=42)
    
    #Translate the training and validation data frames into Dataset structures for data pre-processing stage
    data_train = Dataset.from_pandas(pd.concat([train_data,train_label],axis=1))
    data_valid = Dataset.from_pandas(pd.concat([test_data,test_label],axis=1))
    
    #Create a combined dataset dictionary from the Dataset structures and subject it to the tokenization operation
    tr_data = datasets.DatasetDict({'train':data_train,'valid':data_valid})
    tok_data = tr_data.map(tokenize,batched=True)
    
    #Remove unwanted columns once the encoded data compatible with the LLM is generated
    tok_data = tok_data.remove_columns(['comb_news','__index_level_0__'])
    
    #Re-package the encoded data, attention masks and labels into tensors for insertion into the model 
    tok_data.set_format('pandas')
    train_in = tok_data['train'][:]
    train_set = TensorDataset(torch.tensor(train_in['input_ids']),torch.tensor(train_in['attention_mask']),torch.tensor(train_in['label']))
    valid_in = tok_data['valid'][:]
    valid_set = TensorDataset(torch.tensor(valid_in['input_ids']),torch.tensor(valid_in['attention_mask']),torch.tensor(valid_in['label']))
    
    #Collate the tensor datasets into training and validation data loaders for generating batches that can be used for 
    #training the pre-trained Roberta LLM through stochastic gradient descent optimization
    train_loader = DataLoader(train_set,batch_size=64,shuffle=True)
    valid_loader = DataLoader(valid_set,batch_size=64,shuffle=False)
    
    return train_loader,valid_loader

In [None]:
#Store the generated training and validation data loaders
train_loader,valid_loader = create_loaders(data_in)

## 3. Model instantiation and training environment configuration

Ensure that the Roberta model instance is loaded onto the GPU for accelerated computation. Also unfreeze the feed-forward layers and the last few attention layers for orchestrating the transfer learning procedure

In [None]:
#Instantiate the text sequence classification model from HuggingFace model hub based on the Roberta architecture
model = RobertaForSequenceClassification.from_pretrained(model_name)

In [None]:
#Load the instantiated model onto the GPU for accelerated computation
model = model.to(device)

#Unfreeze the parameters across layers 8, 9 and 10 along with the feed-forward classifier layers for configuring the architecture to perform transfer learning and identify fake news
for idx,(name,params) in enumerate(model.named_parameters()):
    if 'classifier' in name or 'encoder.layer.8' in name or 'encoder.layer.9' in name or 'encoder.layer.10' in name:
        params.requires_grad = True
    else:
        params.requires_grad = False

#Keep a stock of the total number of parameters that would be subjected to adam optimization
total_params = 0
for param in model.parameters():
    if param.requires_grad:
        total_params+= param.numel()
print(total_params)

Configure the parameters of the training environment in terms of the choice of optimizer, learning rate, loss function for gradient computation and number of epochs

In [None]:
epochs=2
optimizer = torch.optim.AdamW(model.parameters(),lr=5e-5,eps=1e-8)
criterion = torch.nn.CrossEntropyLoss()
train_acc,valid_acc = Accuracy(task='binary',num_classes=2).to(device),Accuracy(task='binary',num_classes=2).to(device)

## 4. Model training for Transfer learning to align with the Task context

Unpack the training batches from data loaders and train configured model layers through adam optimization with the use of cross-entropy as the loss function. Assess the training and validation data accuracy at the end of each epoch. Gradient clipping is also used to avoid the issue of exploding gradients. At the end of every epoch, predict the labels of the validation data batches to assess the performance of the model 

In [None]:
for epoch in range(epochs):
    train_loss, valid_loss = list(),list()
    print(f'Epoch:{epoch}----------------------->')
    
    model.train()
    for idx,(x_ids,x_mask,x_label) in tqdm(enumerate(train_loader),total=len(train_loader)):
        optimizer.zero_grad()
        #Make sure that the upacked training data batches are loaded onto the GPU
        x_ids, x_mask, x_label = x_ids.to(device), x_mask.to(device), x_label.to(device)
        #Determine the label predictions from the model and update the parameter with gradients computed on the cross-entropy loss function
        preds = model(x_ids,attention_mask = x_mask)
        loss = criterion(preds.logits,x_label)
        train_loss.append(loss.item())
        #Determine the training data accuracy at the end of every epoch
        train_acc.update(torch.argmax(preds.logits,dim=1),x_label)
        
        #Clip the computed gradients to prevent the performance constraints posed by exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
        #Update the unfrozen model parameters with each optimizer step
        loss.backward()
        optimizer.step()
        
    model.eval()
    for idx,(v_ids,v_mask,v_label) in tqdm(enumerate(valid_loader),total=len(valid_loader)):
        #Make sure that the upacked validation data batches are loaded onto the GPU
        v_ids, v_mask, v_label = v_ids.to(device), v_mask.to(device), v_label.to(device)
        #Determine label predictions on the validation batch
        preds = model(v_ids,attention_mask = v_mask)
        #Compute the validation cross-entropy loss for each batch in the validation data
        loss = criterion(preds.logits,v_label)
        valid_loss.append(loss.item())
        valid_acc.update(torch.argmax(preds.logits,dim=1),v_label)
        
    #Publish the training and validation cross-entropy losses as well as classificatio accuracy at the end of every iteration
    avg_train_loss, avg_valid_loss = sum(train_loss)/len(train_loss),sum(valid_loss)/len(valid_loss)
    print(f'Training loss:{avg_train_loss}\tValidation loss:{avg_valid_loss}')
    print(f'Training accuracy:{train_acc.compute().item()}\tValidation accuracy:{valid_acc.compute().item()}')

## 5. Determine the model performance on the challenge test data for submission

Define two helper functions for post-processing the test predictions before submission

In [None]:
def flatten_data(data):
    flat_list = list()
    for item in data:
        flat_list += item.tolist()
    return flat_list

def find_issue(d_out):
    check_issue = pd.crosstab(d_out['id'],d_out['label']).reset_index().rename(columns={'id':'index',0:'label_0',1:'label_1'})
    check_issue['issue_flag'] = check_issue.apply(lambda x: x['label_0']>0 & x['label_1']>0,axis=1)
    errors = check_issue.shape[0]-check_issue['issue_flag'].value_counts().to_frame().reset_index().loc[0,'count']
    return errors

In [None]:
#Extract the test data for submission creation
d_test = pd.read_csv('/kaggle/input/fake-news/test.csv')

#Define a function for determination of test predictions and its post-processing into the submission template
def generate_result(test):
    #Impute any missing values with suitable placeholders to prevent loss of valuable information
    test['author'] = test['author'].fillna('Unknown',axis=0)
    test['title'] = test['title'].fillna('Unknown',axis=0)
    test['text'] = test['text'].fillna('Not Available',axis=0)
    
    #Concatenate the article body with title and author fields to mirror the pre-processing performed on training data
    test['comb_news'] = 'Title:'+test['title']+'\nAuthor:'+test['author']+'\nBody:'+test['text']
    d_test_in = d_test[['id','comb_news']]
    
    #Translate the data into Dataset structure to mirror the pre-processing performed on training data
    test_dset = Dataset.from_pandas(d_test_in)
    
    #Tokenize the test data with the Roberta tokenizer and remove unwanted columns
    test_tokens = test_dset.map(tokenize,batched=True)
    test_tokens = test_tokens.remove_columns(['comb_news'])
    test_tokens.set_format('pandas')
    
    #Package the tokenized data into data loaders for generating model predictions
    torch_t_data = TensorDataset(torch.tensor(test_tokens['input_ids']),torch.tensor(test_tokens['attention_mask']))
    test_dataloader = DataLoader(torch_t_data,batch_size=64,shuffle=False)
    label_preds = list()
    
    #Pass every batch from the test data loader into the model for determining the corresponding label predictions
    model.eval()
    for idx,(t_inputs,t_amask) in tqdm(enumerate(test_dataloader),total=len(test_dataloader)):
        t_inputs,t_amask = t_inputs.to(device),t_amask.to(device)
        preds = model(t_inputs,attention_mask = t_amask)
        p_label = torch.argmax(preds.logits,dim=1)
        label_preds.append(p_label)
    
    #Post-process the predicted labels with helper functions restructure into a format suitable for submision creation  
    test_predictions = flatten_data(label_preds)
    output = pd.concat([test_tokens['id'],pd.DataFrame(test_predictions,columns=['label'])],axis='columns')
    
    #Also ensure that the prediction is identical on every token chunk coming from the same news article
    if find_issue(output)==0:
        return output.drop_duplicates(subset=['id'])
    else:
        return -1

In [None]:
#Generate the submission outputs and create a submission file
f_output = generate_result(d_test)

In [None]:
f_output.to_csv('submission.csv',index=False)